Computer Science Homework Help
Please answer this homework related to Data Science and Big Data Analysis in APA format with References and Citations Q1 Perform: a.Text extraction & creating a corpusb.Text Pre-processingc.Create
Please answer this homework related to Data Science and Big Data Analysis in APA format with References and Citations
Q1 Perform:
a.Text extraction & creating a corpusb.Text Pre-processingc.Create the DTM & TDM from the corpusd.Exploratory text analysise.Feature extraction by removing sparsity
f.Build the Classification Models and compare Logistic Regression to Random Forest regression https://medium.com/analytics-vidhya/customer-review-analytics-using-text-mining-cd1e17d6ee4e
Q2 – Analyze the customer reviews in the file Restaurant_Reviews.tsv
a.Explain each step for the following text clean-up commands
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
b. What is the classification question?
c. Use CM for Random Forest classifier to calculate:
TP = # True Positives,
TN = # True Negatives,
FP = # False Positives,
FN = # False Negatives):
Accuracy = (TP + TN) / (TP + TN + FP + FN)
d. Apply the logistic regression classifier to the problem – recalculate “Q2c” i.e. TP, TN, FP, FN, Accuracy
e.Apply SVM classifier to the same question – recalculate “Q2c”
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
Uncomment in order to see the impact:
#as.character(corpus[[841]])
#as.character(corpus[[1]])
Q3: Study the quanteda toolkit for R
Q3a: Compare quanteda to: alternative R packages for quantitative text analysis (tm, tidytext, corpus, and koRpus)
Q3b: Install(quanteda) and then library(quanteda) – and explain different features of the quanteda package for text analysis
Q4 Spam Text Message Classification – Use the quanteda package to perform “spam” classification on the text message file in Q4
The file name: Q4.spam-text-message-classification.zip
a.Create the ”word” cloud for spam and ham messagesb.Apply a Naïve Bayes Classifier and compute TP, TN, FP, FN, Accuracyc.Use a Logistic Regression Classifier and compute TP, TN, FP, FN, Accuracyd.Use a Random Forest Classifier and compute TP, TN, FP, FN, Accuracy
Q5. The State of the Union is an annual address by the President of the United States before a joint session of congress. In it, the President reviews the previous year and lays out his legislative agenda for the coming year
This dataset contains the full text of the State of the Union address from 1989 (Regan) to 2017 (Trump).
a.Topic modelling: Which topics have become more popular over time? Which have become less popular?
b.Sentiment analysis: Are there differences in tone between different Presidents? Presidents from different parties?