RSS

Monthly Archives: February 2015

Step 2 – R Sentiment Analysis


In my previous article Step 1 – R Authentication for Twitter, we got to know how to pull tweets from the tweeter. In Step 2, we will look how to do sentiment analysis on the pulled tweeter. We can do this on two ways

1. Write our won code to do Sentiment Analysis.
2. Using Sentiment package of R

First, lets try to write our own code. Pull tweets from the tweeter.

Import required packags


library(twitteR)
library(plyr)
library(ggplot2)
library (stringr)

Load Twitter Cred file and verify that the credential is working properly.

load("twitter authentication.Rdata")
registerTwitterOAuth(Cred)

Finally, pull some tweets of Swachh Bharat Abhiyan from the twiteer


tweets <- searchTwitter("#ObamaInIndia", n=4000, cainfo="cacert.pem")

Data Cleaning

Prepare the tweets for sentiment analysis. It include remove special character, retweets, html links


tweet_txt = tweets$text
# remove retweet entities
tweet_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweet_txt)
# remove at people
tweet_txt = gsub("@\\w+", "", tweet_txt)
# remove punctuation
tweet_txt = gsub("[[:punct:]]", "", tweet_txt)
# remove numbers
tweet_txt = gsub("[[:digit:]]", "", tweet_txt)
# remove html links
tweet_txt = gsub("http\\w+", "", tweet_txt)
# remove unnecessary spaces
tweet_txt = gsub("[ \t]{2,}", "", tweet_txt)
tweet_txt = gsub("^\\s+|\\s+$", "", tweet_txt)
tweet_txt=gsub("[^0-9a-zA-Z ,./?><:;’~`!@#&*’]","", tweet_txt)

<span style="line-height: 1.4;">

Read the positive and negative words txt file and add words in the negative and positive words, depending on the tweets and area you covered for sentiment analysis

#Load sentiment word lists

liu.pos = scan('positive-words.txt', what='character', comment.char=';')
liu.neg = scan('negative-words.txt', what='character', comment.char=';')

#Add words to list


>pos.words = c(liu.pos, 'upgrade','india','credit','Milkha Singh','credit','MannKiBaat','Modi','Obama')
>neg.words = c(liu.neg, 'wait','waiting', 'epicfail', 'mechanical','love jihaad','ghar waapsi','jihad')

Procedure to calculate the score of sentiments.


score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{

>scores = laply(sentences, function(sentence, pos.words, neg.words) {

# and convert to lower case:
>sentence = tolower(sentence)

# split into words. str_split is in the stringr package
>word.list = str_split(sentence, '\\s+')

# sometimes a list() is one level of hierarchy too much
>words = unlist(word.list)

# compare our words to the dictionaries of positive & negative terms
>pos.matches = match(words, pos.words)
>neg.matches = match(words, neg.words)

# match() returns the position of the matched term or NA
>pos.matches = !is.na(pos.matches)
>neg.matches = !is.na(neg.matches)

# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)

return(score)

}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

#Method call to calculate the sentiment score
tweet.scores = score.sentiment(tweet_txt, pos.words,neg.words, .progress='text')

#Draw a chart to display the score bar.
hist(tweet.scores$score)
qplot(tweet.scores$score)

Rplot

 

Tags: , ,