R: Sentiment analysis of morning pages
A couple of months ago I came across a cool blog post by Julia Silge where she runs a sentiment analysis algorithm over her tweet stream to see how her tweet sentiment has varied over time.
I wanted to give it a try but couldn’t figure out how to get a dump of my tweets so I decided to try it out on the text from my morning pages writing which I’ve been experimenting with for a few months.
Here’s an explanation of morning pages if you haven’t come across it before:
Morning Pages are three pages of longhand, stream of consciousness writing, done first thing in the morning. There is no wrong way to do Morning Pages — they are not high art. They are not even “writing.” They are about anything and everything that crosses your mind-- and they are for your eyes only. Morning Pages provoke, clarify, comfort, cajole, prioritize and synchronize the day at hand. Do not over-think Morning Pages: just put three pages of anything on the page…and then do three more pages tomorrow.
Most of my writing is complete gibberish but I thought it’d be fun to see how my mood changes over time and see if it identifies any peaks or troughs in sentiment that I could then look into further.
I’ve got one file per day so we’ll start by building a data frame containing the text, one row per day:
library(syuzhet) library(lubridate) library(ggplot2) library(scales) library(reshape2) library(dplyr) root="/path/to/files" files = list.files(root) df = data.frame(file = files, stringsAsFactors=FALSE) df$fullPath = paste(root, df$file, sep = "/") df$text = sapply(df$fullPath, get_text_as_string)
We end up with a data frame with 3 fields:
> names(df)  "file" "fullPath" "text"
Next we’ll run the sentiment analysis function - syuzhet#get_nrc_sentiment - over the data frame and get a score for each type of sentiment for each entry:
get_nrc_sentiment(df$text) %>% head() anger anticipation disgust fear joy sadness surprise trust negative positive 1 7 14 5 7 8 6 6 12 14 27 2 11 12 2 13 9 10 4 11 22 24 3 6 12 3 8 7 7 5 13 16 21 4 5 17 4 7 10 6 7 13 16 37 5 4 13 3 7 7 9 5 14 16 25 6 7 11 5 7 8 8 6 15 16 26
Now we’ll merge these columns into our original data frame:
df = cbind(df, get_nrc_sentiment(df$text)) df$date = ymd(sapply(df$file, function(file) unlist(strsplit(file, "[.]")))) df %>% select(-text, -fullPath, -file) %>% head() anger anticipation disgust fear joy sadness surprise trust negative positive date 1 7 14 5 7 8 6 6 12 14 27 2016-01-02 2 11 12 2 13 9 10 4 11 22 24 2016-01-03 3 6 12 3 8 7 7 5 13 16 21 2016-01-04 4 5 17 4 7 10 6 7 13 16 37 2016-01-05 5 4 13 3 7 7 9 5 14 16 25 2016-01-06 6 7 11 5 7 8 8 6 15 16 26 2016-01-07
Finally we can build some 'sentiment over time' charts like Julia has in her post:
posnegtime <- df %>% group_by(date = cut(date, breaks="1 week")) %>% summarise(negative = mean(negative), positive = mean(positive)) %>% melt names(posnegtime) <- c("date", "sentiment", "meanvalue") posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)]) ggplot(data = posnegtime, aes(x = as.Date(date), y = meanvalue, group = sentiment)) + geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) + geom_point(size = 0.5) + ylim(0, NA) + scale_colour_manual(values = c("springgreen4", "firebrick3")) + theme(legend.title=element_blank(), axis.title.x = element_blank()) + scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) + ylab("Average sentiment score") + ggtitle("Sentiment Over Time")
So overall it seems like my writing displays more positive sentiment than negative which is nice to know. The chart shows a rolling one week average and there isn’t a single week where there’s more negative sentiment than positive.
I thought it’d be fun to drill into the highest negative and positive days to see what was going on there:
> df %>% filter(negative == max(negative)) %>% select(date) date 1 2016-03-19 > df %>% filter(positive == max(positive)) %>% select(date) date 1 2016-01-05 2 2016-06-20
On the 19th March I was really frustrated because my boiler had broken down and I had to buy a new one - I’d completely forgotten how annoyed I was, so thanks sentiment analysis for reminding me!
I couldn’t find anything particularly positive on the 5th January or 20th June. The 5th January was the day after my birthday so perhaps I was happy about that but I couldn’t see any particular evidence that was the case.
Playing around with the get_nrc_sentiment function it does seem to identify positive sentiment when I wouldn’t say there is any. For example here’s some example sentences from my writing today:
> get_nrc_sentiment("There was one section that I didn't quite understand so will have another go at reading that.") anger anticipation disgust fear joy sadness surprise trust negative positive 1 0 0 0 0 0 0 0 0 0 1
> get_nrc_sentiment("Bit of a delay in starting my writing for the day...for some reason was feeling wheezy again.") anger anticipation disgust fear joy sadness surprise trust negative positive 1 2 1 2 2 1 2 1 1 2 2
I don’t think there’s any positive sentiment in either of those sentences but the function claims 3 bits of positive sentiment! It would be interesting to see if I fare any better with Stanford’s sentiment extraction tool which you can use with syuzhet but requires a bit of setup first.
I’ll give that a try next but in terms of getting an overview of my mood I thought I might get a better picture if I looked for the difference between positive and negative sentiment rather than absolute values.
The following code does the trick:
difftime <- df %>% group_by(date = cut(date, breaks="1 week")) %>% summarise(diff = mean(positive) - mean(negative)) ggplot(data = difftime, aes(x = as.Date(date), y = diff)) + geom_line(size = 2.5, alpha = 0.7) + geom_point(size = 0.5) + ylim(0, NA) + scale_colour_manual(values = c("springgreen4", "firebrick3")) + theme(legend.title=element_blank(), axis.title.x = element_blank()) + scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) + ylab("Average sentiment difference score") + ggtitle("Sentiment Over Time")
This one identifies peak happiness in mid January/February. We can find the peak day for this measure as well:
> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(date) date 1 2016-02-25
Or if we want to see the individual scores:
> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(-text, -file, -fullPath) anger anticipation disgust fear joy sadness surprise trust negative positive date diff 1 0 11 2 3 7 1 6 6 3 31 2016-02-25 28
After reading through the entry for this day I’m wondering if the individual pieces of sentiment might be more interesting than the positive/negative score.
On the 25th February I was:
quite excited about reading a distributed systems book I’d just bought (I know?!)
thinking about how to apply the tag clustering technique to meetup topics
preparing my submission to PyData London and thinking about what was gonna go in it
thinking about the soak testing we were about to start doing on our project
Each of those is a type of anticipation so it makes sense that this day scores highly. I looked through some other days which specifically rank highly for anticipation and couldn’t figure out what I was anticipating so even this is a bit hit and miss!
I have a few avenues to explore further but if you have any other ideas for what I can try next let me know in the comments.
About the author
I'm currently working on real-time user-facing analytics with Apache Pinot at StarTree. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also I co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.