The growth of social media usage changed the world of marketing. Delivering content had never been so easy as it is nowadays. Any information can spread across the World in less than seconds and it is important to recognize what’s going on with your product or brand as fast as possible. It is especially crucial when something goes wrong, or when you want to check how people perceive the things you do. That’s why we believe, having a continuous and fully automatic monitoring of social media is a must have for all the businesses, and the sentiment is one of the most important factors to keep an eye on.

What is this sentiment analysis about?

Sentiment analysis is a well-known Natural Language Processing (NLP) problem which goal is to determine whether a particular text is positive, negative or neutral. It can be thought as a simpler variant of emotion recognition, which the author felt while writing. The variety of emotions is richer what makes this issue harder to solve, but for our purposes sentiment is just enough.
Twitter was our first choice, when we thought about the media to be monitored, as it is commonly used and gives an opportunity to automatically retrieve the messages which include given phrase or hashtag. It is also a tool which people use to inform the others about things happening right now. It looks like designed for realtime analytics then.

From text messages to sentiment

Humans have a natural ability to recognize somebody’s else feelings and we are quite good at it. Most of us can also deal with irony and read between the lines. An automated system needs to learn from provided examples somehow and find a way how to map words of texts into the sentiment. The thing is, our minds don’t rely on words only, but may take some more subtle things into consideration. What we found out interesting was the usage of emojis, which are direct indicators of somebody’s emotions and can help to improve the recognition accuracy. That’s why we decided to pay more attention to them.

As machine learning models are usually mathematical, there is a need to somehow map the texts from letters into numbers. This is so called vectorization and for our purposes we’ve chosen TFIDF[1] vectorization which splits the text into words and assigns each word a weight, depending on its importance. Additionally, several features like text length and the presence of special characters (exclamation marks, for instance), have been also attached to the vectors. In our tests we have tested standard count vectorization method, as well as feature extraction only, however, choosing a proper method here is also crucial for the overall accuracy and performance. TFIDF was the best choice, as it led to the best results, in terms of the correctness.

For the training phase we collected publicly available datasets of tweets[2],[3] labelled with their sentiment. It turned out, the number of unique words across these datasets was higher than 250 000. That means, each given text is mapped into 250 000-dimensional space. That was definitely too many, so we reduced the dimensionality with the great help of PCA[4] to 500. That speeds up the computation a lot. In our experiments, we compared different machine learning models, and Random Forest Classifier[5] won the competition, as it had the highest accuracy from all of those which were also fast enough to be used in production.

An advantage over available tools

Stanford CoreNLP[6] is thought to be a state-of-the-art library in the field of NLP, and initially we wanted to use it for our content monitoring tool, but it had about 50% percent of accuracy, what was below our expectations. Stanford’s tool was rather designed to work with longer texts, usually grammatically correct, but as Twitter limits the length of the messages and its users do not care about the linguistic correctness, the ability to work properly with the texts we had decreased. That was why we decided to create our own solution that fits our needs.

 

A flow of the data from the raw text of tweet, through words splitting, TFIDF vectorization, PCA dimensionality reduction and classification via Random Forest classifier of the message sentiment

 

Described approach achieves accuracy higher than 75% on the same dataset like we tested Stanford CoreNLP on. Codete has created a tool that allows to put any input phrase which will be monitored in Twitter, and to visualize the overall sentiment for all the matching tweets. It can be also used to perform the analytics of any brand perception.

The video below shows how created system has been applied to our visualization, presented on different conferences:

We still believe there are plenty of opportunities of how to improve the accuracy of such system. Having a high quality training dataset is definitely one of the most important factors and that’s what we are going to focus on. Sounds interesting? Any progress will be reported on our blog, so please keep in touch!

References:

  1. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
  2. https://www.kaggle.com/crowdflower/twitter-airline-sentiment
  3. http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
  4. https://en.wikipedia.org/wiki/Principal_component_analysis
  5. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  6. https://stanfordnlp.github.io/CoreNLP/

Software Engineer

I am a big fan of AI and applying machine learning methods in real-life problems, with an experience in web development and databases. Currently, I'm involved in Big Data projects as well as in internal research at Codete.