10 Tips for Sentiment Analysis projects

sentiment-analysis-tipsIn my Thesis project for the MSc in Statistics I focused on the problem of Sentiment Analysis. The Sentiment Analysis is an application of Natural Language Processing which targets on the identification of the sentiment (positive vs negative vs neutral), the subjectivity (objective vs subjective) and the emotional states of the document. I worked on the particular project for over 9 months and used several different statistical methods and techniques under the supervision of Professors Tsiamyrtzis and Kakadiaris.

During my thesis, I had the opportunity learn about new machine learning techniques but also bumped into some interesting and non-obvious matters. In this article I discuss the things that I found most interesting while working on the Sentiment Analysis project and I provide some tips that you should have in mind while working on similar Natural Language Processing problems.

All the below tips and practices are used in order to develop Datumbox’s Sentiment Analysis service which powers up our API.

1. Using Lexicon based VS Learning based techniques

Lexicon based techniques use a dictionary to perform entity-level sentiment analysis. This technique uses dictionaries of words annotated with their semantic orientation (polarity and strength) and calculates a score for the polarity of the document. Usually this method gives high precision but low recall.

Learning based techniques require creating a model by training the classifier with labeled examples. This means that you must first gather a dataset with examples for positive, negative and neutral classes, extract the features/words from the examples and then train the algorithm based on the examples.

Choosing which method you will use heavily depends on the application, domain and language. Using lexicon based techniques with large dictionaries enables us to achieve very good results. Nevertheless they require using a lexicon, something which is not always available in all languages. On the other hand Learning based techniques deliver good results nevertheless they require obtaining datasets and require training.

2. Using Statistical VS Syntactic techniques

Similar to the above, while performing text analysis application you have the ability to select between using a Statistical technique or a Syntactic one. Syntactic techniques can deliver better accuracy because they make use of the syntactic rules of the language in order to detect the verbs, adjectives and nouns. Unfortunately such techniques heavily depend on the language of the document and as a result the classifiers can’t be ported to other languages.

On the other hand statistical techniques have probabilistic background and focus on the relations between the words and categories. Statistical techniques have 2 significant benefits over the Syntactic ones: we can use them in other languages with minor or no adaptations and we can use Machine Translation of the original dataset and still get quite good results. This obviously is impossible by using syntactic techniques.

3. Don’t forget the Neutral Class

While performing Sentiment Analysis most people tend to ignore the Neutral class and focus only on positive and negative classes. Nevertheless it is important to understand that not all sentences have a sentiment. Training the classifier to detect only the 2 classes forces several neutral words to be classified either as positive or negative something that leads to over fitting.

As Koppel and Schler showed on their paper “The Importance of Neutral Examples for Learning Sentiment” the neutral class not only should be ignored but also it can improve the overall accuracy of SVM classifier. My research for my MSc thesis on this field also showed that Max Entropy classifier can benefit from the neutral class. Within the next weeks I plan to publish an article on this.

4. Mind the Tokenization algorithm

How will you present the documents? Are you going to take into account the multiple occurrences of the words? What type of tokenization will you use? Will you make use of the n-grams framework? If so how many keyword combinations are you going to use?

There is no single answer for the above questions. The answers can heavily change depending the topic, application and language. Thus make sure you run several preliminary tests to find the best algorithmic configuration.

Just remember that in case that you use the n-grams framework, the number of n should not be too big. Particularly in Sentiment Analysis you will see that using 2-grams or 3-grams is more than enough and that increasing the number of keyword combinations can hurt the results. Moreover keep in mind that in Sentiment Analysis the number of occurrences of the word in the text does not make much of a difference. Usually Binarized versions (occurrences clipped to 1) of the algorithms perform better than the ones that use multiple occurrences.

5. Mind the Feature Selection algorithm

In learning based techniques, before training the classifier, you must select the words/features that you will use on your model. You can’t just use all the words that the tokenization algorithm returned simply because there are several irrelevant words within them.

Two commonly used feature selection algorithms in Text Classification are the Mutual Information and the Chi-square test. Each algorithm evaluates the keywords in a different way and thus leads to different selections. Also each algorithm requires different configuration such as the level of statistical significance, the number of selected features etc. Again you must use Trial and error to find the configuration that works better in your project.

6. Different Classifiers deliver different results

Make sure you try as many classification methods as possible. Have in mind that different algorithms deliver different results. Also note that some classifiers might work better with specific feature selection configuration.

Generally it is expected that state of the art classification techniques such as SVM would outperform more simple techniques such as Naïve Bayes. Nevertheless be prepared to see the opposite. Sometimes Naïve Bayes is able to provide the same or even better results than more advanced methods. Don’t eliminate a classification model only due to its reputation.

7. The domain/topic matters… a lot!

There is no single algorithm that performs well in all topics/domains/applications. Be prepared to see that the accuracy of your classifier can be as high as 90% in one domain/topic and as low as 60% in some other.

For example you might find that Max Entropy with Chi-square as feature selection is the best combination for restaurant reviews, while for twitter the Binarized Naïve Bayes with Mutual Information feature selection outperforms even the SVMs. Be prepared to see lots of weird results. Particularly in case of twitter, avoid using lexicon based techniques because users are known to use idioms, jargons and twitter slangs what heavily affect the polarity of the tweet.

8. Don’t expect every technique to work well for you

The best source of information for Sentiment Analysis is obviously the academic papers. Nevertheless don’t expect that every suggested technique will work well for you. While usually the papers can turn you to the right direction, some techniques work only to specific domains. Also have in mind that not all papers are of the same quality and that some authors overstate or “optimize” their results. Don’t make the mistake to use a particular technique just because you found it on a paper. Ask yourself if it delivers the results that you expect or if it makes your algorithm unnecessary complicated and difficult to explain its results.

9. Garbage in – Garbage out

Be careful what datasets you use when you train your classifiers. Simply by reading few examples of the most commonly used datasets of Sentiment Analysis will make you understand that they contain a lot of garbage. Some of the examples are too ambiguous, contain mixed sentiments and make comparisons and thus they are not ideal to be used for training.

Try to use human annotated datasets as match as possible and not automatically extracted examples. Scrapping structured reviews from various websites is also a problematic approach so be extra careful what examples you use. Finally remember that unless you know otherwise, the probability of classifying a document as positive, negative or neutral is equal. Thus in the dataset the number of examples in each category should be equal.

10. Ensemble learning might not be as powerful

One of the most powerful techniques for building highly accurate classifiers is using ensemble learning and combining the results of different classifiers. Ensemble learning has great applications in fields of computer vision where the same object can be presented in 3D, 2D, infrared etc. Thus using several different weak classifiers that focus on different areas can help us build strong high-accuracy classifiers. Unfortunately in text analysis this is not as effective. The options of looking the problem from a different angle are limited and the results of the classifiers are usually highly correlated. Thus this makes the use of ensemble learning less practical and less useful.

Got a tip that I missed? Leave your comment below!

Did you like the article? Please take a minute to share it on Twitter. :)


My name is Vasilis Vryniotis. I'm a Data Scientist, Software Engineer, Statistics & Machine Learning enthusiast, co-founder of WebSEOAnalytics.com and author of Datumbox Machine Learning Framework. Learn more

Latest Comments
  1. Timothy Potter

    Nice list … I’ve also had success with separating the passion from the polarity. In other words, decide if the text is positive or negative using the techniques you described above and then use other techniques to decide on the strength of the polarity, e.g. very positive or slightly negative, etc. For the latter, I’ve used mostly heuristics (which of course can lower recall), such as use of profanity, exclamation points, repeated letters (yummmm), etc.

    • Vasilis Vryniotis

      Hi @Timothy,

      Great suggestion, thanks for contributing. As an alternative to heuristics you can also use ordinal regression. NEVERTHELESS as you know regression is usually more vulnerable to multicollinearity problems which always appear in text classification problems by definition.

  2. Ali

    Thanks! I myself found some of this tips after a huge trial-error! Specially for the the “Garbage-in Garbage-out”!

    I think in the last tip you confused “ensemble learning” and “multi-view learning”. What you said about Vision applications, i.e. combining weak classifiers that look into the problem from different angles, is actually the definition of multi-view learning, and not ensembling.
    As far as I know , a classifier ensemble is made on top of different classifiers on different instances AND/OR using different parameters, but the feature set should be identical. If you use different feature sets, it is called multi-view learning.
    I think multi-view learning has no interpretation in text mining!
    (I’m not sure about it, just induced based on some machine vision papers. However, I agree none of these techniques are useful in text analysis applications, or generally, in any problem with big and sparse feature sets )

  3. ranjith

    sentiment analysis…why can’t you try or extend this idea to assess the stress levels at different scenario and measure it quantitatively.so a comparative analysis can be made out of this.

  4. Ashleigh

    Great list! I would also add that you need to know the thresholds for the data set you are using. Using F-measure and BEP (among others) help identify what the best possible benchmark might be. If a data set is composed of all weak classifiers, based on parameters, scope and subject matter norms, you are probably shooting too high if you attempt to achieve 90% accuracy. Instead, understanding your data set can at the very best achieve a probable 70% accuracy will help you focus on what you can do to achieve that benchmark and not the 90%. In the process you may find that you can reach a higher level of accuracy but at least setting out with expectations based on the data you are working with will ensure you allocate the correct resources to the project and not get discouraged when you cannot reach an accuracy the data is not capable of.

  5. poonam

    Hi ,Great list! I am doing masters on Sentiment analysis on tweets for Social Events….from your article have you discuss about supervised learning technique but not for unsupervised technique so can you please give an idea regarding unsupervised technique….as i am new to sentiment analysis how shall i start working with my project please tell me….Thanks.:)

    • Vasilis Vryniotis

      Hi @poonam,

      Since sentiment analysis is a classification problem, I mostly worked with supervised techniques. Most researchers follow this path, even though I suppose you could try using an unsupervised technique such as clustering.

      Good luck!

  6. amal


    i am new to this sentimental analysis..But I wish to do my M.tech Thesis on Sentiment analysis of products…

    Where should i start?
    Can you suggest the best method to do this?

    • Vasilis Vryniotis

      Hey Amal,

      I think you are on the right place. :)

      Check out the posts of this blogs. I explain in detail how to the Machine Learning models works, how to perform feature selection and even how to implement the algorithms.

      Good luck with your Thesis!

  7. Nani

    Great job when I read some of your writings in this web. It’s help me a lot in understanding the sentiment analysis.
    I have a question to ask.
    Do you know anything about implicit feature extraction? Implicit feature means the feature that not exist explicitly in a sentence, but it’s implied.
    I’m doing my research in this implicit feature extraction.
    If you have any idea or any relevant link to read or any relevant tools, please let me know.

  8. Daniel Einspanjer

    I just had to drop a line to let you know I found your blog post because I was debugging some rough sentiment analysis on real time tweets involving #bigdata, and I saw a tweet from @Datumbox pointing to your article. Unfortunately, the tweet was classified as slightly negative which incorrect. :) There is something intriguing about a sentiment analysis engine incorrectly classifying a tweet pointing to a blog post providing tips on improving sentiment analysis…

    • Vasilis Vryniotis

      Hey @Daniel,

      Hahaha… I like your comment… This situation was the implementation of Google’s random surfer model in real life. Let alone the irony… :p

      I think you are referring to the tweet “10 Tips for Sentiment Analysis projects: http://blog.datumbox.com/10-tips-for-sentiment-analysis-projects/ #machinelearning #bigdata”

      I understand the frustration, Sentiment Analysis is not always accurate and sometimes very domain specific. Are you using a particular service or your own custom implementation? BTW check out the Datumbox classifier. The above tweet is classified as neutral. …Which I think it makes sense since it does not specifically expresses a sentiment about the article. One can say that it is positive since “tips” carries a positive meaning, nevertheless this is not an opinion but rather the title of an article. So neutral is fine by my book. :)

  9. John Housty

    I am currently in my final year of my BSc Computer Science Program and for my project I’m focusing on sentiment Analysis. The question is which direction should I take? From what I’ve been reading the SA algorithms are still inefficient in terms of polarity, sarcastic statements etc. I also found that they are very domain specific.

    My question is what is the current state of SA algorithms? Have they been improved because I’m contemplating which is more feasible: looking at improving these algorithms vs applying an algorithm to a specific context or domain

    Much thanks for any advice you can offer!!

    • Vasilis Vryniotis

      Hi John,

      Indeed Sentiment Analysis is very domain specific. You can either try using Statistical approaches or lexicon/rule-based approaches. In this blog I discuss statistical machine learning approaches such as Max Entropy, Naive Bayes, Softmax Regression, SVM etc. The benefit of using this approach is that they are not language specific and most of them use a strong probabilistic framework. Given your background, I would advise you to focus on applying the previous techniques in a particular domain. Improving the algorithms usually requires a very good understanding and knowledge of the current methods and of Machine Learning in general, something that is not easy.

      Good luck!

  10. Jill

    What a great post! For learning based technique, we can get precision, recall and accuracy automatically by using weka for example. What about lexicon based technique such as using general inquirer or MPQA? Is there any standard formula to use? Really appreciate if you can give some references.

    • Vasilis Vryniotis

      Hi Jull,

      My knowledge on the best practices when using lexicon based techniques is limited because I prefer the statistical methods. At any case if you are doing a classification, F1, precision, recall and accuracy can still be applied.


  11. Saraa Roy

    Hello Sir,
    I am new in the field of sentiment analysis. I want to know that how to start work with this topic.
    I also want to do the analysis using BigData.

    • Vasilis Vryniotis

      Hi Saraa,

      In this blog you can find several articles on the subject. In each article I provide references that can help you explore the topic in great extend.

      Good luck!

  12. MK

    I am going to do a research on opinion mining and i think twitter is the best corpus. can you please suggest what new approach should i concentrate on, as it i driving me nuts coz its so vast. plz help

Leave a Reply

Your email address will not be published. Required fields are marked *

four + 3 =

You may use these HTML tags and attributes: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>