In my Thesis project for the MSc in Statistics I focused on the problem of Sentiment Analysis. The Sentiment Analysis is an application of Natural Language Processing which targets on the identification of the sentiment (positive vs negative vs neutral), the subjectivity (objective vs subjective) and the emotional states of the document. I worked on the particular project for over 9 months and used several different statistical methods and techniques under the supervision of Professors Tsiamyrtzis and Kakadiaris.
During my thesis, I had the opportunity learn about new machine learning techniques but also bumped into some interesting and non-obvious matters. In this article I discuss the things that I found most interesting while working on the Sentiment Analysis project and I provide some tips that you should have in mind while working on similar Natural Language Processing problems.
All the below tips and practices are used in order to develop Datumbox’s Sentiment Analysis service which powers up our API.
1. Using Lexicon based VS Learning based techniques
Lexicon based techniques use a dictionary to perform entity-level sentiment analysis. This technique uses dictionaries of words annotated with their semantic orientation (polarity and strength) and calculates a score for the polarity of the document. Usually this method gives high precision but low recall.
Learning based techniques require creating a model by training the classifier with labeled examples. This means that you must first gather a dataset with examples for positive, negative and neutral classes, extract the features/words from the examples and then train the algorithm based on the examples.
Choosing which method you will use heavily depends on the application, domain and language. Using lexicon based techniques with large dictionaries enables us to achieve very good results. Nevertheless they require using a lexicon, something which is not always available in all languages. On the other hand Learning based techniques deliver good results nevertheless they require obtaining datasets and require training.
2. Using Statistical VS Syntactic techniques
Similar to the above, while performing text analysis application you have the ability to select between using a Statistical technique or a Syntactic one. Syntactic techniques can deliver better accuracy because they make use of the syntactic rules of the language in order to detect the verbs, adjectives and nouns. Unfortunately such techniques heavily depend on the language of the document and as a result the classifiers can’t be ported to other languages.
On the other hand statistical techniques have probabilistic background and focus on the relations between the words and categories. Statistical techniques have 2 significant benefits over the Syntactic ones: we can use them in other languages with minor or no adaptations and we can use Machine Translation of the original dataset and still get quite good results. This obviously is impossible by using syntactic techniques.
3. Don’t forget the Neutral Class
While performing Sentiment Analysis most people tend to ignore the Neutral class and focus only on positive and negative classes. Nevertheless it is important to understand that not all sentences have a sentiment. Training the classifier to detect only the 2 classes forces several neutral words to be classified either as positive or negative something that leads to over fitting.
As Koppel and Schler showed on their paper “The Importance of Neutral Examples for Learning Sentiment” the neutral class not only should be ignored but also it can improve the overall accuracy of SVM classifier. My research for my MSc thesis on this field also showed that Max Entropy classifier can benefit from the neutral class. Within the next weeks I plan to publish an article on this.
4. Mind the Tokenization algorithm
How will you present the documents? Are you going to take into account the multiple occurrences of the words? What type of tokenization will you use? Will you make use of the n-grams framework? If so how many keyword combinations are you going to use?
There is no single answer for the above questions. The answers can heavily change depending the topic, application and language. Thus make sure you run several preliminary tests to find the best algorithmic configuration.
Just remember that in case that you use the n-grams framework, the number of n should not be too big. Particularly in Sentiment Analysis you will see that using 2-grams or 3-grams is more than enough and that increasing the number of keyword combinations can hurt the results. Moreover keep in mind that in Sentiment Analysis the number of occurrences of the word in the text does not make much of a difference. Usually Binarized versions (occurrences clipped to 1) of the algorithms perform better than the ones that use multiple occurrences.
5. Mind the Feature Selection algorithm
In learning based techniques, before training the classifier, you must select the words/features that you will use on your model. You can’t just use all the words that the tokenization algorithm returned simply because there are several irrelevant words within them.
Two commonly used feature selection algorithms in Text Classification are the Mutual Information and the Chi-square test. Each algorithm evaluates the keywords in a different way and thus leads to different selections. Also each algorithm requires different configuration such as the level of statistical significance, the number of selected features etc. Again you must use Trial and error to find the configuration that works better in your project.
6. Different Classifiers deliver different results
Make sure you try as many classification methods as possible. Have in mind that different algorithms deliver different results. Also note that some classifiers might work better with specific feature selection configuration.
Generally it is expected that state of the art classification techniques such as SVM would outperform more simple techniques such as Naïve Bayes. Nevertheless be prepared to see the opposite. Sometimes Naïve Bayes is able to provide the same or even better results than more advanced methods. Don’t eliminate a classification model only due to its reputation.
7. The domain/topic matters… a lot!
There is no single algorithm that performs well in all topics/domains/applications. Be prepared to see that the accuracy of your classifier can be as high as 90% in one domain/topic and as low as 60% in some other.
For example you might find that Max Entropy with Chi-square as feature selection is the best combination for restaurant reviews, while for twitter the Binarized Naïve Bayes with Mutual Information feature selection outperforms even the SVMs. Be prepared to see lots of weird results. Particularly in case of twitter, avoid using lexicon based techniques because users are known to use idioms, jargons and twitter slangs what heavily affect the polarity of the tweet.
8. Don’t expect every technique to work well for you
The best source of information for Sentiment Analysis is obviously the academic papers. Nevertheless don’t expect that every suggested technique will work well for you. While usually the papers can turn you to the right direction, some techniques work only to specific domains. Also have in mind that not all papers are of the same quality and that some authors overstate or “optimize” their results. Don’t make the mistake to use a particular technique just because you found it on a paper. Ask yourself if it delivers the results that you expect or if it makes your algorithm unnecessary complicated and difficult to explain its results.
9. Garbage in – Garbage out
Be careful what datasets you use when you train your classifiers. Simply by reading few examples of the most commonly used datasets of Sentiment Analysis will make you understand that they contain a lot of garbage. Some of the examples are too ambiguous, contain mixed sentiments and make comparisons and thus they are not ideal to be used for training.
Try to use human annotated datasets as match as possible and not automatically extracted examples. Scrapping structured reviews from various websites is also a problematic approach so be extra careful what examples you use. Finally remember that unless you know otherwise, the probability of classifying a document as positive, negative or neutral is equal. Thus in the dataset the number of examples in each category should be equal.
10. Ensemble learning might not be as powerful
One of the most powerful techniques for building highly accurate classifiers is using ensemble learning and combining the results of different classifiers. Ensemble learning has great applications in fields of computer vision where the same object can be presented in 3D, 2D, infrared etc. Thus using several different weak classifiers that focus on different areas can help us build strong high-accuracy classifiers. Unfortunately in text analysis this is not as effective. The options of looking the problem from a different angle are limited and the results of the classifiers are usually highly correlated. Thus this makes the use of ensemble learning less practical and less useful.
Got a tip that I missed? Leave your comment below!
Did you like the article? Please take a minute to share it on Twitter. 🙂