We discuss sentiment data models, significance of linguistic features, handling the noise in social conversations, industry challenges, important use cases and the appropriateness of over-simplified binary classification.
Vita Markman is currently employed as a Staff Software Engineer at LinkedIn, where she works on various natural language processing applications such as performing sentiment analysis of customer feedback and extracting relevant information from job postings. Before joining LinkedIn, she was a Staff Research Engineer at Samsung Research America. Prior to Samsung Research, she was employed as a Computational Linguist at Disney Interactive.
In addition, she conducts independent research on mining the language of social media. Here primary interests are focused on extracting topics and sentiment from micro-text – the short, snippet-like pieces of text found on Twitter, Facebook, and in various other social media sources.
Her education background is in theoretical and computational linguistics (Rutgers, 2005). In addition to computational linguistics, she has publication record in theoretical syntax and morphology, which was her primary area of research between the years of 2002 – 2008.
I had the pleasure of attending her talk “Integrating Linguistic Features into Sentiment Models: Sentiment Mining in Social media within industry Setting” at Sentiment Analysis Innovation Summit 2014 in San Francisco, CA. Here is first part of my interview with her:
Anmol Rajpurohit Q1. How do the sentiment data models benefit from the integration of linguistic features? Why do we need to go beyond the word (i.e. its dictionary meaning)? Can you share a few examples?
Vita Markman: To answer this question let me first offer a working definition of “linguistic features” that I used throughout the talk. I use the term “linguistic features” to refer to “features beyond the word-level”; in other words, phrase-level features. For example, a classic word-level feature commonly used in sentiment analysis models is an adjective that bears positive or negative semantic orientation, such as “great” or “terrible”. In contrast, a phrase-level feature is a linguistic unit composed of several words, a phrase, that conveys positive or negative orientation only in its entirety. For example, in a domain related to deliveries of products “on time” is a phrase-level feature, where neither the meaning of “on” nor the meaning of “time” alone convey the sentiment; rather, “on time” is the phrasal unit that suggests (in this case) positive orientation, as in “delivery arrived on time”.
Other similar examples include phrases such as “as described”, “as advertised”, “met expectations”, “arrived in pieces”. When reviews are very short, using features beyond the word level becomes critical because many data points do not contain any adjectives or other obvious sentiment bearing words. For example, in the dataset of feedback postings for merchants at Amazon.com (at http://economining.stern.nyu.edu/datasets.html), that was used in the experiment described in the talk, roughly 20% of the sampled transaction reviews do not contain any obvious sentiment bearing vocabulary. The feedback postings are for the most part micro-reviews, i.e. reviews that consist of a single short sentence such as “transaction met expectations” or “arrived as promised”, making it hard to impossible to single out a specific word that contributes to the sentiment of the review.
AR: Q2. Noise is one of the biggest deterrents when dealing with the massive volume of data from social conversations. What are your recommendations for identification and removal of noise while text-mining social data?
VM: First, there is the rather obvious recommendation of normalizing misspelling as much as possible, removing uninformative punctuation, and normalizing important information-bearing emoticons such as “:-)” and “:-(“. After that, the subsequent pre-processing steps will largely depend on the application and on how the data will be used. For example, if the social data is to be used in a bag of words model, then removing frequent stop words such as articles (‘the’, ‘a’) and some prepositions (‘of’, ‘to’) is an important next step. In addition to the basic stop word removal (if one chooses to go with a bag of words model), I would also recommend doing a more data-specific text pre-processing. Namely, each data-set comes from some domain that contains words that are very frequent for that domain. They may not be your regular stop words, but they contribute little to no information and may actually hurt the model. Specifically, in the case of amazon feedback data, words such as “amazon” are actually not informative because they appear in too many reviews, e.g. have too high a document count. Removing or at least discounting these words may be of use.
In the past, I have used the rule of thumb where if a token appears in more than 20%-25% of all documents, it can be treated as a stop word. That said, if one plans to use a model that relies on phrase-level features such as “on time” or “as described”, then removing stop words such as the preposition “on” will be detrimental. For models that use phrase-level features, I would recommend leaving stop words as they are, but performing other noise-reduction steps such as recognizing and then normalizing negation words such as “not” , “no”, “don’t” “doesn’t” , etc. into a single form “NOT”. Negation contributes very important information especially to sentiment models and should be retained and ideally normalized. That way, phrases such as “package not arrived” and “package never arrived” are mapped to a single form “Package NOT arrived”. Performing some simple and very careful lemmatization such that “arrive” and “arrived” are mapped to the same form “arrive” is also helpful.
In sum, normalizing misspellings, removing basic stop words/words with high document count, and normalizing negation are some of the key preprocessing steps for text-mining social data. color>
AR: Q3. What are the unique challenges of sentiment mining in social media within “industry setting”? Currently, what kind of sentiment mining use cases are of the most interest to the industry?
VM: There are several challenges posed by industry setting. I will focus on two.
First, industry often demands a quick turnaround time. A simple model that works and gets shipped quickly is oftentimes preferable to a more nuanced and complex model that takes longer to build, especially if the simpler model is amenable to iterative improvements. Practitioners and researchers in industry are driven to create models that work well, yet are developed/trained quickly, which is challenging.
The second constraint of industry is obtaining good labeled data, where “labeled” and “good” are the key operative words. While in today’s world raw data is prolific, labeled data is not. Obtaining “good” labeled data requires many hours spent by senior team members on annotation code-book design as well as on implementation of inter-annotator agreement metrics such as pairwise Kappa statistic. Creating a proper annotation code-book is critical for the development of a solid machine learning model. Annotation code-books are particularly difficult and labor-intensive to design for the tasks that are vague and subjective such as sentiment labeling. In addition, there may also be some legal issues surrounding releasing raw data for crowd-sourcing as it may be proprietary and hence using third party annotators may not be possible.
In sum, some of the challenges posed by industry setting relate to developing good models in a short period of time and to obtaining solid labeled training data for the task.color>
Currently, some of the more interesting use cases in industry involve aspect-based sentiment modeling where sentiment is attributed not to the product as a whole but to each component (aspect) of the product. For example, a hotel review is more informative if each aspect such as room quality, location, amenities are given a separate sentiment score as opposed to if the entire hotel receives a single sentiment score.
AR: Q4. Throughout your talk at Sentiment Analysis Innovation Summit you were referring to the goal of identifying a text snippet from customer’s review comments as positive or negative. This binary classification of sentiment seems to be an over-simplification. Do you agree? Would it be better to measure sentiment as a percentage or other more comprehensive metric rather than as binary?
VM: Completely agree. Binary classification for sentiment is a great over-simplification and is done solely for the purpose of getting some preliminary phrasal features that indicate positive or negative sentiment. At least three-way classification such as “good”, “neutral”, and “bad” would be better to start with. That said, “neutral” is a very hard class to define, as it is often “neutral-good” or “neutral-bad”. Given the brevity of the reviews I was working with, inferring these distinctions from the text alone was near impossible. However, if there were a way to obtain a reliable set of three-way classifications, or a reliable finer grained classification, it would be preferred.
Vita Markman on Discovering Customer Insights through Sentiment Mining
Interview: Daniel Tunkelang, Head of Query Understanding, LinkedIn
Top LinkedIn Groups in 2014 for Analytics, Big Data, Data Mining, and Data Science
Taken from –