Twitrratr : how to make a fuzz over nothing
Par Marguerite le mardi 21 octobre 2008, 20:37 - In my WebOpinion - Lien permanent
Today, twitrratr, another utterance of the so-called "semantic-apps" flooding
the web those days, made quite a buzz.
I just cannot resist but to report here the presentation provided on the "about" page of twitrratr :
" We wanted to keep things as simple as possible. We built a list of positive keywords and a list of negative keywords. We search Twitter for a keyword and the results we get back are crossreferenced against our adjective lists, then displayed accordingly. There are obvious issues with this, so if you have any ideas on how we could do this better let us know."
As it is not necessary to demonstrate the weakness of this twitter-based
application, which conclusion you can just come to by yourself while trying it,
I would like to give here a very basic linguistic point of view, in order to
avoid pointless amazement.
We are going to make a little simple exercise, with words taken from the
"positive" list and the "negative" list twitrratr uses to process its automatic classification
of tweets. Let's just keep in mind that the "neutral" category is the rubbish
one, where the tweets with no positive or negative clusters are
classified.
1) semantic ambiguity, even in a 140 characters message
Let's begin with two clusters taken from the "negative list" :
"completely wrong" and "nothing is".
What if you were to say (a) "Obama wasn't completely wrong" and (b) "I guess
nothing is better than that"? The context effect of the negation in (a)
reverses the semantic orientation of the cluster. In (b), the comparative
adjective "better (than)" also does so. Hence, those simple cases show the
importance of the context in which negative clusters appears.
It also works for positive clusters, such as (c) "awesome" or (d) "thank
you" : "Let's try this awesome shit" or "I thank you for letting me down".
(c) is a case of ironic utterance; up to now, the automatic identification of
irony is an unsolved problem, even for the best searchers in natural language
processing. (d) shows an example of the sarcastic opinion expressed by the
speaker; sarcasm is as tough to process automatically as irony, because these
turns of phrases need context to be interpreted
properly.
One could think that short text messages tend to be easier to process, but
determining the semantic orientation of sentences is a difficult task to
accomplish without taking into account the grammatical relations between the
words.
2) why natural language processing should be of prior interest for twitrratr developpers
The best technologies developped for the automatic processing of subjective
content, such as those developed by CELI, can analyse the positive or negative orientation of
sentences. But this achievement requires several levels of linguistic
analysis : the grammatical level, that is to say the relations between
words in a sentence, is not easy to represent. Why? Because this is natural
language, which characteristic are ambiguity and semantic variation depending
on the context (textual context, social context, cultural context) in which
words occur.
That's why twitrratr developers need a linguist to avoid most of the mistakes in automatic classification of tweets, such as the following, found using the query "cartier" (classified in the positive tweets because of the positive cluster "hilarious"):
After reading this tweet, do you consider it as a positive one?

Commentaires
Great post! We're workin on it
It's an interesting problem to solve, and
definitely not one that we could have solved in a weekend, but we think we have
a good start. If you have any references for us we'd love the info.