MyLinguistics - Marguerite Leenhardt's Web Log

Aller au contenu | Aller au menu | Aller à la recherche

samedi 12 novembre 2011

"Better filters will play a big part"... towards an "expert + system" model in content analysis solutions

During the last couple of years, there has been some brief bursts of content issues here and there, impacting search as well as content analysis. In a recent tribune for Sparksheet.com, Karyn Campbell (The IdeaList) took an interesting stand, saying whatever 3.0 looks like, better filters will play a big part. professional, human filters will play an integral role in the next web after all. I bet she has hollow nose !

Well, indeed, this makes sense and resonates with some other clues around there. 

Remember : two years ago, Yahoo! patented human intervention through a "human editor ranking system" in its engine. At that time, their point was that such a process obtained more refined results. The idea that, for qualitative results with high expectations concerning accuracy and preciseness, it is needed to have human experts in the game, well, this idea made its way. Better filters.

About one year later, one of the Pew Internet studies emphasized that :

Information overload is here, which means anyone with an interest in making sure their news reaches people has to pay close attention to how news now flows and to the production and usage of better filters.

Better filters, again ! In a march 2010 Researcher's tribune by Martin Hayward, some ideas bring water to our mill :

the real stars will be those who can make sense of, and draw insight from, vast amounts of data quickly and reliably. we have to move from being an industry where value was derived from providing scarce information, to one where value is derived from connecting and interpreting the vast amounts of infomation available, to help clients make better business decisions faster

What could this mean for content analysis now, which has a foot in search issues and the other in qualitative content analysis and curation issues ? More specifically, what would this mean for the business applications of content analysis, such as trend monitoring solutions, sentiment analysis and other types of applications dealing with one of the biggest amount of information available - say User Generated Content from the social media areas of the web ?

Back in 2009, Asi Sharabi made a realistic but critical portrait of social media monitoring solutions. The systems may have improved by now, but several raised issues still are more relevant than ever :
  • "Unreliable data" : where do the most part of your brand's mentions come from ? is there any feature allowing you to make a distinction between spam messages, deceptive reviews and the spontaneous conversational material you'd like to meaningfully draw insights from ? Rhetoric question, of course there's not such a feature.
  • "Sentiment analysis is flawed" : even if there is progress on the subject, the idea that fully-automated systems are costly to set up, train and adapt from a domain to another has also made its way, which benefits to a different approach : defining a methodology where the software and the analyst collaborate to get over the noise and deliver accurate analysis.
  • "Time consuming" : Asi Sharabi put it well, saying it may take "hours and days" to accurately configure a dashboard. Is this time-consuming step a proper and adequate one to put on any end-user working in a social media, communication or marketing department ?  As suggested by the author, at some point, it would be more profitable for the client to pay an analyst to do the job.
No, unfortunately, the situation has not tremendously evolved since then. Just ask some social media analysts dealing with dashboards and qualitative insight to provide well maybe I attract the bad tempered ones a lot. So, what can be said after that ? 
A few more words. Making faster but accurate and congruent business decisions and recommandations using content analysis solutions is not the core of the problem. The core of the problem more likely lies in setting up an appropriate workflow, with a single main idea : expert systems need experts, and they need them upstream and downstream of the data analysis process. Data scientists skills are without any doubt one of the keys to a "better filtering" of content, to provide, curate and analyse real qualitative content.

mardi 27 septembre 2011

Linguistic Resources for French : does the size really matter ?

As a web worker and a qualitative data analyst, most of my time is spent analyzing written french online whether published on newspapers, forums, blogs or SNSs. As a linguist, it can be either fun, stimulative or harmful to see how the linguistic norm and the usage differ from each other. So came the idea to take a closer look at it. For this little test, I chose to make a basic lexical evaluation, for comparing the coverage of standard linguistic resources for French.

  • The data

Let's try to be representative with the data to be tested ! Well, that might be easier said than done, but let's give it a try with the following datasets :

- around 800 000 words from an online newspapers corpus on the political topicality in 2009, including both articles and user's comments ;
- over 400 000 words from a popular french forum.

In the end, the dataset contains about 1 200 000 words : this is pure raw text from the Internet, without any pre-treatment such as lemmatization, orthographic correction, case harmonization or any other mutilation of the raw text material.

  • The reference Linguistic Resources

Among other questionable choices, mine was to take the DELA and the Morphalou linguistic resources as a gold standard for a raw lexical comparison.

  1. The DELA dictionnary is a dictionnary of lemmas and their inflected forms for French (and also English, but obviously not used here), designed by the LIGM team from MLV University, France. This resource contains around 700 000 entries.
  2. The Morphalou lexicon is developped and maintained by the CNRTL and deeply linked to the research work from the UMR ATILF of the Nancy University, France. This resource is another reference dataset for inflected forms of the French language, and contains over 500 000 entries.

Here's how they are used by a nice little Perl script :

  1. first, the words of the raw dataset are brutally segmented to obtain one word per line in the input file ;
  2. the result is stored in a hashtable for subsequent use ;
  3. in the same way, the entries of each resource are stored in a specific hashtable, so we have all the entries from DELA in one hashtable and all the entries from Morphalou in another.

In case you ask yourself "why hashtables", the purpose of using this data structure was to set words as keys and nothing as value, to make fast checks on keys, especially checking their existence. It was the best way to make it in Perl, from my non-dev point of view.

  • The rough encounter

Yes, the script aimed at comparing the raw material with each one of the considered resources. So ... Tadaa ! The results !

Three objectives here :

  1. Determine the number of words that were not recognized by DELA
  2. Then, the same evaluation for Morphalou
  3. Being able to give a percentage of the words that aren't recognized neither by one nor the other reference linguistic resources

So, here are my little Perl script results :

unknown words for DELA : 42719
unknown words for Morphalou : 43906
percentage of shared unknown words : 97.296497061905 %

From those results, let's highlight three points :

- first, french web-writers seem to be not that bad at spelling french online, as less than 50 000 occurrences are not recognized by the linguistic ressources ;
- second, that means that those ressources are very close in terms of coverage, even if DELA has 200 000 more entries than Morphalou : a bit less than 3% differences on the shared unknown words ; so the size of the resource is not that important (at least, for the purpose of this little test) ;
- third, as some Linguist colleagues told me at the latest AFLS conference I attended, the linguistic performance problems in current French are more salient on the (micro and macro) syntaxic level (e.g. agreement of inflexions in gender and number) : this partly explains why the results of this basic evaluation based on lexicon are quite good. However, this is far from sufficient to determine what the current level of orthographic performance is.

vendredi 22 juillet 2011

Putting forth the benefits of Textometry : Vegas baby !

In my life of PhD student, publishing as a single author is a satisfying way to give visibility to my research work. But I recently discovered that collaborating with lab mates you share interests with can be a lot of fun ! Well, as long as you're as lucky as I am to find lab mates that work as hard as you do and share your will of getting things done. This is the EMM + ML combo !

So, lucky me to have such a great first long-standing collaboration research experience : our first co-authored paper got the warmest of welcomes at the ILINTEC'11 Workshop on Intelligent Linguistic Technologies, that took place in Las Vegas earlier this week. The ILINTEC’11 Workshop was one of the events of the 2011 World Congress in Computer Science, Computer Engineering, and Applied Computing within the ICAI'11 International Conference on Artificial Intelligence. To provide more context on the event, here are a few descriptive lines :

The core idea of ILINTEC’11 is to bring together researchers who explore different paradigms of language and speech processing; special emphasis is laid on interaction of stochastic techniques and logical methods. ILINTEC 11 is a unique opportunity to discuss the problems of natural language processing in immediate contact with the leading research and development teams from universities and industry engaged in information technology projects and various fields of Computer Science.

Our aim was to show how interesting textometric methods can be for information discovery and web mining tasks, from an academic point of view but also from an industrial point of view, as both EMM and I are achieving our PhD in an industrial context. So, below the link to the Slideshare version of the presentation displayed at ILINTEC : hope you'll enjoy it ! PostScriptum : (Slideshare embedding in DotClear is a bit capricious right now, I'll update asap to provide you with the embedded presentation here.)

lundi 6 juin 2011

3D motion + speech2text + translation memory = towards innovative broadcast services

Just found this info twitted by @TheNextWeb : Japanese researchers invent automatic animated sign language system, and just had to blog about it !

As you may not know, apart my research work on text analytics methodologies, I studied speech processing until the rigorous nomenclatures of the French University forced me to choose between specializing in Natural Language Processing applied to textual material or applied to speech material, a few years ago.

I still have a strong interest for what goes on in the field of speech processing and its applications (conversational agents, lip-sync systems, vocal search engines) even though I work on textual material for now. And I particularly enjoy applications that merge text and speech processing. So I could not help but being drown into writing those lines on the latest innovative development made by the NHK Science & Technology Research Laboratories that is, imho, just an awesome example of what could be done merging text and speech processing. Let's take a closer look :

The NHK Science & Technology Research Laboratories is coming up with technology that automatically generates animated sign language in order to expand sign language in news broadcasts.

Simply put, it is almost like a lip-sync system but for the hands :) The system is actually built on a text-to-text correspondence module that converts japanese text to signed text ; another correspondance module then associates text spans to "hand-codes" (I don't know the exact term, and suggest this one by analogy with "mouth-codes", used in animation for lip-sync systems development).

The cherry-on-top idea ? Incorporating a translation memory to enhance the system outputs with expert knowledge : this materializes by a user interface through wich a human can enrich the lexicon or refine combination rules for hand gestures.

Oh yes ! I teased with "speech2text" but wait... There is no speech-to-text module in this system ! Let's think about it : it lacks only one brick ! Indeed, once the speech signal's complexity is reduced to text material (words, phrases or any other accurate text span), the whole system would be in capacity to deal with speech material as input. This kind of phonetization processes development is not an issue in itself nowadays.

And if we think a bit further, I'd say it is a reasonable hope to expect this kind of system handling "text2speech" outputs too, even if "text2speech" is not as easy to handle for now, if one is expecting for a natural / non-robotic output. That would be very useful for blind people (of course, they can hear broadcast news, but hey, what if they want to refresh their experience of accessing written info on the web ?), social games applications (texting messages to your motioned and talking avatar while being temporarily or permanently speechless, so that it can talk ingame) or domotic applications (texting messages to your home that are displayed with your avatar and voice in the end, for example), to mention just a few. #I skip the 3D motion part, as I am completely unexperimented in this domain#

I am quietly but eagerly waiting for this kind of initiatives to develop and impact the mainstream audience. Startupers with NLProc backgrounds in text AND speech processing should begin to combine their skills thinking of the next opportunities to come up with an innovative solution : multimodal NLProc is on its way :)

mercredi 4 mai 2011

Opinion Mining & Sentiment Analysis, or what sets up a hot topic

SAS2011.jpgThe Sentiment Analysis Symposium was a great experience for me ! Back in Paris, I first thought of updating my last post on Opinion Mining and Sentiment Analysis. But the update grew heavier and heavier, so here's a enhanced one.

Context

For more than a decade now, researchers from Text and Data Analytics, Computer Science, Computational Linguistics and Natural Language Processing, among others, have been working on technologies that could lead to analyze how people feel or what people think about something. In the current period, a great amount of commercial offers have been built on what is still to be taken as a Research Program. Here are some basic clues to get an idea of how this kind of content analysis technologies work.

One of the major issues dealing with huge amounts of User-Generated Content published online – also referred to as UGC – implies mining opinions, which means detecting their polarity, knowing the target(s) they aim at and what arguments they rely on. Opinion Mining/Sentiment Analysis tools are, simply put, derived from Information Extraction (such as Named Entities detection) and Natural Language Processing technologies (such as syntactic parsing). Given this, simply put, they work like an enhanced search engine with complex data calculation habilities and knowledge bases.

Applications with pieces of linguistics inside

Four types of applications are put forth in (Pang & Lee, 2008)’s reference survey :

  1. those seeking for customer insight, in movie or product reviews websites or in social networks ;
  2. the specific integrations within CRM (Customer Relationship Management) or e-commerce systems ;
  3. the strategic foresight and e-reputation applications ;
  4. and last but not least, political discourse analysis.

Automated textual summaries also stands as a very promising subtask, as it is currenlty deeply linked to data visualization for information summarization.

Among the numerous problematics related to Opinion Mining and Sentiment Analysis systems adressed in (Pang & Lee, 2008)’s, I would pinpoint two of particular interest from a linguistic point of view :

  1. linguistic – e.g. syntactic properties and negation modelization – and statistic – e.g. the type/token distribution within large amounts of texts - features as an important issue for systems improvement ;
  2. current processes for adapting Linguistic Resources- such as lexicons or dictionaries – to various domains as an impediment to cost-cutting and reusability.

Not as easy as it seems

Indeed, the Social Media industry expresses a growing interest and need towards NLP technologies to overcome issues such as accuracy, robustness and multilinguism. Sentiment Analysis & Opinion Mining became a promising business field a couple of years ago, as a very well documented post by Doug Henschen for Information Week explains.

But quick recipes are easily found on the web, as shown by a glance on Quora’s « How does Sentiment Analysis Work ? » thread. Also, a manichean way of viewing things, which implies an insuperable dichotomy between ''Linguistic Resources'' and ''Machine Learning'', is well-spread in the industry right now. As Neil Glassman writes on the latest Sentiment Analysis Symposium’s insights, he puts forth that there is a way

« Between those on one side who feel the accuracy of automated sentiment analysis is sufficient and those on the other side who feel we can only rely on human analysis », adding that « most in the field concur with /the idea that/ we Need to define a methodology where the software and the analyst collaborate to get over the noise and deliver accurate analysis.»

So the word is spread !

Putting forth the benefits of Textometry

Textometry is one of the major steps towards the new methodologies to achieve such a goal. Simply put, it is a branch of statistical study of linguistic data, where text is considered as possessing its own internal structure. Textometric methods and tools lead to bypass the information extraction step (qualitative coding), by :

  • applying statistical and probabilistic calculations to the units that make up comparable texts in a corpus ;
  • providing robust methods for processing data without external resources constraints (lexicons, dictionaries, ontologies, for example) ;
  • analyzing objects distribution within the corpus framework ;
  • improving the flow of building corpus-driven Linguistic Resources that can be projected on the data and incrementally enhanced for various purposes, such as Named Entity Recognition and paraphrase matching, resources for deep thematic analysis, and resources for opinion analysis.

Kurt Williams, Mindshare Technologies CTO, accurately wraps it up as follows :

« Using Textometry to leverage opinion analysis. It can be used to cluster authors who share similar opinions together. One approach for improving opinion mining, rather than starting with the individual leveling phrases, start with the context of the conversation first. In other words, many approaches often skip the step of analyzing the context of the text. »

Please find out more in the following presentation displayed at the Sentiment Analysis Symposium.

So this must be what sets up a hot topic : an emerging market, industrial R&D and academics chasing for better solutions and improved systems, and a pluridisciplinary field of interest !

Post scriptum

Special thanks to Seth Grimes who chaired the Sentiment Analysis Symposium and Neil Glassman who nicely quoted me in his post.

Post Update Just to let you know that Seth Grimes nicely provides videos of the SAS'11 Talks and Lighting Talks. You can find my french-accent speech here :)

lundi 14 mars 2011

Sentiment Analysis, Opinion Mining & neophyte basics

Conversations2.jpg For more than a decade now, researchers from Text and Data Analytics, Computer Science, Computational Linguistics and Natural Language Processing, among others, have been working on technologies that could lead to analyze how people feel or what people think about something. In the current period, lots and lots of commercial offers have been built on what I think one should still call a Research Program. Here are some basic clues to get an idea of how this kind of content analysis technologies work.

One of the major issues dealing with huge amounts of User-Generated Content published online – also referred to as UGC – implies mining opinions, which means detecting their polarity, knowing the target(s) they aim at and what arguments they rely on. Opinion Mining/Sentiment Analysis tools are, simply put, derived from Information Extraction (such as Named Entities detection) and Natural Language Processing technologies (such as syntactic parsing). Given this, simply put, they work like an enhanced search engine with complex data calculation habilities and knowledge bases.

But dealing with the data emphasizes the fact that understanding "how does sentiment analysis work" is more a linguistic modelization problem than a computational one. The "keywords" or "bag-of-words" approach is the most commonly used because it underlies a simplistic representation of how opinions and sentiments can be expressed. It would consist, in its most simplistic form, in detecting words in UGC from a set of words labeled as "positive" or "negative" : this method remains unable to solve most of "simple" ambiguity problems (here is an example that illustrates this quite well, I guess).

Most of Opinion Mining tasks focus on local linguistic determination for opinion expression, which is partly constrained by external ressources and thus often deals with problems such as dictionaries coverage limitations, and at a higher level, domain-dependance. Contextual analysis stil is a challenge, as you will find in the following reference book : Bo PANG, Lillian LEE, Opinion Mining and Sentiment Analysis, Now Publishers Inc., 2008, 135 pages, ISSN 1554-0669.

As a temporary conclusion, I would say that accuracy remains the major challenge in this industry development. In fact, in such analysis systems, some "simple" linguistic phenomena still are problematic to modelize and implement, for example the negation scope problem, which is how to deal with negative turns of phrases. Another problem for systems accuracy is the analysis methodology itself. Fully organic methods are costly, but fully automated ones are innacurate : you need to define a methodology where the software and the analyst collaborate to get over the noise and deliver accurate analysis.

dimanche 6 mars 2011

Communications homme-machine [link update]

L'aboutissement d'un travail formalisé dans mon premier papier. Cela date du tout début de mon Master en 2007-2008 : il a été rafraîchi et un peu relu depuis. Le lien a été mis à jour, j'en profite donc pour vous faire (re)découvrir ce travail.

Il s'agit d'une analyse d'un corpus d'interactions mis à disposition par la SNCF (pour la petite histoire, un corpus de 1985 : il a donc le même âge que moi ^^), dans lequel des usagers appellent un standard téléphonique pour demander des informations sur la circulation des trains, prendre une réservation ou confirmer un horaire, par exemple.

Les résultats sont obtenus avec des outils de calcul textométrique (AFC, spécificités, notamment), en utilisant le logiciel de référence Lexico3. J'aborde le tout du point de vue de l'analyse conversationnelle : c'est donc une pure approche interdisciplinaire.

L'objectif ? avoir des pistes pour étudier l'ajustement entre les participants (ou comment l'humain ajuste son discours à la boîte vocale). Et surtout, illustrer mon assertion conclusive :

L'analyse textométrique peut-être utilisée pour effectuer des comparaisons à des niveaux de granularité variables, permettant de ne pas dissocier l'analyse des dimensions locale et globale du corpus.

N'hésitez pas à découvrir la revue Lexicometrica, dans laquelle ce papier a été accepté pour publication :)

mardi 15 février 2011

Reprises, interactions textuelles, échanges asynchrones

Tels sont les trois mots-clés qui résument la présentation d'une de mes recherches en cours. L'objectif ? Décrire et modéliser les phénomènes linguistiques liés à la cohérence conversationnelle dans les échanges asynchrones sur Internet.

J'ai eu la chance d'assister à une journée d'étude un peu particulière : celle l'avènement officiel de la Fédération CLESTHIA.

Ce fut extrêmement intéressant de voir une autre facette de "la recherche en marche", celle qui sort de son laboratoire pour nouer des relations d'émulation et d'échange entre chercheurs de domaines connexes. Nous nous sommes donc retrouvés, traductologues, spécialistes de l'analyse des discours - politique, littéraire, de presse-, linguistes fins connaisseurs du français parlé, avec quelques TAListes au milieu. Tous dans l'optique d'échanger sur une thématique fort intéressante, qui est celle du "discours rapporté", du "discours autre", chaque présentation étant l'occasion de mieux comprendre l'appréhension de notre objet d'étude par les autres.

Une grande chance que la mienne, donc, de pouvoir aller présenter mes petits travaux devant une telle audience ! Eh oui, ça stresse toujours quand on revient de l'entreprise (et des présentations clients), d'aller parler d'un problème de linguistique "hard-core" devant des dizaines de linguistes chevronnés... Une grande chance également que ce travail ait été bien reçu : je vous le fais donc partager :)

Brève intro : je travaille en ce moment sur les forums (interactions textuelles, échanges asynchrones), dans une perspective Opinion Mining (dont l'e-réputation est un ersatz, si l'on reprend l'acception qu'ont de ce terme nos amis philosophes).

Bonne consultation et n'hésitez pas à me contacter si cela vous intéresse :)

mardi 18 janvier 2011

Glozz, l'outil pour gloser en liberté

L'art de la glose, c'est d'abord l'art des annotations, et c'est un peu le moyen des herméneutiques quotidiennes déployées par les <insert-random-word>-analystes qui développent des études et autres rapports d'analyse. Ou des linguistes qui travaillent sur corpus. De tous ceux-là, donc, Glozz devrait susciter l'intérêt.

Glozz est une plateforme dédiée à l'annotation et à l'exploration de corpus textuels, librement téléchargeable. Cet outil est développé par des chercheurs français en TAL (Traitement Automatique des Langues), dans le cadre du projet ANR ANNODIS par le GREYC, en collaboration avec l'ERSS et l'IRIT.

Je vous propose de découvrir une première facette de cet outil, celle de l'annotation de corpus. On garde tout de même en mémoire l'énorme intérêt de GlozzQL, langage de requêtes qui permet d'interroger les annotations réalisées, aspect que je mets de côté pour le moment. Je mets également de côté toutes les considérations liées à la méthodologie de mise en place d'une campagne d'annotation (adaptation de modèle, définition de la grille, rôdage, ...). L'objectif est vraiment de partager ma première expérience utilisateur de cet outil, et comme pour le moment celle-ci consiste en de l'annotation... :D oui, j'adore ça !

La belle facette de l'annotation de corpus, disait-on... cela commence par une application Java, exit donc les problèmes de plateforme : ça tourne sous Windows, Linux et Mac OS X, no worries. D'un point de vue end user et mise en oeuvre de la campagne d'annotation, quatre points-clés :

  • la prise en main est accessible, mais pas encore user-friendly ; il faut passer par l'invite de commande pour lancer le .jar ;

Lancement de Glozz via le terminal

  • l'interface permet deux points de vue simultanés sur le fil textuel (global et local en même temps) ;

Vues du fil textuel

  • la puissance du système d'annotation - récursivité, quand tu nous tiens - qu'il est possible de mettre en place ;

Récursivité du système d'annotation

  • la synchronisation de plusieurs trames d'annotation projetées sur le cadre du corpus confère un confort et une grande flexibilité du système d'annotation (ex : trame d'annotation modifiable en temps réel)

Chargement en temps réel des modèles d'annotation


ÉTAPES D'UTILISATION

  • convertir son corpus .txt au format requis

Lors du premier import du corpus via l'interface, on crée deux fichiers de sortie :
(i) l'un au format .ac, qui est le cadre des coordonnées du corpus et auquel on associe
(ii) un fichier au format .aa, dans lequel sont stockées les trames d'annotation instanciées par la suite.

Charger et convertir son corpus au format requis

  • importer son corpus "prêt-à-annoter"

Il faut ensuite charger simultanément le cadre et la trame, pour l'instant vides, du corpus.

Charger le corpus avec son cadre (fichiers .ac et .aa)

  • annoter ... mais avec quoi ?

Il faut à présent importer le fichier contenant la grille d'annotation que vous souhaitez projeter sur le corpus ; on passe bien entendu sur l'étape préalable de réflexion qui consiste à définir la grille d'annotation en elle-même. Ce fichier est un "annotation model" et est distingué par l'extension .aam : le modèle d'annotation peut être directement importé depuis l'interface.

Charger le fichier .aam (modèle d'annotation)

A noter :
(i) la flexibilité du système, car on peut modifier et réimporter directement un modèle d'annotation à tout moment ;
(ii) la puissance du système, qui permet de travailler le corpus avec différentes grilles de description très facilement, sans l'altérer. Par "altérer", j'entends que dans la plupart des campagnes d'annotation - tout du moins les quelques-unes auxquelles j'ai eu la chance de participer - les annotations sont intégrées au fil textuel, et non pas associées comme c'est le cas dans Glozz.

  • et maintenant, annotez !

La prise en main est assez simple une fois qu'on a intégré quelques éléments de signalétique propres à l'interface (distinguer les boutons pour instancier une annotation des boutons pour modifier une annotation) ; la navigation simultanée "local + global" du corpus est rend tout simplement la tâche infiniment plus confortable, surtout lorsque vous avez la bonne idée de travailler sur des fils de conversation de plusieurs centaines - et encore, je minimise - de messages :)

Je n'ai pas testé s'il est possible de travailler en mode collaboratif sur un projet d'annotation, mais c'est certainement faisable et confère donc un fort potentiel d'utilisation à cette plateforme. Dans des versions ultérieures, l'applicatif pourrait facilement trouver sa place dans le flux de travail d'analystes en entreprise, ou encore favoriser le développement de projets de recherche sur des corpus collaboratifs.

A noter ! petite astuce si jamais vous avez un corpus un peu volumineux : lancez l'application en lui donnant plus de mémoire :
java -Xmx1024m -jar chemin .jar #pour allouer 1Go
java -Xmx2048m -jar chemin .jar #pour allouer 2Go

mercredi 22 décembre 2010

Modélisation en linguistique et perspectives du TAL

Il n'est pas toujours simple d'expliquer à des linguistes dépourvus d'outils informatiques, tout autant qu'à des ingénieurs dépourvus de culture linguistique, l'importance de la modélisation des phénomènes linguistiques, dont l'implémentation permet la validation expérimentale pour les uns, le gain qualitatif du système d'analyse pour les autres.

Lors d'un séminaire dispensé à Paris X Nanterre, plusieurs étudiants de Master présentaient leurs spécialités à partir de références bibliographiques proposées par l'organisateur. J'avais choisi de travailler des articles de Bernard Victorri, Directeur de Recherche au LaTTiCe, qui sont riches et très bien documentés. L'objectif ? Tenter de fournir une synthèse qui permette :

  • aux linguistes de se faire une idée des applications de TAL et de l'importance de la qualité de la modélisation pour la performance des systèmes ;
  • aux ingénieurs d'avoir un panorama des travaux et des problématiques où l'ingénierie gagne en qualité grâce à l'apport de la linguistique.
Ces slides ont été réalisés il y a deux ans, mais me semblent toujours d'actualité, surtout s'ils sont consommés à titre introductif !
#add-on : apparemment, les embed Slideshare ne sont pas super amis de la plateforme DotClear, l'affichage du Slideshare ci-dessus pouvant pâtir de cette mauvaise entente. Si c'est le cas, je vous invite à consulter les slides ici :)

- page 1 de 12