As a web worker and a qualitative data analyst, most of my time is spent analyzing written french online whether published on newspapers, forums, blogs or SNSs. As a linguist, it can be either fun, stimulative or harmful to see how the linguistic norm and the usage differ from each other. So came the idea to take a closer look at it. For this little test, I chose to make a basic lexical evaluation, for comparing the coverage of standard linguistic resources for French.
- The data
Let's try to be representative with the data to be tested ! Well, that
might be easier said than done, but let's give it a try with the following
datasets :
- around 800 000 words from an online newspapers corpus on the political
topicality in 2009, including both articles and user's comments ;
- over 400 000 words from a popular french forum.
In the end, the dataset contains about 1 200 000 words : this is pure raw text from the Internet, without any pre-treatment such as lemmatization, orthographic correction, case harmonization or any other mutilation of the raw text material.
- The reference Linguistic Resources
Among other questionable choices, mine was to take the DELA and the Morphalou linguistic resources as a gold standard for a raw lexical comparison.
- The DELA dictionnary is a dictionnary of lemmas and their inflected forms for French (and also English, but obviously not used here), designed by the LIGM team from MLV University, France. This resource contains around 700 000 entries.
- The Morphalou lexicon is developped and maintained by the CNRTL and deeply linked to the research work from the UMR ATILF of the Nancy University, France. This resource is another reference dataset for inflected forms of the French language, and contains over 500 000 entries.
Here's how they are used by a nice little Perl script :
- first, the words of the raw dataset are brutally segmented to obtain one word per line in the input file ;
- the result is stored in a hashtable for subsequent use ;
- in the same way, the entries of each resource are stored in a specific hashtable, so we have all the entries from DELA in one hashtable and all the entries from Morphalou in another.
In case you ask yourself "why hashtables", the purpose of using this data structure was to set words as keys and nothing as value, to make fast checks on keys, especially checking their existence. It was the best way to make it in Perl, from my non-dev point of view.
- The rough encounter
Yes, the script aimed at comparing the raw material with each one of the considered resources. So ... Tadaa ! The results !
Three objectives here :
- Determine the number of words that were not recognized by DELA
- Then, the same evaluation for Morphalou
- Being able to give a percentage of the words that aren't recognized neither by one nor the other reference linguistic resources
So, here are my little Perl script results :
unknown words for DELA : 42719 unknown words
for Morphalou : 43906 percentage of shared unknown
words : 97.296497061905 %
From those results, let's highlight three points :
- first, french web-writers seem to be not that bad at spelling
french online, as less than 50 000 occurrences are not recognized by the
linguistic ressources ;
- second, that means that those ressources are very close in terms of
coverage, even if DELA has 200 000 more entries than Morphalou : a
bit less than 3% differences on the shared unknown words ; so the size of
the resource is not that important (at least, for the purpose of this little
test) ;
- third, as some Linguist colleagues told me at the latest AFLS conference I attended,
the linguistic performance problems in current French are more salient on
the (micro and macro) syntaxic level (e.g. agreement of
inflexions in gender and number) : this partly explains why the results of
this basic evaluation based on lexicon are quite good. However, this is far
from sufficient to determine what the current level of orthographic performance
is.
