As a web worker and a qualitative data analyst, most of my time is spent analyzing written french online whether published on newspapers, forums, blogs or SNSs. As a linguist, it can be either fun, stimulative or harmful to see how the linguistic norm and the usage differ from each other. So came the idea to take a closer look at it. For this little test, I chose to make a basic lexical evaluation, for comparing the coverage of standard linguistic resources for French.

  • The data

Let's try to be representative with the data to be tested ! Well, that might be easier said than done, but let's give it a try with the following datasets :

- around 800 000 words from an online newspapers corpus on the political topicality in 2009, including both articles and user's comments ;
- over 400 000 words from a popular french forum.

In the end, the dataset contains about 1 200 000 words : this is pure raw text from the Internet, without any pre-treatment such as lemmatization, orthographic correction, case harmonization or any other mutilation of the raw text material.

  • The reference Linguistic Resources

Among other questionable choices, mine was to take the DELA and the Morphalou linguistic resources as a gold standard for a raw lexical comparison.

  1. The DELA dictionnary is a dictionnary of lemmas and their inflected forms for French (and also English, but obviously not used here), designed by the LIGM team from MLV University, France. This resource contains around 700 000 entries.
  2. The Morphalou lexicon is developped and maintained by the CNRTL and deeply linked to the research work from the UMR ATILF of the Nancy University, France. This resource is another reference dataset for inflected forms of the French language, and contains over 500 000 entries.

Here's how they are used by a nice little Perl script :

  1. first, the words of the raw dataset are brutally segmented to obtain one word per line in the input file ;
  2. the result is stored in a hashtable for subsequent use ;
  3. in the same way, the entries of each resource are stored in a specific hashtable, so we have all the entries from DELA in one hashtable and all the entries from Morphalou in another.

In case you ask yourself "why hashtables", the purpose of using this data structure was to set words as keys and nothing as value, to make fast checks on keys, especially checking their existence. It was the best way to make it in Perl, from my non-dev point of view.

  • The rough encounter

Yes, the script aimed at comparing the raw material with each one of the considered resources. So ... Tadaa ! The results !

Three objectives here :

  1. Determine the number of words that were not recognized by DELA
  2. Then, the same evaluation for Morphalou
  3. Being able to give a percentage of the words that aren't recognized neither by one nor the other reference linguistic resources

So, here are my little Perl script results :

unknown words for DELA : 42719
unknown words for Morphalou : 43906
percentage of shared unknown words : 97.296497061905 %

From those results, let's highlight three points :

- first, french web-writers seem to be not that bad at spelling french online, as less than 50 000 occurrences are not recognized by the linguistic ressources ;
- second, that means that those ressources are very close in terms of coverage, even if DELA has 200 000 more entries than Morphalou : a bit less than 3% differences on the shared unknown words ; so the size of the resource is not that important (at least, for the purpose of this little test) ;
- third, as some Linguist colleagues told me at the latest AFLS conference I attended, the linguistic performance problems in current French are more salient on the (micro and macro) syntaxic level (e.g. agreement of inflexions in gender and number) : this partly explains why the results of this basic evaluation based on lexicon are quite good. However, this is far from sufficient to determine what the current level of orthographic performance is.