jExSLI - java Extremely Simple Language Identifier
- The approach used
- How to use the library inside your code
- How to extend it for other languages
This tool is a simple text language identifier that can be used as a simple means to understand for example in which language a text input of your application was given. It's written in Java (compartible with all application written in Java 1.5 or later) and is distributed as a single jar file.
An initial list of languages contains 20 most commonly used languages and can be easily extended.
In this tool we applied very simple text categorization approach based on similarity of documents presented as vectors of terms with their tf*idf values. To exploit this idea we need to have each language presented as such vector and for this we use most frequent words of a language and their frequencies. According to our evaluation it's a reasonable approach that performs well not only for big texts but for phrases larger than 5 words.
The alternative software includes TextCat (most known Perl library for 69 languages) and lc4j (Java library) based on n-gram document classification. These tools are more sophisticated while jexsli allows user to do language identification in the simplest way.
To use language identification inside you Java code you just need to add jExSLI.jar as an external jar library into your building path, create an istance of class
LanguageIdentifier and call its method
identify(String text) to identify a language of your text. Method returns a name of a language from the list of available languages or
null if the languages is not recognized.
LanguageIdentifier languageIdentifier = new LanguageIdentifier();
System.out.println(languageIdentifier.identify("c'est la vie"));
Additionally you can tune language identifier parameters to have more robust classification if the input language is unknown. It is done by setting some threshold and removing some features that occur often in different languages. Simply use
LanguageIdentifier(boolean setDefaultThreshold) constructor with parameter
The list of languages that can be recognized by the tool is specified in the file
languages.txt included in jar-file. To add a new language first of all it should be added to this file with a proper name (that you want to use afterwards). Second step is to create a file that contains most frequent words of this language and their frequencies. The file should be named
LanguageNameFreq.txt and be included in the directory
freqs/ of the jar-file.
$ jar uf jExSLI.jar languages.txtto add modified list of languages to the jar file and
$ jar uf jExSLI.jar freqs/LanguageNameFreq.txtto add the frequency file.
The frequency file can be created for any common language from Wikipedia pages using
LanguageIdentifierUtils class and its methods
createMostFreqWords. Notice that these methods require additional library HTML parser (specifically
htmlparser.jar) and use internet connection to load Wiki pages.
From command line you can do it as follows:
$ java -cp jExSLI.jar eu.fbk.hlt.LanguageIdentifierUtils -1 languageWikiAbbr englishTopPagesFileto create a list of pages from wiki using a list of pages for english wiki. The latter you can find included in jar-file or from some other source. First parameter is a abbreviation used in wiki page address, e.g., 'en' for English, 'de' for German etc.
$ java -cp jExSLI.jar eu.fbk.hlt.LanguageIdentifierUtils -2 languageWikiAbbr topPagesFileto produce a frequency file for a specified language. The file 'topPagesFile' should contain a list of wiki pages for given language (created in previous step), 'languageWikiAbbr' is an abbreviation of language as before. The output file is called 'languageWikiAbbrFreq.txt' and should be included in jar-file as discussed above (notice that the file must be renamed to agree with a language name in the list).
The jar file containing all necessary functionality can be downloaded here. The basic set of languages is set to English, Italian, Spanish and German. To specify your own set of languages available for application you need to change
languages.txt file (see instructions above).
The full list of supported for this moments languages includes: Arabic, Catalan, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Ukranian.
jExSLI is licensed under Apache License 2.0.
It was developed as part of summer internship project at FBK HLT group by Kristina Gulordava.
Contact person Claudio Giuliano.