AUTOMATIC LANGUAGE IDENTIFICATION


Original Posting

Dear Content | Corpora subscribers,

in order to extend the functionality of a prototype to analyze the textual content of Web-based information systems (see preceding publication alert on "Evolutionary Web Development"), we are currently working on a component to automatically detect various languages. Thus we would be interested in

  • general papers or books on automatic language detection (based on words, n-grams,...).
  • lists of the most common or typical words in certain languages.

Please reply to me personally and I'll post a summary of the responses to the list.
Thank you
& best regards, ~ Arno Scharl

Summary of Reponses

--- Raphael Salkie <R.M.Salkie@bton.ac.uk> ---

This page has a section of "language identifiers": http://www.yourdictionary.com/morph.html#guessers  

Stochastic Language Identifier: http://www.dougb.com/ident.html 

--- Jochen Leidner <leidner@linguit.com> --

CPAN has a free implementation of such a module in perl by Michael Piotrowski. SUN (Java) and Linguit (C, Java) do also have commercial implementations of such tools.

--- Tom Emerson <tree@basistech.com> ---

Here is the classic paper: 

@InProceedings{Cavnar:1994:NBT, author = {Cavnar, William B. and Trenkle, John M.}, title = {{N}-Gram-Based Text Categorization}, booktitle = {Proceedings of the 1994 Symposium on Document Analysis and Information Retrieval}, year = {1994}, address = {Las Vegas, NV USA} }

This is also very good: 

@TechReport{Dunning:1994:SIL, author = {Dunning, Ted}, title = {Statistical Identification of Language}, institution = {Computing Research Lab, New Mexico State University}, year = {1994}, type = {Technical Report}, number = {{CRL} {MCCS}-94-273}, month = mar }

Here are some others in my collection: 

@InProceedings{Sibun:1996:LIE, author = {Sibun, Penelope and Reynar, Jeffrey C.}, title = {Language Identification: Examining the Issues}, booktitle = {Proceedings of the 1996 Symposium on Document Analysis and Information Retrieval}, pages = {125--135}, year = {1996}, address = {Las Vegas, NV USA} }

@TechReport{Hazen:1993:ALI, author = {Hazen, Timothy J.}, title = {Automatic Language Identification Using a Segment-Based Approach}, institution = {Massachusetts Institute of Technology}, year = {1993}, type = {Technical Report}, number = {MIT/LCS/TR} }

@InProceedings{Combrinck:1995:TAL, author = {Combrinck, H.P. and Botha, E.C.}, title = {Text-Based Automatic Language Identification}, booktitle = {Proceedings of the Sixth Annual South African Workshop on Pattern Recognition}, year = {1995}, address = {Rand Afrikaans University}, month = nov } 

@InProceedings{Combrink:1997:ALI, author = {Combrinck, H.P. and Botha, E.C.}, title = {Automatic Language Identification: Performance vs. Complexity}, booktitle = {Proceedings of the Eighth Annual South African Workshop on Pattern Recognition}, year = {1997}, address = {Rhodes University, Grahamstad}, month = nov }

--- E S Atwell <eric@comp.leeds.ac.uk> ---

Automatic language detection is the PhD research topic of Leeds student John Elliott, and several papers can be downloaded from his website http://www.comp.leeds.ac.uk/jre/ including:

Elliott J, Atwell E & Whyte, W. 2001. Visualisation of Long Distance Grammatical Collocation Patterns in Language in: IV2001: 5th International Conference on Information Visualisation, London, UK. 

Elliott J, Atwell E & Whyte, W. 2001. First Stage Identification of Syntactic Elements in an Extraterrestrial Signal in: Proccedings of IAC'2001: the 52nd International Astronautical Congress, paper IAA-01-IAA.9.2.07, Toulouse, France. 

Elliott, John & Atwell, Eric & Whyte, Bill. 2000. Increasing our ignorance of language: identifying language structure in an unknown signal in: Proceedings of 4th International Conference on Computational Natural Language Learning (CoNLL-2000, Lisbon, Portugal), pages 25-30. Association of Computational Linguistics (ACL), New Brunswick, NJ 08901, USA. 

Elliott, John, Atwell, Eric & Whyte, Bill. 2000. Language identification in unknown signals in Proceeding of COLING'2000, 18th International Conference on Computational Linguistics, pages 1021-1026, Association for Computational Linguistics (ACL) and Morgan Kaufmann Publishers, San Francisco. ISBN: 1-55860-717-X (2 volumes). 

Elliott, John & Atwell, Eric. 2000. Is anybody out there?: the detection of intelligent and generic language-like features in: Journal of the British Interplanetary Society, vol.53 no.1/2 pages 13-22, British Interplanetary Society, London. ISSN: 0007-084X. 

Elliott, J & Atwell, E. 1999. Language in signals: the detection of generic species-independent intelligent language features in symbolic and oral communications in: Proceedings of the 50th International Astronautical Congress, paper IAA-99-IAA.9.1.08, Amsterdam. International Astronautical Federation, Paris.

--- Susana Sotelo Docio <sdocio@usc.es> ---

I recommend that you start with the Vannoord's web page about language identification: http://odur.let.rug.nl/~vannoord/TextCat/ 

It contains a lot of information and links to another language identification software. His program (TextCat) is based on the paper Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization", http://www.nonlineardynamics.com/trenkle/papers/sdair-94-bc.ps.gz  

Here you will find some bibliographical entries: http://speech.inesc.pt/~dcaseiro/html/bibliografia.html 

Another papers and pages about language identification: 

Greffenstette, G. (1995). Comparing two language identification schemes. In Bolasco S., Lebart L., Salem A. (eds.) JADT 1995. Rome : CISU, I, 263-268, avaliable at: http://www.xrce.xerox.com/publis/mltt/jadt.pdf 

Xerox MLTT Language Identifier: http://www.xrce.xerox.com/research/mltt/tools/guesser/ 

You will found a frequency word list for modern Spanish at Lluis Padro's home page: http://www.lsi.upc.es/~padro/ 

--- William H. Fletcher <fletcher@usna.edu> ---

A related resource which may be of interest to you or others pursuing research in this area is the MLANG.DLL component that is installed on Windows systems along with Internet Explorer. It was described by Michael Kaplan in the Oct 00 issue of Visual Basic Programmer's Journal http://www.devx.com/free/codelib/view.asp?id=632086 and in his book "Internationalization with Visual Basic". http://www.i18nwithvb.com/ It includes the language guessing engine for IE.

--- Larry Spitz <spitz@docrec.com> --- [download paper]

Attached to this email is my paper on the subject of language identification published about four years ago. This work was done starting with scanned documents rather than character coded representations, but that just makes the problem a bit harder. Removing the "noise" of scanning from the process only makes language identification more accurate.

--- Harald Klein <intext@gmx.de> ---

Genau. Konjunktionen, Pronomina, Präpositionen und Artikel, die in anderen Sprachen nicht vorkommen, können da helfen. Also das Wort "in" geht natürlich nicht, aber "nein", "und" für das Deutsche wären da schon geeignete Kandidaten. Ähnliches gilt für das Englische, aber da Englisch germanische und romanische Wurzeln hat, ist es ungleich schwieriger, zumal die Wörter auch kürzer sind als bspw. im Deutschen. Systematische vorgehen: als Präpositionen der Sprachen, die automatisch erkannt werden sollen, aufschreiben, dann prüfen, ob diese gültige Wörter in anderen Sprachen sind. Ich habe selbst Listen dieser Wörter für Deutsch, Englisch und Französisch in meine Textanalysesoftware eingebaut (http://www.textquest.de). Eine Firma in Berlin hat meines Wissens ein erhebliches Interesse an dieser Anwendung. Ansonsten interessiert mich das Thema.

--- Arne Fitschen <fitschen@ims.uni-stuttgart.de> ---

ausgehend von 200 Millionen Token (Wörter + Satzzeichen) deutschsprachiger Zeitungstexte, die wir hier am Institut für maschinelle Sprachverarbeitung vorliegen haben, kann ich Ihnen hier die Prozentzahlen der häufigsten 50 Token nennen: Häufigste 50 Wortformen aus Zeitungstexten von 204.813.118 Token:

der 2.80 % die 2.58 % und 1.84 % in 1.45 % den 1.00 % von 0.80 % zu 0.75 % das 0.68 % mit 0.67 % sich 0.62 % für 0.61 % des 0.61 % im 0.60 % nicht 0.58 % auf 0.58 % ist 0.55 % Die 0.54 % dem 0.54 % ein 0.48 % eine 0.45 % auch 0.38 % als 0.38 % es 0.37 % an 0.37 % daß 0.34 % aus 0.33 % werden 0.33 % sie 0.30 % nach 0.29 % hat 0.29 % am 0.28 % Der 0.27 % er 0.26 % einer 0.26 % um 0.25 % noch 0.25 % bei 0.25 % wird 0.24 % sind 0.24 % vor 0.23 % wie 0.23 % über 0.22 % einem 0.21 % zum 0.21 % nur 0.20 % bis 0.20 % Das 0.20 % einen 0.20 %

--- Gregor Erbach <gor@acm.org> ---

Die besten Ergebnisse bei der Sprachidentifikation bekommt man, wenn man häufige Wörter und n-grams kombiniert. Einige Links zu dem Thema finden sich gleich am Anfang meiner bookmark-Liste: http://www.ge.f2s.com/bookmark.htm. Wegen der Wortlisten würde ich empfehlen, entweder nach Stopwortlisten für Information-Retrieval zu suchen. Auf der Website von Gertjan van Noord gibt es auch Daten für verschiedene Sprachen.