|
--- Raphael Salkie <R.M.Salkie@bton.ac.uk>
---
This page has a section of "language
identifiers": http://www.yourdictionary.com/morph.html#guessers
Stochastic Language Identifier: http://www.dougb.com/ident.html
--- Jochen Leidner <leidner@linguit.com>
--
CPAN has a free implementation of such a module in perl by
Michael Piotrowski. SUN (Java) and Linguit (C, Java) do also have
commercial implementations of such tools.
--- Tom Emerson <tree@basistech.com>
---
Here is the classic paper:
@InProceedings{Cavnar:1994:NBT,
author = {Cavnar, William B. and Trenkle, John M.}, title =
{{N}-Gram-Based Text Categorization}, booktitle = {Proceedings of the 1994
Symposium on Document Analysis and Information Retrieval}, year = {1994},
address = {Las Vegas, NV USA} }
This is also very good:
@TechReport{Dunning:1994:SIL,
author = {Dunning, Ted}, title = {Statistical Identification of Language},
institution = {Computing Research Lab, New Mexico State University}, year
= {1994}, type = {Technical Report}, number = {{CRL} {MCCS}-94-273}, month
= mar }
Here are some others in my collection:
@InProceedings{Sibun:1996:LIE, author = {Sibun, Penelope and Reynar,
Jeffrey C.}, title = {Language Identification: Examining the Issues},
booktitle = {Proceedings of the 1996 Symposium on Document Analysis and
Information Retrieval}, pages = {125--135}, year = {1996}, address = {Las
Vegas, NV USA} }
@TechReport{Hazen:1993:ALI, author = {Hazen, Timothy J.},
title = {Automatic Language Identification Using a Segment-Based
Approach}, institution = {Massachusetts Institute of Technology}, year =
{1993}, type = {Technical Report}, number = {MIT/LCS/TR} }
@InProceedings{Combrinck:1995:TAL, author = {Combrinck, H.P. and Botha,
E.C.}, title = {Text-Based Automatic Language Identification}, booktitle =
{Proceedings of the Sixth Annual South African Workshop on Pattern
Recognition}, year = {1995}, address = {Rand Afrikaans University}, month
= nov }
@InProceedings{Combrink:1997:ALI, author = {Combrinck, H.P. and
Botha, E.C.}, title = {Automatic Language Identification: Performance vs.
Complexity}, booktitle = {Proceedings of the Eighth Annual South African
Workshop on Pattern Recognition}, year = {1997}, address = {Rhodes
University, Grahamstad}, month = nov }
--- E S Atwell <eric@comp.leeds.ac.uk>
---
Automatic language detection is the PhD research topic of
Leeds student John Elliott, and several papers can be downloaded from his
website http://www.comp.leeds.ac.uk/jre/
including:
Elliott J, Atwell E & Whyte, W. 2001. Visualisation of
Long Distance Grammatical Collocation Patterns in Language in: IV2001: 5th
International Conference on Information Visualisation, London, UK.
Elliott
J, Atwell E & Whyte, W. 2001. First Stage Identification of Syntactic
Elements in an Extraterrestrial Signal in: Proccedings of IAC'2001: the
52nd International Astronautical Congress, paper IAA-01-IAA.9.2.07,
Toulouse, France.
Elliott, John & Atwell, Eric & Whyte, Bill.
2000. Increasing our ignorance of language: identifying language structure
in an unknown signal in: Proceedings of 4th International Conference on
Computational Natural Language Learning (CoNLL-2000, Lisbon, Portugal),
pages 25-30. Association of Computational Linguistics (ACL), New
Brunswick, NJ 08901, USA.
Elliott, John, Atwell, Eric & Whyte, Bill.
2000. Language identification in unknown signals in Proceeding of
COLING'2000, 18th International Conference on Computational Linguistics,
pages 1021-1026, Association for Computational Linguistics (ACL) and
Morgan Kaufmann Publishers, San Francisco. ISBN: 1-55860-717-X (2
volumes).
Elliott, John & Atwell, Eric. 2000. Is anybody out there?:
the detection of intelligent and generic language-like features in:
Journal of the British Interplanetary Society, vol.53 no.1/2 pages 13-22,
British Interplanetary Society, London. ISSN: 0007-084X.
Elliott, J &
Atwell, E. 1999. Language in signals: the detection of generic
species-independent intelligent language features in symbolic and oral
communications in: Proceedings of the 50th International Astronautical
Congress, paper IAA-99-IAA.9.1.08, Amsterdam. International Astronautical
Federation, Paris.
--- Susana Sotelo Docio <sdocio@usc.es>
---
I recommend that you start with the Vannoord's web page
about language identification: http://odur.let.rug.nl/~vannoord/TextCat/
It contains a lot of information and links to another language
identification software. His program (TextCat) is based on the paper
Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text
Categorization",
http://www.nonlineardynamics.com/trenkle/papers/sdair-94-bc.ps.gz
Here you
will find some bibliographical entries:
http://speech.inesc.pt/~dcaseiro/html/bibliografia.html
Another papers and
pages about language identification:
Greffenstette, G. (1995). Comparing
two language identification schemes. In Bolasco S., Lebart L., Salem A.
(eds.) JADT 1995. Rome : CISU, I, 263-268, avaliable at:
http://www.xrce.xerox.com/publis/mltt/jadt.pdf
Xerox MLTT Language
Identifier: http://www.xrce.xerox.com/research/mltt/tools/guesser/
You
will found a frequency word list for modern Spanish at Lluis Padro's home
page: http://www.lsi.upc.es/~padro/
--- William H. Fletcher <fletcher@usna.edu>
---
A related resource which may be of interest to you or
others pursuing research in this area is the MLANG.DLL component that is
installed on Windows systems along with Internet Explorer. It was
described by Michael Kaplan in the Oct 00 issue of Visual Basic
Programmer's Journal http://www.devx.com/free/codelib/view.asp?id=632086
and in his book "Internationalization with Visual Basic".
http://www.i18nwithvb.com/ It includes the language guessing engine for
IE.
--- Larry Spitz <spitz@docrec.com>
--- [download paper]
Attached to this email is my paper on the subject of
language identification published about four years ago. This work was done
starting with scanned documents rather than character coded
representations, but that just makes the problem a bit harder. Removing
the "noise" of scanning from the process only makes language
identification more accurate.
--- Harald Klein <intext@gmx.de>
---
Genau. Konjunktionen, Pronomina, Präpositionen und Artikel, die in
anderen Sprachen nicht vorkommen, können da helfen. Also das Wort
"in" geht natürlich nicht, aber "nein",
"und" für das Deutsche wären da schon geeignete Kandidaten.
Ähnliches gilt für das Englische, aber da Englisch germanische und
romanische Wurzeln hat, ist es ungleich schwieriger, zumal die Wörter
auch kürzer sind als bspw. im Deutschen. Systematische vorgehen: als
Präpositionen der Sprachen, die automatisch erkannt werden sollen,
aufschreiben, dann prüfen, ob diese gültige Wörter in anderen Sprachen
sind. Ich habe selbst Listen dieser Wörter für Deutsch, Englisch und
Französisch in meine Textanalysesoftware eingebaut (http://www.textquest.de).
Eine Firma in Berlin hat meines Wissens ein erhebliches Interesse an
dieser Anwendung. Ansonsten interessiert mich das Thema.
--- Arne Fitschen <fitschen@ims.uni-stuttgart.de>
---
ausgehend von 200 Millionen Token (Wörter + Satzzeichen)
deutschsprachiger Zeitungstexte, die wir hier am Institut für maschinelle
Sprachverarbeitung vorliegen haben, kann ich Ihnen hier die Prozentzahlen
der häufigsten 50 Token nennen: Häufigste 50 Wortformen aus
Zeitungstexten von 204.813.118 Token:
der 2.80 % die 2.58 % und 1.84 % in 1.45 % den 1.00 % von
0.80 % zu 0.75 % das 0.68 % mit 0.67 % sich 0.62 % für 0.61 % des 0.61 %
im 0.60 % nicht 0.58 % auf 0.58 % ist 0.55 % Die 0.54 % dem 0.54 % ein
0.48 % eine 0.45 % auch 0.38 % als 0.38 % es 0.37 % an 0.37 % daß 0.34 %
aus 0.33 % werden 0.33 % sie 0.30 % nach 0.29 % hat 0.29 % am 0.28 % Der
0.27 % er 0.26 % einer 0.26 % um 0.25 % noch 0.25 % bei 0.25 % wird 0.24 %
sind 0.24 % vor 0.23 % wie 0.23 % über 0.22 % einem 0.21 % zum 0.21 % nur
0.20 % bis 0.20 % Das 0.20 % einen 0.20 %
--- Gregor Erbach <gor@acm.org>
---
Die besten Ergebnisse bei der Sprachidentifikation bekommt man, wenn
man häufige Wörter und n-grams kombiniert. Einige Links zu dem Thema
finden sich gleich am Anfang meiner bookmark-Liste: http://www.ge.f2s.com/bookmark.htm.
Wegen der Wortlisten würde ich empfehlen, entweder nach Stopwortlisten
für Information-Retrieval zu suchen. Auf der Website von Gertjan van
Noord gibt es auch Daten für verschiedene Sprachen.
|