« Released ordered KVS Mio 0.0.1alpha | Main | Vmbkp: An Online Backup Tool for VMware vSphere »

Oct 15, 2010

Language Detection Library for Java

The language-detection library is a Java opensource library to detect languages in which texts are written.
(Also known as 'Language identification', 'Language guessing' and 'Language recognition')

Features:
99% over precision for 40+ languages
Detect language of a text using naive Bayesian filter
Generate language profiles from Wikipedia abstract database file

Supported languages (bundled 47 profiles):

Afrikaans, Arabic, Bulgarian, Bengali, Czech, German, Greek, English, Spanish, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Macedonian, Malayalam, Marathi, Nepali, Dutch, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Somali, Albanian, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Simplified/Traditional Chinese.

Project Homepage:

http://code.google.com/p/language-detection/

License:

Apache License 2.0

Author:

Shuyo Nakatani (twitter : @shuyo) / Cybozu Labs, Inc.

Presentation

TrackBack

TrackBack URL for this entry:
http://bb.lekumo.jp/t/trackback/404050/25238803

Listed below are links to weblogs that reference Language Detection Library for Java :

Comments

nice article, keep the posts coming

I’ll have to go back and read all your previous posts now.

How to use this? Where ca i find the jar file. in the website it has only jsonic.jar.But i doesnt contain language detection files

Thank you for your comment.
You can get the jar file and language profiles from svn repository http://code.google.com/p/language-detection/source/checkout or download from here.

http://code.google.com/p/language-detection/source/browse/#svn/trunk/lib
http://code.google.com/p/language-detection/source/browse/#svn/trunk/profiles

I am preparing the package to download recently.

Hi,

Thanks.For sentences with less number of words say 2-3 words like "ipad is great", "awesome ipad" it is not able to detect proper language.

Thanks

Thanks for your trial.

It is difficult to detect the languages of short sentences because of using the characteristic of spelling.
This detection library can be correct to sentences with over 10-20 words.
A developer who is trying to detect languages of tweets uses this library with his original filter.

It is difficult to handle Also proper nouns. e.g. iPad, iPhone. These are not "English"...

Post a comment