Python: How to determine the language?

1. TextBlob.

Requires NLTK package, uses Google.

    from textblob import TextBlob    b = TextBlob("bonjour")    b.detect_language()

pip install textblob

Note: This solution requires internet access and Textblob is using Google Translate's language detector by calling the API.

2. Polyglot.

Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

    from polyglot.detect import Detector    mixed_text = u"""    China (simplified Chinese: 中国; traditional Chinese: 中國),    officially the People's Republic of China (PRC), is a sovereign state    located in East Asia.    """    for language in Detector(mixed_text).languages:            print(language)    # name: English     code: en       confidence:  87.0 read bytes:  1154    # name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755    # name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run:sudo apt-get install python-numpy libicu-dev

Note: Polyglot is using pycld2, see https://github.com/aboSamoor/polyglot/blob/master/polyglot/detect/base.py#L72 for details.

3. chardet

Chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

    >>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))    {'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

4. langdetect

Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:

    from langdetect import detect, DetectorFactory    DetectorFactory.seed = 0    detect('今一はお前さん')

pip install langdetect

5. guess_language

Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

6. langid

langid.py provides both module

    import langid    langid.classify("This is a test")    # ('en', -54.41310358047485)

and a command-line tool:

    $ langid < README.md

pip install langid

7. FastText

FastText is a text classifier, can be used to recognize 176 languages with a proper models for language classification. Download this model, then:

    import fasttext    model = fasttext.load_model('lid.176.ftz')    print(model.predict('الشمس تشرق', k=2))  # top 2 matching languages    (('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

8. pyCLD3

pycld3 is a neural network model for language identification. This package contains the inference code and a trained model.

    import cld3    cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")    LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

python string parsing

Have you had a look at langdetect?

from langdetect import detectlang = detect("Ein, zwei, drei, vier")print lang#output: de

python string parsing

If you are looking for a library that is fast with long texts, polyglot and fastext are doing the best job here.

I sampled 10000 documents from a collection of dirty and random HTMLs, and here are the results:

+------------+----------+| Library    | Time     |+------------+----------+| polyglot   | 3.67 s   |+------------+----------+| fasttext   | 6.41     |+------------+----------+| cld3       | 14 s     |+------------+----------+| langid     | 1min 8s  |+------------+----------+| langdetect | 2min 53s |+------------+----------+| chardet    | 4min 36s |+------------+----------+

I have noticed that a lot of the methods focus on short texts, probably because it is the hard problem to solve: if you have a lot of text, it is really easy to detect languages (e.g. one could just use a dictionary!). However, this makes it difficult to find for an easy and suitable method for long texts.

CodeHunter

Python: How to determine the language?

1. TextBlob.

2. Polyglot.

3. chardet

4. langdetect

5. guess_language

6. langid

7. FastText

8. pyCLD3

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last