Looking for a good ISO language tag list by number of native speakers Looking for a good ISO language tag list by number of native speakers json json

Looking for a good ISO language tag list by number of native speakers


I don't know if the thing you want is readily available. You may need to create this yourself, starting with the biggest languages and gradually moving to the smaller ones.

The question poses several difficulties:

  • There are 6000-7000 languages in the world, but not all of them have a language tag.
  • Estimates for the number of speakers are always somewhat dated, but some estimates are more dated than others. While consulting Wikipedia to create my list of language tags, estimates were dated between the early 1990s and 2010. So the figures are not perfectly comparable.
  • Estimates for smaller languages and for languages without an official status are often very rough, sometimes even non-existing.
  • Some language tags, especially in ISO 639-3, are "inclusive codes", i.e. they identify language groups (e.g. Chinese) instead of individual languages.
  • For some languages, it is sometimes useful to distinguish between variants spoken in different countries, e.g. when you want to distinguish speech synthesis for Belgian Dutch or Dutch from the Netherlands.

What you'll need initially is just the list of ISO 639-1 language tags (two-letter codes), since the biggest languages are all represented there. For smaller ones you will eventually need the ISO 639-3 tags (three-letter codes). IETF BCP 47 recommends that you use the shortest code that is available for a specific language. (So, in your example, 'cmn' for Chinese would be replaced by 'zh', 'zh-CN', 'zh-TW' or something else, depending on how specific you want to be.)

Anyway, I now have a JSON file with over 400 languages in one of my GitHub repositories. See http://cstrobbe.github.io/languagelearning/misc/languagetags.json.

PS:For a JSON list of ISO 639-1 tags in alphabetical order, see languages.js on GitHub. These tags are not ordered by the number of native speakers in the corresponding languages. (And many languages covered by ISO 639-3 are not in ISO 639-1.)


I'll address the "number of native speakers" part:

Another option would be to scrape the data:

  • SIL maintains a list of ISO 639-3 URLs about specific languages (e.g. https://iso639-3.sil.org/code/afr for Afrikaans), which point to resources about the language. In particular, they point to MultiTree and Wikipedia pages, which feature estimates about the number of speakers (again, the figures come from Ethnologue/SIL). So you could write a scraper to fetch what you need.

(Any decent resource to do with language will provide an ISO 639 language code from which to base your lookup.)

Yet another option might be to answer a slightly different question, e.g.: the number of Internet users per language, or credit card users, etc. depending on your objective.