MAIN PROJECT

Indicators for the Presence of Languages in the Internet

Project Summary

Up until recently, the most consulted source of statistics related to the use of languages online relied on algorithms to analyse websites classed as the most visited. While these statistics offer some interesting insight, they might not accurately reflect the presence of languages online due to the lack of consideration of the often very multilingual nature of websites which trigger important biases.

In 2017, the Observatory of Linguistic and Cultural Diversity in the Internet devised a new approach that could help to better follow the progress and prevalence of languages online. Using this approach, we were able to identify meaningful indicators outlining the presence of 343 languages on the Internet.

Highlights from Latest Results

ISOLanguages% INTERNAUTS% L1+L2
SPEAKERS
% CONNECTED
SPEAKERS
% CONTENTSVIRTUAL
PRESENCE
CONTENT
PRODUCTIVITY
engEnglish15,79%14,13%70,86%20,42%1,451,29
zhoChinese Macro17,41%14,48%76,27%18,88%1,301,08
spaSpanish6,62%5,22%80,46%7,70%1,481,16
hinHindi4,34%5,68%48,48%3,82%0,670,88
rusRussian3,28%2,38%87,42%3,73%1,571,14
araArabic Macro4,37%4,08%67,81%3,65%0,890,84
fraFrench3,05%2,91%66,58%3,41%1,181,12
porPortuguese2,89%2,46%74,42%3,09%1,251,07
jpnJapanese1,54%1,15%84,98%2,20%1,911,42
deuGerman, Standard1,80%1,25%91,21%2,15%1,721,20

Methodology

The observatory’s new approach consists in indirectly approximating the relative amount of Web content per language. In doing so, it also considers crucial factors that are often ignored when describing a language’s Internet presence, but that should be considered to prevent errors or biases.

Firstly, the team considers the likely existence of an ‘economic law’ relating to online communication, which links the offer (ie, Web content available in a language) with the demand (ie, number of speakers of that language who are connected to the Internet). Past findings suggest that the more speakers of a given language are connected to the Internet, the more webpages in that language tend to exist.

In addition, past research suggests that Internet users often prefer to communicate in their mother tongue when content is available in that language, yet they are happy to use their second language or languages in the absence of this content. In some cases, Internet users might also create content in their second language for economic reasons and could use translation services to do so.

A language’s presence online is also linked to the amount of Internet traffic in different places, the number of subscriptions to social networks, and the progress of different countries in terms of Internet-related services for citizen. The indicators of Internet presence created by the researchers collectively consider all these factors, thus painting a more detailed picture of how much and in what ways different languages exist online.

Cybergeography of Language Families

An analysis of the linguistic evolution of the Internet from a geographic perspective.

Cyber-Globalization Index (CGI)

The cyber-globalization Index is a strategic indicator of the future of a language in the Internet. It is defined as :

CGI (L) = (L1 + L2)/L1(L) x S(L) x C(L) where

(L1+L2)/L1 (L) is the rate of multilingualism of language L

S(L) is the percentage of countries having speakers of language L

C(L) is the percentage of speakers of language L connected to the Internet

Release Notes & Previous Versions

If you wish to have a better idea of the method without reading the published articles and independently of the last figures, go to Version 3.0 for which some effort of explanation and displaying results has been realized.

Version 5.1 (April 2024)

Updated in this Version

1) Ethnologue Dataset #27 of March 2024 has been used for demolinguistic figures. The Digital Support Indicator provided by Ethnologue as part of this database has also been updated. The ITU figures of percentage of individuals connected to the Internet per country have been updated.

2) 19 new languages reaching the threshold of 1M L1 speakers have then been added to the model for a new total of 361 languages:

Malay, Ambonese abs
Bulu bum
Bangala bxg
Efik efi
Basque eus
Gbaya Macro gba
Irish gle
Ghanaian Pidgin English gpe
Iban iba
Krio kri
Liberian English lir
Indonesian, Makassar mfp
Saxon, Low nds
Malay, Papuan pmy
Guinea-Bissau Creole pov
Rakhine rki
Sango sag
Scots sco
Tok Pisin tpi

3) The changes in the results of the model are few.

  • In terms of contents, English consolidates slightly its first position compared to Chinese.
  • Hindi takes the lead of the languages in 4th position leaving Arabic behind Russian and before French and Portuguese.
Version 4.0 (May 2023)

Updates on Methodology in this Version

1) In the integration of Ethnologue data, Arabic Standard (arb) has not been computed as second L1 for all concerned countries except Saudi Arabia. The rationale is that one founding principle of the model is that there is only one L1 and that the macro language ara cannot include twice the same population as L2.

2) Concerning the inclusion of the Digital Support Indicator (DLS) from the source Assessing Digital Language Support on a Global Scale, the indicator is set for each language. This arises the issue of how to manage the macro-languages. The decision made was to attribute to each macro-language the higher indicator from the suite of languages which belong to that macro-language.

3) The interface indicator of the model is now computed as half the sum of the previous indicator plus the DLS (which has a value between 0 and 1)  and recomputing the results to normalize at 100%. This addition reduces the bias of that indicator by potentially elevating the weight of many languages which were absent from application interfaces or translation programs and weighted null. For the rest of languages it does not induce notable changes.

Version 3.2 (April 2023)

Updated ITU % of Connected Persons per Country

Summary

  • The percentage of connected persons worldwide has grown from  64% to 67% in one year
  • ITU have retaken the offering of estimates in countries where no official data is proposed by the government
  • Many important changes on connectivity figures per country, with some strong growth or drop
  • Practically no changes for the first languages    
  • Strong growth of connectivity in Africa drives increase above 10% of African languages    
  • The signs of progress of the less connected starts to appear: French progresses thanks to Africa together with African languages; Asian languages keep  progressing except Chinese     
  • Arabic growth have stopped  
Version 3.1 (August 2022)

Updated World Bank % persons connected per country), includes comparison with V3.c

Version 3.c (August 2022)

Correction of a bug in V3, with marginal impact

Version 3.0 (March 2022)

Model redesign, reaching final version, all biases controlled.

Summary

More than a new version, this is the reach of maturity for the method, as all the biases are now controlled to an acceptable threshold, and the produced indicators are reliable within a ±20% confidence interval.

The Observatory is pleased to share the results of version 3 of its model for computing indicators of the presence of languages ​​on the Internet, which, as for version 2, announced in 2021, processes the 329 languages ​​over one million native speakers.

A confidence interval of ±20%, may seem wide if we apply the criteria of other statistical works, but for the data about the place of languages ​​on the Internet, a subject that has always been very difficult to reach, and prone to chronic misinformation, this is a feat.

All the results are available under CC-BY-SA 4.0 license.

Version 2.0 (2021)

Model improvement in bias control, reaching out 329 languages

Summary

In February 2021 starts a project for measuring Portuguese on the Internet and comparing with other languages, coordinated by the UNESCO Chair on public policies for multilingualism, carried out by the Observatory of Linguistic and Cultural Diversity in the Internet within the frame of the International Institute of the Portuguese Language and under the support of the Cultural and Educational Department of the Brazilian Ministry of Foreign Affairs. First results will be produced by May 2021 and the full products by August 2021.

The study will gain some notable improvements:
– Use of Ethnologue last Global Dataset for demo-linguistic data
– Processing of L2 speakers by country instead of global 
– Actualization and extension of language and country’s indicators
– Extension of language’ s coverage

Version 1.2 (2019)

Offers a comparison between results of 2015,  2016 and 2017 using 2017 version & playing with ITU data from previous years.

Methodological Notes

1 – Only the ITU data has been updated in 2016 and 2017
2- A full comparison would require the update of demo-linguistic data AND of the various micro-indicators of presence of languages or countries
3- However the updated data are the one of major impacgt inside the model and therefore offer a credible indication of trends
4- It is important to understand that percentages of increase or decrease are not absolutes but relatives to the rest of languages

Summary of Results

    As far as the most powerful languages ​​are concerned, developments are slow even though there is a clear differential between
    – languages ​​that are progressing very strongly: Hindi and Malay
languages ​​that progress strongly: Korean, Urdu, Arabic and Portuguese
    – languages ​​that continue their steady progression; Spanish and Polish
    – languages ​​in steady decline: Japanese, Russian and Chinese
languages ​​in strong decline: German, French, Italian and to a lesser extent English.
    Note that Arabic passes in front of Japanese and Urdu passes Polish and Korean.

    In the best progressions appear African and Asian languages;   then appear  Kabyle, Arabic, Turkish and Armenian in strong progression.
    Some European languages ​​follow like Romanian, Ukrainian, Portuguese, Albanian and Spanish in the middle of the ranking in stable progression.
    Polish is the last language in weak progression and the opposite occurs with first low decline of Russian and Chinese, followed by Hebrew and Swedish
    Most western languages ​​logically show a relative decline as a consequence of the saturation of connections (90% of people connected).
    English continues a steady decline and French even more, a sign that Francophone Africa is slow in its fight against the digital divide.
    At the end of the ranking, a strong decline of local languages ​​from Asian or African countries (often French-speaking) that remain stuck in the digital divide.

Version 1.0 (2017)

Start of new method for 129 languages.

Summary

The observatory has measured the space of Latin languages, English and German in the Internet, between 1997 and 2007.

After 10 years of eclipse, because of the evolution of Search Engines, we are back, thanks to the support of International Organization of Francophonie and with MAAYA, with a new method to produce indicators for the 140 languages of more than 5 millons speakers.

Projects by OBDILCI

  • Indicators for the Presence of Language in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • AI and Multilingualism
  • DILINET
  • Pre-historic Projects…