The English language is a lot more French than we thought, here’s why
- by 7wData
DISCLAIMER: I personally do not have an opinion on the classification of English, I am not a linguist. This article only researches the statistics behind English as there is currently no such data available.
The English language and its origins have been a topic for fierce debate among many linguists. English is classified as a (West) Germanic language, meaning that it is closely related to other Germanic languages such as Swedish, Dutch and German. The other dominant language family in Western Europe is the group of Romance languages: French, Italian, Spanish… all languages that have sprouted from Latin somewhere throughout history.
Unlike other Germanic languages, English shares a large portion of their vocabulary with French and Latin, often attributed to the period of Norman French dominance in England after 1066. The size of this Romance influence on English, along with some other technical aspects such as pronunciation and syntax, has led some radical linguists to believe that English should in fact not be seen as a Germanic language, but rather as a Romance-Germanic hybrid. However, the general consensus is that the overall English language is a third of Old English origin (so, Germanic) but that the core vocabulary is entirely Old English. The keyword here is core, as most linguists claim that French and Latin influence only enters the language through a handful of basic words but a vast majority of academic terms. For many, this seems to be the most important criterion for its classification as a Germanic language.
I personally don’t care much about these classifications, but I was very surprised to discover that in fact no-one recently has actually bothered to research the origins of English, let alone the core! The latest research was done in 1975 by Joseph M. Williams, where he examined the 10,000 most frequently used words in English, based on a rather small sample size of corporate letters. Here are my issues with his research:
And core vocabulary is precisely what this whole debate is all about, so I decided to do my own little research using Python to see how I could provide some statistics behind these claims!
The Oxford Dictionary claims that there are roughly 250 000 distinct words in English vocabulary. But what share represents the core vocabulary? What does that even mean? The Oxford Dictionary uses the following table with some insight on the relation of the most common words in English to the appearance of words in English sources:
This table shows us a rather large problem: the actual occurrence of words in applied English does not reflect the (core) vocabulary or even the language as a whole. 50% of any given text in English will use the exact same linkers/pronouns, even though those 100 words only reflect 0.04% of distinct English vocabulary. A word such as “the” alone makes up 6% of any given source in English. This disproportionate use of extremely basic structural words deceives the reader into thinking that the English language/vocabulary is of an entirely different etymological composition.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More