Paper published in the AI & Society academic journal. Vector Space Models are currently the most popular machine-learning technology for natural language processing. While current vector space models are far from being able to capture all nuances of a language or culture, their geometry could represent early attempts to grasp a universal pattern in the human mind. Such an extensive use of VSMs in knowledge management raises some important questions. What exactly is the nature of their geometry, and how universal is it?
The Overall Issue: Cultural Homogenization
Is the world going through a cultural homogenization? Opinions diverge on this question, but here are a few indicators of homogenization on some cultural levels:
- About half of the world’s spoken languages will disappear by the end of the century. (1)
- 1/5 of total film revenues worldwide are generated by 13 movies (2016), with revenue margins of the top 1% movies growing ever since 2007. (2)
- Global academic performance measures such as the h-index and Shanghai ranking are gaining popularity, with institutions in English-speaking countries monopolizing the top of the rankings. Their influence over institutional behaviour and public policy is growing, “with many universities devoting considerable attention to trying to increase their position in the rankings.” (3)
While cultural homogenization has its benefits, such as facilitating international cooperation and achieving so-called higher international standards, it also makes it harder to think differently, to put into context prevailing norms and customs. In its most extreme form, cultural homogenization leads to ideological alienation and the loss of personal agency.
With cultural homogenization comes the myth of a “natural” and universal culture, “the loss of the historical quality of things: in it things lose the memory that they were once made. ” (4)
A Risk to Address: The Hegemony of a Few Algorithms
Digital technologies play a central in knowledge management and its dissemination.
According to an international Reuters Institute study, more than half of its respondents (54%) prefer paths that use algorithms to select news stories rather than editors or journalists (44%). 25% of respondents have a preference for searches (Google etc.), and 23% for social media (Facebook and other platforms) to access news.
According to another study from Pew Research Center, 94% of U.S. students use a search engine for a typical research assignment.
Digital technologies are increasingly complex and concentrated in a few hands
Both searches and social media are provided by a few competing businesses and algorithms. Those lacking sufficient scale barely get noticed and can’t sustain investments in new technologies such as A.I. and machine learning. Technology constitutes an increasingly higher barrier of entry in knowledge management.
Of the around 6 billion daily search queries on Google (5), at least 15% are using A.I (6). News feeds provided to the 2 billion Facebook users (7) rely on machine-learning technologies such as DeepText (8).
All major tech companies invest massively in machine-learning technologies, and by doing so secure their leadership position. Efforts are being made to make their technology accessible through open-source projects such as fasttext and Tensorflow, even though this is arguably a mere strategy to increase their influence among developers.
The Internet long tail promise was naive
The initial hope was that digital technologies would encourage cultural diversification by making it cheaper to produce, publish, and distribute cultural assets to niche markets (Internet long tail).
In some industries, however, it led to the polarization of cultural assets, either in the category of niche and barely profitable cultural productions, or of even bigger blockbusters at the expense of medium-sized cultural projects. “Consumers generally favour whatever they find on their mobile screens or at the top of their search results. The tail is indeed long, but it is very skinny.” (2)
Whether the use of digital technologies in knowledge management (search engines, news feeds, etc.) is responsible for cultural homogenization or not, the fact that the vast majority of people only use a few technological solutions to access knowledge creates a major risk.
If the knowledge management business remains an oligopoly, particular attention should at least be paid to guaranteeing that the few tools in use don’t make any assumptions about how people ought to think, based on a predominant culture or supposedly universal logic governing the human mind.
Machine Learning and its Potential Bias
New machine-learning algorithms are being developed to analyze large corpora of text. They do so by representing each word with a vector in spaces that can have hundreds of dimensions: Vector Space Models.
The more often two words appear together in texts, the closer their vectors are brought. The resulting vector space models (VSMs) are not only statistically significant, but have, according to an already well-identified hypothesis, a semantic value.
Distributional hypothesis in machine learning: words appearing in the same context share semantic meaning.
Thanks to their geometry, vector space models allow machines to classify documents by meaning and to understand natural language better than ever before. They are used in search engines and automated news feeds to curate information based on a user query or profile, and are thus in a position to influence people’s perception of reality.
To illustrate the power of vector space models, we have developed a simple application: political footprints, a class of vector space models that deal more specifically with political discourse.
Based on our investigation, published in the A.I. and Society academic journal, we conclude that while current vector space models are far from being able to capture all nuances of a language or culture, their geometry could represent early attempts to grasp a universal pattern in the human mind, one that would come with its own structure and logic. Which leads us to state that an additional hypothesis is at play in machine-learning technologies: structuralist hypothesis.
Structuralist hypothesis in machine learning: all languages and human thoughts share a same hidden structure.
Based on this definition, we identify three possible positions in machine learning and more broadly knowledge management:
Structuralism: the belief in algorithms and knowledge-management tools that languages, cultures, and the human mind share a unique underlying structure. Not surprisingly, all tech giants are in the race to discover what could be an invariant structure of the human mind, to “capture the deeper semantic meaning of words. *8“
Hybrid structuralism: despite recent progress, the discovery of a hypothetical invariant structure of the human mind is still a distant prospect. Hybrid structuralism is the belief that a finite set of geometries are at least discoverable at some level (domain, language, or time specific corpora). In this scenario, machine learning technologies would first have to identify which structure a system of thoughts belonged to, and then apply the corresponding model to make sense of it.
Post-structuralism: the belief that no universal structure of the human mind exists. Under this hypothesis, attempts to find a limited set of geometric patterns in statistically derived VSMs would prove to be futile, and research should focus instead on understanding how semantic models not sharing any composition rules can interact with one another.
Take a Clear Stand on Distributional and Structural Hypothesis
We believe that there is a significant risk that cultural agents and knowledge operators will not wait for a formal validation that a shared structure of the human mind exists, and use commercial success as the only criterion to establish the validity of the distributional and structuralist hypotheses.
This would in effect make structuralism a self-fulfilling hypothesis, transform a false universal logic into a myth, and lead to cultural homogenization wherever these technologies are in use: academic research, journalism, art, and entertainment to list just a few disciplines.
The purpose of this report is not to demonstrate whether the structuralist hypothesis is true or not, to argue whether this risk of assuming its validity is worth taking or not, but to acknowledge its existence.
In our view, the fact that the technology and its popular applications are owned by only a handful of knowledge operators – a reality that won’t change anytime soon – should bring with it additional responsibilities: a very careful and inclusive approach to knowledge management, what we call a conciliation of a monopoly (or in this case, oligopoly).
Our recommendation is for all involved parties, knowledge operators, cultural and government institutions to engage actively in debating whether distributional and structuralist hypotheses should be formulated, what their scope should be, and how to mitigate inherent risks of cultural homogenisation.
Christophe Bruchansky, Plural’s founder, author.
Niel Chah, University of Toronto, code reviewer.
Araz Taeihagh, Assistant professor of public policy (Singapore Management University), paper reviewer.
Thanks to the A.I. and Society editors for their in-depth paper review and feedback.
machine learning, structuralism, post-structuralism, natural language processing, culture, political discourse, linguistic, knowledge, semantics, human mind, artificial intelligence
Disclaimer – please use the following reference when using this report: Artificial intelligence – Dealing with Diversity of Meaning, Sept. 2017.
(1) The world’s languages, in 7 maps and charts (Washington Post, April 2015) https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts
(2) Mass entertainment in the digital age is still about blockbusters, not endless choice (The Economist, Feb. 2017). https://www.economist.com/news/special-report/21716467-technology-has-given-billions-people-access-vast-range-entertainment-gady
(3) University rankings gain influence, despite obvious drawbacks (University Affairs, Sep. 2013) http://www.universityaffairs.ca/news/news-article/university-rankings-gain-influence-despite-obvious-drawbacks/
(4) Mythologies, Roland Barthes, p.142 (Hill and Wang, 1972).
(5) Search Engine Statistics 2017 (Smart Insights, April 2017). http://www.smartinsights.com/search-engine-marketing/search-engine-statistics/
(6) Google Turning Its Lucrative Web Search Over to AI Machines (Bloomberg, Oct. 2015). https://www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines
(7) Facebook now has 2 billion monthly users… and responsibility (Tech Crunch, June 2017). https://techcrunch.com/2017/06/27/facebook-2-billion-users/
(8) Introducing DeepText: Facebook’s text understanding engine. (Facebook code blog, June 2016). https://code.facebook.com/posts/181565595577955/introducing-deeptext-facebook-s-text-understanding-engine/