EDITOR’S NOTE: Tomorrow, we’ll release a very special edition of This Week In Voice VIP, which will be made available to free and paid subscribers alike.
I've been looking so long, at these pictures of you,
that I almost believe that they're real.
I've been living so long, with my pictures of you,
that I almost believe that the pictures…
are all I can feel.
The world’s AI is only as good as its training models, and these training models are only as good as the sources of information used to train them.
Just two days ago, I documented a problem highlighting this (made available on This Week In Voice TV) where I asked Alexa and Google Assistant what should’ve been a simple question related to the US Presidential Election:
Many training models, as well as all manner of other language technology, rely on Wikipedia for information.
And Wikipedia has its limitations.
A discovery in Scotland, uncovered just last week, perfectly highlights the problems (and potential opportunities) present with trying to train AI.
The Scots (language spoken by 1.5 million people living in Scotland) version of Wikipedia was written almost entirely by one single person!
Want to know what’s worse?
That person doesn’t speak Scots at all!
Want to know why?
That one lone Wikipedia author…was an American teenager!
All of the errors and inaccuracies that are present in the Scots version of Wikipedia have multiplied exponentially throughout language models, translation technology, and other computing applications that depend on Wikipedia.
This one American teenager spent NEARLY AN ENTIRE DECADE writing these articles on the Scots version of Wikipedia, making the worst kind of errors and mistakes: consistent, repeated errors and mistakes.
These persistent mistakes and errors created a vicious cycle, causing the Reddit author who made this discovery to call this “the most damage done to the Scots language in history.”
A group in Scotland has now formed to begin the tall task of editing all of these Wikipedia articles, which will take years. They’ve already held their first “editathon.”
This episode has underscored the importance of having local communities which write and produce a lot of content online. By doing so, not only will the nuances of each language be preserved in proper fashion, but AI training models - which have far-reaching implications - will be fed accurate and useful data.
This also underscores the importance of groups like SIL International, a large non-profit which exists to support and preserve languages around the world, including what they call minority languages which have relatively few people speaking them, and are in danger of potentially vanishing altogether.
We’re in a whole new era, where language has an intertwined and deeply symbiotic relationship with technology.
Consequently, many of the most important new recruits at major technology companies are those with liberal arts degrees, such as history, linguistics, psychology, sociology, philosophy, and more.
If we want AI to get anywhere close to its full potential:
we need to watch our language.
If only I'd thought of the right words
I could have held on to your heart
If only I'd thought of the right words
I wouldn't be breaking apart
All my pictures of you
Subscribe to This Week In Voice VIP - the daily letter to the voice technology and AI communities.
Have a friend? Gift ‘em This Week In Voice VIP.
Discounts for This Week In Voice VIP subscribers:
DBW Global is the (virtual) gathering of the wide world of publishing, taking place September 14-16. Produced by Bradley Metrock. Use the promo code to save when you register here.
Project Voice 2021 is the #1 event for voice tech and AI in America. In person, in April 2021, in Chattanooga, Tennessee. Use the promo code to save when you register here.
Subscribe to This Week In Voice TV on YouTube!