- Uncontrolled digital inbreeding will destroy the current AI model
- Training data used by large language models increasingly corrupted by “synthetic” data
- New AI training models needed – now
- Keep it real!
As we know from human history, when a gene pool becomes depleted in-breeding occurs and congenital recessive defects that include blindness, schizophrenia and widespread imbecility start to appear and then multiply. It seems it’s the same with AI.
Researchers in the US are warning that if AI training continues to eat itself by gorging and regorging on immense quantities of already-regurgitated data scraped, legally and increasingly illegally from freely or cheaply accessible sources, it will get “progressively dumber” to the point it inflicts “irreversible harm” on itself and “breaks the AI model” by generating gibberish and rendering itself useless.
To have any value, generative AI of the sort provided by the likes of Chat GPT must be up to date, accurate, reliable and true. Increasingly it isn’t, as training continues on data that is known to be dated, corrupted and incestuous rather than fresh and new. But it’s basically free and quick and easy to access and thus it is likely to degrade itself to the point of incoherence. To make matters worse, more and more inaccurate AI-generated content is being consumed as AI actually cannibalises itself and coughs up yet more meaningless rubbish.
As this “synthetic” training data (that which is generated and manipulated by AI) is compiled, along with everything else, into a set that is then scraped up by a different AI model, the contagion infects the new host, mutates and spreads onwards. The contagion is made worse because AI outfits, and the companies and agencies that use their products, are barely regulated and are not required to disclose that their content includes synthetic input, nor are they compelled to watermark the AI content they pump out. It makes it all but impossible to keep synthetic content out of AI training sets. An article on Futurism.com, the US science technology website, estimates that “as much as 90% of online content may be synthetically generated by 2026”.
In another very recent Futurism article, “AI Appears to Be Slowly Killing Itself” (published on August 27), journalist Maggie Hamilton Dupré quotes Sina Alemohammad, a graduate student at Rice University, in Houston, Texas, as saying, “The web is becoming an increasingly dangerous place to look for your data”, and it is. Alemohammad was a co-author of a paper that originated the term “model autophagy disorder” (or MAD, for short) as a descriptor of AI as Orobouros, the snake that eternally eats its own tail.
Scientists at Rice University, and at Stanford University in California, are researching MAD, and posit that AI models decline in the quality and diversity of their output without a constant stream of new high-quality data. This develops into full-blown terminal autophagy when a model is trained solely on its own responses and, to all intents and purposes, degenerates into a state of imbecility. AI engines can also train machines on data from other AI sets, which leads to the concept of “digital inbreeding” where models train their own data as well as data from others until their outputs inevitably become useless. At this point there is no way for users to reliably distinguish what is real from what is not.
Jathan Sadowski, a senior lecturer in the Faculty of Information Technology at Monash University in Melbourne, Australia, has been researching and critiquing the rise of AI. In an article in the journal Nature, he wrote that training large-language models on data created by other models results, inevitably, in “irreversible defects in the resulting models.” Calling it “Hapsburg AI”, he compared it to the inbreeding that afflicted multiple generations of one of Europe’s infamously inbred royal families. The over-developed lower lip and enlarged tongues of the Hapsburg dynasty became so pronounced that not only did it make it almost impossible to understand what a Hapsburg emperor was saying, but he sprayed his fellow diners with half-masticated food every time he tried to speak. The royal chins and noses were also distorted by inbreeding. In the end, it put an end to the family. Maybe the same will happen to Hapsburg AI?
Such a scenario may be possible and could even become likely unless action is taken now to strengthen the AI-model gene pool. Evidently, the solution will lie in how AI models are trained and that means providing access to high-quality, verifiable data that is generated by actual human beings – and that will cost money that companies won’t want to pay. Therefore, regulation and guidelines need to be drawn up now, before it’s too late. Indeed, it may already be too late but we have to at least try.
Furthermore, AI companies and their users need to be much more transparent about what they are doing. One way would be to share sources of data, details of the training systems used and, above all, where the content comes from in the first place. Methods should be put in place to ensure AI-generated data isn’t recycled and reused. No AI systems will have integrity unless it can be proven that no training data is passed on to other models.
And most important of all, some way must be found whereby AI-trained models and systems can periodically be stopped, stripped of old, corrupted duplicated datasets and completely reset with clean data. It will be a very complex set of tasks, but will be the only way to ensure that, in the end, AI will stave off its own collapse as well as the chaos that would follow such an eventuality.
– Martyn Warwick, Editor in Chief, TelecomTV
Email Newsletters
Sign up to receive TelecomTV's top news and videos, plus exclusive subscriber-only content direct to your inbox.
Subscribe