Artificial intelligence (AI) experts are predicting the end of the generative AI hype and warning of a potential “model collapse.” But how likely are these predictions, and what exactly is model collapse?
First discussed in 2023 and gaining popularity recently, “model collapse” refers to a hypothetical scenario where future AI systems become progressively less intelligent due to the abundance of AI-generated data on the internet.
Modern AI systems rely on machine learning, where programmers establish the mathematical framework, but the actual intelligence comes from training the system to recognize patterns in data. However, not just any data will do. The current generation of generative AI systems requires high-quality data in large quantities.
To obtain this data, tech giants like OpenAI, Google, Meta, and Nvidia continuously scour the internet, collecting terabytes of content to feed their machines. Since the emergence of widely available and useful generative AI systems in 2022, people have been increasingly uploading and sharing content created in part or entirely by AI.
In 2023, researchers began exploring the possibility of relying solely on AI-generated data for training, rather than human-generated data. There are significant incentives to make this work, as AI-made content is cheaper to source than human data and does not raise ethical and legal concerns when collected in large quantities.
However, researchers discovered that AI systems trained solely on AI-generated data become less intelligent with each model, resembling a digital version of inbreeding. This “regurgitive training” leads to a decline in the quality and diversity of the AI’s behavior. Quality refers to being helpful, harmless, and honest, while diversity pertains to the range of responses and representation of cultural and social perspectives in AI outputs.
In essence, by relying heavily on AI systems, we risk polluting the very data source necessary to make them useful in the first place.
Some may wonder if tech companies can filter out AI-generated content, but this is not a viable solution. These companies already invest significant time and money in cleaning and filtering the data they collect, discarding up to 90% of it during the training process. As the need to remove AI-generated content increases, filtering and removing synthetic data will become increasingly challenging. Additionally, distinguishing between AI and human-generated content will become more difficult over time.
Research indicates that we cannot completely eliminate the need for human data. After all, the “I” in AI stands for intelligence derived from humans.
While there are indications that developers are working harder to source high-quality data, the possibility of catastrophic model collapse may be overstated. Most research focuses on cases where synthetic data replaces human data, but in practice, human and AI data are likely to accumulate in parallel, reducing the risk of collapse.
The future is more likely to involve an ecosystem of diverse generative AI platforms used for content creation and publication, rather than relying on a single monolithic model. This approach increases resilience against collapse and highlights the importance of healthy competition and funding for public interest technology development.
However, there are other concerns associated with excessive AI-generated content. A flood of synthetic content may not pose a threat to AI development, but it does jeopardize the digital public good of the internet. For example, the release of ChatGPT led to a 16% decrease in activity on StackOverflow, suggesting that AI assistance is reducing person-to-person interactions in some online communities. Hyperproduction from AI-powered content farms also makes it challenging to find non-clickbait content without excessive advertisements.
Reliably distinguishing between human-generated and AI-generated content is becoming increasingly difficult. One potential solution is to watermark or label AI-generated content, as many have suggested. Additionally, there is a risk of losing socio-cultural diversity as AI-generated content becomes more homogeneous, potentially leading to cultural erasure. Cross-disciplinary research is urgently needed to address the social and cultural challenges posed by AI systems.
Protecting human interactions and human data is crucial, both for our own benefit and to mitigate the potential risk of future model collapse.