Researchers have raised concerns about a potential shortage of training data for artificial intelligence (AI) systems, which could hinder the growth of AI models and impact the AI revolution. While there is a vast amount of data available on the web, the quality and quantity of data needed to train accurate and high-quality AI algorithms is crucial. Insufficient data can lead to inaccurate or low-quality outputs. Low-quality data from sources like social media posts may be biased, contain disinformation, or illegal content, which can be replicated by AI models. To ensure high-performing AI models, developers seek out high-quality content from books, online articles, scientific papers, and filtered web content.
However, despite the increasing demand for data to train AI systems, research shows that online data stocks are growing slower than the datasets used for AI training. A group of researchers predicted that high-quality text data could run out before 2026 if current training trends continue. This could have implications for the development of AI, which is projected to contribute trillions of dollars to the global economy by 2030.
While the situation may seem concerning, there are potential solutions to address the risk of data shortages. One approach is for developers to improve algorithms to use existing data more efficiently. This could lead to training high-performing AI systems with less data and computational power, reducing the carbon footprint of AI. Another option is to generate synthetic data specifically tailored to train AI models. Some projects are already using synthetic content from data-generating services.
Developers are also exploring content sources outside of free online platforms, such as large publishers and offline repositories. Digitizing millions of texts published before the internet could provide a new source of data for AI projects. News Corp, a major news content owner, has mentioned negotiating content deals with AI developers, potentially requiring payment for training data.
Addressing the issue of data shortages may help restore the power imbalance between content creators and AI companies. Some creators have protested against the unauthorized use of their content to train AI models and have taken legal action. Remunerating content creators for their work could be a step towards resolving this imbalance.
Overall, while the potential lack of training data for AI is a concern, there are opportunities to improve algorithms, generate synthetic data, and explore alternative content sources to mitigate the risk.