Online data has long been a valuable commodity. For years, Meta and Google have used data to better target online ads. Netflix and Spotify use this to recommend more movies and music. Political candidates are looking to data to know which groups of voters to focus on.

Over the past 18 months, it has become increasingly clear that digital data is also important in the development of artificial intelligence. Here's what you need to know:

The success of AI depends on data. Because the more data you have, the more accurate and human-like your AI models will become.

Just as students learn by reading more books, essays, and other information, large-scale language models (the systems underlying chatbots) also become more accurate and It becomes powerful.

Some large-scale language models, such as OpenAI's GPT-3 released in 2020, were trained on hundreds of billions of “tokens” that are essentially words or word fragments. Recent large-scale language models have been trained with over 3 trillion tokens.

Technology companies are using up publicly available online data to develop AI models faster than new data can be generated. According to some predictions, high-quality digital data will be exhausted by 2026.

In the race for more data, OpenAI, Google, and Meta are turning to new tools, changing their terms of service, and engaging in internal discussions.

At OpenAI, researchers in 2021 created a program that converted the audio of YouTube videos to text and fed the transcript into one of the company's AI models, which was against YouTube's terms of service. , said people familiar with the matter.

(The New York Times sued OpenAI and Microsoft for using copyrighted news articles in AI development without permission.) OpenAI and Microsoft sued OpenAI and Microsoft for using news articles in innovative ways that do not violate copyright law. (He said he did.)

Google, which owns YouTube, also used YouTube data to develop AI models, stepping into the legal gray area of ​​copyright, people familiar with the move said. And last year, Google revised its privacy policy to make publicly available materials available for the development of more AI products.

At Meta last year, executives and lawyers discussed how to get more data to develop AI and discussed acquiring major publishers like Simon & Schuster. In a private meeting, the company considered the possibility of incorporating copyrighted works into its AI models, even if it meant being sued later, according to a transcript of the meeting obtained by The Times.

OpenAI, Google, and other companies are looking to use AI to create more data. The result is so-called “synthetic” data. The idea is that AI models can generate new text that can be used to build better AI.

Synthetic data is dangerous because AI models can make errors. Relying on such data can make mistakes even worse.



Source link