VSstate-of-the-art artificial intelligence systems can help you escape a parking finewrite a academic essayor make you believe Pope Francis is a fashionista. But the virtual libraries behind this jaw-dropping technology are vast – and there are fears that they operate in violation of personal data and copyright laws.
The huge datasets used to train the latest generation of these AI systems, like the ones behind ChatGPT and Stable Diffusion, are likely to contain billions of images scraped from the Internet, millions of pirated ebooks, the entire 16-year-old acts of the European Parliament and the entire English Wikipedia.
But the industry’s voracious appetite for big data is starting to pose problems, as regulators and courts around the world crack down on researchers who suck up content without consent or notice. In response, AI labs are fighting to keep their datasets secret, or even skimming regulators entirely and challenging them to push the issue.
In Italy, ChatGPT has been banned to operate after the country’s data protection regulator said there appeared to be no legal basis to justify the collection and “massive storage” of personal data to form the GPT AI. On Tuesday, Canada’s privacy commissioner followed suit with an investigation into the company in response to a complaint alleging “the collection, use and disclosure of personal information without consent.”
Britain’s data watchdog expressed their own concerns. “Data protection law still applies when the personal information you process comes from publicly available sources,” said Stephen Almond, director of technology and innovation at the Information Commissioner’s Office.
Michael Wooldridge, professor of computer science at the University of Oxford, says that “large language models” (LLMs) like those that underpin OpenAI’s ChatGPT and Google’s Bard suck up colossal amounts of data.
“That includes the entire World Wide Web – everything. Every link is tracked on every page, and every link on those pages is tracked…In that unimaginable amount of data, there’s probably a lot of data about you and me,” he said, adding that comments about a person and their work could also be gathered by an LLM.”And it’s not stored in a big database somewhere – we can’t see exactly what information he has about me Everything is buried in huge networks of opaque neurons.
Wooldridge adds that copyright will be a “coming storm” for AI companies. LLMs are likely to have accessed copyrighted material, such as news articles. Indeed, the GPT-4-assisted chatbot attached to Microsoft’s Bing search engine cites news sites in its responses. “I didn’t explicitly allow my work to be used as training data, but it almost certainly was, and now it’s contributing to what these models know,” he says.
“Many artists are seriously concerned that their livelihoods will be threatened by generative AI. Expect to see legal battles,” he adds.
Lawsuits have already emerged, with a photo company Getty Images is suing UK startup Stability AI – the company behind AI image generator Stable Diffusion – after claiming the image generator company breached copyright by using millions of unlicensed Getty Photos to form his system. In the USA a group of artists is suing Midjourney and Stability AI in a lawsuit that claims the companies “infringed the rights of millions of artists” by developing their products using the artists’ work without their permission.
Inconvenient for Stability, Stable Diffusion will sometimes spit out images with an intact Getty Images watermark, which the photography agency included in its lawsuit. In January, researchers at Google even managed to trick the Stable Diffusion system into almost perfectly recreating one of the unlicensed images it was trained on, a portrait of American evangelist Anne Graham Lotz.
Copyright Lawsuits and Regulator Actions Against Open AI are hampered by the company’s absolute secrecy about its own training data. In response to the Italian ban, Sam Altman, the managing director of OpenAI, developer of ChatGPT, said, “We believe we comply with all privacy laws.” But the company declined to share information about the data used to train GPT-4, the latest version of the underlying technology that powers ChatGPT.
Even in histechnical reportDescribing the AI, the company simply says it was trained “using both publicly available data (such as internet data) and data licensed from third-party vendors.” Other information is hidden , he says, because of “both the competitive landscape and the security implications of large-scale designs like GPT-4.”
Others take the opposite view. EleutherAI describes itself as a “non-profit AI research lab” and was founded in 2020 with the goal of recreating GPT-3 and making it public. To that end, the group set up the Stack, an 825 gigabyte collection of datasets from all corners of the internet. It includes 100 GB of e-books extracted from the pirate site bibliotik, an additional 100 GB of computer code extracted from the Github and a 228 GB collection of websites collected from the Internet since 2008 – all, admits the group, without the consent of the authors involved . .
after newsletter promotion
Eleuther argues that the stack’s datasets have all already been so widely shared that its compilation “does not constitute significantly increased harm.” But the group isn’t taking the legal risk of directly hosting the data, turning instead to a group of anonymous “data enthusiasts” called the Eye, whose copyright takedown policy is a video of a choir of fully clothed women pretending to masturbate while singing.
Some of the information produced by chatbots has also been false. ChatGPT falsely accused an American law professor, Jonathan Turley, of George Washington University, of sexually harassing one of his students – citing a news article that didn’t even exist. The Italian regulator had also mentioned the fact that ChatGPT’s answers do not “always correspond to the factual circumstances” and that “inaccurate personal data is processed”.
Concerns about how AI is being trained have emerged as an annual report on progress in AI showed that commercial players are dominating the industry, at the expense of academic institutions and governments.
According to 2023 AI Index report, compiled by California-based Stanford University last year, there were 32 significant machine learning models produced by industry, compared to just three produced by academia. Until 2014, most of the significant models came from the academic sphere. But since then, the cost of developing AI models, including personnel and computing power, has increased.
“Overall, large language and multimodal models are getting larger and more expensive,” the Index said. An early iteration of the LLM behind ChatGPT, known as GPT-2, had 1.5 billion parameters, analogous to neurons in a human brain, and cost around $50,000 to train. By comparison, Google’s PaLM had 540 billion parameters and cost around $8 million.
This has raised concerns that companies are taking a less measured approach to risk than universities or government-backed projects. Last week, a letter whose signatories included Elon Musk and Apple co-founder Steve Wozniak called for an immediate break in creating “giant AI experiments” for at least six months. The letter said there were concerns that tech companies were creating “ever more powerful digital minds” that no one could “reliably understand, predict or control”.
“Big AI means that these AIs are created only by big, for-profit companies, which unfortunately means that our interests as human beings are not necessarily well represented,” said Dr Andrew Rogoyski of the Institute. for People-Centred AI at the University of Surrey.
He added: “We need to focus our efforts on making AI smaller, more efficient, requiring less data, less electricity so that we can democratize access to AI.