Artificial intelligence-backed chatbots like ChatGPT are creating “plagiarism stew” — responding to queries with “paraphrasing or outright repetition” of text that’s cribbed from copyrighted news articles, according to a prominent trade group.
News Media Alliance — a nonprofit that represents more than 2,200 publishers, including The Post — released a blistering, 77-page report on Tuesday that argued the most popular AI chatbots have been violating copyright law by reproducing entire sections of some articles in their responses.
The report singled out OpenAI’s ChatGPT, Google’s Bard, Microsoft’s Bing and a more recent AI tool called the Search Generative Experience which can craft responses to open-ended queries while retaining a recognizable list of links to the web.
NMA’s findings claimed that these LLMs — a type of AI that understands and can respond to written text — “are just ‘learning’ unprotectable facts from copyrighted training materials.”
What’s more, since the technology is not actually “ever absorbing any underlying concepts,” its responses are “technically inaccurate,” NMA said.
After analyzing a sample of datasets believed to be used by LLMs, NMA claimed that the AI chatbots produce “unauthorized derivative works by responding to user queries with close paraphrasing or outright repetition of copied and memorized portions of the works on which they were trained.”
The group found that curated data sets used content from news, magazines and digital media publications as much as 100 times more than other generic data sets.
As many as half of the top 10 sites in training sets used for Google’s Bard — which reportedly launched in March despite internal warnings that the techy tool was a “pathological liar” — are news outlets, NMA claimed.
Current and former employees told Bloomberg earlier this year that Google’s push to develop Bard reportedly ramped up in late 2022 after ChatGPT’s success prompted top brass to declare a “competitive code red.”
However, insiders worried that the chatbot was prone to spewing out responses riddled with false information that can “result in serious injury or death” — a concern that Google ignored in favor of a troubled launch that saw Bard emitting erratic feedback.
NMA said it submitted its white paper to the US Copyright Office, “acknowledging that an author’s expression may be implicated both in training … as well as at the output stage because of a similarity between her works and an output of an AI system.”
Robert Thomson, the CEO of News Corp — which owns The Post, as well as The Wall Street Journal and other publishers represented by NMA — has bashed inaccuracies spewed out by AI-generated content as “rubbish in, rubbish out” — even as he warned the technology threatens to kill thousands more jobs across the news industry.
“People have to understand that AI is essentially retrospective,” the media executive said during an appearance at the Goldman Sachs Communacopia and Technology Conference in San Francisco last week.
“It’s about permutations of pre-existing content.”
“And so instead of elevating and enhancing, what you might find is that you have this ever-shrinking circle of sanity surrounded by a reservoir of rubbish,” Thomson continued. “So instead of the insight that AI can potentially bring, what it will evolve into, essentially, is maggot-ridden mind mold.”
Journalists have fumed about the use of AI in news reports, including just last week when USA Today staff writers suspected that its parent company Gannett used the tech to generate content for a product review site after mysterious bylines of unknown writers started showing up on articles.
Reviewed, the USA Today-owned shopping recommendation site, published several articles last week with names of reporters that other staffers did not recognize, according to The Washington Post.
Journalists called the prose “robotic,” and seemingly “not even real” after one story reviewing scuba masks juxtaposed text with another article developed to tumbler cups.
It’s not just reporters bashing the intrusion of AI on the industry — well-known authors have also condemned OpenAI for misusing their works to allegedly train ChatGPT, which became the fastest-growing consumer application in history earlier this year, reaching 100 million active users in January only two months after it was launched.
Stand-up comic Sarah Silverman has filed separate lawsuits against OpenAI and Meta, claiming copyright infringement after their AI models allegedly used content from her memoir, “The Bedwetter,” for training without her permission.
In the suit, Silverman, along with authors Christopher Golden and Richard Kadrey, allege that OpenAI and Meta’s respective AI-backed language models were trained on illegally-acquired datasets containing the authors’ works.
The complaints state that ChatGPT and Meta’s LLaMA honed their skills using “shadow library” websites like Bibliotik, Library Genesis and Z-Library, among others, which are illegal given that most of the material uploaded on these sites is protected by authors’ rights to the intellectual property over their works.
When asked to create a dataset, ChatGPT reportedly produced a list of titles from these illegal online libraries.
Massachusetts-based writers Paul Tremblay and Mona Awad made a similar argument in San Francisco federal court earlier this year — claiming ChatGPT mined data copied from thousands of books without permission, infringing the authors’ copyrights.
The Post has sought comment from OpenAI, Google and Microsoft.
This story originally Appeared on NYPost