Study: Generative AI content detection authority reveals clandestine content threat, cites impacts and implications
By Merilee Kern, MBA
With industries across the board becoming increasingly adept at using AI-driven Large Language Models (LLMs) like OpenAI’s ChatGPT, Anthropic’s Claude2 and Meta AI’s Llama2, it begs one obvious question: How do we make sure all of the content out there isn’t generated by AI?
So severe the concern over AI-induced mis- and disinformation, the World Economic Forum (WEF) named these concerns in its Global Risks Report 2024, which warns of a “global risks landscape in which progress in human development is being chipped away slowly, leaving states and individuals vulnerable to new and resurgent risks.” In the report, Carolina Klint, Chief Commercial Officer, Europe, Marsh McLennan, underscored that “Artificial intelligence breakthroughs will radically disrupt the risk outlook for organizations with many struggling to react to threats arising from misinformation, disintermediation and strategic miscalculation.”
Fortunately, being an “AI sleuth” is relatively easy thanks to some emerging smart SaaS tools. Such solutions will become crucial for those endeavoring to identify the use of AI in content creation, dealing with challenges related to false positives in AI content detection, discerning if AI material has yet been detected, and even if AI “distilled” content has been reconfigured to be more neutral in sentiment.
In fact, to that latter point of original content being watered down or repositioned to a more impartial format that does not accurately reflect the originator’s voice or purpose—whether that be intended or unintended outcome of the spin—one study has revealed how pervasive this is along with the implications related thereto. This as Originality.ai—a 99% accurate AI content, plagiarism and fact checker SaaS assuring publishing with integrity—released its findings that some of the most popular LLMs used to rewrite or paraphrase another text are making content more neutral in sentiment and, in doing so, are notably altering the nature and objective of the initial written work.
To better understand and consider the possible consequences of such AI-driven content transformation relative to tone, temperament and overall sentiment, I connected with Originality.ai Founder and CEO Jonathan Gillham for some insight into the study along with implications of the disconcerting findings it revealed.
MK: Your study found that popular LLMs like OpenAI’s ChatGPT, Anthropic’s Claude2 and Meta AI’s Llama2 are officially making content more neutral in sentiment. Why does this actually matter?
JG: Employing LLMs to rewrite or paraphrase another text can certainly offer speed and ease in content production, but it comes with caveats. For example, there might be a sound reason for coverage of a news event to have highly negative or positive sentiment. Dampening those qualities might prevent readers from perceiving how potentially troublesome or heartening an event might be. Outside of news content, publishers might desire to convey a particular kind of sentiment to evoke feelings in readers, and a neutral-scoring story might struggle to do so. On the other hand, there could be uses for making texts with more neutral sentiment that read more like “just the facts.” Publishers might want to consider the tone and purpose of a piece and know that LLMs might modify texts in ways that affect those goals.
MK: What exactly is sentiment analysis, which was the benchmark employed for your AI Paraphrased Content study?
JG: Sentiment analysis is the process of analyzing and categorizing texts as positive, neutral or negative, and to what degree. It’s often used to assess opinions and feelings expressed in reviews or open-ended questions in surveys. Many of the stories in this study had sentiment made more neutral after generative AI rewrote them. In the Sentiment Analysis scale, 1 is highly negative, 5 is highly positive and 3 is neutral. LLMs tended to move a story’s sentiment closer to 3, whether the original writing was more negative or positive. In the aggregate, the rewritten articles had their sentiment flattened.
MK: How was the study methodology undertaken and what were some of the specific key findings and data points?
JG: We analyzed 100 articles for their sentiment, or how positive or negative they were, and then had them rewritten by three Large Language Models: OpenAI’s ChatGPT, Anthropic’s Claude2 and Meta AI’s Llama2. The new texts’ sentiment scores were then analyzed for any changes. The 100 articles utilized in the study, each from popular websites, were rated by Sapien.IO’s Sentiment Analysis for how positive, neutral or negative each was. We then had three different LLMs—ChatGPT, Claude2 and Llama2, specifically—each paraphrase the article and then analyze the sentiment of the new text. These ratings were then compared to the original article’s sentiment rating. The score given, along with each rewritten article’s word count, was analyzed for any relationships.
The key findings include the substantiation that LLM rewrites moved the sentiment scores closer to the middle, neutral part of the scale and that the resulting sentiment scores differed by LLM. Llama2 had the most positive orientation scores, with Claude2 having the most negative. It’s also important to note that rewritten articles were made shorter than the original, which could be part of the reason sentiment scores changed.
Overall, the analysis showed no more than half a point in difference between the original articles’ average Sentiment Analysis score of 2.54 (slightly more negative than neutral) and the LLMs’ rewrite averages of 2.72 (Claude2), 2.95 (ChatGPT) and 3.08 (Llama2). However, those differences became pronounced when looking at articles that originally held sentiment scores of 1 or 5. In those cases, the rewrites differed by more than a point and up to 1.5 points on average, pulling toward a neutral 3. If the original scored a 1, the rewrites averaged 2.35. When the original was a 5, the rewrites averaged 3.56.
MK: You mentioned that the LLM rewrites often resulted in fewer words and that this could have impacted the results. Can you elaborate a bit on that?
JG: Yes, a possible explanation for the neutralization in sentiment could be that all three LLMs reduced the number of words when they rewrote articles. Claude2 reduced words by a notable 43.5%, compared to 13.5% for ChatGPT and 15.6% for Llama2. While shortening an article can be desirable for some purposes, the reduction might eliminate details or potent phrases that indicate how negative or positive the sentiment of the story is. Losing those details or descriptive words could explain part of the movement toward a rating of 3, neutral, for stories with either the most positive or negative sentiment.
This study was small, but the data displayed suggests a slightly positive correlation between sentiment scores and word counts, with longer texts receiving higher scores. The trend was highlighted by comparing the three LLMs to each other. Across all levels of sentiment in the original articles, Claude2 consistently had both the lowest sentiment scores and the lowest word count, while Llama2 had the highest sentiment scores and highest word count.
MK: How pervasive do you feel the issue of AI-generated content is becoming?
JG: Some of our other research has looked at large companies that dominate Google search results. These include Valnet, Arena Group, Cande Nast, Red Ventures and DotDashMeredith, among others. Our analysis shows that 46% of major publishers are using AI for content creation. This includes seven out of the 16 companies we’ve studied, with significant AI content detected on their sites. On one particular globally popular sports magazine site, an article was identified by Futurism as being penned by a “fake” author, which our algorithm substantiated. This is a tactic being increasingly used for a content marketing strategy called “parasite SEO,” where reputable domains host content—even wholly AI created—primarily for search engine ranking purposes. Of course, for readers of such content expecting a human was at the helm, this kind of gratuitous and unimpassioned information propagation, especially that with extreme reach, can be disheartening.
According to Axios, “By some estimates, AI-generated content could soon account for 99% or more of all information on the internet, further straining already overwhelmed content moderation systems…Dozens of ‘news sites’ filled with machine generated content of dubious quality have already cropped up, with far more likely to follow—and some media sites are helping blur the lines.”
University of Washington professor Kate Starbird, an expert in the field, told Axios that generative AI will deepen the misinformation problem in numerous ways. “Generative AI has the potential to accelerate the spread of both mis- and disinformation, and exacerbate the ongoing challenge of finding information we can trust online,” she said.
With other reports indicating that a vast majority—over 75%—of consumers are reportedly concerned about AI-driven misinformation, content sentiment degradation is emerging as an insidious facet of that overarching threat. Although it is a more stealth and less considered offense, the restructuring of content in a way that mitigates, or outright eliminates, the intended emotional tonality—the humanity—of a text is exacerbating the Misinformation Age of AI.