Saturday, February 7, 2026
Science
No Result
View All Result
  • Login
  • HOME
  • SCIENCE NEWS
  • CONTACT US
  • HOME
  • SCIENCE NEWS
  • CONTACT US
No Result
View All Result
Scienmag
No Result
View All Result
Home Science News Policy

Phantom data could show copyright holders if their work is in AI training data

July 29, 2024
in Policy
Reading Time: 3 mins read
0
65
SHARES
593
VIEWS
Share on FacebookShare on Twitter
ADVERTISEMENT

The technique was presented at the International Conference on Machine Learning in Vienna this week, and is detailed in this preprint:   

The technique was presented at the International Conference on Machine Learning in Vienna this week, and is detailed in this preprint:   

 

Generative AI is taking the world by storm, already transforming the day-to-day lives of millions of people.

 

Yet today, AI is often built on “shaky” legal grounds when it comes to training data. Modern AI models, such as Large Language Models (LLMs), require vast amounts of text, images and other forms of content from the internet to achieve its impressive capabilities.

 

In a new paper from Imperial College London experts, researchers propose a mechanism to detect the use of data for AI training. 

 

They hope that their proposed method will serve as a step towards greater openness and transparency in a rapidly evolving field of Generative AI, and will help authors better understand how their texts are used. 

 

Lead researcher Dr Yves-Alexandre de Montjoye, from Imperial’s Department of Computing, said: “Taking inspiration from the map makers of the early 20th century, who put phantom towns on their maps to detect illicit copies, we study how injection of “copyright traps” – unique fictitious sentences – into the original text enables content detectability in a trained LLM.”

 

First, the content owner would repeat a copyright trap multiple times across their collection of documents (e.g. news articles). Then, if an LLM developer scrapes the data and trains a model on it, the data owner would be able to confidently prove training by observing irregularities in the model’s outputs.

 

The proposal is best suited for online publishers, who could hide the copyright trap sentence across news article, such that it stays invisible to the reader, yet is likely to be picked up by a data scraper. 

 

However, Dr de Montjoye emphasises how LLM developers could develop techniques to remove traps and avoid detection. With traps being embedded in several different ways across news articles, successfully removing all of them is likely to require significant engineering resources to stay ahead of new ways to embed them.

 

To verify the validity of the approach, they partnered with a team in France, training a “truly bilingual” English-French 1.3B-parameter LLM, injecting various copyright traps into the training set of a real-world state-of-the-art parameter-efficient language model. The researchers believe the success of their experiments enables better transparency tools for the field of LLM training.

 

Co-author Igor Shilov, also from Imperial College London’s Department of Computing, added: “AI companies are increasingly reluctant to share any information about their training data. While the training data composition for GPT-3 and LLaMA (older models released by OpenAI and Meta AI respectively) is publicly known, it is no longer the case for the more recent models GPT-4 and LLaMA-2. LLM developers have little incentive to be open about their training procedure, leading to a concerning lack of transparency (and thus fair profit sharing), making it more important than ever to have tools to inspect what went into the training process.”

 

Co-author Matthieu Meeus, also from Imperial College London’s Department of Computing, said: “We believe the issue of AI training transparency and discussions on fair compensation for content creators to be very important for the future where AI is built in a responsible way. Our hope is that this work on copyright traps contributes towards a sustainable solution.”



Share26Tweet16
Previous Post

Transforming higher education for minority students: Minor adjustments, major impacts

Next Post

Massive appropriation of labor from the Global South enables high consumption of rich countries  

Related Posts

Policy

Additional Support Initiatives Target Southeastern Dairy Farms

February 6, 2026
blank
Policy

Global Physician Migration: Assessing the Effects of the 2010 WHO Code

February 6, 2026
blank
Policy

Adaptive Governance Essential to Mitigate AI-Driven Biosecurity Risks in Biological Data

February 6, 2026
blank
Policy

SCAI Celebrates Passage of Accelerating Kids’ Access to Care Act, Eliminating Medicaid Obstacles to Critical Interventional Cardiology Treatment

February 6, 2026
blank
Policy

Charter Schools Drive Comparable Student Outcome Gains for Both Disabled and Non-Disabled Learners

February 5, 2026
blank
Policy

HKU and Takatuf Oman Partner to Advance Educational Opportunities for Omani Scholars

February 5, 2026
Next Post

Massive appropriation of labor from the Global South enables high consumption of rich countries  

  • Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    27610 shares
    Share 11040 Tweet 6900
  • University of Seville Breaks 120-Year-Old Mystery, Revises a Key Einstein Concept

    1017 shares
    Share 407 Tweet 254
  • Bee body mass, pathogens and local climate influence heat tolerance

    662 shares
    Share 265 Tweet 166
  • Researchers record first-ever images and data of a shark experiencing a boat strike

    529 shares
    Share 212 Tweet 132
  • Groundbreaking Clinical Trial Reveals Lubiprostone Enhances Kidney Function

    515 shares
    Share 206 Tweet 129
Science

Embark on a thrilling journey of discovery with Scienmag.com—your ultimate source for cutting-edge breakthroughs. Immerse yourself in a world where curiosity knows no limits and tomorrow’s possibilities become today’s reality!

RECENT NEWS

  • Barriers and Boosters of Seniors’ Physical Activity in Karachi
  • Boosting Remote Healthcare: Stepped-Wedge Trial Insights
  • Enhancing Education: Effective Support for Gender Equality
  • Improving Dementia Care with Enhanced Activity Kits

Categories

  • Agriculture
  • Anthropology
  • Archaeology
  • Athmospheric
  • Biology
  • Biotechnology
  • Blog
  • Bussines
  • Cancer
  • Chemistry
  • Climate
  • Earth Science
  • Editorial Policy
  • Marine
  • Mathematics
  • Medicine
  • Pediatry
  • Policy
  • Psychology & Psychiatry
  • Science Education
  • Social Science
  • Space
  • Technology and Engineering

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 5,190 other subscribers

© 2025 Scienmag - Science Magazine

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • HOME
  • SCIENCE NEWS
  • CONTACT US

© 2025 Scienmag - Science Magazine

Discover more from Science

Subscribe now to keep reading and get access to the full archive.

Continue reading