Close Menu
Fin Street NewsFin Street News
  • Home
  • Business
  • Finance
    • Banking
    • Stocks
    • Commodities & Futures
    • ETFs & Mutual Funds
    • Funds
    • Currencies
    • Crypto
  • Markets
  • Investing
  • Personal Finance
    • Loans
    • Credit Cards
    • Dept Management
    • Retirement
    • Mortgages
    • Saving
    • Taxes
  • Fintech

Subscribe to Updates

Get the latest finance and business news and updates directly to your inbox.

Trending
Mom of 3 Shares Go-to Grocery Items at Costco to Save Time and Money

Mom of 3 Shares Go-to Grocery Items at Costco to Save Time and Money

November 14, 2025
Former Tesla AI Lead: Self-Driving Cars Will ‘Terraform’ Urban Spaces

Former Tesla AI Lead: Self-Driving Cars Will ‘Terraform’ Urban Spaces

November 14, 2025
TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

November 14, 2025
AI Comes for SaaS: RBC Capital Picks the Top Software M&A Targets.

AI Comes for SaaS: RBC Capital Picks the Top Software M&A Targets.

November 14, 2025
Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

November 14, 2025
Facebook X (Twitter) Instagram
  • Privacy Policy
  • Terms of use
  • Press Release
  • Advertise
  • Contact
November 14, 2025 12:28 pm EST
|
Facebook X (Twitter) Instagram
  Market Data
Fin Street NewsFin Street News
Newsletter Login
  • Home
  • Business
  • Finance
    • Banking
    • Stocks
    • Commodities & Futures
    • ETFs & Mutual Funds
    • Funds
    • Currencies
    • Crypto
  • Markets
  • Investing
  • Personal Finance
    • Loans
    • Credit Cards
    • Dept Management
    • Retirement
    • Mortgages
    • Saving
    • Taxes
  • Fintech
Fin Street NewsFin Street News
Home » There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.
There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.
Finance

There’s a Looming AI Data Shortage. Google Researchers Have a New Fix.

News RoomBy News RoomSeptember 15, 202513 ViewsNo Comments

Google DeepMind researchers have an idea for how to solve the AI data drought, and it might involve your Social Security number.

The large language models powering AI require vast amounts of training data pulled from webpages, books, and other sources. When it comes to text specifically, the amount of data on the web considered fair game for training AI models is being scraped faster than new data is being created.

However, a large portion of the data isn’t used because it’s deemed toxic, inaccurate, or it contains personally identifiable information.

In a newly published paper, a group of Google DeepMind researchers claim to have found a way to clean up this data and make it usable for training, which they claim could be a “powerful tool” for scaling up frontier models.

They refer to the idea as Generative Data Refinement, or GDR. The method uses pretrained generative models to rewrite the unusable data, effectively purifying it so it can be safely trained on. It’s not clear if this is a technique Google is using for its Gemini models.

Minqi Jiang, one of the paper’s researchers who has since left the company to Meta, told Business Insider that a lot of AI labs are leaving usable training data on the table because it’s intermingled with bad data. For example, if there’s a document on the web that contains something considered unusable, such as someone’s phone number or an incorrect fact, labs will often discard the entire thing.

“So you essentially lose all those tokens inside of that document, even if it was a small single line that contained some personally identifying information,” said Jiang. Tokens are the units of data, processed by AI, which make up words within text.

Related stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

The authors give an example of raw data that included someone’s Social Security number or information that may soon be out of date (“the incoming CEO is…”). In these instances, the GDR would swap or remove the numbers, ignore the information that risks becoming obsolete, and retain the remainder of usable data.

The paper was written more than a year ago and was only published this month. A Google DeepMind spokesperson did not respond to a request for comment about whether the researcher’s work was being applied to the company’s AI models.

The authors’ findings could prove helpful for labs as the usable well of data runs dry. They cite a research paper from 2022 that predicted AI models could soak up all the human-generated text between 2026 and 2032. This prediction was based upon the amount of indexed web data, using statistics from Common Crawl, a project that continuously scrapes web pages and makes them openly available for AI labs to use.

For the GDR paper, the researchers performed a proof of concept by taking over one million lines of code and having human expert labelers annotate the data line by line. They then compared the results with the GDR method.

“It completely crushes the existing industry solutions being used for this kind of stuff,” said Jiang.

The authors also said their method is better than the use of synthetic data (data generated by AI models for the purpose of training themselves or other models), which has been a topic of exploration among AI labs. However, using synthetic data can degrade the quality of model output and, in some cases, lead to “model collapse.”

The authors compared the GDR data against synthetic data created by an LLM and discovered that their approach created a better dataset for training AI models.

They also said further testing could be conducted on other complicated types of data considered a no-go, such as copyrighted materials and personal data that is inferred across multiple documents rather than explicitly spelled out.

The paper has not been peer reviewed, said Jiang, adding that this is common in the tech industry and that all papers are reviewed internally.

The researchers only tested GDR on text and coding. Jiang said that it could also be tested on other modalities, such as video and audio. However, given the rate at which new videos are generated each day, they’re still providing a firehose of data for AI to train on.

“With video, you’re just going to have a lot more of it, just because there’s a constant stream of millions of hours of video generated each day,” said Jiang. “So I do think, going across new modalities beyond text, video, and images, we’re going to unlock a lot more data.”

Have something to share? Contact this reporter via email at hlangley@businessinsider.com or Signal at 628-228-1836. Use a personal email address and a non-work device; here’s our guide to sharing information securely.



Read the full article here

data fix Google looming researchers shortage
Share. Facebook Twitter LinkedIn Telegram WhatsApp Email

Keep Reading

Mom of 3 Shares Go-to Grocery Items at Costco to Save Time and Money

Mom of 3 Shares Go-to Grocery Items at Costco to Save Time and Money

TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

Photos: See Target’s Plan to Bring Back the Holiday Spirit in Stores

Photos: See Target’s Plan to Bring Back the Holiday Spirit in Stores

A Homebuyer Used an AI Tool to Write up an Offer on a Home

A Homebuyer Used an AI Tool to Write up an Offer on a Home

Anthropic Says Chinese Hackers Used Its AI for Cyberattack

Anthropic Says Chinese Hackers Used Its AI for Cyberattack

Creators Built Meta’s  Billion Reels Business. Where’s Their Cut?

Creators Built Meta’s $50 Billion Reels Business. Where’s Their Cut?

New Trump CFPB Decision Could Make Future Student-Debt Relief Harder

New Trump CFPB Decision Could Make Future Student-Debt Relief Harder

A Jack Dorsey-Backed App Is Reviving Vine to Combat AI Slop

A Jack Dorsey-Backed App Is Reviving Vine to Combat AI Slop

Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Former Tesla AI Lead: Self-Driving Cars Will ‘Terraform’ Urban Spaces

Former Tesla AI Lead: Self-Driving Cars Will ‘Terraform’ Urban Spaces

November 14, 2025
TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

TSA PreCheck Tips and Etiquette to Get Through Airport Security Fast

November 14, 2025
AI Comes for SaaS: RBC Capital Picks the Top Software M&A Targets.

AI Comes for SaaS: RBC Capital Picks the Top Software M&A Targets.

November 14, 2025
Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

Men Skincare Routine for Anti-Aging: Dermatologists Reveal the Simplest Steps

November 14, 2025
Sonder Workers Describe Chaos and No Severance After Marriott Breakup

Sonder Workers Describe Chaos and No Severance After Marriott Breakup

November 14, 2025

Latest News

Photos: See Target’s Plan to Bring Back the Holiday Spirit in Stores

Photos: See Target’s Plan to Bring Back the Holiday Spirit in Stores

November 14, 2025
Edgar Wright on Rebooting ‘the Running Man’ and Directing Glen Powell

Edgar Wright on Rebooting ‘the Running Man’ and Directing Glen Powell

November 14, 2025
10 Recession-Proof Jobs That Can Withstand a Downturn

10 Recession-Proof Jobs That Can Withstand a Downturn

November 14, 2025

Subscribe to News

Get the latest finance and business news and updates directly to your inbox.

Advertisement
Demo
Facebook X (Twitter) Pinterest TikTok Instagram
2025 © Prices.com LLC. All Rights Reserved.
  • Privacy Policy
  • Terms
  • For Advertisers
  • Contact

Type above and press Enter to search. Press Esc to cancel.