Thousands of live API keys and passwords found exposed in training data

What Happened?

On February 27, 2025, security researchers at Truffle Security revealed that large language models (LLMs), including DeepSeek, were trained on datasets containing approximately 12,000 live API keys and passwords. Researchers scanned Common Crawl, a publicly available dataset widely used to train AI coding assistants, and discovered extensive hardcoded secrets across millions of web pages.

‍

Why This Issue Matters

AI models trained on insecure data risk inadvertently suggesting unsafe coding practices, such as embedding sensitive credentials directly in source code. The repeated exposure of live secrets in widely used training datasets significantly increases the risk of compromised API keys and passwords.

‍

How the Secrets Were Exposed

Websites inadvertently published live API keys, passwords, and sensitive credentials in front-end HTML/JavaScript.
Common Crawl dataset captured snapshots of these insecure web pages.
LLMs like DeepSeek subsequently trained on this publicly available dataset.

‍

Implications

Increased risk of credential misuse in phishing campaigns, data breaches, and brand impersonation.
Higher likelihood of insecure code recommendations from AI coding assistants.

‍

Recommended Actions

Review API and Password Management: Immediately audit and rotate exposed API keys and passwords.
Enhanced Secret Scanning: Extend scanning to cover public internet datasets such as Common Crawl and archive.org.
Educate Developers: Incorporate secure coding guidelines explicitly into AI coding assistant instructions.
Engage AI Providers: Advocate for stricter data alignment and additional safeguards in AI model training processes.

Customer Story

Thousands of live API keys and passwords found exposed in training data

The Nudge Security Team

What Happened?

Why This Issue Matters

How the Secrets Were Exposed

Implications

Recommended Actions

Table of Contents

Related posts

Report

Debunking the "stupid user" myth in security

Product

Resources

Company

Assurance

Use Cases

SaaS Security

SaaS Management

Third-Party Risk Management

Identity Governance

Thousands of live API keys and passwords found exposed in training data

The Nudge Security Team

What Happened?

Why This Issue Matters

How the Secrets Were Exposed

Implications

Recommended Actions

Table of Contents

Related posts

A guide to identity and access management for SaaS

A deep dive on SaaS spend management

SaaS security checklist: 9 best practices to try

Report

Debunking the "stupid user" myth in security

Let’s stay in touch.

Product

Resources

Company

Assurance

Use Cases

SaaS Security

SaaS Management

Third-Party Risk Management

Identity Governance