On February 27, 2025, security researchers revealed that LLMs were trained on datasets containing approximately 12,000 live API keys and passwords.
On February 27, 2025, security researchers at Truffle Security revealed that large language models (LLMs), including DeepSeek, were trained on datasets containing approximately 12,000 live API keys and passwords. Researchers scanned Common Crawl, a publicly available dataset widely used to train AI coding assistants, and discovered extensive hardcoded secrets across millions of web pages.
‍
AI models trained on insecure data risk inadvertently suggesting unsafe coding practices, such as embedding sensitive credentials directly in source code. The repeated exposure of live secrets in widely used training datasets significantly increases the risk of compromised API keys and passwords.
‍
‍
‍