According to TechCrunch, Chinese AI lab DeepSeek's recently updated R1-0528 reasoning model has sparked controversy as developers have identified similarities suggesting it may have been trained using outputs from Google's Gemini AI. While DeepSeek didn't disclose its training data sources, experts like Sam Paeach have noted the model's preference for terminology and expressions remarkably similar to those used by Google's Gemini 2.5 Pro.
The evidence pointing to DeepSeek's potential use of Google Gemini outputs centers on distinctive terminology patterns. Experts have identified that DeepSeek's R1-0528 model exhibits a notable preference for Gemini-specific terms and expressions, such as "context window," "foundation model," and "function calling"—technical vocabulary that appears frequently in Google's Gemini documentation12. Additionally, the model demonstrates familiar response structures and stylistic elements characteristic of Gemini's outputs, including its approach to explaining AI concepts and its particular phrasing when discussing generative capabilities34.
These linguistic fingerprints are especially telling because AI models tend to adopt the terminology patterns of their training data. When an LLM is trained on outputs from another model like Gemini, it inherits not just knowledge but also distinctive vocabulary and phrasing—similar to how human language learners adopt the speech patterns of their teachers56. This phenomenon, sometimes related to "prompt chaining," shows how information from previous interactions influences future responses, potentially revealing the model's training lineage6.
Sam Paeach's analysis of DeepSeek's R1-0528 model relied on sophisticated AI detection techniques that can identify model lineage through linguistic patterns. While specific AI detectors vary in effectiveness—with some commercial solutions like Pangram achieving up to 99.3% accuracy in identifying AI-generated content1—detection methods generally fall into several categories:
Statistical detection analyzes word frequencies, n-gram patterns, and syntactic structures to identify machine-generated text2
Neural network approaches like BERT and RoBERTa-based detectors examine deeper linguistic features, with some achieving over 97% accuracy in controlled studies34
Zero-shot detection techniques examine the probability distribution of text without requiring additional training, capable of reaching 99% accuracy for certain models53
The detection of training data lineage represents a particularly challenging frontier in AI forensics, as models trained on outputs from other AI systems inherit distinctive vocabulary and response patterns that serve as "fingerprints" of their training sources—precisely what Paeach identified in DeepSeek's model behavior.
DeepSeek-R1-0528 demonstrates remarkable performance improvements across multiple benchmarks, positioning it as a serious competitor to closed-source models like OpenAI's o3 and Google's Gemini 2.5 Pro. In mathematics testing, the model achieved 87.5% accuracy on the AIME 2025 test, up from 70% in the previous version, and 91.4% on AIME 2024.12 This enhanced reasoning capability stems from deeper computational processing—using an average of 23,000 tokens per question compared to the previous 12,000.23
The model shows significant gains in other critical areas as well:
Programming: LiveCodeBench scores increased from 63.5% to 73.3%, while SWE Verified evaluation rose from 49.2% to 57.6%23
General reasoning: GPQA-Diamond test scores improved from 71.5% to 81.0%2
Complex reasoning: Performance on "Humanity's Last Exam" more than doubled from 8.5% to 17.7%12
Reduced hallucinations: The update significantly decreases factually inaccurate responses23
Enhanced integration: New support for JSON output generation and expanded function calling capabilities24
While DeepSeek has not directly addressed the specific allegations about using Google's Gemini outputs to train its R1-0528 model, the company has faced similar accusations before. In December, developers observed that DeepSeek's V3 model frequently identified itself as ChatGPT, suggesting it may have been trained on OpenAI's chat logs.1 Earlier in 2025, OpenAI reported finding evidence linking DeepSeek to distillation techniques—extracting data from larger, more capable models to train smaller ones.1
This pattern of suspected AI model distillation has prompted increased security measures across the industry. OpenAI implemented ID verification requirements in April 2025 for accessing advanced models, notably excluding China from its list of supported countries.1 AI experts like Nathan Lambert from AI2 consider it plausible that DeepSeek would leverage synthetic data from leading models, noting the company is "short on GPUs and flush with cash," making this approach "effectively more compute for them."1 The silence from DeepSeek regarding these latest allegations follows their established pattern of not directly addressing training methodology controversies.