DeepSeek for Dummies: Understand How They Did It (Without Being Techie)

How DeepSeek's New Model Works and How it Moved SFT (Supervised Fine Tuning) to the backseat, Trump's $500B Stargate Project That's Privately Funded, and OpenAI's New Agent Operator: This Week in AI

Jan 27, 2025

Today, we’ve got a packed episode covering some of the most exciting—and let’s be honest, controversial—developments in AI.

Watch Video Debrief Here:

Advancement #1: How Chinese Startup DeepSeek Matched OpenAI o1 Performance at 95% Less Cost by Doing the Opposite Of Everyone Else

This week, Chinese AI startup DeepSeek released DeepSeek-R1, an open-source model that achieved the same performance as OpenAI’s o1... all while using 95% less compute power and cost.

With a training budget of only $6 million dollars, DeepSeek didn’t rely on billions of dollars like OpenAI and Google but instead focused on highly efficient strategies like synthetic data and pure reinforcement learning. The result is a model that’s faster, cheaper, and open for anyone to build on.

And here’s the best part: DeepSeek-R1 is fully open-weight, with a transparent technical report detailing exactly how it was built, so this isn’t just innovation but it’s a gift to the AI community and a direct challenge to the industry status quo.

So let’s look at how it works how they trained this model to pull this off.

HOW IT WORKS

One of the most expensive steps right now when building large AI models is supervised fine-tuning (SFT). It’s the process that teaches models to follow instructions by training them on carefully labeled examples, relying on huge, curated datasets and human oversight to ensure its accuracy. OpenAI and other industry leaders rely heavily on this step, making it the backbone of their development pipeline. But SFT is costly and labor-intensive, because it requires labeling data, humans, and repeated iterations.

What DeepSeek did was flipped the script by completely reversing the typical training process. Instead of building its model around supervised fine-tuning (SFT), DeepSeek-R1 put pure reinforcement learning (RL) in the driver’s seat.

Reinforcement learning (RL) is a trial-and-error training approach where a model interacts with an environment, generates outputs, and adjusts its behavior based on rewards or penalties. A typical scenario would be to generate multiple answers to a problem, scoring each one based on how "correct" or "useful" it is, and refining the model to produce better answers over time. Ultimately, it’s a self-driven training method that allows models to figure out what the right answer is by experimenting and self-correcting.

(yes, i know what you’re thinking, it does indeed sound like a human brain’s way of handling an issue... but we’re not going to go there.. yet).

So here’s how to think about it: DeepSeek let reinforcement learning (RL) handle the heavy lifting—reasoning, logic, and problem-solving—while supervised fine-tuning (SFT) acted as the cleanup crew, stepping in at the end to polish everything up. This makes a lot of sense when you think about it, but its completely the opposite of what everyone else does.

DeepSeek broke up its training process into four phases.

Step 1: Create Base Model with Supervised Fine-Tuning (SFT)

The process started with supervised fine-tuning, but only as the foundation. DeepSeek fine-tuned its base model (R1-Zero) using 10k token-long chain-of-thought (CoT) data from earlier models and human annotations. This initial step focused on readability which would set up a base model ready for RL.

Step 2: Pure Reinforcement Learning With New Technique GRPO

With the foundation in place, DeepSeek applied pure reinforcement learning to develop reasoning capabilities using a new technique called Group Relative Policy Optimization (GRPO), which teaches the model how to reason through tasks like math or coding.

GRPO takes a simple but effective approach: instead of judging each response on its own, it looks at a group of responses the model generates for the same question and compares them to each other.

So how GRPO works:

The model generates several responses to a single prompt.
Each response gets a score based on specific rules—like how accurate or well-formatted it is.
GRPO calculates the average score across all the responses in the group to set a baseline.
If a response scores better than the baseline, it’s rewarded; if it scores worse, it’s penalized.

This process pushes the model to consistently outperform its own average, which forces it to improve.

Unlike older methods that rely on value functions (which predict how good a response might be before testing it), GRPO skips this step entirely. It focuses only on how each response compares to the others, simplifying the process and makes it more efficient, which means saved money. You see where i’m going here?

Step 3: Polishing Phase, Rejection Sampling and SFT Refinement
In the third phase, DeepSeek focused on polishing the model. The team generated a large synthetic dataset of reasoning tasks, writing, and role-playing scenarios and then used rejection sampling, where only the best responses generated by the model were selected for further refinement.

Once the top-quality outputs were curated, DeepSeek applied a second round of supervised fine-tuning (SFT) to make sure it was skilled in broader tasks and was fluent and had precision. Phase 3 basically gave the model a final coat of polish to enhance its flexibility and versatility.

Step 4: Reinforcement Learning for Helpfulness
The final phase focused on making the model more accurate, harmless and user-friendly. DeepSeek brought back reinforcement learning technique GRPO, and then used two types of rewards:

Rule-based rewards ensured responses were technically correct, like solving math problems accurately or using proper grammar and structure.
Outcome-based rewards evaluated how helpful or practical the response was for the user, focusing on real-world usefulness.

They combined these rewards to ensure the answers were not just correct but also meaningful in everyday use.

So, DeepSeek’s approach flipped the traditional training process. SFT was put in the backseat, and sprinkled in to polish things when necessary. Because of this decision, they bypassed the most expensive step of AI model training and thus were able to reduce their costs by 95%.

My Initial Thoughts: OpenAI, Google, and Anthropic had every resource to lead with transparency like DeepSeek did and become a part of AI history, but instead everything they did remain locked behind paywalls and tied to tech giants like Microsoft and Amazon. So DeepSeek by choosing to be transparent may be remembered as the team who rewrote the rules of AI history because they gave everyone their exact approach on how they did it.

But what’s even more... intriguing is how their method hints at something bigger—models that can teach themselves by experimenting and self-correcting, inching closer to what some might call human-like problem-solving cough AGI.

This could very well be the next generation of AI development, or at the very least, the moment smaller, scrappier teams start outpacing the giants. Watch out Sam Altman. This is just the beginning.

One caveat though, that we shouldn’t forget. This is Chinese based. So please, for the love of God and all that is good... please remember to protect your data.

Advancement #2: Project Stargate: A Private AI Gamble, Not a Government Funded Trump Initiative

On his second day in office, President Donald Trump announced Project Stargate, a private-sector partnership aiming to build the largest AI infrastructure project in history. The venture, led by four companies, OpenAI, SoftBank, Oracle, and MGX, comes with a $500 billion price tag. This is really important to understand. Despite the high-profile announcement, no federal funding is involved—the entire project relies on private investment.

The partnership between these four companies aims to construct massive AI development hubs across the country, starting with an 875-acre site in Abilene, Texas. Billed as the largest AI infrastructure project in history Stargate’s goal is ultimately to build and maintain cutting-edge AI technology while fending off international competition, particularly from China.

Despite the hype, the project has raised more questions than answers. Its initial $100 billion investment has been criticized, with Elon Musk claiming that the funding isn’t fully secured and Sam Altman arguing him back on X. What’s Trump’s actual involvement in what actually just seems like a private sector bromance? Trump’s promise to expedite construction through emergency declarations and by revoking Biden-era AI regulations.

My Initial Thoughts: When I first saw Project Stargate and Trump’s involvement all over the news, it sounded like this massive federal government project—a big, flashy AI push from Washington. But here’s the thing which is so so so important to understand: it’s not. It’s entirely private. OpenAI, SoftBank, Oracle, and MGX are the ones funding this, not the government. Trump just announced it and his only promise is to cut red tape. That’s a huge difference. So if Stargate works, it’s a win for private companies proving they can pull off something this massive. But if it crashes and burns? It’s all on them, not taxpayers, so in theory it sounds good.

Here’s the catch though. At first glance, Project Stargate might look like a bold step forward for the United States in the AI race, but dig a little deeper, and it’s clear this isn’t about national leadership—it’s my speculation but i think moves like this are consolidating control into the hands of the wealthy elite. OpenAI, Grok, and Llama might operate out of the U.S., but they’re not truly American entities in the sense of public benefit—they’re privately owned, profit-driven enterprises.

So i’m not quite sure how i feel about this. The first country to crack AGI will set the stage for everything that follows, including ASI (Artificial Superintelligence), and in theory it’s good for me as an American if we’re hypothetically leading the charge, but if this progress is actually being fueled by private corporations, not the federal government, with so much power concentrated in the hands of a select few, it’s hard not to worry about how this plays out in the long run.

Advancement #3: OpenAI Launches Operator: The AI Agent Era Begins

OpenAI CEO Sam Altman started the year by declaring 2025 as a turning point for AI agents, and now we’re seeing their first major move. On Thursday, OpenAI announced the launch of Operator, a general-purpose AI agent designed to take control of web browsers and perform tasks autonomously. Whether it’s ordering groceries, booking flights, or filing expense reports, Operator aims to handle real-world tasks with minimal user input.

At the heart of Operator is OpenAI’s new Computer-Using Agent (CUA) model, which combines the vision capabilities of GPT-4o with the reasoning skills of OpenAI’s more advanced systems. The result is an AI agent designed to mimic human actions on a web page, from clicking buttons to navigating menus.

Operator has a lot of built-in safeguards, though. The CUA model is trained to seek user confirmation to double check its movements before completing tasks that could have external consequences, such as submitting an order with payment information or sending an email. OpenAI also said it struggles with more complex tasks, such as managing intricate calendar systems or interacting with highly customized web interfaces.

(Intricate calendar systems. Yes, calendars are super... intricate. Lol.)

OpenAI also emphasizes privacy, stating that Operator doesn’t collect or screenshot data. Currently available as a research preview for U.S. users on ChatGPT’s $200 Pro plan, OpenAI plans to eventually roll Operator out to its Plus, Team, and Enterprise tiers. While it’s exclusive to the U.S. for now, Altman confirmed during a livestream that it would expand internationally.

My Initial Thoughts: Operator is a fascinating step toward autonomous AI agents, but it’s clear we are in the wee little stages here of what this will become. The limitations of what it can do and the need for human oversight in general signatures it’s still a huge work in progress.

Advancement #4: Speed Round

LinkedIn Being Sued for Using Data for AI Model Training Without Consent

LinkedIn is being sued for allegedly sharing Premium customers’ private messages with third parties to train AI models without consent. The lawsuit claims LinkedIn enabled data sharing by default last August through a new privacy feature, which wasn’t communicated in its terms of service or privacy policy.

Anthropic Gets 1B in Funding from Google

Anthropic has secured 1b additional in funding from Google in order to further development. Anthropic has traditionally had a strong partnership with Google’s rival, Amazon.

Perplexity AI Launched Sonar API

Perplexity has launched an API service called Sonar, enabling developers and enterprises to integrate its generative AI search tools into their own apps. Perplexity says the API optimizes for factuality and authority with real-time internet connections and is already being used by companies like Zoom for AI-powered assistants.

OpenAI No Longer Exclusive Microsoft Partners

Microsoft is no longer the exclusive provider of data center infrastructure for OpenAI. This change comes as OpenAI signs a massive new infrastructure deal with SoftBank, Oracle, and others, citing a lack of compute capacity as a key challenge.