• 0 Posts
  • 5 Comments
Joined 1 year ago
cake
Cake day: January 25th, 2024

help-circle

  • I doubt that will be the case, and I’ll explain why.

    As mentioned in this article,

    SFT (supervised fine-tuning), a standard step in AI development, involves training models on curated datasets to teach step-by-step reasoning, often referred to as chain-of-thought (CoT). It is considered essential for improving reasoning capabilities. DeepSeek challenged this assumption by skipping SFT entirely, opting instead to rely on reinforcement learning (RL) to train the model. This bold move forced DeepSeek-R1 to develop independent reasoning abilities, avoiding the brittleness often introduced by prescriptive datasets.

    This totally changes the way we think about AI training, which is why while OpenAI spent $100m on training GPT-4, running an expected 500,000 GPUs, DeepSeek used about 50,000, and likely spent that same roughly 10% of the cost.

    So while operation, and even training, is now cheaper, it’s also substantially less compute intensive to train models.

    And not only is there less data than ever to train models on that won’t cause them to get worse by regurgitating other worse quality AI-generated content, but even if additional datasets were scrapped entirely in favor of this new RL method, there’s a point at which an LLM is simply good enough.

    If you need to auto generate a corpo-speak email, you can already do that without many issues. Reformat notes or user input? Already possible. Classify tickets by type? Done. Write a silly poem? That’s been possible since pre-ChatGPT. Summarize a webpage? The newest version of ChatGPT will probably do just as well as the last at that.

    At a certain point, spending millions of dollars for a 1% performance improvement doesn’t make sense when the existing model just already does what you need it to do.

    I’m sure we’ll see development, but I doubt we’ll see a massive increase in training just because the cost to run and train the model has gone down.


  • That set of tokens/s is the performance, or response time if you’d like to call it that. GPT-o1 tends to get anywhere from 33-60, whereas in the example I showed previously, a Raspberry Pi can do 200 on a distilled model.

    Now, granted, a distilled model will produce worse performance than the full one, as seen in a benchmark comparison done by DeepSeek here (I’ve outlined the most distilled version of the newest DeepSeek model, which is likely the kind that is being run on the Raspberry Pi, albeit likely with some changes made by the author of that post, as well as OpenAI’s two most high-end models of a comparable distillation)

    The gap in quality is relatively small for a model that is likely distilled far past what OpenAI’s “mini” model is, when you consider that even regular laptop/PC hardware is orders of magnitudes more powerful than a Raspberry Pi, or that an external AI accelerator can be bought for as little as $60, the quality in practice could be very comparable with even slightly less distillation, especially with fine-tuning for a given use case (e.g. a local version of DeepSeek in a code development platform would be fine-tuned specifically just to produce code-related results)

    If you get into the region of only cloud-hosted instances of DeepSeek that are running at-scale on GPUs like OpenAI’s models are, the performance is only 1-2 percentage points off from OpenAI’s model, at about 3-6% of the cost, which effectively means 3-6% of the total amount of GPU power being paid for compared to the amount of GPU power OpenAI is paying for.