I have tried them, and to be honest I was not surprised. The hosted service was better at longer code snippets and in particular, I found that it was consistently better at producing valid chain of thought reasoning chains (I’ve found that a lot of simpler models, including the distills, tend to produce shallow reasoning chains, even when they get the answer to a question right).
I’m aware of how these models work; I work in this field and have been developing a benchmark for reasoning capabilities in LLMs. The distills are certainly still technically impressive and it’s nice that they exist, but the gap between them and the hosted version is unfortunately nontrivial.
TBH the paper is a bit light on the details, at least compared to the standards of top ML conferences. A lot of DeepSeek’s innovations on the engineering front aren’t super well documented (at least well enough that I could confidently reproduce them) in their papers.