Daily Productive Sharing 1067 - DeepSeek FAQ
One helpful tip per day:)
Regarding the release of DeepSeek’s R1 model, there have been numerous analysis articles. Ben Thompson, based in Taipei, wrote a particularly insightful and accessible piece:
- Technically, DeepSeek’s biggest breakthrough was reducing training costs by two orders of magnitude. This was mainly achieved through DeepSeekMoE and DeepSeekMLA, which were already used in the V2 model a year ago but have been further refined this time.
- MoE refers to the mixture of experts approach, which splits the model into multiple experts and activates only the necessary ones, significantly reducing inference costs. DeepSeekMoE in V2 made key innovations, including subdividing experts into more specialized and general-purpose shared experts.
- DeepSeekMoE also introduced new methods for load balancing and routing during training, making the process even more efficient.
- DeepSeekMLA, or multi-headed latent attention, enables compressed key-value storage, greatly reducing memory usage during inference.
- The training cost for V3 is shockingly low. DeepSeek claims the total training time was 2788K H800 GPU hours, costing only $5.576 million at $2 per hour. Despite storing parameters at BF16 or FP32 precision, calculations were performed at FP8 precision. The total computing power of 2048 H800 GPUs reached 3.97 exaFLOPs, or 3.97 quintillion FLOPS.
- DeepSeek specifically programmed 20 out of the 132 processing units in each H800 to manage cross-chip communication.
- To overcome GPU bandwidth limitations, DeepSeek engineers had to optimize PTX, Nvidia GPU’s low-level instruction set, equivalent to assembly language. GPUs with higher bandwidth would not require such optimizations and could simply use CUDA.
- If DeepSeek had access to H100, they might have chosen a larger training cluster instead of making so many bandwidth-focused optimizations.
- For companies, distilling their own models is easier because they have full access. But it can also be done through APIs or creatively through chat clients, albeit in a clumsier way.
- If Microsoft can offer inference services at extremely low costs, it could significantly reduce their data center and GPU investments. More likely, due to lower inference costs, usage will increase significantly.
- Another big winner is Amazon. While AWS has struggled to develop its own high-quality models, high-quality open-source models allow them to offer services at much lower-than-expected costs.
- The reduced memory requirements for inference make edge computing more feasible, an area where Apple has the best hardware.
- Meanwhile, Google’s position may become more difficult, as reduced hardware needs weaken the relative advantage of its reliance on TPUs.
- AI doesn’t need to be explicitly taught how to reason—given enough computing power and data, it can learn on its own.
- DeepSeek leads in efficiency, but that is different from overall leadership.
- While there are major loopholes in the chip ban, it is likely that DeepSeek achieved this legally using permitted chips.
- Nvidia still has two key advantages: CUDA is the preferred language for training models and only runs on Nvidia chips. Additionally, Nvidia excels at integrating multiple chips into a large virtual GPU.
- DeepSeek has just proven another path—massive optimization can yield incredible results even with weaker hardware and lower memory bandwidth. Simply paying Nvidia more isn’t the only way to build a better model.
- Just because DeepSeek found a more efficient way to use compute doesn’t mean increasing compute resources is worthless.
- In the long run, lower inference costs should significantly boost AI adoption.
- Models like R1 and o1 perform exceptionally well because they are backed by more computing power. AI’s performance and capabilities heavily rely on increased computation, which benefits Nvidia.
- Software and technical knowledge cannot be embargoed—this has been widely discussed. However, chips are physical products, and the US has a justified reason to restrict their access to China.
- What concerns me is the mindset behind the chip ban: the US isn’t competing through future innovation but by negating past innovation.
- After six years, the world now has access to a significantly superior open-source model, while OpenAI’s reliance on US government-enforced control has completely failed.
- More importantly, this highlights why openness is crucial—we need more AI, not an unaccountable board controlling everything.
- In reality, open-sourcing and publishing research come at no cost. For technical talent, having others follow your innovations brings immense fulfillment.
- On the other hand, Anthropic might be this week’s biggest loser. DeepSeek topping the App Store charts underscores how Claude has little influence outside of San Francisco.
- DeepSeek has provided a massive gift to nearly everyone. The biggest winners are consumers and businesses that foresee a future where AI products and services are nearly free.
- Alternatively, we can recognize that real competition has arrived and finally give ourselves permission to compete.
- If we choose to compete, we can still win. And if we do win, we will have a Chinese company to thank.
If you enjoy today's sharing, why not subscribe
Need a superb CV, please try our CV Consultation
关于 Deepseek 的 R1 模型发布,由众多的分析文章,生活在台北的 Ben Thompson 写的这篇深入浅出,非常透彻:
- 技术上,Deepseek 最大的突破就是把训练成本降低了两个数量级,这主要是因为他们运用了 DeepSeekMoE 和 DeepSeekMLA,其实他们早在一年前就在
V2
模型中运用了,只不过这次更加精进; - MoE 指的是“专家混合”(mixture of experts,它将模型拆分为多个“专家”,仅激活必要的部分,这样可以大大降低推理成本。在
V2
中实现的DeepSeekMoE 对这一概念进行了重要创新,包括将专家细分为更精细化的专用专家以及具备更通用能力的共享专家; - DeepSeekMoE 还引入了训练过程中的负载均衡和路由新方法,他们的做法使训练同样更为高效;
- DeepSeekMLA,即多头潜在注意力机制,使得压缩键值存储成为可能,从而在推理过程中大幅降低内存使用量;
V3
的训练成本低得令人震惊。DeepSeek宣称,模型训练总共耗时2788千H800 GPU小时,按每小时2美元计算,总费用仅为557.6万美元。尽管参数以 BF16 或 FP32 精度存储,但在计算时会降至 FP8 精度;而2048台 H800 GPU 的总计算能力达3.97 exaFLOPs,即3.97万亿亿 FLOPS;- DeepSeek实际上将每台H800中132个处理单元中的20个专门编程,用于管理跨芯片通信;
- 为了突破 GPU 带宽的限制,DeepSeek 工程师不得不优化 PTX——Nvidia GPU的低级指令集,基本上相当于汇编语言。而使用带宽更多的 GPU 则根本不需要考虑这些,只要在 CUDA 上优化即可;
- 如果 DeepSeek 能使用H100,他们可能会选择更大规模的训练集群,而不必进行如此多针对带宽不足的优化;
- 对于公司而言,在自有模型上进行蒸馏较为容易,因为他们拥有全部访问权限,但你也可以通过API,或者更具创意地通过聊天客户端,以稍显笨拙的方式进行蒸馏;
- 若微软能以极低成本为客户提供推理服务,则意味着他们可在数据中心和GPU上的投入大幅减少,或者更可能,由于推理成本极低,使用率将显著提升;
- 另一个大赢家是亚马逊:尽管 AWS 在很大程度上未能打造出自有的高质量模型,但如果存在质量极高的开源模型,他们仍能以远低于预期的成本提供服务,这就足够了;
- 推理所需内存大幅降低,使得边缘推理变得更为可行,而 Apple 正好拥有这一领域最优秀的硬件;
- 与此同时,谷歌的处境可能更为艰难:硬件需求的降低削弱了其依赖TPU所获得的相对优势;
- 你无需特意教会AI如何推理,只需给予它足够的计算资源和数据,它便能自我学习!
- DeepSeek 在效率方面绝对领先,但这与整体领先是两个不同的概念。
- 虽然芯片禁令存在重大漏洞,但在我看来,DeepSeek很可能是使用合法芯片完成了这一成就。
- Nvidia 仍有两条护城河:CUDA 是训练模型时的首选语言,而 CUDA 仅适用于 Nvidia 芯片。在将多个芯片整合为一个大型虚拟 GPU 方面,Nvidia具有巨大优势;
- DeepSeek刚刚证明了另一种途径:通过大量优化,即便在较弱硬件和较低内存带宽下,也能取得惊人效果;仅仅多付Nvidia的钱并不是制造更好模型的唯一途径;
- 仅因为 Deepseek 找到了一种更高效利用计算资源的方法,并不意味着增加计算资源就毫无价值;
- 从长远来看,降低推理成本应能显著推动使用率的提高;
- 像
R1
和o1
这样的推理模型,其卓越性能正源于更高的计算资源投入。AI的性能和能力在很大程度上依赖于更多的计算,而这正是Nvidia的受益所在! - 软件和技术诀窍是无法被禁运的——我们对此已有诸多讨论和认识——但芯片毕竟是实物,美国有理由将它们拒之中国之外。
- 令我担忧的是支撑芯片禁令的那种心态:美国不是通过未来的创新来竞争,而是通过否定过去的创新来进行竞争;
- 六年过去了,整个世界如今都能获取一个显著更优模型的权重。而OpenAI依靠美国政府强制推行的控制策略,彻底失败了;
- 更重要的是,这正说明了开放性为何如此重要:我们需要更多的AI,而不是由一个不负责任的董事会来统治所有人。
- 事实上,开源和发表论文对我们来说毫无成本可言。对于技术人才来说,让他人追随你的创新能带来巨大的成就感。
- 另一方面,Anthropic 可能是本周最大的输家。DeepSeek 登顶 App Store榜首,这正凸显出相比之下,Claude 在旧金山以外几乎没有任何影响力。
- DeepSeek 为几乎所有人提供了一份巨大的礼物。最大的受益者是那些预见到未来几乎免费AI产品和服务的消费者与企业。
- 或者,我们也可以意识到真正的竞争已经到来,并真正给予自己竞争的许可。
- 如果我们选择竞争,依然可以获胜;而如果获胜,我们将有一家中国公司值得感谢。
如果你喜欢的话,不妨直接订阅这份电子报 ⬇️