06 Mar 2024 6 min read tech

Daily Productive Sharing 928 - Infrastructure of OpenAI

One helpful tip per day:)

With the popularity of ChatGPT, OpenAI faces an increasing number of engineering challenges. Gergely Orosz invited OpenAI's applied engineering lead, Evan Morikawa, to share their best practices, many of which are completely different from traditional best practices:

After asking a question to ChatGPT, the system breaks down the question into tokens, then transforms these into embeddings, multiplies them by the model's weights, and then obtains a prediction;
ChatGPT uses a transformer architecture at its core, employing a self-attention mechanism, which has a critical issue: the computation of self-attention grows quadratically;
They treat the prediction problem as a QKV (Query, Key, Value) model, where Q represents the user's input, K is the input used to output predictions, and V is the prediction value. Both K and V can be cached, whereas Q cannot be cached;
The primary hardware bottleneck lies in the size of the VRAM, even the most advanced GPU - the H100 - faces a VRAM bottleneck, and its architectural design was determined years ago, with no short-term possibility for change;
Of course, the shortage of graphics cards is a broader issue, but fortunately, they have the support of Microsoft Azure, so they can utilize all available graphics card resources. This means that from day one, their server scheduling design has been globally coordinated;
Because the biggest computational bottleneck is the graphics card, the physical location of the server becomes less important, making edge computing irrelevant in this context;
Monitoring GPU utilization is not very useful because the computational mechanism of GPUs is completely different from that of CPUs. Looking at this utilization rate can only tell us whether the GPU is computing, without providing more details.

If you enjoy today's sharing, why not subscribe

Need a superb CV, please try our CV Consultation

随着 ChatGPT 的流行，OpenAI 面临的工程挑战也越来越多。Gergely Orosz 邀请了 OpenAI 的应用工程主管 Evan Morikawa 介绍了他们的最佳实践，其中不少与传统的最佳实践完全不一样：

向 ChatGPT 提问后，系统会把提问拆解成 token，然后转换成 embedding，乘以模型的权重之后，然后获得预测值；
ChatGPT 底层使用 transformer 架构，使用了 self-attention 机制，又一个致命问题，就是 self-attention 的计算是呈平方增长的；
他们将预测问题当作一个 QKV 的模型来处理，Q 指用户的输入，K是用来输出预测值的输入，V 是预测值，其中 K 和 V 可以被缓存，而 Q 无法被缓存；
先在最大的硬件瓶颈在于显存大小，即使最先进的 GPU - H100 也有显存瓶颈，而它的架构设计早就在多年前就确定了，短时间也无法更改；
当然显卡是跟广义上的短缺问题，好在他们有微软 Azure 的支持，所以可以调用一切闲置的显卡资源。这也就意味着，从第一天开始，他们的服务器调度设计就是全球调度；
因为最大的计算瓶颈在于显卡，所以服务器的物理位置就没那么重要，edge computing 在这里也就无所谈起了；
监测 GPU 使用率其实没有多少用，因为 GPU 的计算机制和 CPU 完全不一样，看这个使用率只能告诉我们 GPU 是否在计算，无法提供更多细节。

阅读全文请点击 ⬇️

Dr Selfie

You might also like...