PyTorchやPythonなしの純粋なC言語を使用した大規模言語モデルトレーニングツール「llm.c」がリリースされる

2024年4月11日 23時0分

AIの本体と言える大規模言語モデル(LLM)のトレーニングはほとんどの場合PyTorchやPythonを使用して行われていますが、そうしたトレーニングを純粋なC言語のみで実装したツール「llm.c」が登場しました。まだ最適化が行われておらず従来の手法に速度面では敗北していますが、GPT-2のトレーニングを行う実装を約1000行のクリーンなコードで行えています。

GitHub - karpathy/llm.c: LLM training in simple, raw C/CUDA

https://github.com/karpathy/llm.c

作者のアンドレイ・カルパシー氏はOpenAIの創設グループの一員で、テスラのAIディレクターだった事もある人物です。

llm.cを使用することで、245MBの容量を持つPyTorchや107MBの容量を持つcPythonを使用せずに大規模言語モデルをトレーニングすることが可能です。実際にカルパシー氏が現在の大規模言語モデルの祖先と言える「GPT-2」をCPUでトレーニングするコードを実装したところ、依存関係を減らしつつ約1000行という少ないコード量で実装できたとのこと。

Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c:https://t.co/w2wkY0Ho5m

To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…— Andrej Karpathy (@karpathy) 2024年4月8日

実際のコードはGitHubで公開されています。最初に必要な量のメモリをまとめて取得し、トレーニング中はメモリの使用量が変動しないとのこと。このコードではPythonのライブラリを使用しないため、すべての個別のレイヤーの順方向パスと逆方向パスが手動で実装されています。

You can look at the raw training implementation here:https://t.co/ZiiCwYurMP

You'll see that we allocate all the required memory a single time in the beginning in one large block of 1D memory. From there on during training, no memory gets created or destroyed, so we stay at… pic.twitter.com/S92d5dPcJZ— Andrej Karpathy (@karpathy) 2024年4月8日

レイヤーの接続においては全てのポインタとテンソルオフセットが正しく配置されていることを確認しつつコードを書く必要があり、非常に面倒でマゾヒスティックな作業だったそうです。

Once you have all the layers, you just string all it all together. Not gonna lie, this was quite tedious and masochistic to write because you have to make sure all the pointers and tensor offsets are correctly arranged.

Left: we allocate a single 1D array of memory and then… pic.twitter.com/KLPz7udGni— Andrej Karpathy (@karpathy) 2024年4月8日

記事作成時点ではCPUでのトレーニングコードしか公開されていませんが、カルパシー氏はCUDAを使用したトレーニングのためのコードも作成中とのこと。CUDAへの移植や効率化を施すと、重い依存関係なしでPyTorchと同等レベルの速度でトレーニングできるはずだとカルパシー氏は期待を述べました。

Once you have the forward/backward, the rest of it (data loader, Adam update, etc) are mostly trivial.

The real fun starts now though: I am now porting this to CUDA layer by layer so that it can be made efficient, perhaps even coming within reasonable fraction of PyTorch, but…— Andrej Karpathy (@karpathy) 2024年4月8日

今後、精度をfp32からfp16へと下げたり、llama 2・mistral・gemmaのような現代的なアーキテクチャをサポートしたりする予定とのこと。また、もう少し安定した状態になったらこうしたコードを詳細にゼロから構築するムービーを公開するつもりだとカルパシー氏は述べています。