GPT-4をハッキングして出力するテキストの制限を解除する「ジェイルブレイク」に早くも成功したことが報告される

2023年3月17日 16時0分

2023年3月14日(火)にOpenAIが正式発表した大規模言語モデル「GPT-4」は、従来のGPT-3.5だけでなく、既存のAIの性能を大きく上回っているとされています。一般的にGPT-4のような言語モデルは出力するテキストに制限がかけられていますが、テキスト入力によってこの制限を外すことが可能で「ジェイルブレイク(脱獄)」と呼ばれています。そんな中、ワシントン大学でコンピューターサイエンスを学ぶアレックス・アルバート氏によってGPT-4をベースにしたChatGPTでジェイルブレイクに成功したことが報告されています。

GPT-4 Simulator

https://www.jailbreakchat.com/prompt/b2917fad-6803-41f8-a6c8-756229b84270

アルバート氏は2023年3月17日に「コンテンツフィルターを回避するGPT-4ベースのChatGPTの最初のジェイルブレイクの作成を手伝いました」と報告しています。

Well, that was fast…

I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time

credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT

here's GPT-4 writing instructions on how to hack someone's computer pic.twitter.com/EC2ce4HRBH— Alex (@alexalbert__) March 16, 2023

アルバート氏が公開したジェイルブレイク用のプロンプトは以下の通り。プロンプトとは、ChatGPTと会話を始める前準備として、一番始めに入力を行うテキストのことです。

here's the jailbreak:https://t.co/eUTYAX45ia pic.twitter.com/OycgiB4yJ9— Alex (@alexalbert__) March 16, 2023

アルバート氏はこのプロンプトについて、「GPT-4に次のトークンを予測する能力をシミュレートしてもらうことで実現できました」と述べています。手順としては、GPT-4にPythonの関数を与え、その関数の1つが次のトークンを予測する言語モデルとして機能するように指示するとのこと。その後、基となった関数を呼び出して、開始トークンをGPT-4に渡します。

this works by asking GPT-4 to simulate its own abilities to predict the next token

we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token

we then call the parent function and pass in the starting tokens— Alex (@alexalbert__) March 16, 2023

開始トークンの使用には、「爆弾・武器・薬物」などの本来制限される「トリガーワード」をトークンに分割し、「誰かのコンピューター」というテキストを分割した変数に置き換える必要があるとのこと。また、「simple_function」の入力を質問を行う冒頭で置き換える必要があります。

to use it, you have to split “trigger words” (e.g. things like bomb, weapon, drug, etc) into tokens and replace the variables where I have the text "someone's computer" split up

also, you have to replace simple_function's input with the beginning of your question— Alex (@alexalbert__) March 16, 2023

これらの手順は「token smuggling(トークンの密輸)」と呼ばれ、GPT-4がテキストの出力を開始する直前に、敵対するプロンプトをトークンに分割しています。そのため、これらの敵対的なプロンプトを正しく分割することで、コンテンツフィルターを毎回回避することができるとされています。

this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn't piece together before starting its output

this allows us to get past its content filters every time if you split the adversarial prompt correctly— Alex (@alexalbert__) March 16, 2023

アルバート氏は「この情報を広めることで、何を達成したいと思っていますか」という質問に対し、「GPT-4がまだ初期段階にある内に、GPT-4の機能と限界を知っておく必要があります」と返答しています。

to start, I want to say I have nothing to gain here and I don't condone anyone actually acting upon any of GPT-4's outputs

however, I believe red-teaming work is important and shouldn't be conducted in the shadows of AI companies. the general public should know the capabilities… https://t.co/ATPwO7sbDM— Alex (@alexalbert__) March 16, 2023

なおアルバート氏はこれまでにChatGPTでジェイルブレイクを行うための会話例を集めた「Jailbreak Chat」を公開しています。

ChatGPTが答えられない質問でも強引に聞き出す「ジェイルブレイク」が可能になる会話例を集めた「Jailbreak Chat」 - GIGAZINE