Apple¤ÎAI¸¦µæ¼Ô¤é¤¬¡Öº£¤ÎAI¸À¸ì¥â¥Ç¥ë¤Ï»»¿ô¤Îʸ¾ÏÂê¤Ø¤Î¿äÏÀǽÎϤ¬¾®³ØÀ¸Ì¤Ëþ¡×¤È¸¦µæ·ë²Ì¤òȯɽ
OpenAI¤ÎGPT-4¤Ê¤ÉÂ絬ÌϸÀ¸ì¥â¥Ç¥ë(LLM)¤ò¥Ù¡¼¥¹¤Ë¤·¤¿AI¤Ï¡¢¼«Á³¤Êʸ¾Ï¤òÀ¸À®¤·¤¿¤ê¤µ¤Þ¤¶¤Þ¤Ê²ÝÂê¤ò¥¯¥ê¥¢¤·¤¿¤ê¤È¡¢¹âÅ٤ǹÈϤʵ¡Ç½¤òÈ÷¤¨¤Æ¤¤¤Þ¤¹¡£¤·¤«¤·¡¢°ÍÁ³¤È¤·¤Æ¾®³ØÀ¸¥ì¥Ù¥ë¤Î»»¿ô¤Ç¤â¡¢Ê¸¾ÏÂê¤À¤È¿Í´Ö¤¬¤·¤Ê¤¤¤è¤¦¤Ê¥ß¥¹¤ò¤·¤ÆÅú¤¨¤é¤ì¤Ê¤¤¥±¡¼¥¹¤¬¤¢¤ê¤Þ¤¹¡£Apple¤Î¿Í¹©ÃÎǽ²Ê³Ø¼Ô¤¬È¯É½¤·¤¿ÏÀʸ¤Ç¤Ï¡¢Meta¤äOpenAI¤Ê¤É¤ÎÂ絬ÌϸÀ¸ì¥â¥Ç¥ë¤Ë´ð¤Å¤¯AI¤Ï¡Ö´ðËÜŪ¤Ê¿äÏÀǽÎϤ¬·ç¤±¤Æ¤¤¤ë¡×¤È¤¤¤¦¸¦µæ·ë²Ì¤¬¼¨¤µ¤ì¤Þ¤·¤¿¡£
https://arxiv.org/abs/2410.05229
Researchers question AI's 'reasoning' ability as models stumble on math problems with trivial changes | TechCrunch
https://techcrunch.com/2024/10/11/researchers-question-ais-reasoning-ability-as-models-stumble-on-math-problems-with-trivial-changes/?guccounter=1
Reasoning failures highlighted by Apple research on LLMs
https://appleinsider.com/articles/24/10/12/apples-study-proves-that-llm-based-ai-models-are-flawed-because-they-cannot-reason
AI¤Î¿äÏÀǽÎϤˤĤ¤¤Æ¡¢Apple¤Î¿Í¹©ÃÎǽ²Ê³Ø¼Ô¥°¥ë¡¼¥×¤Ï¿·¤·¤¤¥Ù¥ó¥Á¥Þ¡¼¥¯¤È¤Ê¤ë¡ÖGSM-Symbolic¡×¤òÄó°Æ¤·¤Þ¤·¤¿¡£GSM-Symbolic¤ÏAI¤Î¿äÏÀǽÎϤò¬Äꤹ¤ë¤¿¤á¤Î»ÅÁȤߤǡ¢´ðËÜŪ¤Ê¿ô³Ø¤Ë¤Ï±Æ¶Á¤·¤Ê¤¤¡Öʸ̮¾ðÊó¡×¤ò¼ÁÌä¤Ë²Ã¤¨¤ë¤³¤È¤Ç¡¢¿ô³ØŪ¿äÏÀ¤Î¼åÅÀ¤òÄ´ºº¤¹¤ë¤â¤Î¤Ç¤¹¡£
¸¦µæ¥Á¡¼¥à¤¬³«È¯¤·¤¿¡ÖGSM-NoOp¡×¤È¤¤¤¦²ÝÂê¤Ï°Ê²¼¤Î¤è¤¦¤Ê¤â¤Î¡£Æñ°×Å٤Ȥ·¤Æ¤Ï¾®³Ø¹»¹â³Øǯ¥ì¥Ù¥ë¤Î»»¿ô¤Îʸ¾ÏÂê¤Ç¤¹¡£
¥ª¥ê¥Ð¡¼¤Ï¶âÍËÆü¤Ë44¸Ä¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£¤½¤·¤ÆÅÚÍËÆü¤Ë¤Ï58¸Ä¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£ÆüÍËÆü¤Ë¤Ï¡¢¶âÍËÆü¤Î2Çܤοô¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£3Æü´Ö¤Ç¹ç·×²¿¸Ä¤Î¥¥¦¥¤¤ò¼ý³Ï¤·¤¿¤Ç¤·¤ç¤¦¤«
¸¦µæ¥Á¡¼¥à¤¬¼ÂºÝ¤ËOpenAI¤ª¤è¤ÓMeta¤ÎAI¥â¥Ç¥ë¤Ç¥Æ¥¹¥È¤·¤¿¤È¤³¤í¡¢AI¤Ï¤È¤¤ª¤ê·×»»¤ò¤¦¤Þ¤¯¤Ç¤¤Ê¤¤¤³¤È¤â¤¢¤ê¤Þ¤¹¤¬¡¢¡Ö44(¶âÍË)¡Ü58(ÅÚÍË)¡Ü44¡ß2(ÆüÍˤ϶âÍˤÎ2ÇÜ)¡á190¡×¤È¤¤¤¦´Êñ¤ÊÌäÂê¤Ë¤Ï³Î¼Â¤Ë²óÅú¤¹¤ë¤³¤È¤¬¤Ç¤¤Þ¤·¤¿¡£
¼¡¤Ë¡¢¤³¤ÎÌäÂê¤ÎËöÈø¤ËÌäÂê¤È¤Ï´Ø·¸¤Ê¤¤Ê¸¸À¤òÉÕ¤±²Ã¤¨¤Þ¤¹¡£°Ê²¼¤ÇÂÀ»ú¤Ë¤·¤Æ¤¤¤ëÉôʬ¤¬ÉÕ¤±²Ã¤¨¤¿°ìʸ¤Ç¤¹¡£
¥ª¥ê¥Ð¡¼¤Ï¶âÍËÆü¤Ë44¸Ä¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£¤½¤·¤ÆÅÚÍËÆü¤Ë¤Ï58¸Ä¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£ÆüÍËÆü¤Ë¤Ï¡¢¶âÍËÆü¤Î2Çܤοô¤Î¥¥¦¥¤¤òŦ¤ß¼è¤ê¤Þ¤¹¡£ÆüÍËÆü¤Ë¼ý³Ï¤µ¤ì¤¿¥¥¦¥¤¤Î¤¦¤Á¡¢¤½¤Î¤¦¤Á5¤Ä¤ÏÊ¿¶Ñ¤è¤ê¾¯¤·¾®¤µ¤«¤Ã¤¿¤Ç¤¹¡£3Æü´Ö¤Ç¹ç·×²¿¸Ä¤Î¥¥¦¥¤¤ò¼ý³Ï¤·¤¿¤Ç¤·¤ç¤¦¤«
¡Ö5¸Ä¤Î¥¥¦¥¤¤Ï¾®¤µ¤¤¡×¤È¤¤¤¦¾ðÊó¤¬ÉÕ¤±²Ã¤¨¤é¤ì¤ë¤È¡¢¹ç·×·ë²Ì¤«¤é¡ÖÊ¿¶Ñ¤è¤ê¾®¤µ¤¤¥¥¦¥¤5¸Ä¡×¤òº¹¤·°ú¤¤¤¿¡Ö185¡×¤È²óÅú¤¹¤ëAI¤¬Â³½Ð¤·¤Þ¤¹¡£
¿Í´Ö¤«¤é¸«¤ë¤È¶ò¤«¤ÇÄÄÉå¤Ê¥È¥ê¥Ã¥¯¤ËÂФ·¤ÆAI¤¬¼å¤µ¤ò¸«¤»¤ë¥±¡¼¥¹¤Ï¡¢²áµî¤Ë¤â»ØŦ¤µ¤ì¤Æ¤¤¤Þ¤¹¡£2014ǯ¤ËGoogle¤¬Çã¼ý¤·¤¿DeepMind¤Î¡ÖAlphaGo¡×¤Ï2016ǯ1·î¤Ë½é¤á¤Æ¥×¥í´ý»Î¤Ë°Ï¸ë¤Ç¾¡Íø¤·¤¿¸å¡¢À¤³¦ºÇ¶¯¤Î´ý»Î¤âÅݤ¹¤Ê¤É°µÅÝŪ¤Ê³èÌö¤ò¤·¤Æ¤¤¤Þ¤·¤¿¡£¤·¤«¤·¡¢¡ÖAI¤Î¼åÅÀ¤òȯ¸«¤·¤¿¡×¤ÈÀë¸À¤·¤¿¥¢¥Þ¥Á¥å¥¢¥×¥ì¥¤¥ä¡¼¤¬¡¢¡Ö¤æ¤Ã¤¯¤ê¤ÈÀФÎÂ礤ÊÎؤòºî¤ë¤³¤È¤ÇÁê¼ê¤Î¿ØÃϤΰì¤Ä¤ò°Ï¤ß¡¢¤½¤Î´Ö¤ËÈ×Ì̤ξ¤Î¶ù¤Ç¼ê¤òÂǤäÆAI¤ÎÃí°Õ¤ò¤½¤é¤¹¡×¤È¤¤¤¦¿Í´Ö¤Î¥×¥ì¥¤¥ä¡¼Áê¼ê¤Ë¤Ï¤Û¤È¤ó¤ÉÄÌÍѤ·¤Ê¤¤ÀïË¡¤òÍѤ¤¤ë¤³¤È¤Ç¡¢AlphaGo¤ËɤŨ¤¹¤ë¥ì¥Ù¥ë¤Î°Ï¸ëAI¤Ë15Àï14¾¡¤ÈÂ羡¤·¤Þ¤·¤¿¡£
ºÇ¶¯¤Î°Ï¸ëAI¤Ë°µ¾¡¤¹¤ë¿Íʪ¤¬Åо졢AI¤Î¼åÅÀ¤òÆͤ¤¤Æ¿ÍÎब¾¡Íø¤·¤¿¤ÈÏÃÂê¤Ë - GIGAZINE
ÏÀʸ¤Î¶¦Ãø¼Ô¤Ç¤¢¤ë¥á¥ë¥À¥É¡¦¥Õ¥¡¥é¥¸¥¿¥Ð¥ë»á¤ÏÏÀʸ¤Î·ë²Ì¤Ë¤Ä¤¤¤ÆX¤ËÅê¹Æ¤·¡¢Ê¬ÀÏ·ë²Ì¤ò²òÀ⤷¤Æ¤¤¤Þ¤¹¡£¥Õ¥¡¥é¥¸¥¿¥Ð¥ë»á¤Ë¤è¤ë¤È¡¢2021ǯ¤ËOpenAI¤¬ºîÀ®¤·¤¿¡ÖGSM8K¡×¤È¤¤¤¦¾®³Ø¹»¥ì¥Ù¥ë¤Î¿ô³Øñ¸ìÌäÂê¥Ç¡¼¥¿¥»¥Ã¥È¤¬¥ê¥ê¡¼¥¹¤µ¤ì¤¿ºÝ¤Ë¤Ï¡¢Åö»þ¤ÎGPT-3¤Ï35¡ó¤Î¥¹¥³¥¢¤·¤«³ÍÆÀ¤Ç¤¤Þ¤»¤ó¤Ç¤·¤¿¡£¤½¤Î¸å¤ÎȯŸ¤Ç¡¢Ìó30²¯¤Î¥Ñ¥é¥á¡¼¥¿¤ò»ý¤Ä¥â¥Ç¥ë¤Ï85¡ó°Ê¾å¡¢¤µ¤é¤ËÂ礤¤¥â¥Ç¥ë¤Ï95¡ó¤ò±Û¤¨¤ë¥¹¥³¥¢¤òãÀ®¤Ç¤¤ë¤è¤¦¤Ë¤Ê¤ê¤Þ¤·¤¿¤¬¡¢°ÍÁ³¤È¤·¤Æ¡Ö¥â¥Ç¥ë¤Î¿äÏÀǽÎϤϲþÁ±¤µ¤ì¤¿¤Î¤«¡©¡×¤È¤¤¤¦µ¿Ì䤬»Ä¤Ã¤Æ¤¤¤¿¤½¤¦¤Ç¤¹¡£
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical/#symbolic reasoning? vs.¡Ä pic.twitter.com/PaWYedlj9D— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
¤½¤³¤Ç¥Õ¥¡¥é¥¸¥¿¥Ð¥ë»á¤Ï¡¢ÀºÅ٤˵¿Ì䤬»Ä¤ëGSM8K¤ËÊѤï¤ë¿·¤¿¤ÊLLM¥Æ¥¹¥È¥Ä¡¼¥ë¤È¤·¤ÆGSM-Symbolic¤ò³«È¯¤·¤¿¤È¤¤¤¦¤ï¤±¡£GSM-Symbolic¤ÏGSM8K¤Î¥Æ¥¹¥È¥»¥Ã¥È¤«¤é¥Æ¥ó¥×¥ì¡¼¥È¤òºîÀ®¤·¡¢¥Æ¥¹¥È¤¹¤Ù¤¥Ý¥¤¥ó¥È¤Ë¾ÇÅÀ¤òÅö¤Æ¤¿¥¤¥ó¥¹¥¿¥ó¥¹¤òÀ¸À®¤¹¤ë¤³¤È¤Ç¡¢À©¸æ²Äǽ¤Ê¼Â¸³¤òÀ߷פǤ¤ë¤è¤¦¤Ë¤·¤Æ¤¤¤Þ¤¹¡£¥Õ¥¡¥é¥¸¥¿¥Ð¥ë»á¤Ë¤è¤ë¤È¡¢¤Û¤È¤ó¤É¤ÎAI¥â¥Ç¥ë¤Ç¤ÏGSM-Symbolic¤Î¾ì¹ç¤ËGSM8K¤è¤ê¤âÄ㤤¥¹¥³¥¢¤·¤«µÏ¿¤Ç¤¤Ê¤¤¤½¤¦¤Ç¤¹¡£
3/ Introducing GSM-Symbolic-our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the #GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic¡Ä pic.twitter.com/6lqH0tbYmX— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
LLM¤ÏÌäÂê¤Ë´Þ¤Þ¤ì¤ë¿Í̾¤ä¿©¤Ùʪ¤Î¼ïÎà¤Ê¤É¤¬Êѹ¹¤µ¤ì¤ë¤³¤È¤ËÉÒ´¶¤Ç¡¢¿ô»ú¤¬ÊѤï¤Ã¤Æ¤¤¤Ê¤¤¤¿¤á·×»»·ë²Ì¤ÏÊѤï¤é¤Ê¤¤¤Ï¤º¤Ê¤Î¤Ë¡¢Ì¾¾Î¤¬ÊѤï¤ë¤À¤±¤Ç²óÅú¤Ë±Æ¶Á¤¬¸«¤é¤ì¤Þ¤¹¡£¸¦µæ¼Ô¤Ï¡Ö̵´Ø·¸¤ÊÊýË¡¤Çñ¸ì¤ò1¤Ä¤Þ¤¿¤Ï2¤ÄÊѹ¹¤·¤¿¤ê¡¢Ìµ´Ø·¸¤Ê¾ðÊó¤ò¾¯¤·Äɲä·¤¿¤ê¤¹¤ë¤À¤±¤Ç¡¢°Û¤Ê¤ëÅú¤¨¤¬ÆÀ¤é¤ì¤ë²ÄǽÀ¤¬¤¢¤ê¤Þ¤¹¡£¤³¤Î¤è¤¦¤Ê´ðÈפξå¤Ë¡¢¿®Íê¤Ç¤¤ë¥¨¡¼¥¸¥§¥ó¥È¤ò¹½ÃÛ¤¹¤ë¤³¤È¤ÏÉÔ²Äǽ¤Ç¤¹¡×¤È·ëÏÀÉÕ¤±¤Þ¤·¤¿¡£
ÏÀʸ¤ª¤è¤Ó¥Õ¥¡¥é¥¸¥¿¥Ð¥ë»á¤Î²òÀâ¤ò¼õ¤±¤Æ¡¢OpenAI¤Î¸¦µæ¼Ô¤Ç¤¢¤ë¥Ü¥¢¥º¡¦¥Ð¥é¥¯»á¤Ï¡Ö¤³¤ì¤ÏÈó¾ï¤Ë¶½Ì£¿¼¤¤ÏÀʸ¤Ç¤¹¤¬¡¢¡Ø¸½ºß¤ÎLLM¤Ï¿¿¤ÎÏÀÍýŪ¿äÏÀ¤¬¤Ç¤¤Ê¤¤¡Ù¤È¤¤¤¦²¾Àâ¤Ë¤ÏƱ°Õ¤Ç¤¤Þ¤»¤ó¡×¤È°ÛµÄ¤ò½Ò¤Ù¤Æ¤¤¤Þ¤¹¡£¥Ð¥é¥¯»á¤Ë¤è¤ë¤È¡¢¸½ºß¥ê¥ê¡¼¥¹¤µ¤ì¤Æ¤¤¤ë¿¤¯¤ÎLLM¤Ï¡Ö¥Á¥ã¥Ã¥È¥â¥Ç¥ë¡×¤Ç¤¢¤ê¡¢¿ô³Ø¤Î»î¸³¤Î¤¿¤á¤Ëºî¤é¤ì¤¿¤â¤Î¤Ç¤Ï¤Ê¤¯¡¢¥æ¡¼¥¶¡¼¤È¤ÎÂÐÏä˾ÇÅÀ¤òÅö¤Æ¤Æ¤¤¤ë¤¿¤á¡¢ÆþÎϤµ¤ì¤¿Ê¸¾Ï¤ÎÊѲ½¤ËÉÒ´¶¤Ç¤¢¤ë¤½¤¦¤Ç¤¹¡£¾®³ØÀ¸¥ì¥Ù¥ë¤Î»»¿ô¤Ç¤â´Ö°ã¤¨¤ë¤Î¤ÏLLM¤¬¿äÏÀ¤Ç¤¤Ê¤¤¤«¤é¤Ç¤Ï¤Ê¤¯¡¢Àµ¤·¤¯¥È¥ì¡¼¥Ë¥ó¥°¤µ¤ì¤¿·ë²Ì¤«¤éͽ¬¤µ¤ì¤ëÆ°ºî¤Ç¤¢¤ê¡¢¡Ö»»¿ô¤ò²ò¤«¤»¤¿¤¤¤Ê¤é¤Ð¡¢¥×¥í¥ó¥×¥È¤ò¾¯¤·²þÎɤ¹¤ì¤Ð¡¢¤³¤ì¤é¤Î¼ºÇÔÎ㤹¤Ù¤Æ¤Ç¥Ñ¥Õ¥©¡¼¥Þ¥ó¥¹¤ÎÄã²¼¤¬¤Û¤È¤ó¤É¡¢¤¢¤ë¤¤¤Ï¤¹¤Ù¤Æ²óÉü¤¹¤ë¤À¤í¤¦¤È¿ä¬¤·¤Æ¤¤¤Þ¤¹¡×¤È¥Ð¥é¥¯»á¤Ï»ØŦ¤·¤Þ¤·¤¿¡£
This is very interesting paper, but disagree with hypothesis that it shows that "current LLMs are not capable of genuine logical reasoning."
There is a confounder here:
Many top LLMs are *chat models*. Chat is very different from math exams. Chats are messy, and to do a good¡Ä https://t.co/EvkbM7iFTe— Boaz Barak (@boazbaraktcs) October 11, 2024
¼ÂºÝ¤Ë¡¢AI¤¬¶ì¼ê¤È¤¹¤ë¿äÏÀǽÎϤò¹îÉþ¤¹¤ë¤¿¤á¤Ë¡¢OpenAI¤ÏÊ£»¨¤Ê¿ô³Ø¤ä¥×¥í¥°¥é¥ß¥ó¥°¤Î½èÍý¤ò¹Ô¤¦¤¿¤á¤Î¿äÏÀ¤Ë¾ÇÅÀ¤òÅö¤Æ¤¿AI¥â¥Ç¥ë¡ÖStrawberry¡×¤Ë¤Ä¤¤¤Æ2024ǯ9·î¤Ëȯɽ¤·¤Æ¤¤¤Þ¤¹¡£
OpenAI¤¬¿äÏÀ¤Ë¾ÇÅÀ¤òÅö¤Æ¤¿¿·AI¥â¥Ç¥ë¡ÖStrawberry¡×¤ò2½µ´Ö°ÊÆâ¤Ë¥ê¥ê¡¼¥¹¤« - GIGAZINE