23] งานวิจัย ML เด่นประจำสัปดาห์ (Top ML Papers of the Week)

(discuss.pytorch.kr)

1 คะแนน โดย ninebow 2024-06-24 | ยังไม่มีความคิดเห็น | แชร์ทาง WhatsApp

บทความนี้เป็นการแปลอัตโนมัติของบทความเกี่ยวกับงานวิจัย ML ที่ DAIR.AI เผยแพร่ทุกสัปดาห์
เมื่อพิจารณางานวิจัยที่ได้รับคัดเลือกในสัปดาห์นี้ จะเห็นแนวโน้มสำคัญอยู่ 2 ประการ ประการแรก งานวิจัยส่วนใหญ่เน้นหัวข้อที่เกี่ยวข้องกับการประมวลผลภาษาธรรมชาติ (NLP) โดยเฉพาะวิธีการเพิ่มประสิทธิภาพของโมเดลภาษา (LM) ที่รองรับบริบทยาว ระบบค้นคืนข้อมูล และระบบถาม-ตอบ (QA) ซึ่งกำลังกลายเป็นประเด็นสำคัญที่ได้รับความสนใจ ตัวอย่างเช่น งานวิจัยอย่าง ‘Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?’ สำรวจความเป็นไปได้ของโมเดลภาษาที่เข้าใจบริบทยาว ขณะที่ ‘PlanRAG’ และ ‘From RAG to Rich Parameters’ นำเสนอแนวทางใหม่ในการปรับปรุงระบบค้นคืนข้อมูลและระบบถาม-ตอบ
อีกแนวโน้มที่น่าสนใจคือความพยายามในการบรรเทา memorization (การจดจำจากข้อมูลฝึก) ของโมเดลภาษา หรือการยกระดับประสิทธิภาพผ่านกระบวนการ self-refine งานอย่าง ‘Mitigating Memorization in LLMs’ และ ‘Monte Carlos Tree Self-Refine’ ถือว่าน่าจับตาในมุมนี้ การบรรเทาปัญหาการจดจำมีความสำคัญเพื่อให้โมเดลภาษาไม่ได้เพียงลอกตามข้อมูลฝึก แต่สามารถเรียนรู้ความรู้ที่มีลักษณะทั่วไปมากขึ้นและสร้างคำตอบที่สร้างสรรค์ได้มากขึ้น ซึ่งเป็นหนึ่งในกุญแจสำคัญในการเพิ่มการใช้งานจริงและประโยชน์ของโมเดลภาษาให้สูงสุด
แนวโน้มเหล่านี้น่าจะถูกขับเคลื่อนด้วยหลายปัจจัย ประการแรก ความสำคัญของการประมวลผลภาษาธรรมชาติในแวดวงปัญญาประดิษฐ์เพิ่มสูงขึ้นอย่างต่อเนื่อง และความก้าวหน้าทางเทคนิคในด้านนี้ก็เกิดขึ้นอย่างรวดเร็ว ประการที่สอง เมื่อปริมาณข้อมูลเพิ่มขึ้นอย่างมหาศาล ความต้องการเทคโนโลยีที่สามารถประมวลผลข้อมูลเหล่านั้นได้อย่างมีประสิทธิภาพและมอบข้อมูลที่เป็นประโยชน์แก่ผู้ใช้ก็ยิ่งเพิ่มขึ้น สุดท้าย แม้โมเดลภาษายุคใหม่จะซับซ้อนและทรงพลังขึ้นเรื่อย ๆ แต่ก็ยังจำเป็นต้องมีแนวทางใหม่ ๆ อย่างต่อเนื่องเพื่อแก้ปัญหาที่โมเดลเหล่านี้เผชิญอยู่ เพื่อตอบสนองความต้องการดังกล่าว นักวิจัยจึงยังคงแสวงหาแนวคิดและระเบียบวิธีใหม่ที่ก้าวข้ามกรอบเดิมอย่างต่อเนื่อง

Claude 3.5 Sonnet / Claude 3.5 Sonnet

แนะนำงานวิจัย

โมเดลใหม่ที่ทำผลงานระดับล้ำสมัยบนเบนช์มาร์กทั่วไปหลายรายการ เช่น MMLU และ HumanEval โดยทำผลงานได้ดีกว่า Claude 3 Opus และ GPT-4o ในหลายเบนช์มาร์ก ยกเว้นงานแก้โจทย์ปัญหาคำศัพท์ทางคณิตศาสตร์ และยังแสดงประสิทธิภาพที่แข็งแกร่งในงานด้านวิชัน ซึ่งช่วยรองรับความสามารถใหม่หลายอย่าง เช่น การถอดข้อความจากภาพและการสร้าง artifacts

A new model that achieves state-of-the-art performance on several common benchmarks such as MMLU and HumanEval; it outperforms Claude 3 Opus and GPT-4o on several benchmarks with the exception of math word problem-solving tasks; achieves strong performance on vision tasks which also helps power several new features like image-text transcription and generation of artifacts.

ลิงก์งานวิจัย

https://www.anthropic.com/news/claude-3-5-sonnet

อ่านเพิ่มเติม

https://discuss.pytorch.kr/t/gn-claude-3-5-sonnet-gpt4o/4665

https://x.com/AnthropicAI/status/1803790676988920098

DeepSeek-Coder-V2

แนะนำงานวิจัย

แข่งขันกับโมเดลปิดซอร์สในงานสร้างโค้ดและคณิตศาสตร์ ทำได้ 90.2% บน HumanEval และ 75.7% บน MATH โดยตามรายงานระบุว่าผลลัพธ์เหล่านี้สูงกว่าประสิทธิภาพของ GPT-4-Turbo-0409 และมีทั้งโมเดลขนาด 16B และ 236B พารามิเตอร์ พร้อมความยาวคอนเท็กซ์ 128K

Competes with closed-sourced models on code and math generation tasks; achieves 90.2% on HumanEval and 75.7% on MATH; these results are higher than GPT-4-Turbo-0409 performance according to their report; includes a 16B and 236B parameter model with 128K context length.

บทคัดย่อ (Abstract)

เราขอนำเสนอ DeepSeek-Coder-V2 ซึ่งเป็นโมเดลภาษาโค้ดแบบ Mixture-of-Experts (MoE) โอเพนซอร์ส ที่ให้ประสิทธิภาพเทียบเคียง GPT4-Turbo ในงานเฉพาะด้านโค้ด โดยเฉพาะอย่างยิ่ง DeepSeek-Coder-V2 ได้รับการ pre-train เพิ่มเติมจาก intermediate checkpoint ของ DeepSeek-V2 ด้วยโทเคนเพิ่มเติมอีก 6 ล้านล้านโทเคน ผ่าน continued pre-training นี้ DeepSeek-Coder-V2 ได้ยกระดับความสามารถด้านการเขียนโค้ดและการให้เหตุผลทางคณิตศาสตร์ของ DeepSeek-V2 อย่างมาก ขณะเดียวกันก็ยังคงรักษาประสิทธิภาพที่ใกล้เคียงกันในงานภาษาทั่วไป เมื่อเทียบกับ DeepSeek-Coder-33B แล้ว DeepSeek-Coder-V2 แสดงให้เห็นถึงความก้าวหน้าที่สำคัญในหลายแง่มุมของงานที่เกี่ยวข้องกับโค้ด รวมถึงความสามารถด้านการให้เหตุผลและความสามารถทั่วไป นอกจากนี้ DeepSeek-Coder-V2 ยังขยายการรองรับภาษาโปรแกรมจาก 86 ภาษาเป็น 338 ภาษา และขยายความยาวคอนเท็กซ์จาก 16K เป็น 128K ในการประเมินด้วยเบนช์มาร์กมาตรฐาน DeepSeek-Coder-V2 ทำผลงานได้เหนือกว่าโมเดลปิดซอร์ส เช่น GPT4-Turbo, Claude 3 Opus และ Gemini 1.5 Pro ในเบนช์มาร์กด้านโค้ดและคณิตศาสตร์

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek- Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder- V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

ลิงก์งานวิจัย

https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf

อ่านเพิ่มเติม

https://github.com/deepseek-ai/DeepSeek-Coder-V2

https://x.com/omarsar0/status/1803078095219417475

TextGrad: การ "ดิฟเฟอเรนชิเอชัน" อัตโนมัติผ่านข้อความ / TextGrad: Automatic "Differentiation" via Text

แนะนำงานวิจัย

กรอบงานใหม่สำหรับ automatic differentiation ผ่าน backpropagation บนฟีดแบ็กแบบข้อความที่ LLM ให้มา ซึ่งช่วยปรับปรุงองค์ประกอบแต่ละส่วน และใช้ภาษาธรรมชาติเพื่อช่วยเพิ่มประสิทธิภาพ computation graph ทำงานโดยให้ objective function โดยไม่ต้องปรับ prompt หรือองค์ประกอบใด ๆ และอ้างว่าสามารถทำคะแนนสูงสุดบน LeetCodeHard และทำผลงานระดับ SoTA บน GPQA เมื่อใช้ร่วมกับ GPT4o

A new framework for automatic differentiation through backpropagation on textual feedback provided by an LLM; this improves individual components and the natural language helps to optimize the computation graph; it works by providing an objective function without tuning prompts or components; claims to achieve LeetCodeHard best scores and SoTA performance on GPQA when combined with GPT4o.

บทคัดย่อของงานวิจัย (Abstract)

AI กำลังเผชิญกับการเปลี่ยนผ่านเชิงกระบวนทัศน์ โดยมีความก้าวหน้าครั้งสำคัญจากระบบที่ประสานการทำงานของ large language model (LLM) หลายตัวและองค์ประกอบที่ซับซ้อนอื่น ๆ ดังนั้น การพัฒนาวิธีการเพิ่มประสิทธิภาพที่เป็นระบบและทำงานอัตโนมัติสำหรับระบบ AI แบบผสมจึงเป็นหนึ่งในความท้าทายใหม่ที่สำคัญที่สุด โครงข่ายประสาทก็เคยเผชิญปัญหาคล้ายกันในช่วงแรกเริ่ม แต่ backpropagation และ automatic differentiation ได้เปลี่ยนแปลงวงการด้วยการทำให้การเพิ่มประสิทธิภาพกลายเป็นสิ่งที่พร้อมใช้งานแบบ turn-key จากแรงบันดาลใจดังกล่าว เราขอแนะนำ TextGrad ซึ่งเป็นกรอบงานทรงพลังที่ทำ "differentiation" แบบอัตโนมัติผ่านข้อความ TextGrad ทำ backpropagate ฟีดแบ็กแบบข้อความที่ LLM ให้มาเพื่อปรับปรุงองค์ประกอบแต่ละส่วนของระบบ AI แบบผสม ในกรอบงานนี้ LLM จะให้ข้อเสนอแนะด้วยภาษาธรรมชาติที่หลากหลายและครอบคลุมเพื่อเพิ่มประสิทธิภาพตัวแปรใน computation graph ซึ่งครอบคลุมตั้งแต่โค้ดบางส่วนไปจนถึงโครงสร้างโมเลกุล TextGrad ใช้ไวยากรณ์และ abstraction ตามแบบของ PyTorch และมีความยืดหยุ่นพร้อมใช้งานได้ง่าย มันสามารถใช้งานได้ทันทีในงานที่หลากหลาย โดยผู้ใช้เพียงระบุ objective function เท่านั้น โดยไม่ต้องปรับองค์ประกอบหรือ prompt ของกรอบงาน เราแสดงให้เห็นถึงประสิทธิภาพและความทั่วไปของ TextGrad ผ่านแอปพลิเคชันที่หลากหลาย ตั้งแต่การตอบคำถามและการเพิ่มประสิทธิภาพโมเลกุล ไปจนถึงการวางแผนการรักษาด้วยรังสี โดยไม่ต้องแก้ไขกรอบงาน TextGrad สามารถเพิ่มความแม่นยำแบบ zero-shot ของ GPT-4o ใน Google-Proof Question Answering จาก $51%$ เป็น $55%$ ให้การปรับปรุงประสิทธิภาพสัมพัทธ์ $20%$ ในการเพิ่มประสิทธิภาพคำตอบของโจทย์เขียนโค้ด LeetCode-Hard ปรับปรุง prompt สำหรับการให้เหตุผล ออกแบบโมเลกุลขนาดเล็กคล้ายยาตัวใหม่ที่มีการจับแบบ in silico ตามต้องการ และออกแบบแผนการรักษาทางรังสีมะเร็งวิทยาที่มีความจำเพาะสูง TextGrad วางรากฐานเพื่อเร่งการพัฒนาระบบ AI ยุคถัดไป

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51%$ to $55%$, yields $20%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

ลิงก์งานวิจัย

https://arxiv.org/abs/2406.07496v1

อ่านเพิ่มเติม

https://x.com/james_y_zou/status/1800917174124740667

โมเดลภาษาที่มีคอนเท็กซ์ยาวสามารถเข้ามาแทนที่ Retrieval, RAG, SQL และอื่น ๆ ได้หรือไม่? / Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

แนะนำงานวิจัย

ดำเนินการวิเคราะห์ประสิทธิภาพเชิงลึกของ LLM แบบ long-context ในด้านการดึงค้นและการให้เหตุผลภายในคอนเท็กซ์ โดยนำเสนอเบนช์มาร์กที่เป็นงานจริงซึ่งต้องใช้คอนเท็กซ์ขนาด 1 ล้านโทเค็นเป็นครั้งแรก รายงานว่า LLM แบบ long-context สามารถแข่งขันกับระบบ retrieval และ RAG ระดับ state-of-the-art ได้แม้ไม่มีการฝึกเฉพาะกับงานเหล่านี้ ชี้ว่าการให้เหตุผลแบบ compositional (ซึ่งจำเป็นในงานลักษณะคล้าย SQL) ยังคงเป็นความท้าทายสำหรับ LLM เหล่านี้ และยังเน้นย้ำถึงความจำเป็นของการวิจัยต่อเนื่องเกี่ยวกับกลยุทธ์ prompt ขั้นสูง โดยพบว่าสามารถเพิ่มประสิทธิภาพได้อย่างมีนัยสำคัญเมื่อนำมาใช้กับปัญหา long-context

Conducts a deep performance analysis of long-context LLMs on in-context retrieval and reasoning; they first present a benchmark with real-world tasks requiring 1M token context; reports that long-context LLMs can rival state-of-the-art retrieval and RAG systems, without any explicit training on the tasks; suggests that compositional reasoning (required in SQL-like tasks) is still challenging for these LLMs; they also encourage the need for continued research on advanced prompting strategies as they noted significant boosts in performance when applying them for long context problems.

บทคัดย่อของงานวิจัย (Abstract)

โมเดลภาษาบริบทยาว (LCLM) มีศักยภาพที่จะพลิกโฉมแนวทางของเราต่องานที่แต่เดิมต้องพึ่งพาเครื่องมือภายนอก เช่น ระบบค้นคืนหรือฐานข้อมูล การใช้ประโยชน์จากความสามารถของ LCLM ในการรับและประมวลผลคลังข้อมูลทั้งหมดได้โดยตรงนั้นมอบข้อดีหลายประการ ช่วยให้ใช้งานสะดวกขึ้นเพราะไม่จำเป็นต้องมีความเชี่ยวชาญเฉพาะทางเกี่ยวกับเครื่องมือ ให้การสร้างแบบจำลองแบบ end-to-end ที่แข็งแกร่งซึ่งลดข้อผิดพลาดที่ลุกลามเป็นทอด ๆ ในไปป์ไลน์ที่ซับซ้อน และเปิดให้ใช้เทคนิคการพรอมป์ต์ขั้นสูงได้กับทั้งระบบ เพื่อประเมินการเปลี่ยนผ่านเชิงกระบวนทัศน์นี้ เราขอแนะนำ LOFT ซึ่งเป็นเบนช์มาร์กของงานจริงที่ต้องใช้บริบทได้ยาวถึงหลายล้านโทเค็น ออกแบบมาเพื่อประเมินประสิทธิภาพของ LCLM ในการค้นคืนและการให้เหตุผลภายในบริบท ผลการศึกษาพบว่า LCLM มีความสามารถที่น่าทึ่งในการแข่งขันกับระบบค้นคืนและระบบ RAG ระดับล้ำสมัยได้ แม้จะไม่เคยได้รับการฝึกอย่างชัดเจนสำหรับงานเหล่านี้ก็ตาม อย่างไรก็ตาม LCLM ยังเผชิญความท้าทายในด้านต่าง ๆ เช่น การให้เหตุผลเชิงองค์ประกอบที่จำเป็นในงานลักษณะคล้าย SQL โดยเฉพาะอย่างยิ่ง กลยุทธ์การพรอมป์ต์ส่งผลต่อประสิทธิภาพอย่างมาก จึงตอกย้ำถึงความจำเป็นของการวิจัยอย่างต่อเนื่องเมื่อความยาวของบริบทยิ่งเพิ่มขึ้น โดยรวมแล้ว LOFT มอบสนามทดสอบที่เข้มงวดสำหรับ LCLM พร้อมแสดงให้เห็นศักยภาพในการเข้ามาแทนที่กระบวนทัศน์เดิมและรับมือกับงานรูปแบบใหม่เมื่อขีดความสามารถของโมเดลขยายตัว

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.

ลิงก์บทความวิจัย

https://arxiv.org/abs/2406.13121

อ่านเพิ่มเติม

https://github.com/google-deepmind/loft

https://x.com/omarsar0/status/1804184820806766875

PlanRAG: การสร้างแบบเสริมการค้นคืนหลังการวางแผนสำหรับโมเดลภาษาขนาดใหญ่เชิงกำเนิดในฐานะผู้ตัดสินใจ / PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

แนะนำบทความวิจัย

PlanRAG ซึ่งเป็นเทคนิค RAG แบบใหม่ที่ทำการวางแผนก่อนแบบวนซ้ำ ช่วยยกระดับการตัดสินใจ โดยประกอบด้วยสองขั้นตอน: 1) LM ตรวจสอบสคีมาข้อมูลและคำถามเพื่อสร้างแผนสำหรับการตัดสินใจ และ 2) ตัวค้นคืนสร้างคิวรีสำหรับการวิเคราะห์ข้อมูล ในขั้นตอนสุดท้าย ระบบจะตรวจสอบว่าจำเป็นต้องมีแผนใหม่สำหรับการวิเคราะห์เพิ่มเติมหรือไม่ แล้ววนซ้ำขั้นตอนก่อนหน้าหรือตัดสินใจจากข้อมูล PlanRAG พบว่ามีประสิทธิภาพมากกว่า iterative RAG สำหรับงาน Decision QA ที่เสนอ

Enhances decision making with a new RAG technique called iterative plan-then-RAG (PlanRAG); involves two steps: 1) an LM generates the plan for decision making by examining data schema and questions and 2) the retriever generates the queries for data analysis; the final step checks if a new plan for further analysis is needed and iterates on previous steps or makes a decision on the data; PlanRAG is found to be more effective than iterative RAG on the proposed Decision QA tasks.

บทคัดย่อ(Abstract)

บทความนี้ศึกษาการใช้ LLM เป็นโซลูชันสำหรับการตัดสินใจที่ต้องอาศัยการวิเคราะห์ข้อมูลที่ซับซ้อน โดยนิยาม Decision QA ว่าเป็นงานในการหาคำตอบของการตัดสินใจที่ดีที่สุด $d_{best}$ สำหรับคำถามเชิงการตัดสินใจ $Q$, กฎทางธุรกิจ $R$ และฐานข้อมูล $D$ เนื่องจากยังไม่มีเบนช์มาร์กสำหรับตรวจสอบ Decision QA ผู้วิจัยจึงเสนอ DQA ซึ่งเป็นเบนช์มาร์กสำหรับ Decision QA เบนช์มาร์กนี้ประกอบด้วย 2 สถานการณ์คือ Locating และ Building ซึ่งสร้างขึ้นจากวิดีโอเกม 2 เกม (Europa Universalis IV และ Victoria 3) ที่มีเป้าหมายเกือบเหมือนกับ Decision QA นอกจากนี้ยังเสนอเทคนิค RAG แบบใหม่ชื่อ iterative plan-then-retrieval augmented generation (PlanRAG) เพื่อจัดการกับ Decision QA ได้อย่างมีประสิทธิภาพ โดย LM ที่อิงกับ PlanRAG จะสร้างแผนสำหรับการตัดสินใจในขั้นแรก และในขั้นที่สอง retriever จะสร้างคิวรีสำหรับการวิเคราะห์ข้อมูล วิธีที่เสนอมีประสิทธิภาพดีกว่าวิธี iterative RAG ที่ล้ำสมัยที่สุด 15.8% ในสถานการณ์ Locating และ 7.4% ในสถานการณ์ Building ตามลำดับ โค้ดและเบนช์มาร์กเผยแพร่ไว้ที่ https://github.com/myeon9h/PlanRAG

In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.

ลิงก์บทความ

https://arxiv.org/abs/2406.12430

อ่านเพิ่มเติม

https://github.com/myeon9h/PlanRAG

https://x.com/omarsar0/status/1803262374574448757

อย่าจำเหมือนปลาทอง! บรรเทาการจดจำข้อมูลใน Generative LLM / Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

แนะนำบทความ

งานนี้นำเสนอการปรับเป้าหมายการทำนายโทเคนถัดไปที่เรียกว่า goldfish loss เพื่อบรรเทาการสร้างข้อมูลฝึกที่ถูกจดจำแบบตามตัวอักษร โดยใช้เทคนิคง่าย ๆ ที่ตัดชุดย่อยของโทเคนฝึกแบบสุ่มเทียมออกระหว่างการฝึก และแสดงให้เห็นว่า goldfish loss ช่วยต้านการจดจำพร้อมคงความมีประโยชน์ของโมเดลไว้ได้ แต่ก็อาจต้องฝึกนานขึ้นเพื่อให้เรียนรู้จากข้อมูลฝึกได้อย่างมีประสิทธิภาพมากขึ้น

Presents a modification of the next-token prediction objective called goldfish loss to help mitigate the verbatim generation of memorized training data; it uses a simple technique that excludes a pseudorandom subset of training tokens at training time; they show that the goldfish loss resists memorization and keeps the model useful; however, it may need to train for longer to more effectively learn from the training data.

บทคัดย่อ(Abstract)

โมเดลภาษาขนาดใหญ่สามารถจดจำและทำซ้ำข้อมูลฝึกของตนได้ ซึ่งก่อให้เกิดความเสี่ยงด้านความเป็นส่วนตัวและลิขสิทธิ์ เพื่อบรรเทาปัญหาการจดจำ ผู้วิจัยได้เพิ่มการปรับเปลี่ยนเล็กน้อยให้กับเป้าหมายการฝึก next-token ที่เรียกว่า goldfish loss ระหว่างการฝึก โทเคนบางส่วนที่สุ่มตัวอย่างจะถูกตัดออกจากการคำนวณ loss โทเคนที่ถูกตัดออกเหล่านี้จะไม่ถูกโมเดลจดจำ จึงช่วยป้องกันการสร้างลำดับโทเคนทั้งสายจากชุดฝึกซ้ำแบบตามตัวอักษร ผลการทดลองอย่างกว้างขวางกับการฝึกโมเดล Llama-2 ระดับพันล้านพารามิเตอร์ ทั้งแบบ pre-trained และแบบฝึกจากศูนย์ แสดงให้เห็นว่าปริมาณการจดจำที่สามารถดึงออกมาได้ลดลงอย่างมีนัยสำคัญ โดยแทบไม่กระทบต่อเบนช์มาร์กปลายน้ำเลย

Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.

ลิงก์บทความ

https://arxiv.org/abs/2406.10209

อ่านเพิ่มเติม

https://github.com/ahans30/goldfish-loss

https://x.com/omarsar0/status/1802729440163647754

เข้าถึงโซลูชันคณิตศาสตร์โอลิมปิกระดับ GPT-4 ผ่านการปรับปรุงตนเองด้วย Monte Carlo Tree ด้วย LLaMa-3 8B / Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

แนะนำบทความ

มีรายงานว่าสามารถบรรลุโซลูชันคณิตศาสตร์โอลิมปิกในระดับ GPT-4 ได้ด้วยแนวทางที่ผสาน LLM เข้ากับ Monte Carlo Tree Search โดยแนวทางนี้มุ่งเน้นการยกระดับความสามารถด้านการให้เหตุผลทางคณิตศาสตร์ของระบบผ่านคุณสมบัติอย่างการสำรวจอย่างเป็นระบบ การปรับปรุงตนเอง และการประเมินตนเอง

Report to have achieved GPT-4 level mathematical olympiad solution using an approach that integrates LLMs with Monte Carlo Tree Search; this approach focuses on enhancing the mathematical reasoning performance of the system through capabilities such as systematic exploration, self-refinement, and self-evaluation.

บทคัดย่อ(Abstract)

เอกสารฉบับนี้แนะนำอัลกอริทึม MCT Self-Refine (MCTSr) ซึ่งเป็นการผสาน Large Language Models (LLMs) เข้ากับ Monte Carlo Tree Search (MCTS) อย่างสร้างสรรค์ โดยออกแบบมาเพื่อยกระดับประสิทธิภาพในการให้เหตุผลทางคณิตศาสตร์ที่ซับซ้อน โดยเฉพาะอย่างยิ่ง MCTSr แก้ปัญหาด้านความแม่นยำและความน่าเชื่อถือของ LLM ในงานให้เหตุผลเชิงกลยุทธ์และคณิตศาสตร์ ด้วยการใช้การสำรวจอย่างเป็นระบบและกลไก self-refine แบบฮิวริสติกเพื่อปรับปรุงกรอบการตัดสินใจภายใน LLM อัลกอริทึมนี้สร้าง Monte Carlo search tree ผ่านกระบวนการวนซ้ำของ Selection, self-refine, self-evaluation และ Backpropagation พร้อมใช้สูตร Upper Confidence Bound (UCB) ที่ปรับปรุงแล้วเพื่อเพิ่มประสิทธิภาพสมดุลระหว่าง exploration กับ exploitation การทดลองอย่างกว้างขวางแสดงให้เห็นถึงประสิทธิผลของ MCTSr ในการแก้ปัญหาคณิตศาสตร์ระดับโอลิมปิก โดยเพิ่มอัตราความสำเร็จอย่างมีนัยสำคัญในหลายชุดข้อมูล เช่น GSM8K, GSM Hard, MATH ตลอดจนเบนช์มาร์กระดับโอลิมปิกอย่าง Math Odyssey, AIME และ OlympiadBench งานวิจัยนี้ช่วยผลักดันการประยุกต์ใช้ LLM ในงานให้เหตุผลที่ซับซ้อน และวางรากฐานสำหรับการบูรณาการ AI ในอนาคต เพื่อเพิ่มความแม่นยำและความน่าเชื่อถือของการตัดสินใจในแอปพลิเคชันที่ขับเคลื่อนด้วย LLM

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks. Addressing the challenges of accuracy and reliability in LLMs, particularly in strategic and mathematical reasoning, MCTSr leverages systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks within LLMs. The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance. Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.

ลิงก์งานวิจัย

https://arxiv.org/abs/2406.07394v2

อ่านเพิ่มเติม

https://x.com/rohanpaul_ai/status/1801259208341373013

จาก RAG สู่พารามิเตอร์ที่เข้มข้น: สำรวจว่าภาษาโมเดลใช้ความรู้ภายนอกเหนือข้อมูลเชิงพารามิเตอร์สำหรับ factual queries อย่างไร / From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

แนะนำงานวิจัย

ผลการตรวจสอบอย่างใกล้ชิดมากขึ้นว่ากลไก LLM ใช้ความรู้ภายนอกเหนือข้อมูลเชิงพารามิเตอร์สำหรับ factual queries อย่างไร พบว่าใน RAG pipeline นั้น LLM มักเลือก “ทางลัด” และมีอคติอย่างชัดเจนที่จะใช้เฉพาะข้อมูลบริบทเพื่อตอบคำถาม พร้อมพึ่งพา parametric memory เพียงเล็กน้อย

Investigates more closely how LLMs utilize external knowledge over parametric information for factual queries; finds that in a RAG pipeline, LLMs take a “shortcut” and display a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory.

บทคัดย่อ(Abstract)

Retrieval Augmented Generation (RAG) ช่วยเสริมความสามารถของภาษาโมเดลในการให้เหตุผลโดยใช้บริบทภายนอกเพื่อเพิ่มคุณภาพการตอบสนองต่อพรอมป์ต์ของผู้ใช้ แนวทางนี้ได้รับความนิยมเพิ่มขึ้นจากการใช้งานจริงของภาษาโมเดลในหลายด้าน เช่น การค้นหา การถาม-ตอบ และแชตบอต อย่างไรก็ตาม ลักษณะที่แน่ชัดของการทำงานของแนวทางนี้ยังไม่ได้รับความเข้าใจอย่างชัดเจน ในบทความนี้ ผู้วิจัยได้ตรวจสอบ RAG pipeline ในเชิงกลไกเพื่อชี้ให้เห็นว่าภาษาโมเดลมักใช้ทางลัดและมีอคติอย่างมากต่อการใช้เพียงข้อมูลบริบทในการตอบคำถาม โดยพึ่งพา parametric memory เพียงเล็กน้อย ผู้วิจัยสำรวจพฤติกรรมเชิงกลไกนี้ในภาษาโมเดลด้วยวิธีดังนี้: (i) ใช้ Causal Mediation Analysis เพื่อแสดงให้เห็นว่า parametric memory ถูกใช้น้อยมากเมื่อตอบคำถาม และ (ii) ใช้ Attention Contributions และ Knockouts เพื่อแสดงให้เห็นว่า residual stream ของโทเค็นสุดท้ายไม่ได้รับการเสริมจากโทเค็นหัวข้อในคำถาม แต่ได้รับการเสริมจากโทเค็นข้อมูลอื่น ๆ ในบริบทแทน ทั้งนี้พบว่าพฤติกรรมทางลัดที่เด่นชัดนี้เป็นจริงทั้งในตระกูลโมเดล LLaMa และ Phi

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.

ลิงก์งานวิจัย

https://arxiv.org/abs/2406.12824

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1803254134289895555

Open-Sora / Open-Sora

แนะนำงานวิจัย

โมเดลสร้างวิดีโอแบบโอเพนซอร์สที่สามารถสร้างวิดีโอ 720p ความยาว 16 วินาที เป็นโมเดลขนาด 1.1B พารามิเตอร์ที่ฝึกด้วยข้อมูลมากกว่า 30 ล้านรายการ และตอนนี้รองรับ image-to-video แล้ว พร้อมนำเสนอโมเดล diffusion ที่ปรับปรุงแล้วและเครือข่ายบีบอัดวิดีโอสำหรับการบีบอัดเชิงพื้นที่และเชิงเวลา ช่วยเพิ่มความสามารถในการควบคุมการสร้างและลดต้นทุนการฝึก

An open-source video generation model that can generate 16-second 720p videos; it’s a 1.1B parameter model trained on more than 30m data and now supports image-to-video; presents an enhanced diffusion model and video compression network for spatial and temporal compression; increases controllability of generations and reduces training costs.

ลิงก์งานวิจัย

[IMG] Open-Sora 1.2 Report|1028x812

อ่านเพิ่มเติม

https://discuss.pytorch.kr/t/open-sora-feat-hpc-ai/3794

https://x.com/omarsar0/status/1803176105010171957

Tree Search สำหรับ Language Model Agents

แนะนำงานวิจัย

เสนออัลกอริทึม tree search ในช่วง inference สำหรับ LM agents เพื่อให้สามารถสำรวจและทำการให้เหตุผลหลายขั้นตอนได้ ทดสอบในสภาพแวดล้อมเว็บแบบโต้ตอบและนำไปใช้กับ GPT-4o จนปรับปรุงประสิทธิภาพได้อย่างมาก พร้อมแสดงให้เห็นว่าประสิทธิภาพสามารถขยายได้เมื่อเพิ่มการคำนวณในช่วงทดสอบ

Proposes an inference-time tree search algorithm for LM agents to perform exploration and enable multi-step reasoning; it’s tested on interactive web environments and applied to GPT-4o to significantly improve performance; demonstrates that performance scales when increasing test-time compute.

บทคัดย่อ(Abstract)

เอเจนต์อัตโนมัติที่ขับเคลื่อนด้วย language models (LMs) ได้แสดงให้เห็นถึงศักยภาพในการทำงานตัดสินใจ เช่น ระบบอัตโนมัติบนเว็บ อย่างไรก็ตาม ยังมีความท้าทายพื้นฐานอยู่: LMs ซึ่งถูกปรับให้เหมาะกับการทำความเข้าใจและสร้างภาษาธรรมชาติเป็นหลัก ยังประสบปัญหากับการให้เหตุผลหลายขั้น การวางแผน และการใช้ประโยชน์จาก feedback ของสภาพแวดล้อมเมื่อต้องแก้ปัญหางานคอมพิวเตอร์ที่สมจริง เพื่อแก้ปัญหานี้ เราเสนออัลกอริทึมค้นหาในช่วง inference สำหรับ LM agents ที่ทำให้สามารถสำรวจและวางแผนหลายขั้นอย่างชัดเจนในสภาพแวดล้อมเว็บแบบโต้ตอบได้ วิธีการของเราเป็น best-first tree search รูปแบบหนึ่งที่ทำงานภายในพื้นที่ของสภาพแวดล้อมจริง และสามารถทำงานเสริมกับเอเจนต์ล้ำสมัยที่มีอยู่ส่วนใหญ่ได้ นี่คืออัลกอริทึม tree search สำหรับ LM agents ตัวแรกที่แสดงประสิทธิผลบนงานเว็บที่สมจริง บนเบนช์มาร์ก VisualWebArena ที่มีความท้าทายสูง เมื่อนำอัลกอริทึมค้นหาของเราไปใช้บน GPT-4o agent จะทำให้อัตราความสำเร็จเพิ่มขึ้นสัมพัทธ์ 39.7% เมื่อเทียบกับ baseline เดียวกันที่ไม่ใช้การค้นหา และสร้างสถิติใหม่ที่อัตราความสำเร็จ 26.4% บน WebArena การค้นหายังให้การปรับปรุงสัมพัทธ์ 28.0% เหนือ baseline agent และทำอัตราความสำเร็จที่แข่งขันได้ที่ 19.2% การทดลองนี้เน้นย้ำถึงประสิทธิผลของการค้นหาสำหรับเว็บเอเจนต์ และเราแสดงให้เห็นว่าประสิทธิภาพสามารถขยายได้ตามการคำนวณในช่วงทดสอบที่เพิ่มขึ้น เราวิเคราะห์ผลลัพธ์อย่างละเอียดเพื่อชี้ให้เห็นถึงการปรับปรุงจากการค้นหา ข้อจำกัด และทิศทางที่มีแนวโน้มสำหรับงานในอนาคต

Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a fundamental challenge remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work.

บทความนี้สรุปด้วยโมเดล GPT จึงอาจมีบางส่วนที่คลาดเคลื่อน กรุณาอ่านต้นฉบับที่อยู่ด้านล่างประกอบไปด้วย! หากระหว่างอ่านพบเนื้อหาที่ดูแปลกหรือผิดพลาด รบกวนแจ้งในคอมเมนต์ด้วยนะครับ 🤗

⚠️โฆษณา⚠️: บทความนี้ที่ 🔥ชุมชนผู้ใช้ PyTorch เกาหลี🇰🇷 สรุปไว้มีประโยชน์ไหม? หาก สมัครสมาชิก คุณจะได้รับบทความสำคัญทางอีเมล💌! (ค่าเริ่มต้นคือ Weekly แต่ เปลี่ยนเป็น Daily ได้)

[2024/06/17 ~ 06/23] งานวิจัย ML เด่นประจำสัปดาห์ (Top ML Papers of the Week)

Claude 3.5 Sonnet / Claude 3.5 Sonnet

แนะนำงานวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

DeepSeek-Coder-V2

แนะนำงานวิจัย

บทคัดย่อ (Abstract)

ลิงก์งานวิจัย

อ่านเพิ่มเติม

TextGrad: การ "ดิฟเฟอเรนชิเอชัน" อัตโนมัติผ่านข้อความ / TextGrad: Automatic "Differentiation" via Text

แนะนำงานวิจัย

บทคัดย่อของงานวิจัย (Abstract)

ลิงก์งานวิจัย

อ่านเพิ่มเติม

โมเดลภาษาที่มีคอนเท็กซ์ยาวสามารถเข้ามาแทนที่ Retrieval, RAG, SQL และอื่น ๆ ได้หรือไม่? / Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

แนะนำงานวิจัย

บทคัดย่อของงานวิจัย (Abstract)

ลิงก์บทความวิจัย

อ่านเพิ่มเติม

แนะนำบทความวิจัย

บทคัดย่อ(Abstract)

ลิงก์บทความ

อ่านเพิ่มเติม

อย่าจำเหมือนปลาทอง! บรรเทาการจดจำข้อมูลใน Generative LLM / Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

แนะนำบทความ

บทคัดย่อ(Abstract)

ลิงก์บทความ

อ่านเพิ่มเติม

แนะนำบทความ

บทคัดย่อ(Abstract)

ลิงก์งานวิจัย

อ่านเพิ่มเติม

แนะนำงานวิจัย

บทคัดย่อ(Abstract)

ลิงก์งานวิจัย

อ่านเพิ่มเติม

Open-Sora / Open-Sora

แนะนำงานวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

Tree Search สำหรับ Language Model Agents

แนะนำงานวิจัย

บทคัดย่อ(Abstract)

ลิงก์งานวิจัย

อ่านเพิ่มเติม

ต้นฉบับ

บทความที่เกี่ยวข้อง

ยังไม่มีความคิดเห็น