26] งานวิจัย ML เด่นประจำสัปดาห์ (Top ML Papers of the Week)

(discuss.pytorch.kr)

5 คะแนน โดย ninebow 2023-11-27 | ยังไม่มีความคิดเห็น | แชร์ทาง WhatsApp

ภาพรวม

เราได้แปลอัตโนมัติบทความเกี่ยวกับงานวิจัย ML ที่ DAIR.AI เผยแพร่ทุกสัปดาห์
งานวิจัยที่ได้รับคัดเลือกในสัปดาห์นี้ส่วนใหญ่มีแนวโน้มเกี่ยวข้องกับ 'Large Language Models(LLMs)', 'Reasoning and Attention in AI System' และ 'Artificial Intelligence in Medical Domain'
โดยเฉพาะอย่างยิ่ง หัวข้อเกี่ยวกับ 'Reasoning and Attention in AI System' โดดเด่นเป็นพิเศษ ซึ่งสะท้อนถึงความพยายามที่จะทำให้ปัญญาประดิษฐ์ก้าวข้ามการจดจำรูปแบบแบบง่าย ๆ ไปสู่ความสามารถในการให้เหตุผลและแก้ปัญหาคล้ายมนุษย์
งานวิจัยเกี่ยวกับการประยุกต์ใช้ปัญญาประดิษฐ์ในวงการแพทย์ก็น่าจับตาเช่นกัน โดย 'LLMs as Collaborators for Medical Reasoning' เป็นตัวอย่างสำคัญของแนวทางนี้ ซึ่งสำรวจความเป็นไปได้ในการประยุกต์ใช้ LLM (โมเดลภาษาขนาดใหญ่) กับการจัดการข้อมูลทางการแพทย์
โดยสรุป งานวิจัยที่ได้รับคัดเลือกในสัปดาห์นี้แสดงให้เห็นว่ามีงานจำนวนมากที่มุ่งเน้นไปที่การพัฒนาความสามารถในการให้เหตุผลที่ซับซ้อน กลไกการโฟกัส attention ที่คล้ายมนุษย์ และการประยุกต์ใช้ปัญญาประดิษฐ์ในทางการแพทย์ ซึ่งสามารถตีความได้ว่าเป็นตัวชี้วัดสำคัญของทิศทางการพัฒนาเทคโนโลยีปัญญาประดิษฐ์

System 2 Attention (สิ่งที่คุณเองอาจต้องใช้ด้วย) / System 2 Attention (is something you might need too)

แนะนำงานวิจัย

ใช้ความสามารถด้านการให้เหตุผลและการทำตามคำสั่งของ LLM เพื่อตัดสินใจว่าควรใส่ใจข้อมูลใด จากนั้นสร้าง input context ขึ้นมาใหม่ให้มีเฉพาะส่วนที่เกี่ยวข้องก่อนจะให้ attention กับ context ที่สร้างใหม่นั้นเพื่อชักนำคำตอบสุดท้ายของโมเดล ช่วยเพิ่มความเป็น factual และทำผลงานได้ดีกว่า LLM แบบ attention มาตรฐานในงานอย่าง QA และโจทย์ปัญหาคณิตศาสตร์

Leverages the reasoning and instruction following capabilities of llms to decide what to attend to; it regenerates input context to only include relevant portions before attending to the regenerated context to elicit the final response from the model; increases factuality and outperforms standard attention-based llms on tasks such as qa and math world problems.

บทคัดย่องานวิจัย

soft attention ใน LLM (Large Language Models) ที่อิงสถาปัตยกรรม Transformer มีแนวโน้มที่จะรวมข้อมูลที่ไม่เกี่ยวข้องจากบริบทเข้าไปใน latent representations ได้ง่าย ซึ่งส่งผลเสียต่อการสร้างโทเคนถัดไป เพื่อช่วยแก้ปัญหานี้ เราได้เสนอ System 2 Attention (S2A) ซึ่งอาศัยความสามารถของ LLM ในการให้เหตุผลด้วยภาษาธรรมชาติและทำตามคำสั่งเพื่อพิจารณาว่าควรให้ความสนใจกับอะไร S2A จะสร้าง input context ขึ้นมาใหม่ให้มีเฉพาะส่วนที่เกี่ยวข้อง ก่อนจะใช้ attention กับ context ที่สร้างใหม่นั้นเพื่อชักนำคำตอบสุดท้าย ผลการทดลองแสดงให้เห็นว่า S2A ทำผลงานได้ดีกว่า LLM แบบ attention มาตรฐานใน 3 งานที่มีข้อมูลเชิงความเห็นหรือข้อมูลที่ไม่เกี่ยวข้องปะปนอยู่ ได้แก่ QA, โจทย์โจทย์ข้อความคณิตศาสตร์ และการสร้างข้อความแบบยาว โดย S2A ช่วยเพิ่มความ factual และความเป็นกลาง พร้อมทั้งลดการเอาใจผู้ใช้เกินควร (sycophancy)

Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information, QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy.

ลิงก์งานวิจัย

https://arxiv.org/abs/2311.11829

อ่านเพิ่มเติม

https://x.com/jaseweston/status/1726784511357157618

ความก้าวหน้าของสถาปัตยกรรม Transformer ใน Large Language Models แบบคอนเท็กซ์ยาว: การสำรวจอย่างครอบคลุม / Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

แนะนำงานวิจัย

ภาพรวมของวิธีการปรับปรุงโมดูลสถาปัตยกรรม Transformer เพื่อเพิ่มประสิทธิภาพความสามารถด้านคอนเท็กซ์ยาวในทุกขั้นตอน ตั้งแต่ pre-training ไปจนถึง inference

An overview of the methodologies for enhancing transformer architecture modules that optimize long-context capabilities across all stages from pre-training to inference.

บทคัดย่องานวิจัย

Large Language Model (LLM) ขนาดใหญ่ที่อิงสถาปัตยกรรม Transformer ซึ่งถูกจุดกระแสโดย ChatGPT ได้เปิดเส้นทางการปฏิวัติไปสู่ Artificial General Intelligence (AGI) และถูกนำไปใช้ในหลากหลายด้าน เช่น ฐานความรู้ อินเทอร์เฟซสำหรับมนุษย์ และเอเจนต์แบบไดนามิก อย่างไรก็ตาม LLM จำนวนมากในปัจจุบันยังมีข้อจำกัดสำคัญคือถูกจำกัดด้วยทรัพยากร และมักถูกพรีเทรนด้วยข้อความสั้นเป็นหลัก ทำให้มีประสิทธิภาพไม่ดีนักกับพรอมป์ที่มีบริบทยาวซึ่งพบได้บ่อยในการใช้งานจริง งานวิจัยนี้นำเสนอการสำรวจอย่างครอบคลุมที่มุ่งเน้นความก้าวหน้าของสถาปัตยกรรมโมเดลใน Transformer-based LLM เพื่อเพิ่มความสามารถด้าน long-context ให้เหมาะสมในทุกขั้นตอนตั้งแต่ pre-training ไปจนถึง inference โดยเริ่มจากการระบุและวิเคราะห์ปัญหาในการจัดการอินพุตและเอาต์พุตที่มีบริบทยาวในโมเดลที่อิง Transformer ปัจจุบัน จากนั้นจึงนำเสนออนุกรมวิธานภาพรวมเพื่อสำรวจแนวทางการอัปเกรด Transformer ในระดับสถาปัตยกรรมเพื่อแก้ปัญหาเหล่านี้ ต่อจากนั้นยังสำรวจองค์ประกอบการประเมินที่ใช้กันอย่างแพร่หลายซึ่งออกแบบมาสำหรับ long-context LLM เช่น ชุดข้อมูล เมตริก และโมเดลตั้งต้น รวมถึงแนะนำชุดเครื่องมือปรับแต่งที่โดดเด่น เช่น ไลบรารี ระบบ และคอมไพเลอร์ ที่ช่วยเพิ่มทั้งประสิทธิภาพและประสิทธิผลของ LLM ในหลายขั้นตอน สุดท้าย ผู้วิจัยได้อภิปรายถึงความท้าทายหลักของสาขานี้และแนวทางที่เป็นไปได้สำหรับการวิจัยในอนาคต นอกจากนี้ยังได้สร้างรีโพซิทอรีสำหรับคัดสรรงานวิจัยที่เกี่ยวข้องพร้อมอัปเดตแบบเรียลไทม์ไว้ที่ https://github.com/Strivin0311/long-llms-learning

With the bomb ignited by ChatGPT, Transformer-based Large Language Models (LLMs) have paved a revolutionary path toward Artificial General Intelligence (AGI) and have been applied in diverse areas as knowledge bases, human interfaces, and dynamic agents. However, a prevailing limitation exists: many current LLMs, constrained by resources, are primarily pre-trained on shorter texts, rendering them less effective for longer-context prompts, commonly encountered in real-world settings. In this paper, we present a comprehensive survey focusing on the advancement of model architecture in Transformer-based LLMs to optimize long-context capabilities across all stages from pre-training to inference. We firstly delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. Then, we mainly offer a holistic taxonomy to navigate the landscape of Transformer upgrades on architecture to solve these problems. Afterward, we provide the investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as some amazing optimization toolkits like libraries, systems, and compilers to augment LLMs' efficiency and efficacy across different stages. Finally, we further discuss the predominant challenges and potential avenues for future research in this domain. Additionally, we have established a repository where we curate relevant literature with real-time updates at https://github.com/Strivin0311/long-llms-learning.

ลิงก์บทความวิจัย

https://arxiv.org/abs/2311.12351

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1727358484360945750

PaSS: การสุ่มตัวอย่างแบบคาดเดาเชิงขนาน / PaSS: Parallel Speculative Sampling

แนะนำบทความวิจัย

เป็นแนวทางในการลดเวลา inference ของ LLM โดยอิงจากรูปแบบดัดแปลงของ speculative sampling และ parallel decoding ซึ่งสามารถเพิ่มความเร็วได้อย่างมาก (สูงสุด 30%) ด้วยการเรียนรู้พารามิเตอร์เพิ่มเติมเพียง $O(d_{emb})$ เท่านั้น

Approach to reduce inference time of llms based on a variant of speculative sampling and parallel decoding; achieves significant speed-ups (up to 30%) by only learning as little as o(d_emb) additional parameters.

บทคัดย่อบทความวิจัย

การขยายขนาดของ language model ไปสู่ระดับพารามิเตอร์หลายหมื่นล้านตัวทำให้สามารถแสดงประสิทธิภาพที่น่าประทับใจในงานที่หลากหลายได้ ระหว่างการสร้างข้อความ โมเดลเหล่านี้ถูกใช้งานแบบ autoregressive จึงต้องทำ forward pass สำหรับทุกโทเคนที่สร้างขึ้น และด้วยเหตุนี้จึงต้องอ่านชุดพารามิเตอร์ทั้งหมดจากหน่วยความจำ การเข้าถึงหน่วยความจำนี้กลายเป็นคอขวดหลักของการสร้างข้อความ และยิ่งโมเดลมีขนาดใหญ่ขึ้น คอขวดนี้ก็ยิ่งรุนแรงขึ้น นอกจากนี้ การรัน forward pass สำหรับหลายโทเคนแบบขนานมักใช้เวลาแทบไม่ต่างจากการรันสำหรับโทเคนเดียว ข้อสังเกตสองประการนี้นำไปสู่การพัฒนา speculative sampling ซึ่งใช้โมเดลขนาดเล็กอีกตัวหนึ่งเพื่อร่างโทเคนล่วงหน้าสองสามโทเคน จากนั้นจึงใช้ forward pass เพียงครั้งเดียวของโมเดลขนาดใหญ่เพื่อตรวจสอบหรือปฏิเสธโทเคนเหล่านั้น แต่น่าเสียดายที่วิธีนี้ต้องใช้โมเดลสองตัวที่ใช้ tokenizer เดียวกัน จึงจำกัดการนำไปใช้งาน เพื่อเป็นทางเลือก เราเสนอให้ใช้ parallel decoding เป็นวิธีร่างหลายโทเคนจากโมเดลเดียว โดยไม่ต้องมีต้นทุนการคำนวณเพิ่มและไม่จำเป็นต้องใช้โมเดลตัวที่สอง วิธีการนี้ต้องการเพียง input token เพิ่มเติมที่ใช้ระบุคำที่จะถูกสร้างขึ้นพร้อมกัน เราแสดงให้เห็นถึงประสิทธิภาพที่น่าสนใจ โดยเร่งความเร็วได้สูงสุด $30%$ ขณะที่ต้องการพารามิเตอร์เพิ่มเติมเพียง $O(d_{emb})$ เท่านั้น

Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.

ลิงก์งานวิจัย

https://arxiv.org/abs/2311.13581

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1728066181796418009

Mirasol3B: โมเดลอัตโนมัติถดถอยแบบหลายโมดัลสำหรับโมดัลลิตีที่จัดแนวตามเวลาและตามบริบท / Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

แนะนำงานวิจัย

เป็นโมเดลมัลติโหมดสำหรับการเรียนรู้ครอบคลุมทั้งเสียง วิดีโอ และข้อความ โดยแยกการสร้างแบบจำลองหลายโมดัลออกเป็นโมเดล autoregressive ที่มุ่งเน้นแยกกัน อินพุตจะถูกประมวลผลตามแต่ละโมดัลลิตี วิธีนี้สามารถรองรับวิดีโอที่ยาวกว่าเมื่อเทียบกับโมเดลอื่น และทำผลงานได้ดีกว่าวิธีการล้ำสมัยในงาน video QA, long video QA และเบนช์มาร์ก audio-video-text

A multimodal model for learning across audio, video, and text which decouples the multimodal modeling into separate, focused autoregressive models; the inputs are processed according to the modalities; this approach can handle longer videos compared to other models and it outperforms state-of-the-art approach on video qa, long video qa, and audio-video-text benchmark.

บทคัดย่อ

หนึ่งในความท้าทายหลักของการเรียนรู้แบบหลายโมดัลคือความจำเป็นในการผสานโมดัลที่มีลักษณะแตกต่างกัน (เช่น วิดีโอ เสียง ข้อความ) เข้าด้วยกัน ตัวอย่างเช่น วิดีโอและเสียงถูกได้มาด้วยอัตราที่สูงกว่าข้อความมาก และโดยคร่าว ๆ แล้วสอดคล้องกันในเชิงเวลา แต่ก็มักจะไม่ซิงก์กับข้อความซึ่งมาในฐานะบริบทระดับโกลบอล เช่น ชื่อเรื่องหรือคำอธิบาย นอกจากนี้ อินพุตวิดีโอและเสียงยังมีขนาดใหญ่กว่ามาก และเพิ่มขึ้นตามความยาวของวิดีโอ ซึ่งย่อมต้องใช้การประมวลผลสำหรับโมดัลเหล่านี้มากขึ้น และทำให้การสร้างแบบจำลองความสัมพันธ์ระยะไกลทำได้ยากขึ้นโดยธรรมชาติ ที่นี่เราแยกการสร้างแบบจำลองหลายโมดัลออกจากกัน โดยแบ่งเป็นโมเดลอัตถดถอยแยกต่างหากที่มุ่งเน้นเฉพาะ ซึ่งประมวลผลอินพุตตามลักษณะของแต่ละโมดัล เราเสนอโมเดลหลายโมดัลชื่อ Mirasol3B ซึ่งประกอบด้วยคอมโพเนนต์อัตถดถอยสำหรับโมดัลที่ซิงก์กันตามเวลา (เสียงและวิดีโอ) และคอมโพเนนต์อัตถดถอยสำหรับโมดัลบริบทที่อาจไม่จำเป็นต้องจัดแนวตามเวลา แต่ยังคงมีลำดับอยู่ เพื่อรองรับลำดับที่ยาวของอินพุตวิดีโอ-เสียง เราเสนอให้แบ่งลำดับวิดีโอและเสียงออกเป็นช่วงย่อยที่ต่อเนื่องกัน และประมวลผลรีเพรเซนเทชันของช่วงเหล่านั้นแบบอัตถดถอย สำหรับจุดประสงค์นี้ เราเสนอกลไก Combiner ซึ่งสร้างแบบจำลองข้อมูลเสียง-วิดีโอร่วมกันภายในช่วงเวลาเดียวกัน Combiner จะเรียนรู้การดึงคุณลักษณะของเสียงและวิดีโอจากสัญญาณเชิงพื้นที่-เวลาแบบดิบ จากนั้นเรียนรู้การหลอมรวมคุณลักษณะเหล่านี้เพื่อสร้างรีเพรเซนเทชันรายช่วงที่กะทัดรัดแต่แสดงข้อมูลได้ดี แนวทางของเราบรรลุผลลัพธ์ล้ำสมัยบนเบนช์มาร์กหลายโมดัลที่เป็นที่ยอมรับอย่างดี โดยมีประสิทธิภาพเหนือกว่าโมเดลที่มีขนาดใหญ่กว่ามาก อีกทั้งยังจัดการกับความต้องการด้านการคำนวณสูงของอินพุตสื่อได้อย่างมีประสิทธิภาพ ผ่านทั้งการเรียนรู้รีเพรเซนเทชันที่กะทัดรัด การควบคุมความยาวลำดับของรีเพรเซนเทชันคุณลักษณะเสียง-วิดีโอ และการสร้างแบบจำลองความสัมพันธ์ตามเวลา

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

ลิงก์บทความวิจัย

https://arxiv.org/abs/2311.05698

อ่านเพิ่มเติม

https://x.com/GoogleAI/status/1724553024088191211

Orca 2: การสอนให้โมเดลภาษาขนาดเล็กมีความสามารถในการให้เหตุผล / Orca 2: Teaching Small Language Models How to Reason

แนะนำบทความวิจัย

เสนอแนวทางในการสอนการให้เหตุผลแก่โมเดลภาษาที่มีขนาดเล็กกว่า โดยเฉพาะอย่างยิ่ง โมเดลดังกล่าวถูกมองว่าใช้เทคนิคการให้เหตุผล เช่น การประมวลผลแบบทีละขั้น การดึงความจำก่อนแล้วค่อยสร้างคำตอบ การดึงความจำ-ให้เหตุผล-สร้างคำตอบ การสกัดแล้วสร้างคำตอบ และการตอบโดยตรง ซึ่งมีรายงานว่าสามารถเอาชนะโมเดลที่มีขนาดใกล้เคียงกัน และทำระดับประสิทธิภาพได้ใกล้เคียงหรือดีกว่าโมเดลที่มีขนาดใหญ่กว่า 5-10 เท่า เมื่อประเมินบนงานที่ซับซ้อนซึ่งทดสอบความสามารถการให้เหตุผลขั้นสูงในสภาวะ zero-shot

Proposes an approach to teach smaller language models to reason; specifically, the lm is thought to use reasoning techniques, such as step-by-step processing, recall-then-generate, recall-reason-generate, extract-generate, and direct-answer methods; outperforms models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings.

บทคัดย่อบทความวิจัย

Orca 1 เรียนรู้จากสัญญาณที่มีข้อมูลเข้มข้น เช่น ร่องรอยคำอธิบาย ทำให้มีประสิทธิภาพเหนือกว่าโมเดล instruction-tuned แบบเดิมบนเบนช์มาร์กอย่าง BigBench Hard และ AGIEval ใน Orca 2 ผู้วิจัยยังคงศึกษาต่อว่าสัญญาณการฝึกที่ดีขึ้นจะช่วยเพิ่มความสามารถด้านการให้เหตุผลของ LM ขนาดเล็กได้อย่างไร งานวิจัยด้านการฝึก LM ขนาดเล็กมักพึ่งพา imitation learning เพื่อทำซ้ำผลลัพธ์ของโมเดลที่มีความสามารถสูงกว่า แต่ผู้วิจัยโต้แย้งว่าการเน้น imitation มากเกินไปอาจจำกัดศักยภาพของโมเดลขนาดเล็กได้ เป้าหมายคือสอนให้ LM ขนาดเล็กใช้กลยุทธ์การแก้ปัญหาที่แตกต่างกันสำหรับงานที่แตกต่างกัน ซึ่งอาจต่างจากกลยุทธ์ที่โมเดลขนาดใหญ่ใช้ ตัวอย่างเช่น โมเดลขนาดใหญ่อาจให้คำตอบตรง ๆ สำหรับงานที่ซับซ้อนได้ แต่โมเดลขนาดเล็กอาจไม่มีความสามารถเช่นนั้น ใน Orca 2 ผู้วิจัยสอนเทคนิคการให้เหตุผลหลายแบบแก่โมเดล (เช่น step-by-step, recall then generate, recall-reason-generate, direct answer เป็นต้น) และที่สำคัญยิ่งกว่านั้นคือช่วยให้โมเดลเรียนรู้วิธีตัดสินใจว่ากลยุทธ์การแก้ปัญหาใดมีประสิทธิภาพที่สุดสำหรับแต่ละงาน มีการประเมิน Orca 2 ด้วยชุดทดสอบแบบครอบคลุมที่ประกอบด้วย 15 เบนช์มาร์กที่หลากหลาย (เทียบเท่ากับประมาณ 100 งาน และพรอมป์ต์เฉพาะมากกว่า 36,000 รายการ) ผลการประเมินบนงานที่ซับซ้อนซึ่งทดสอบความสามารถการให้เหตุผลขั้นสูงในสภาพแวดล้อม zero-shot พบว่า Orca 2 เหนือกว่าโมเดลขนาดใกล้เคียงกันอย่างชัดเจน และทำผลงานได้ใกล้เคียงหรือดีกว่าโมเดลที่มีขนาดใหญ่กว่า 5-10 เท่า เพื่อสนับสนุนการวิจัยด้านการพัฒนา การประเมิน และการปรับแนวของ LM ขนาดเล็ก จึงเปิดให้ใช้น้ำหนักของ Orca 2 แบบสาธารณะที่ aka.ms/orca-lm

Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In Orca 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research on training small LMs has often relied on imitation learning to replicate the output of more capable models. We contend that excessive emphasis on imitation may restrict the potential of smaller models. We seek to teach small LMs to employ different solution strategies for different tasks, potentially different from the one used by the larger model. For example, while larger models might provide a direct answer to a complex task, smaller models may not have the same capacity. In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse benchmarks (corresponding to approximately 100 tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings. make Orca 2 weights publicly available at aka.ms/orca-lm to support research on the development, evaluation, and alignment of smaller LMs

ลิงก์งานวิจัย

https://arxiv.org/abs/2311.11045

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1726990087399915995

GPQA: เบนช์มาร์กถาม-ตอบแบบ Google-proof ระดับบัณฑิตศึกษา / GPQA: A Graduate-Level Google-Proof Q&A Benchmark

แนะนำงานวิจัย

เสนอเบนช์มาร์ก QA ระดับบัณฑิตศึกษาที่ Google ก็ช่วยหาคำตอบไม่ได้ ซึ่งประกอบด้วยคำถามปรนัย 448 ข้อที่เขียนโดยผู้เชี่ยวชาญเฉพาะทางในสาขาชีววิทยา ฟิสิกส์ และเคมี โดย baseline ที่ใช้ GPT-4 ที่แข็งแกร่งที่สุดทำความแม่นยำได้ 39% และเบนช์มาร์กนี้มอบการทดลองด้าน scalable oversight ที่สามารถช่วยให้ได้ข้อมูลที่เชื่อถือได้และเป็นความจริงจากระบบ AI สมัยใหม่ที่มีความสามารถเหนือมนุษย์

Proposes a graduate-level google-proof qa benchmark consisting of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry; the strongest gpt-4 based baseline achieves 39% accuracy; this benchmark offers scalable oversight experiments that can help obtain reliable and truthful information from modern ai systems that surpass human capabilities.

บทคัดย่อของงานวิจัย

GPQA เป็นชุดข้อมูลความยากสูงที่ประกอบด้วยคำถามแบบปรนัย 448 ข้อ ออกข้อสอบโดยผู้เชี่ยวชาญในสาขาชีววิทยา ฟิสิกส์ และเคมี ผู้ที่มีหรือกำลังศึกษาระดับปริญญาเอกในสาขาเหล่านี้ทำได้ถูกต้อง 65% (หรือ 74% หากไม่นับข้อผิดพลาดที่ชัดเจนซึ่งผู้เชี่ยวชาญย้อนกลับไปพบภายหลัง) ขณะที่ผู้ตรวจสอบที่ไม่ใช่ผู้เชี่ยวชาญแต่มีทักษะสูง แม้จะใช้เวลาเฉลี่ยมากกว่า 30 นาทีและเข้าถึงเว็บได้อย่างไม่จำกัด ก็ยังทำได้ถูกต้องเพียง 34% เท่านั้น (กล่าวคือ คำถามเหล่านี้ผ่านการพิสูจน์แล้วว่า "Google ก็ช่วยไม่ได้") นี่ยังเป็นโจทย์ที่ยากสำหรับระบบ AI ล้ำสมัยด้วย โดย baseline ที่อิง GPT-4 ซึ่งแข็งแกร่งที่สุดยังทำความแม่นยำได้เพียง 39% หากเราจะใช้ระบบ AI ในอนาคตช่วยตอบคำถามที่ยากมาก เช่น ในการสร้างองค์ความรู้ทางวิทยาศาสตร์ใหม่ ๆ เราจำเป็นต้องพัฒนาวิธีการกำกับดูแลที่ขยายขนาดได้ เพื่อให้มนุษย์สามารถกำกับผลลัพธ์ของระบบเหล่านั้นได้ ซึ่งอาจเป็นเรื่องยากแม้ผู้กำกับดูแลจะมีทักษะและความรู้สูงก็ตาม ความยากของ GPQA ทั้งสำหรับผู้มีทักษะที่ไม่ใช่ผู้เชี่ยวชาญและระบบ frontier AI ทำให้สามารถทดลองเรื่อง scalable oversight ได้อย่างสมจริง และคาดว่าจะช่วยคิดค้นวิธีที่ทำให้ผู้เชี่ยวชาญมนุษย์ได้รับข้อมูลที่เป็นความจริงอย่างน่าเชื่อถือจากระบบ AI ที่มีความสามารถเหนือมนุษย์

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

ลิงก์งานวิจัย

https://arxiv.org/abs/2311.12022

อ่านเพิ่มเติม

https://x.com/idavidrein/status/1727033002234909060

จุดประกายความฉลาดทางภาษา: คู่มือ Hitchhiker จากการให้เหตุผลแบบ Chain-of-Thought ไปสู่ Language Agents / Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

แนะนำงานวิจัย

สรุปเกี่ยวกับการให้เหตุผลแบบ CoT(Chain-of-Thought), กลไกพื้นฐานที่รองรับเทคนิค CoT(Chain-of-Thought) และการประยุกต์ใช้กับเฟรมเวิร์ก language agent

Summary of cot reasoning, foundational mechanics underpinning cot techniques, and their application to language agent frameworks.

บทคัดย่องานวิจัย

โมเดลภาษาขนาดใหญ่ (LLM) ได้ยกระดับวงการปัญญาด้านภาษาอย่างก้าวกระโดด ดังที่พิสูจน์ได้จากประสิทธิภาพเชิงประจักษ์อันโดดเด่นในงานให้เหตุผลที่ซับซ้อนหลากหลายประเภท นอกจากนี้ การพิสูจน์เชิงทฤษฎียังเผยให้เห็นความสามารถในการให้เหตุผลที่เกิดขึ้นใหม่ ทำให้สามารถแสดงศักยภาพด้านการรับรู้ขั้นสูงในบริบททางภาษาได้อย่างชัดเจน ปัจจัยสำคัญที่ทำให้ LLM มีประสิทธิภาพยอดเยี่ยมในการจัดการงานให้เหตุผลที่ซับซ้อนคือการใช้เทคนิคการให้เหตุผลแบบ chain-of-thought (CoT) ซึ่งบังคับให้โมเดลต้องจัดรูปขั้นตอนกลางระหว่างทางไปสู่คำตอบ แนวทางการให้เหตุผลแบบ CoT ไม่เพียงช่วยขยายประสิทธิภาพการให้เหตุผล แต่ยังแสดงให้เห็นถึงความสามารถในการเพิ่มการตีความได้ การควบคุมได้ และความยืดหยุ่นอีกด้วย ด้วยข้อดีเหล่านี้ งานวิจัยระยะหลังจึงได้ขยายระเบียบวิธีการให้เหตุผลแบบ CoT เพื่อผลักดันการพัฒนาตัวแทนภาษาอัตโนมัติที่สามารถปฏิบัติตามคำสั่งภาษาธรรมชาติและดำเนินงานในสภาพแวดล้อมที่หลากหลายได้อย่างมีประสิทธิภาพ บทความสำรวจฉบับนี้นำเสนอการอภิปรายอย่างครอบคลุมในมิติการวิจัยสำคัญ ได้แก่ (i) กลไกพื้นฐานของเทคนิค CoT โดยมุ่งอธิบายเงื่อนไขและเหตุผลที่อยู่เบื้องหลังประสิทธิผลของมัน (ii) การเปลี่ยนผ่านเชิงกระบวนทัศน์ของ CoT และ (iii) การเติบโตอย่างรวดเร็วของตัวแทนภาษาที่ได้รับการเสริมด้วยแนวทาง CoT ทิศทางการวิจัยในอนาคตครอบคลุมการสำรวจด้านการทำให้ทั่วไป ประสิทธิภาพ การปรับแต่งเฉพาะ การขยายขนาด และความปลอดภัย บทความนี้เหมาะสำหรับผู้อ่านในวงกว้าง ตั้งแต่ผู้เริ่มต้นที่ต้องการความรู้แบบครอบคลุมเกี่ยวกับการให้เหตุผลแบบ CoT และตัวแทนภาษา ไปจนถึงนักวิจัยที่มีประสบการณ์ซึ่งสนใจกลไกพื้นฐานและต้องการมีส่วนร่วมกับการอภิปรายแนวหน้าของประเด็นเหล่านี้ ที่เก็บบทความที่เกี่ยวข้องสามารถดูได้ที่ https://github.com/Zoeyyao27/CoT-Igniting-Agent

Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks. Additionally, theoretical proofs have illuminated their emergent reasoning capabilities, providing a compelling showcase of their advanced cognitive abilities in linguistic contexts. Critical to their remarkable efficacy in handling complex reasoning tasks, LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. The CoT reasoning approach has not only exhibited proficiency in amplifying reasoning performance but also in enhancing interpretability, controllability, and flexibility. In light of these merits, recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents, which adeptly adhere to language instructions and execute actions within varied environments. This survey paper orchestrates a thorough discourse, penetrating vital research dimensions, encompassing: (i) the foundational mechanics of CoT techniques, with a focus on elucidating the circumstances and justification behind its efficacy; (ii) the paradigm shift in CoT; and (iii) the burgeoning of language agents fortified by CoT approaches. Prospective research avenues envelop explorations into generalization, efficiency, customization, scaling, and safety. This paper caters to a wide audience, including beginners seeking comprehensive knowledge of CoT reasoning and language agents, as well as experienced researchers interested in foundational mechanics and engaging in cutting-edge discussions on these topics. A repository for the related papers is available at https://github.com/Zoeyyao27/CoT-Igniting-Agent.

ลิงก์บทความ

https://arxiv.org/abs/2311.11797

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1726803725220487277

GAIA: เบนช์มาร์กสำหรับผู้ช่วย AI ทั่วไป / GAIA: a benchmark for General AI Assistants

แนะนำบทความ

จากเบนช์มาร์กสำหรับผู้ช่วย AI ทั่วไปที่ประกอบด้วยคำถามจากโลกจริง ซึ่งต้องอาศัยความสามารถพื้นฐานหลายด้าน เช่น การให้เหตุผล การประมวลผลแบบหลายโมดัล การท่องเว็บ และความชำนาญในการใช้เครื่องมือโดยทั่วไป พบว่าผู้ตอบที่เป็นมนุษย์ทำคะแนนได้ 92% ขณะที่ GPT-4 ที่ติดตั้งปลั๊กอินทำได้ 15%

A benchmark for general ai assistants consisting of real-world questions that require a set of fundamental abilities such as reasoning, multimodal handling, web browsing, and generally tool-use proficiency; shows that human respondents obtain 92% vs. 15% for gpt-4 equipped with plugins.

บทคัดย่อบทความ

ขอแนะนำ GAIA ซึ่งเป็น benchmark สำหรับ General AI Assistants ที่หากแก้ได้สำเร็จ จะถือเป็นหมุดหมายสำคัญของงานวิจัย AI โดย GAIA เสนอคำถามจากโลกจริงที่ต้องอาศัยความสามารถพื้นฐานหลายด้าน เช่น การให้เหตุผล การจัดการข้อมูลหลายรูปแบบ การท่องเว็บ และความชำนาญในการใช้เครื่องมือโดยทั่วไป คำถามของ GAIA นั้นในเชิงแนวคิดเรียบง่ายสำหรับมนุษย์ แต่ยากสำหรับ AI ระดับสูงส่วนใหญ่ โดยผู้ตอบที่เป็นมนุษย์ทำคะแนนได้ 92% ขณะที่ GPT-4 ที่ติดตั้งปลั๊กอินทำได้ 15% ช่องว่างด้านประสิทธิภาพที่เด่นชัดนี้สวนทางกับแนวโน้มล่าสุดที่ LLM มีผลงานเหนือมนุษย์ในงานที่ต้องใช้ทักษะเฉพาะทาง เช่น กฎหมายหรือเคมี ปรัชญาของ GAIA แตกต่างจากแนวโน้มปัจจุบันของ benchmark ด้าน AI ที่มุ่งไปยังงานซึ่งยากสำหรับมนุษย์มากขึ้น เราตั้งสมมติฐานว่าการมาถึงของ Artificial General Intelligence (AGI) ขึ้นอยู่กับความสามารถของระบบในการแสดงความทนทานได้ใกล้เคียงกับมนุษย์ทั่วไปต่อคำถามลักษณะนี้ ด้วยระเบียบวิธีของ GAIA เราออกแบบคำถาม 466 ข้อพร้อมคำตอบ และเปิดเผยตัวคำถามทั้งหมด โดยเก็บคำตอบของ 300 ข้อไว้สำหรับ leaderboard ที่ดูได้ที่ https://huggingface.co/gaia-benchmark

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

ลิงก์บทความวิจัย

https://arxiv.org/abs/2311.12983

อ่านเพิ่มเติม

https://x.com/ThomasScialom/status/1727683993045201339

MedAgents: โมเดลภาษาขนาดใหญ่ในฐานะผู้ทำงานร่วมกันสำหรับการให้เหตุผลทางการแพทย์แบบ zero-shot / MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

แนะนำบทความวิจัย

เสนอเฟรมเวิร์กแบบร่วมมือหลายรอบสำหรับโดเมนการแพทย์ ที่ใช้เอเจนต์ LLM แบบสวมบทบาทเพื่อยกระดับความชำนาญและความสามารถในการให้เหตุผลของ LLM

Proposes a collaborative multi-round framework for the medical domain that leverages role-playing llm-based agents to enhance llm proficiency and reasoning capabilities.

บทคัดย่อบทความวิจัย

แม้ Large Language Models (LLMs) จะก้าวหน้าอย่างน่าทึ่งในโดเมนทั่วไปหลากหลายประเภท แต่ก็ยังเผชิญอุปสรรคสำคัญในวงการแพทย์และสุขภาพ ซึ่งเป็นสาขาที่มีความท้าทายเฉพาะตัว เช่น คำศัพท์เฉพาะทางและการให้เหตุผลบนองค์ความรู้เชิงลึก เพื่อแก้ปัญหาเรื้อรังเหล่านี้ ผู้วิจัยจาก Unity เสนอเฟรมเวิร์ก Multi-disciplinary Collaboration (MC) แบบใหม่สำหรับโดเมนการแพทย์ โดยอาศัยเอเจนต์ LLM แบบสวมบทบาทที่เข้าร่วมการอภิปรายแบบร่วมมือหลายรอบ เพื่อยกระดับความชำนาญและความสามารถในการให้เหตุผลของ LLM เฟรมเวิร์กนี้ไม่ต้องฝึกเพิ่มเติมและตีความได้ ครอบคลุม 5 ขั้นตอนสำคัญ ได้แก่ การรวบรวมผู้เชี่ยวชาญเฉพาะด้าน การเสนอการวิเคราะห์รายบุคคล การสรุปการวิเคราะห์เหล่านั้นเป็นรายงาน การวนอภิปรายจนกว่าจะได้ฉันทามติ และการตัดสินใจขั้นสุดท้าย โดยงานนี้มุ่งเน้นเป็นพิเศษที่สถานการณ์ zero-shot และผลลัพธ์บนชุดข้อมูล 9 ชุด (MedQA, MedMCQA, PubMedQA และ 6 งานย่อยจาก MMLU) แสดงให้เห็นว่าเฟรมเวิร์ก MC ที่เสนอมีความโดดเด่นในการดึงและใช้ประโยชน์จากความเชี่ยวชาญทางการแพทย์ใน LLM ตลอดจนขยายความสามารถด้านการให้เหตุผลของโมเดล จากผลลัพธ์ดังกล่าว ผู้วิจัยยังทำการประเมินโดยมนุษย์เพิ่มเติมเพื่อระบุและจัดหมวดหมู่ข้อผิดพลาดที่พบบ่อยในวิธีการนี้ รวมถึงทำ ablation studies เพื่อทำความเข้าใจผลกระทบของปัจจัยต่าง ๆ ต่อประสิทธิภาพโดยรวม สามารถดูโค้ดได้ที่ \url{https://github.com/gersteinlab/MedAgents}

Large Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance. Our code can be found at \url{https://github.com/gersteinlab/MedAgents}.

ลิงก์บทความวิจัย

https://arxiv.org/abs/2311.10537

อ่านเพิ่มเติม

https://x.com/omarsar0/status/1726627951582511135

อูฐในสภาพภูมิอากาศที่เปลี่ยนแปลง: ยกระดับความสามารถในการปรับตัวของ LM ด้วย Tulu 2 / Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

แนะนำงานวิจัย

นำเสนอชุดโมเดล Tülu ที่ปรับปรุงแล้วเพื่อพัฒนาความเข้าใจและแนวปฏิบัติที่ดีที่สุดในการปรับแต่งโมเดลภาษาที่ผ่านการ pretrain ให้เข้ากับงานปลายทางและความชอบของผู้ใช้ โดยชุด Tülu 2 ทำผลงานระดับ state-of-the-art ในบรรดาโมเดลแบบเปิด และมีประสิทธิภาพเทียบเท่าหรือเหนือกว่า GPT-3.5-Turbo-0301 ในหลายเบนช์มาร์ก

Presents a suite of improved tülu models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences; tülu 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of gpt-3.5-turbo-0301 on several benchmarks.

บทคัดย่อ

นับตั้งแต่การเปิดตัว Tülu [Wang et al., 2023b] แหล่งทรัพยากรแบบเปิดสำหรับ instruction tuning ได้พัฒนาอย่างรวดเร็ว ตั้งแต่ base model ที่ดีขึ้นไปจนถึงเทคนิคการ fine-tuning แบบใหม่ เราได้ทดสอบและผสานความก้าวหน้าเหล่านี้หลายรายการเข้ากับ Tülu จนกลายเป็น Tülu 2 ซึ่งเป็นชุดโมเดล Tülu ที่ปรับปรุงแล้ว เพื่อยกระดับความเข้าใจและแนวปฏิบัติที่ดีที่สุดในการปรับโมเดลภาษาที่ผ่านการ pretrain ให้เข้ากับงานปลายทางและความชอบของผู้ใช้ โดยเปิดตัวสิ่งต่อไปนี้อย่างเป็นรูปธรรม: (1) Tülu-V2-mix ชุดข้อมูลคำสั่งคุณภาพสูงที่ปรับปรุงแล้ว, (2) Tülu 2 โมเดล LLAMA-2 ที่ fine-tune บนชุดผสม V2, (3) Tülu 2+DPO โมเดล Tülu 2 ที่ฝึกด้วย direct preference optimization (DPO) รวมถึงโมเดลที่ฝึกด้วย DPO ที่ใหญ่ที่สุดจนถึงปัจจุบัน (Tülu 2+DPO 70B); (4) CODE Tülu 2 โมเดล CODE LLAMA ที่ fine-tune บนชุดผสม V2 ของเรา ซึ่งให้ประสิทธิภาพดีกว่า CODE LLAMA และรุ่น instruction-tuned ของมันอย่าง CODE LLAMA-Instruct ผลการประเมินจากหลายมุมมองแสดงให้เห็นว่าชุด Tülu 2 ทำผลงานระดับ state-of-the-art ในบรรดาโมเดลแบบเปิด และมีประสิทธิภาพเทียบเท่าหรือเหนือกว่า GPT-3.5-turbo-0301 ในหลายเบนช์มาร์ก เราเผยแพร่ checkpoint, ข้อมูล, โค้ดสำหรับการฝึก และโค้ดประเมินผลทั้งหมด เพื่อเอื้อต่อความพยายามแบบเปิดในอนาคตสำหรับการปรับใช้ large language models

Since the release of Tülu [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into Tülu , resulting in Tülu 2, a suite of improved Tülu models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user preferences. Concretely, we release: (1) Tülu-V2-mix, an improved collection of high-quality instruction datasets; (2) Tülu 2, LLAMA-2 models finetuned on the V2 mixture; (3) Tülu 2+DPO, Tülu 2 models trained with direct preference optimization (DPO), including the largest DPO-trained model to date (Tülu 2+DPO 70B); (4) CODE Tülu 2, CODE LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple perspectives shows that the Tülu 2 suite achieves state-of-the-art performance among open models and matches or exceeds the performance of GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data, training and evaluation code to facilitate future open efforts on adapting large language models.

[2023/11/20 ~ 11/26] งานวิจัย ML เด่นประจำสัปดาห์ (Top ML Papers of the Week)

ภาพรวม

System 2 Attention (สิ่งที่คุณเองอาจต้องใช้ด้วย) / System 2 Attention (is something you might need too)

แนะนำงานวิจัย

บทคัดย่องานวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

แนะนำงานวิจัย

บทคัดย่องานวิจัย

ลิงก์บทความวิจัย

อ่านเพิ่มเติม

PaSS: การสุ่มตัวอย่างแบบคาดเดาเชิงขนาน / PaSS: Parallel Speculative Sampling

แนะนำบทความวิจัย

บทคัดย่อบทความวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

แนะนำงานวิจัย

บทคัดย่อ

ลิงก์บทความวิจัย

อ่านเพิ่มเติม

Orca 2: การสอนให้โมเดลภาษาขนาดเล็กมีความสามารถในการให้เหตุผล / Orca 2: Teaching Small Language Models How to Reason

แนะนำบทความวิจัย

บทคัดย่อบทความวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

GPQA: เบนช์มาร์กถาม-ตอบแบบ Google-proof ระดับบัณฑิตศึกษา / GPQA: A Graduate-Level Google-Proof Q&A Benchmark

แนะนำงานวิจัย

บทคัดย่อของงานวิจัย

ลิงก์งานวิจัย

อ่านเพิ่มเติม

แนะนำงานวิจัย

บทคัดย่องานวิจัย

ลิงก์บทความ

อ่านเพิ่มเติม

GAIA: เบนช์มาร์กสำหรับผู้ช่วย AI ทั่วไป / GAIA: a benchmark for General AI Assistants

แนะนำบทความ

บทคัดย่อบทความ

ลิงก์บทความวิจัย

อ่านเพิ่มเติม

แนะนำบทความวิจัย

บทคัดย่อบทความวิจัย

ลิงก์บทความวิจัย

อ่านเพิ่มเติม

อูฐในสภาพภูมิอากาศที่เปลี่ยนแปลง: ยกระดับความสามารถในการปรับตัวของ LM ด้วย Tulu 2 / Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

แนะนำงานวิจัย

บทคัดย่อ

ลิงก์งานวิจัย

อ่านเพิ่มเติม

ต้นฉบับ

บทความที่เกี่ยวข้อง

ยังไม่มีความคิดเห็น