[แปล] Road to Sora: แนะนำงานวิจัยพื้นฐานเพื่อทำความเข้าใจ Sora ของ OpenAI (feat. Oxen.AI)

(discuss.pytorch.kr)

6 คะแนน โดย ninebow 2024-03-26 | 1 ความคิดเห็น | แชร์ทาง WhatsApp

Oxen.AI ผู้สร้างเครื่องมือสำหรับชุดข้อมูล AI คุณภาพสูง ดำเนินรายการ ArXiv Dives ที่อ่านงานวิจัย AI และแบ่งปันอินไซต์กันทุกวันศุกร์
บทความนี้เป็นการแปลและเผยแพร่ บทความ Road to Sora ที่เคยนำเสนอใน ArXiv Dives ช่วงต้นเดือนมีนาคม โดยได้รับอนุญาตแล้ว
Road to Sora ที่แปลในครั้งนี้มีเป้าหมายเพื่อสำรวจองค์ความรู้ที่จำเป็นต่อการทำความเข้าใจโมเดล Sora โดยอ้างอิงจาก เอกสารทางเทคนิคของ Sora ซึ่งเป็นโมเดลสร้างภาพที่ OpenAI เปิดเผย

Road to Sora: งานวิจัยเพื่อทำความเข้าใจ Sora ของ OpenAI / "Road to Sora" Paper Reading List

by Greg Schoeninger, Mar 5, 2024

บทความนี้เป็นส่วนหนึ่งของความพยายามในการรวบรวมรายการอ่านสำหรับชมรมอ่านเปเปอร์วันศุกร์ของเรา ArXiv Dives เนื่องจากยังไม่มีการเผยแพร่เปเปอร์อย่างเป็นทางการของ Sora เป้าหมายจึงเป็นการตามรอยข้อมูลจาก รายงานทางเทคนิคของ OpenAI เกี่ยวกับ Sora ในอีกหลายสัปดาห์ข้างหน้า เราวางแผนจะทบทวนเปเปอร์พื้นฐานบางส่วนในชมรมอ่านเปเปอร์วันศุกร์ เพื่อช่วยให้เห็นภาพได้ชัดเจนขึ้นว่าเบื้องหลังม่านของ Sora กำลังเกิดอะไรขึ้น

This post is an effort to put together a reading list for our Friday paper club called ArXiv Dives. Since there has not been an official paper released yet for Sora, the goal is follow the bread crumbs from OpenAI's technical report on Sora. We plan on going over a few of the fundamental papers in the coming weeks during our Friday paper club, to help paint a better picture of what is going on behind the curtain of Sora.

Sora คืออะไร? / What is Sora?

Sora เป็นโมเดลที่สร้างแรงสั่นสะเทือนครั้งใหญ่ในวงการ Generative AI ด้วยความสามารถในการสร้างวิดีโอความละเอียดสูงจากพรอมป์ต์ภาษาธรรมชาติ หากคุณยังไม่เคยเห็นตัวอย่างของ Sora ลองดูวิดีโอเต่าที่กำลังว่ายน้ำอยู่ในแนวปะการังด้านล่างนี้ได้เลย

Sora has taken the Generative AI space by storm with it's ability to generate high fidelity videos from natural language prompts. If you haven't seen an example yet, here's a generated video of a turtle swimming in a coral reef for your enjoyment.

แม้ว่าทีม OpenAI จะยังไม่ได้เผยแพร่เปเปอร์วิจัยอย่างเป็นทางการเกี่ยวกับรายละเอียดเชิงเทคนิคของตัวโมเดลเอง แต่ก็ได้เผยแพร่เอกสารทางเทคนิคที่ครอบคลุมรายละเอียดระดับสูงของเทคนิคที่ใช้และผลลัพธ์เชิงคุณภาพบางส่วน

While the team at OpenAI has not released an official research paper on the technical details of the model itself, they did release a technical report that covers some high level details of the techniques they used and some qualitative results.

https://openai.com/research/video-generation-models-as-world-simulators

ภาพรวมสถาปัตยกรรมของ Sora / Sora Architecture Overview

หลังจากอ่านเปเปอร์ด้านล่างแล้ว สถาปัตยกรรมของ Sora น่าจะเริ่มเข้าใจได้มากขึ้น เอกสารทางเทคนิคเป็นเพียงภาพมุมสูงมาก ๆ และผมหวังว่าแต่ละเปเปอร์จะช่วยซูมเข้าไปในแง่มุมที่ต่างกันเพื่อประกอบเป็นภาพรวมทั้งหมด มีรีวิวเปเปอร์ที่ดีมากชื่อ "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" ซึ่งให้ไดอะแกรมระดับสูงของสถาปัตยกรรมที่ถูก reverse engineer มา

After reading the papers below, the architecture here should start to make sense. The technical report is a 10,000 foot view and my hope is that each paper will zoom into different aspects and paint the full picture. There is a nice literature review called "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" that gives a high level diagram of a reverse engineered architecture.

ทีม OpenAI ระบุว่า Sora เป็น "Diffusion Transformer" ที่ผสานแนวคิดจำนวนมากซึ่งระบุไว้ในเปเปอร์ข้างต้น โดยนำไปใช้กับ latent spacetime patches ที่สร้างมาจากวิดีโอ

The team at OpenAI states that Sora is a "Diffusion Transformer" which combines many of the concepts listed in the papers above, but applied applied to latent spacetime patches generated from video.

นี่คือการผสมผสานระหว่างรูปแบบของแพตช์ที่ใช้ในเปเปอร์ Vision Transformer (ViT) กับ latent space ที่คล้ายกับเปเปอร์ Latent Diffusion แต่รวมเข้าด้วยกันในรูปแบบของ Diffusion Transformer ไม่ได้มีเพียงแพตช์ตามความกว้างและความสูงของภาพเท่านั้น แต่ยังขยายไปสู่มิติเวลาของวิดีโอด้วย

This is a combination of the style of patches used in the Vision Transformer (ViT) paper, with latent spaces similar to the Latent Diffusion Paper, but combined in the style of the Diffusion Transformer. They not only have patches in width and height of the image but extend it to the time dimension of video.

เป็นเรื่องยากที่จะบอกได้อย่างชัดเจนว่าพวกเขารวบรวมข้อมูลฝึกสำหรับทั้งหมดนี้อย่างไร แต่ดูเหมือนว่าจะเป็นการผสมผสานเทคนิคจากงานวิจัย Dall-E 3 เข้ากับการใช้ GPT-4 เพื่อขยายคำอธิบายข้อความของแต่ละภาพให้ละเอียดมากขึ้น ก่อนจะแปลงสิ่งนั้นให้เป็นวิดีโอ ข้อมูลฝึกน่าจะเป็นเคล็ดลับสำคัญที่สุดในที่นี้ จึงเป็นเหตุให้ในรายงานทางเทคนิคมีรายละเอียดส่วนน้อยที่สุดเกี่ยวกับเรื่องนี้

It's hard to say how exactly they collected the training data for all of this, but it seems like a combination of the techniques in the Dalle-3 paper as well as using GPT-4 to elaborate on textual descriptions of images, that they then turn into videos. Training data is likely the main secret sauce here, hence has the least level of detail in the technical report.

กรณีการใช้งาน / Use Cases

เทคโนโลยีสร้างวิดีโออย่าง Sora มีกรณีการใช้งานและการประยุกต์ใช้งานที่น่าสนใจมากมาย ไม่ว่าจะเป็นภาพยนตร์ การศึกษา เกม การแพทย์ หรือวิทยาการหุ่นยนต์ การสร้างวิดีโอที่สมจริงจากพรอมป์ต์ภาษาธรรมชาตินั้นจะส่งแรงสั่นสะเทือนไปยังหลายอุตสาหกรรมอย่างไม่ต้องสงสัย

There are many interesting use cases and applications for video generation technologies like Sora. Whether it be movies, education, gaming, healthcare or robotics, there is no doubt generating realistic videos from natural language prompts is going to shake up multiple industries.

หมายเหตุที่อยู่ด้านล่างของไดอะแกรมนี้ก็ตรงกับสิ่งที่ Oxen.ai เชื่อเช่นกัน สำหรับผู้ที่ยังไม่คุ้นเคยกับ Oxen.ai ขอยกตัวอย่างว่าเรากำลังสร้างเครื่องมือโอเพนซอร์สเพื่อช่วยให้คุณทำงานร่วมกันและประเมินข้อมูลที่ไหลเข้าและออกจากโมเดลแมชชีนเลิร์นนิง เราเชื่อว่าผู้คนจำนวนมากจำเป็นต้องมองเห็นข้อมูลเหล่านี้ และนี่ควรเป็นความพยายามร่วมกัน AI กำลังส่งผลต่อหลายสาขาและหลายอุตสาหกรรม และยิ่งมีสายตามากขึ้นกับข้อมูลที่ใช้ฝึกและประเมินโมเดลเหล่านี้ ผลลัพธ์ก็จะยิ่งดีขึ้น

The note at the bottom of this diagram rings true for us at Oxen.ai. If you are not familiar with Oxen.ai we are building open source tools to help you collaborate on and evaluate data the comes in and out of machine learning models. We believe that many people need visibility into this data, and that it should be a collaborative effort. AI is touching many different fields and industries and the more eyes on the data that trains and evaluates these models, the better.

ดูเราได้ที่นี่: https://oxen.ai

Check us out here: https://oxen.ai

รายการบทความวิจัย / Paper Reading List

ในส่วนบรรณานุกรมของรายงานทางเทคนิคที่ OpenAI เผยแพร่มีลิงก์ไปยังงานวิจัยจำนวนมาก แต่ค่อนข้างยากที่จะรู้ว่าควรเริ่มอ่านชิ้นไหนก่อน หรือชิ้นใดเป็นพื้นฐานความรู้ที่สำคัญ เราได้คัดกรองทั้งหมดนั้น และเลือกงานที่เราคิดว่าทรงอิทธิพลและน่าสนใจที่สุดมาอ่าน พร้อมจัดหมวดหมู่ตามประเภท

There are many papers linked in the references section of the OpenAI technical report but it is a bit hard to know which ones to read first or are important background knowledge. We've sifted through them and selected what we think are the most impactful and interesting ones to read, and organized them by type.

งานวิจัยพื้นฐาน / Background Papers

คุณภาพของภาพและวิดีโอที่สร้างขึ้นพัฒนาขึ้นอย่างต่อเนื่องมาตั้งแต่ปี 2015 ความก้าวหน้าครั้งใหญ่ที่ดึงดูดสายตาคนทั่วไปเริ่มขึ้นในปี 2022 กับ Midjourney, Stable Diffusion และ Dall-E ส่วนนี้ประกอบด้วยงานวิจัยพื้นฐานและสถาปัตยกรรมโมเดลบางส่วนที่ถูกอ้างถึงซ้ำแล้วซ้ำเล่าในแวดวงวรรณกรรม แม้งานวิจัยทุกชิ้นจะไม่ได้เกี่ยวข้องโดยตรงกับสถาปัตยกรรมของ Sora แต่ทั้งหมดล้วนเป็นบริบทสำคัญในการทำความเข้าใจว่าเทคโนโลยีล้ำสมัยได้พัฒนามาอย่างไรตามกาลเวลา

The quality of generated images and video have been steadily increasing since 2015. The biggest gains that caught the general public's eyes began in 2022 with Midjourney, Stable Diffusion and Dalle. This section contains some foundational papers and model architectures that are referenced over and over again in the literature. While not all papers are directly involved in the Sora architecture, they are all important context for how the state of the art has improved over time.
โฆษณา

งานวิจัยด้านล่างนี้จำนวนมากเราเคยพูดถึงไปแล้วใน ArXiv Dives ก่อนหน้านี้ ดังนั้นถ้าคุณอยากตามเก็บให้ครบ ลองดูเนื้อหาทั้งหมดได้ที่บล็อกของ Oxen.ai

https://www.oxen.ai/community/arxiv-dives

U-Net

"U-Net: โครงข่ายคอนโวลูชันสำหรับการแบ่งส่วนภาพชีวการแพทย์ (U-Net: Convolutional Networks for Biomedical Image Segmentatio)" เป็นตัวอย่างที่ดีของงานวิจัยซึ่งเดิมถูกใช้กับงานในโดเมนเฉพาะทางหนึ่ง (ในที่นี้คือภาพชีวการแพทย์) แต่ต่อมาถูกนำไปประยุกต์ใช้กับกรณีใช้งานที่หลากหลาย จุดเด่นที่สุดคือมันกลายเป็นแกนหลักของโมเดล diffusion จำนวนมาก เช่น Stable Diffusion เพื่อช่วยให้การเรียนรู้การทำนายและลดทอน noise ในแต่ละขั้นตอนมีประสิทธิภาพ แม้จะไม่ได้ถูกใช้โดยตรงในสถาปัตยกรรมของ Sora แต่ก็เป็นความรู้พื้นฐานสำคัญสำหรับเทคโนโลยีล้ำสมัยในยุคก่อนหน้า

"U-Net: Convolutional Networks for Biomedical Image Segmentation" เป็นตัวอย่างที่ยอดเยี่ยมของงานวิจัยที่เริ่มจากการใช้กับงานในโดเมนหนึ่ง (ภาพชีวการแพทย์) แล้วถูกนำไปใช้ต่อในกรณีใช้งานที่หลากหลาย สิ่งที่โดดเด่นที่สุดคือการเป็น backbone ของ diffusion model จำนวนมาก เช่น Stable Diffusion เพื่อช่วยให้โมเดลเรียนรู้การทำนายและลด noise ในแต่ละขั้นตอนได้ ขณะที่ไม่ได้ถูกใช้โดยตรงในสถาปัตยกรรมของ Sora แต่นี่คือความรู้พื้นฐานสำคัญของเทคโนโลยีล้ำสมัยในยุคก่อนหน้า

https://arxiv.org/abs/1505.04597

ทรานส์ฟอร์เมอร์ภาษา / Language Transformers

"Attention Is All You Need" เป็นอีกหนึ่งงานวิจัยที่พิสูจน์ตัวเองในงาน machine translation แต่ท้ายที่สุดก็กลายเป็นงานระดับหมุดหมายสำคัญของงานวิจัยด้าน natural language processing ทั้งหมด ปัจจุบันทรานส์ฟอร์เมอร์คือแกนหลักของแอปพลิเคชัน LLM จำนวนมาก เช่น ChatGPT และท้ายที่สุดก็สามารถขยายไปสู่หลาย modality ได้ จึงถูกใช้เป็นองค์ประกอบหนึ่งในสถาปัตยกรรมของ Sora

"Attention Is All You Need" เป็นอีกหนึ่งงานวิจัยที่พิสูจน์ตัวเองจากงาน Machine Translation แต่ท้ายที่สุดก็กลายเป็นงานระดับหมุดหมายสำคัญของงานวิจัยด้าน natural language processing ทั้งหมด ปัจจุบัน Transformers คือ backbone ของแอปพลิเคชัน LLM จำนวนมาก เช่น ChatGPT และยังสามารถขยายไปสู่หลาย modality ได้ จึงถูกใช้เป็นองค์ประกอบของสถาปัตยกรรม Sora

https://arxiv.org/abs/1706.03762

วิชันทรานส์ฟอร์เมอร์ / Vision Transformer (ViT)

"ภาพหนึ่งภาพมีค่าเท่ากับคำขนาด 16x16 คำ: ทรานส์ฟอร์เมอร์สำหรับการรู้จำภาพในระดับขนาดใหญ่ (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)" เป็นหนึ่งในงานวิจัยแรก ๆ ที่นำทรานส์ฟอร์เมอร์มาใช้กับการรู้จำภาพ และพิสูจน์ว่ามันสามารถเหนือกว่า ResNet และโครงข่ายประสาทคอนโวลูชันอื่น ๆ ได้ หากฝึกบนชุดข้อมูลที่มีขนาดใหญ่เพียงพอ งานวิจัยนี้นำสถาปัตยกรรมจากบทความ "Attention Is All You Need" มาปรับให้ใช้งานกับงานด้าน computer vision โดยแทนที่จะใช้ text token เป็นอินพุต ViT จะใช้ image patch ขนาด 16x16 เป็นอินพุต

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" เป็นหนึ่งในงานวิจัยแรก ๆ ที่นำ Transformers มาใช้กับงานรู้จำภาพ และพิสูจน์ว่ามันสามารถเอาชนะ ResNet และ Convolutional Neural Networks อื่น ๆ ได้ หากฝึกด้วยชุดข้อมูลที่มีขนาดใหญ่เพียงพอ งานนี้นำสถาปัตยกรรมจากบทความ "Attention Is All You Need" มาทำให้ใช้ได้กับงานด้าน computer vision โดยแทนที่อินพุตจะเป็น text token, ViT ใช้ image patch ขนาด 16x16 เป็นอินพุต

https://arxiv.org/abs/2010.11929

โมเดลการแพร่กระจายเชิงแฝง / Latent Diffusion Models

"การสังเคราะห์ภาพความละเอียดสูงด้วยโมเดลการแพร่กระจายเชิงแฝง (High-Resolution Image Synthesis with Latent Diffusion Models)" คือเทคโนโลยีเบื้องหลังโมเดลสร้างภาพจำนวนมาก เช่น Stable Diffusion โดยแสดงให้เห็นว่าการสร้างภาพสามารถปรับนิยามใหม่เป็นลำดับของ denoising auto-encoder จาก latent representation ได้ โมเดลเหล่านี้ใช้สถาปัตยกรรม U-Net ที่กล่าวถึงข้างต้นเป็นแกนหลักของกระบวนการสร้าง โมเดลลักษณะนี้สามารถสร้างภาพที่สมจริงระดับภาพถ่ายได้เมื่อได้รับข้อความอินพุต

"High-Resolution Image Synthesis with Latent Diffusion Models" คือเทคนิคเบื้องหลังโมเดลสร้างภาพจำนวนมาก เช่น Stable Diffusion โดยแสดงให้เห็นว่าคุณสามารถปรับนิยามการสร้างภาพใหม่ให้เป็นลำดับของ denoising auto-encoders จาก latent representation ได้ โมเดลเหล่านี้ใช้สถาปัตยกรรม U-Net ที่อ้างถึงข้างต้นเป็น backbone ของกระบวนการสร้าง และสามารถสร้างภาพสมจริงระดับภาพถ่ายได้จากข้อความอินพุตใด ๆ

https://arxiv.org/abs/2112.10752

CLIP

"การเรียนรู้โมเดลภาพที่ถ่ายโอนได้จากการกำกับดูแลด้วยภาษาธรรมชาติ (Learning Transferable Visual Models From Natural Language Supervision)" มักเรียกกันอีกชื่อว่า Contrastive Language-Image Pre-training (CLIP) เป็นเทคนิคที่ฝังข้อมูลข้อความและข้อมูลภาพให้อยู่ใน latent space เดียวกัน เทคนิคนี้ช่วยเชื่อมส่วนความเข้าใจภาษาของโมเดลเชิงกำเนิดเข้ากับส่วนความเข้าใจภาพ โดยทำให้มั่นใจว่า cosine similarity ระหว่างตัวแทนของข้อความและภาพจะสูงสำหรับคู่ข้อความ-ภาพ

"Learning Transferable Visual Models From Natural Language Supervision" ซึ่งมักเรียกกันว่า Contrastive Language-Image Pre-training (CLIP) เป็นเทคนิคสำหรับฝังข้อมูลข้อความและข้อมูลภาพให้อยู่ใน latent space เดียวกัน เทคนิคนี้ช่วยเชื่อมครึ่งหนึ่งด้านความเข้าใจภาษาของโมเดลเชิงกำเนิดเข้ากับครึ่งหนึ่งด้านความเข้าใจภาพ โดยทำให้มั่นใจว่า cosine similarity ระหว่างตัวแทนของข้อความและภาพจะสูงในคู่ข้อความและภาพ

https://arxiv.org/abs/2103.00020

VQ-VAE

ตามเอกสารทางเทคนิคของ Sora มีการใช้ Vector Quantized Variational Auto Encoder (VQ-VAE) เพื่อลดมิติของวิดีโอดิบ โมเดล VAE เป็นที่รู้จักกันว่าเป็นวิธี pre-training แบบไม่มีผู้สอนที่ทรงพลังสำหรับการเรียนรู้การแทนค่าแฝง

According to the technical report, they reduce the dimensionality of the raw video with a Vector Quantised Variational Auto Encoder (VQ-VAE). VAEs have been shown to be a powerful unsupervised pre-training method to learn latent representations.

https://arxiv.org/abs/1711.00937

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

เอกสารทางเทคนิคของ Sora อธิบายถึงวิธีรับวิดีโอที่มีอัตราส่วนภาพทุกรูปแบบ และวิธีที่สิ่งนี้ช่วยให้ฝึกกับชุดข้อมูลที่มีขนาดใหญ่ขึ้นมากได้ ยิ่งสามารถป้อนข้อมูลให้โมเดลได้มากขึ้นโดยไม่ต้องครอป ก็ยิ่งได้ผลลัพธ์ที่ดีขึ้น งานวิจัยนี้ ใช้เทคนิคเดียวกันกับภาพ แต่ Sora ขยายแนวคิดนี้ไปใช้กับวิดีโอ

The Sora technical report talks about how they take in videos of any aspect ratio, and how this allows them to train on a much larger set of data. The more data they can feed the model without having to crop it, the better results they get. This paper uses the same technique but for images, and Sora extends it for video.

https://arxiv.org/abs/2307.06304

งานวิจัยด้านการสร้างวิดีโอ / Video Generation Papers

พวกเขาอ้างอิงงานวิจัยด้านการสร้างวิดีโอหลายฉบับที่เป็นแรงบันดาลใจให้ Sora และยกระดับโมเดลเชิงกำเนิดข้างต้นไปอีกขั้นด้วยการนำไปประยุกต์ใช้กับวิดีโอ

ViViT: A Video Vision Transformer

งานวิจัยนี้ อธิบายรายละเอียดเกี่ยวกับวิธีแบ่งวิดีโอออกเป็น "spatio-temporal tokens" ที่จำเป็นสำหรับงานด้านวิดีโอ แม้งานวิจัยนี้จะมุ่งเน้นที่การจัดประเภทวิดีโอ แต่แนวทาง tokenization แบบเดียวกันก็สามารถนำไปใช้กับงานสร้างวิดีโอได้

This paper goes into details about how you can chop the video into "spatio-temporal tokens" needed for video tasks. The paper focuses on video classification, but the same tokenization can be applied to generating video.

https://arxiv.org/abs/2103.15691

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen เป็นระบบสร้างวิดีโอแบบมีข้อความเป็นเงื่อนไข (text-conditional video generation system) ที่อิงกับลำดับของโมเดล video diffusion โดยใช้ convolution ตามแนวเวลาและเทคนิค Super Resolution เพื่อสร้างวิดีโอคุณภาพสูงจากข้อความ

Imagen is a text-conditional video generation system based on a cascade of video diffusion models. They use convolutions in the temporal direction and super resolution to generate high quality videos from text.

https://arxiv.org/abs/2210.02303

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

งานวิจัยนี้ นำ latent diffusion model จากงานวิจัยด้านการสร้างภาพข้างต้นมาใช้ และเพิ่มมิติเวลา (temporal dimension) เข้าไปใน latent space ที่นี่มีการใช้เทคนิคที่น่าสนใจบางอย่างกับมิติเวลาโดยการจัดแนว latent space แต่ก็ยังไปไม่ถึงระดับความสอดคล้องเชิงเวลาของ Sora

This paper takes the latent diffusion models from the image generation papers above and introduces a temporal dimension to the latent space. They apply some interesting techniques in the temporal dimension by aligning the latent spaces, but does not quite have the temporal consistency of Sora yet.

https://arxiv.org/abs/2304.08818

Photorealistic video generation with diffusion models

บทความนี้แนะนำ W.A.L.T ซึ่งเป็น แนวทางแบบ transformer สำหรับการสร้างวิดีโอสมจริงผ่าน diffusion modeling เท่าที่ผมทราบ นี่น่าจะเป็นเทคนิคที่ใกล้เคียงกับ Sora มากที่สุดในบรรดารายการอ้างอิง และถูกเผยแพร่ในเดือนธันวาคม 2023 โดยทีมจาก Google, Stanford และ Georgia Tech

They introduce W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. This feels like the closest technique to Sora in the reference list as far as I can tell, and was released in December of 2023 by the teams at Google, Stanford and Georgia Tech.

https://arxiv.org/abs/2312.06662

งานวิจัยด้านความเข้าใจภาพ-ภาษา / Vision-Language Understanding

ในการสร้างวิดีโอจาก text prompt จำเป็นต้องรวบรวมชุดข้อมูลขนาดใหญ่จำนวนมาก เนื่องจากไม่สามารถให้มนุษย์มาติดป้ายกำกับวิดีโอจำนวนมหาศาลนั้นได้ จึงดูเหมือนว่าพวกเขาใช้เทคนิคข้อมูลสังเคราะห์ที่คล้ายกับที่อธิบายไว้ใน บทความ DALL-E 3

In order to Generate Videos from text prompts, they need to collect a large dataset. It is not feasible to have humans label that many videos, so it seems they use some synthetic data techniques similar to those described in the DALL·E 3 paper.

DALL·E 3

การฝึกระบบสร้างวิดีโอจากข้อความต้องใช้วิดีโอจำนวนมากที่มี text caption ที่สอดคล้องกัน พวกเขานำเทคนิค re-captioning ที่แนะนำใน DALL·E 3 มาปรับใช้กับวิดีโอ เช่นเดียวกับ DALL·E 3 พวกเขายังใช้ GPT เพื่อแปลง prompt สั้น ๆ ของผู้ใช้ให้เป็น caption แบบละเอียดที่ยาวขึ้น แล้วส่งต่อไปยังโมเดลวิดีโอ

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. They apply the re-captioning technique introduced in DALL·E 3 to videos. Similar to DALL·E 3, they also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.

https://openai.com/dall-e-3

Llava

เพื่อให้โมเดลสามารถทำตามคำสั่งของผู้ใช้ได้ มีความเป็นไปได้สูงว่าพวกเขาได้ทำ instruction fine-tuning ในลักษณะคล้ายกับบทความ Llava บทความนี้ยังแสดงเทคนิคข้อมูลสังเคราะห์บางอย่างสำหรับสร้างชุดข้อมูลคำสั่งขนาดใหญ่ ซึ่งอาจน่าสนใจเมื่อนำไปใช้ร่วมกับวิธีของ Dall-E ข้างต้น

In order for the model to be able to follow user instructions, they likely did some instruction fine-tuning similar to the Llava paper. This paper also shows some synthetic data techniques to create a large instruction dataset that could be interesting in combination with the Dalle methods above.

https://arxiv.org/abs/2304.08485

Make-A-Video & Tune-A-Video

บทความอย่าง Make-A-Video และ Tune-A-Video แสดงให้เห็นว่า prompt engineering ใช้ประโยชน์จากความสามารถในการเข้าใจภาษาธรรมชาติของโมเดลอย่างไร เพื่อถอดรหัสคำสั่งที่ซับซ้อนและเรนเดอร์ออกมาเป็นเรื่องราววิดีโอที่มีความต่อเนื่อง มีชีวิตชีวา และมีคุณภาพสูง ตัวอย่างเช่น การนำ prompt แบบง่ายของผู้ใช้มาขยายด้วยคำคุณศัพท์และคำกริยา เพื่อเติมรายละเอียดของฉากให้สมบูรณ์ยิ่งขึ้น

Papers like Make-A-Video and Tune-A-Video have shown how prompt engineering leverages model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. For example: taking a simple user prompt and extending it with adjectives and verbs to more fully flush out the scene.

https://arxiv.org/abs/2209.14792

https://arxiv.org/abs/2212.11565

บทสรุป / Conclusion

เราหวังว่าบทความนี้จะเป็นจุดเริ่มต้นให้คุณได้สำรวจองค์ประกอบสำคัญทั้งหมดที่อาจประกอบกันเป็นระบบแบบ Sora! หากคุณคิดว่าเราพลาดอะไรไป สามารถส่งอีเมลมาได้ที่ hello@oxen.ai

We hope this gives you a jumping off point for all the important components that could make up a system like Sora! If you think we missed anything, feel free to email us at hello@oxen.ai.

งานที่แนะนำไว้ที่นี่ไม่ใช่อ่านสบาย ๆ อย่างแน่นอน นั่นจึงเป็นเหตุผลว่าทำไมในทุกวันศุกร์ เราจึงค่อย ๆ อ่านทีละหนึ่งบทความ ชะลอจังหวะลง และอธิบายหัวข้อต่าง ๆ ด้วยภาษาง่าย ๆ เพื่อให้ทุกคนเข้าใจได้ เราเชื่อว่าทุกคนสามารถมีส่วนร่วมในการสร้างระบบ AI ได้ และยิ่งคุณเข้าใจพื้นฐานมากขึ้นเท่าไร คุณก็จะยิ่งมองเห็นรูปแบบได้มากขึ้นและสร้างผลิตภัณฑ์ที่ดีขึ้นได้

It is by no means a light set of reading. This is why on Fridays we take one paper at a time, slow down, and break down the topics in plain speak so anyone can understand. We believe anyone can contribute to building AI systems, and the more you understand the fundamentals, the more patterns you will spot, and better products you will build.

https://www.oxen.ai/community

มาร่วมเส้นทางการเรียนรู้ไปกับเรา ไม่ว่าจะสมัคร ArXiv Dives หรือเข้าร่วมชุมชน Oxen.ai บน Discord

Join us on a learning journey either by signing up for ArXiv Dives or simply joining the Oxen.ai Discord community.

https://discord.com/invite/s3tBEn7Ptg

ต้นฉบับ

https://www.oxen.ai/blog/road-to-sora-reading-list

⚠️โฆษณา⚠️: บทความนี้ที่ :pytorch:ชุมชนผู้ใช้ PyTorch เกาหลี:kr: รวบรวมไว้มีประโยชน์หรือไม่? หาก สมัครสมาชิก เราจะส่งบทความสำคัญให้ทางอีเมล:love_letter:! (ค่าเริ่มต้นคือ Weekly แต่ เปลี่ยนเป็น Daily ได้.)

1 ความคิดเห็น

ninebow 2024-03-26

OpenAI's Sora: