[แปล] คำอธิบายเชิงภาพของ Vision Transformer (A Visual Guide to Vision Transformers)

(discuss.pytorch.kr)

13 คะแนน โดย ninebow 2024-04-22 | 1 ความคิดเห็น | แชร์ทาง WhatsApp

ℹ️ หลังจากได้อ่านบทความ คู่มือภาพอธิบาย Visual Transformers ที่ xguru แนะนำไว้ จึงได้แปล บทความอธิบายเชิงภาพเกี่ยวกับ Vision Transformer (ViT) (A Visual Guide to Vision Transformers) ซึ่งเขียนโดย Dennis Turp นักวิทยาศาสตร์ข้อมูลและวิศวกรซอฟต์แวร์ โดยได้รับอนุญาตจากผู้เขียนแล้ว
Vision Transformer (ViT) เป็นโมเดลที่นำ Transformer มาประยุกต์ใช้กับงานด้าน CV (Computer Vision) และแสดงประสิทธิภาพที่ยอดเยี่ยมในงานอย่างการตรวจจับวัตถุและการจัดประเภทภาพ โดยเฉพาะอย่างยิ่ง มักถูกใช้เป็น Visual Encoder สำหรับดึงคุณลักษณะ (feature) จากภาพ
เนื่องจากคำอธิบายในต้นฉบับค่อนข้างสั้น อาจทำให้เข้าใจได้ยากในบางจุด จึงได้เพิ่มคำอธิบายประกอบบางส่วนเพื่อช่วยให้เข้าใจง่ายขึ้น

คำอธิบายเชิงภาพของ Vision Transformer (ViT)

บทความนี้เป็นคำอธิบายเชิงภาพของ Vision Transformers (ViTs) ซึ่งเป็นโมเดลดีปเลิร์นนิงที่ทำผลงานระดับล้ำสมัย (SotA, State-of-the-Art) ในงานจัดประเภทภาพ Vision Transformer คือการนำสถาปัตยกรรม Transformer ที่เดิมออกแบบมาสำหรับการประมวลผลภาษาธรรมชาติ (NLP) มาประยุกต์ใช้กับข้อมูลภาพ ในบทความนี้ คุณจะได้ทำความเข้าใจวิธีการทำงานของ Vision Transformer ผ่านคำอธิบายแบบสั้น ๆ พร้อมภาพประกอบที่ช่วยให้เห็นการไหลของข้อมูลภายในโมเดลได้ชัดเจน (:pytorch::kr:: ในที่นี้อธิบายแบบเลื่อนหน้าจอทำได้ยาก จึงใช้ภาพแคปหน้าจอแทน แนะนำให้อ่าน ต้นฉบับ ควบคู่กันไป)

This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.

0. มาดูข้อมูลกันก่อน / Lets start with the data

เช่นเดียวกับโครงข่ายประสาทเทียมแบบคอนโวลูชัน (CNN) ทั่วไป Vision Transformer ก็ฝึกด้วยวิธีการเรียนรู้แบบมีผู้สอน (Supervised Learning) กล่าวคือ โมเดลจะถูกฝึกด้วยชุดข้อมูลที่ประกอบด้วยภาพและเลเบลที่สอดคล้องกัน

Like normal convolutional neural networks, vision transformers are trained in a supervised manner. This means that the model is trained on a dataset of images and their corresponding labels.

1. โฟกัสที่ข้อมูลเพียงหนึ่งตัวอย่าง / Focus on one data point

เพื่อให้เข้าใจได้ดีขึ้นว่าเกิดอะไรขึ้นภายใน Vision Transformer ลองโฟกัสไปที่ข้อมูลเพียงหนึ่งตัวอย่างก่อน (batch size เท่ากับ 1) แล้วลองตั้งคำถามนี้ไปพร้อมกัน: ข้อมูลตัวอย่างนี้ต้องถูกเตรียมอย่างไรจึงจะป้อนเข้า Transformer ได้?

To get a better understanding of what happens inside a vision transformer lets focus on a single data point (batch size of 1). And lets ask the question: How is this data point prepared in order to be consumed by a transformer?

2. พักเรื่องเลเบลไว้ก่อน / Forget the label for the moment

เลเบลจะมีความเกี่ยวข้องมากขึ้นในภายหลัง ตอนนี้ให้เหลือไว้แค่ภาพเพียงภาพเดียวก่อน

The label will become more relevant later. For now the only thing that we are left with is a single image.

3. แบ่งภาพออกเป็นแพตช์ / Create patches of the image

เพื่อเตรียมภาพสำหรับใช้งานภายใน Transformer เราจะแบ่งภาพทั้งภาพออกเป็นแพตช์ขนาดเท่ากันทั้งหมด โดยแต่ละแพตช์มีขนาด p x p

To prepare the image for the use inside the transformer we divide the image into equally sized patches of size p x p.

4. ทำภาพแพตช์ให้เป็นเวกเตอร์แบน / Flatting of the image patches

จากนั้นทำให้แต่ละแพตช์แบนลงเป็นเวกเตอร์ขนาด p' = p² x c โดย p คือขนาดด้านหนึ่งของแพตช์ และ c คือจำนวนช่องสัญญาณ (:pytorch::kr:: ตัวอย่างเช่น ในภาพ RGB จำนวนช่องสัญญาณจะเท่ากับ 3)

The patches are now flattened into vectors of dimension p'= p²*c where p is the size of the patch and c is the number of channels.

5. สร้างแพตช์เอ็มเบดดิง / Creating patch embeddings

เวกเตอร์ที่ได้จากภาพแพตช์ในขั้นก่อนหน้าจะถูกเข้ารหัสด้วยการแปลงเชิงเส้น ผลลัพธ์ที่ได้คือ Patch Embedding Vector ซึ่งมีขนาดคงที่เป็น d

These image patch vectors are now encoded using a linear transformation. The resulting Patch Embedding Vector has a fixed size d.

6. ทำเอ็มเบดดิงให้กับทุกแพตช์ / Embedding all patches

เมื่อแปลงภาพแพตช์ทั้งหมดให้เป็นเวกเตอร์ขนาดคงที่แล้ว เราจะได้อาร์เรย์ขนาด n x d โดยที่ n คือจำนวนของภาพแพตช์ และ d คือขนาดของ patch embedding

Now that we have embedded our image patches into vectors of fixed size, we are left with an array of size n x d where n is the the number of image patches and d is the size of the patch embedding

7. เพิ่มโทเคนสำหรับการจัดประเภท (CLS) / Appending a classification token

เพื่อฝึกโมเดลของเราได้อย่างมีประสิทธิภาพ เราจะเพิ่มเวกเตอร์อีกตัวหนึ่งที่เรียกว่าโทเคนสำหรับการจัดประเภท (CLS token) เข้าไปต่อจาก patch embedding เวกเตอร์นี้เป็นพารามิเตอร์ที่เรียนรู้ได้ของโครงข่ายประสาท และถูกกำหนดค่าเริ่มต้นแบบสุ่ม ข้อสังเกตคือ เรามี CLS token เพียงตัวเดียว และจะนำเวกเตอร์เดียวกันนี้ไปต่อเพิ่มให้กับข้อมูลทุกตัว (:pytorch::kr:: เมื่อทำถึงขั้นนี้แล้ว จะได้ (n+1) รายการจากการเพิ่ม CLS token ให้กับ patch embedding จำนวน n รายการ โดยแต่ละ embedding มีขนาด d ดังนั้นจึงมีขนาดเป็น (n+1) x d)

In order for us to effectively train our model we extend the array of patch embeddings by an additional vector called classification token (cls token). This vector is a learnable parameter of the network and is randomly initialized. Note: We only have one cls token and we append the same vector for all data points.

8. เพิ่มเวกเตอร์ตำแหน่ง / Add positional embedding Vectors

จนถึงตอนนี้ patch embedding ของเรายังไม่มีข้อมูลตำแหน่งกำกับอยู่ เราแก้ปัญหานี้โดยการบวก เวกเตอร์ฝังตำแหน่ง (Positional Embedding Vector) ที่เรียนรู้ได้และถูกกำหนดค่าเริ่มต้นแบบสุ่มเข้าไปใน patch embedding ทุกตัว นอกจากนี้ เรายังเพิ่มเวกเตอร์ตำแหน่งลักษณะเดียวกันนี้ให้กับ โทเคนสำหรับการจัดประเภท (CLS token) ที่เพิ่มไว้ก่อนหน้านี้ด้วย (:pytorch::kr:: ใน Transformer จะใช้การ 'บวก' ค่า Positional Encoding เข้าไป ดังนั้นขนาดของเวกเตอร์จึงไม่เปลี่ยนแปลง)

Currently our patch embeddings have no positional information associated with them. We remedy that by adding a learnable randomly initialized positional embedding vector to all our patch embeddings. We also add a such a positional embedding vector to our classification token.

9. ป้อนข้อมูลเข้า Transformer / Transformer Input

เมื่อเพิ่มเวกเตอร์ตำแหน่งแล้ว เราจะเหลืออาร์เรย์ขนาด (n+1) x d อาร์เรย์นี้จะถูกใช้เป็นอินพุตของ Transformer ซึ่งเราจะอธิบายอย่างละเอียดมากขึ้นในขั้นตอนถัดไป

After the positional embedding vectors have been added we are left with an array of size (n+1) x d. This will be our input for the transformer which will be explained in greater detail in the next steps.

10.1. Transformer: การสร้าง QKV / QKV Creation

เวกเตอร์ patch embedding ที่เป็นอินพุตของ Transformer จะถูกฝังเชิงเส้นเข้าไปเป็นเวกเตอร์ขนาดใหญ่หลายตัว จากนั้นเวกเตอร์ใหม่เหล่านี้จะถูกแบ่งออกเป็นสามส่วนที่มีขนาดเท่ากัน ได้แก่ Q คือเวกเตอร์ Query, K คือเวกเตอร์ Key, และ V คือเวกเตอร์ Value เราจะได้เวกเตอร์เหล่านี้อย่างละ (n+1) ตัว

Our transformer input patch embedding vectors are linearly embedded into multiple large vectors. These new vectors are than separated into three equal sized parts. The Q - Query Vector, the K - Key Vector and the V - Value Vector . We will have (n+1) of a all of those vectors.

10.2. Transformer: การคำนวณคะแนน attention / Attention Score Calculation

ขั้นแรก เพื่อคำนวณคะแนน attention A เราจะนำเวกเตอร์ query Q ทุกตัวไปคูณกับเวกเตอร์ key K ทุกตัว

To calculate our attention scores A we will now multiply all of our query vectors Q with all of our key vectors K.

10.3. Transformer: เมทริกซ์คะแนน attention / Attention Score Matrix

เมื่อได้เมทริกซ์คะแนน attention A แล้ว เราจะใช้ฟังก์ชัน softmax กับทุกแถว เพื่อให้ผลรวมของแต่ละแถวมีค่าเท่ากับ 1

Now that we have the attention score matrix A we apply a softmax function to every row such that every row sums up to 1.

10.4. Transformer: การคำนวณข้อมูลบริบทที่ถูกรวม / Aggregated Contextual Information Calculation

ในการคำนวณ ข้อมูลบริบทที่ถูกรวม (aggregated contextual information) สำหรับเวกเตอร์ patch embedding ตัวแรก เราจะพิจารณา แถวแรก ของเมทริกซ์ attention และใช้ค่าในแถวนั้นเป็นน้ำหนักให้กับ เวกเตอร์ Value V ผลลัพธ์ที่ได้คือเวกเตอร์ ข้อมูลบริบทที่ถูกรวม (aggregated vector) สำหรับ patch embedding ของภาพตัวแรก

To calculate the aggregated contextual information for the first patch embedding vector. We focus on the first row of the attention matrix. And use the entires as weights for our Value Vectors V. The result is our aggregated contextual information vector for the first image patch embedding.

10.5. Transformer: หาข้อมูลบริบทที่ถูกรวมสำหรับทุก patch / Aggregated Contextual Information for every patch

จากนั้นเราจะทำกระบวนการข้างต้นซ้ำกับแถวอื่น ๆ ของเมทริกซ์คะแนน attention เพื่อให้ได้เวกเตอร์ข้อมูลบริบทที่ถูกรวมจำนวน N+1 ตัว กล่าวคือ หนึ่งตัวสำหรับแต่ละ patch (=N ตัว) และอีกหนึ่งตัวสำหรับโทเคนการจัดประเภท (CLS Token) เมื่อถึงจุดนี้ เราจะได้ Attention Head ตัวแรก

Now we repeat this process for every row of our attention score matrix and the result will be N+1 aggregated contextual information vectors. One for every patch + one for the classification token. This steps concludes our first Attention Head.

10.6. Transformer: Multi-Head Attention

เนื่องจากเรากำลังจัดการกับ multi-head attention (ของ Transformer) เราจึงทำกระบวนการทั้งหมดตั้งแต่ 10.1 ถึง 10.5 ซ้ำอีกครั้งกับการแมป QKV ชุดอื่น ในภาพประกอบด้านบนเราสมมติว่ามี 2 heads แต่โดยทั่วไปแล้ว ViT จะมีมากกว่านั้นมาก สุดท้ายจึงได้เวกเตอร์ข้อมูลบริบทที่ถูกรวมหลายชุด (Multiple Aggregated Contextual Information Vectors)

Now because we are dealing multi head attention we repeat the entire process from step 10.1 - 10-5 again with a different QKV mapping. For our explanatory setup we assume 2 Heads but typically a VIT has many more. In the end this results in multiple Aggregated contextual information vectors.

10.7. Transformer: ขั้นตอนสุดท้ายของเลเยอร์ attention / Last Attention Layer Step

หลังจากนำหลายเฮดที่สร้างขึ้นมาเหล่านี้มาซ้อนกันแล้ว ก็จะแมปให้เป็นเวกเตอร์ขนาด d ซึ่งมีขนาดเท่ากับ patch embedding

These heads are stacked together and are mapped to vectors of size d which was the same size as our patch embeddings had.

10.8. Transformer: หาผลลัพธ์ของเลเยอร์ attention / Attention Layer Result

จากขั้นก่อนหน้านี้ เลเยอร์ attention ก็เสร็จสมบูรณ์ และเราได้ embedding ที่มี ขนาดเท่ากันทุกประการ กับที่ใช้เป็นอินพุต

The previous step concluded the attention layer and we are left with the same amount of embeddings of exactly the same size as we used as input.

10.9. Transformer: เชื่อมต่อแบบ residual / Residual connections

ใน Transformer มีการใช้ การเชื่อมต่อแบบ residual (Residual Connection) อย่างมาก ซึ่งก็คือการนำอินพุตของเลเยอร์ก่อนหน้ามาบวกกับเอาต์พุตของเลเยอร์ปัจจุบันอย่างง่าย ๆ และในที่นี้เราก็จะทำ residual connection เช่นกัน

Transformers make heavy use of residual connections which simply means adding the input of the previous layer to the output the current layer. This is also something that we will do now.

10.10. Transformer: หาผลลัพธ์ของ residual connection / Residual connection Result

การเชื่อมต่อแบบ residual ลักษณะนี้จะทำให้ได้เวกเตอร์ที่มีขนาดเท่าเดิม (จากการนำเวกเตอร์ขนาด d ที่เท่ากันมาบวกกัน)

The addition results in vectors of the same size.

10.11. Transformer: ส่งผ่านเข้า Feed Forward Network / Feed Forward Network

นำผลลัพธ์ (output) ที่ได้จนถึงตอนนี้ส่งผ่านเข้าโครงข่ายประสาทเทียมแบบ feed forward ที่มีฟังก์ชันกระตุ้นแบบไม่เชิงเส้น

Now these outputs are feed through a feed forward neural network with non linear activation functions

10.12. Transformer: หาผลลัพธ์สุดท้าย / Final Result

ใน Transformer หลังจากการคำนวณทั้งหมดจนถึงตอนนี้ ยังมี residual connection อีกจุดหนึ่ง แต่เพื่อให้คำอธิบายกระชับ เราจะข้ามส่วนนั้นไปและจบการทำงานของเลเยอร์ Transformer ไว้ตรงนี้ สุดท้ายแล้ว Transformer จะสร้างเอาต์พุตที่มีขนาดเท่ากับอินพุต

After the transformer step there is another residual connections which we will skip here for brevity. And so the last step concluded the transformer layer. In the end the transformer produced outputs of the same size as input.

11. ทำซ้ำการคำนวณ Transformer / Repeat Transformers

ทำซ้ำกระบวนการคำนวณ Transformer ทั้งหมดตั้งแต่ 10.1 ถึง 10.12 หลายครั้ง โดยในที่นี้ยกตัวอย่างเป็น 6 รอบ

Repeat the entire transformer calculation Steps 10.1 - Steps 10.12 for the Transformer several times e.g. 6 times.

12. ตรวจสอบเอาต์พุตของ classification token / Identify Classification token output

ขั้นตอนสุดท้ายคือการดูเอาต์พุตของ classification token (CLS token) โดยเวกเตอร์นี้จะถูกนำไปใช้ในขั้นตอนสุดท้ายของกระบวนการ Vision Transformer

Last step is to identify the classification token output. This vector will be used in the final step of our Vision Transformer journey.

13. ขั้นตอนสุดท้าย: ทำนายความน่าจะเป็นของการจำแนก / Final Step: Predicting classification probabilities

ในขั้นตอนสุดท้ายจริง ๆ เราจะนำ classification output token นี้ส่งผ่านเข้าโครงข่ายประสาทเทียมอีกตัวที่เป็นแบบ fully-connected เพื่อทำนายความน่าจะเป็นของการจำแนก (classification probabilities) สำหรับภาพอินพุต

In the final and last step we use this classification output token and another fully connected neural network to predict the classification probabilities of our input image.

14. การฝึก Vision Transformer / Training of the Vision Transformer

เราใช้ฟังก์ชันสูญเสียแบบ cross-entropy มาตรฐาน (Cross-Entropy Loss Function) เพื่อฝึก Vision Transformer โดยเปรียบเทียบความน่าจะเป็นของคลาสที่ทำนายได้ (class probabilities) กับป้ายกำกับคลาสจริง (true class label) โมเดลจะเรียนรู้โดยใช้ backpropagation และ gradient descent เพื่ออัปเดตพารามิเตอร์ของโมเดลให้ลดค่าฟังก์ชันสูญเสียลง

We train the Vision Transformer using a standard cross-entropy loss function, which compares the predicted class probabilities with the true class labels. The model is trained using backpropagation and gradient descent, updating the model parameters to minimize the loss function.

บทสรุป / Conclusion

จนถึงตอนนี้ เราได้ดูองค์ประกอบสำคัญของ Vision Transformer ตั้งแต่การเตรียมข้อมูลไปจนถึงการฝึกโมเดลผ่านคำอธิบายเชิงภาพแล้ว หวังว่าคำอธิบายนี้จะช่วยให้เข้าใจว่า Vision Transformer ทำงานอย่างไร และถูกนำไปใช้จำแนกภาพได้อย่างไร

In this visual guide, we have walked through the key components of Vision Transformers, from the data preparation to the training of the model. We hope this guide has helped you understand how Vision Transformers work and how they can be used to classify images.

เพื่อช่วยให้เข้าใจ Vision Transformer ได้ดียิ่งขึ้น ยังมี Colab Notebook เล็ก ๆ เตรียมไว้ให้ด้วย กรุณาลองดูคอมเมนต์ของ 'Blogpost' ด้วย โค้ดนี้นำมาจาก VIT Pytorch implementation อันยอดเยี่ยมของ @lucidrains อย่าลืมแวะไปดูผลงานของเขาด้วย

I prepared this little Colab Notebook to help you understand the Vision Transformer even better. Please have look for the 'Blogpost' comment. The code was taken from @lucidrains great VIT Pytorch implementation be sure to checkout his work.

หากคุณมีคำถามหรือข้อเสนอแนะ สามารถติดต่อมาได้ทุกเมื่อ ขอบคุณที่อ่านจนจบ! (ผู้เขียนมี GitHub, X(Twitter), Threads, LinkedIn)

If you have any questions or feedback, please feel free to reach out to me. Thank you for reading!

คำขอบคุณ / Acknowledgements

VIT implementation ของ @lucidrains บน PyTorch
รูปภาพทั้งหมดนำมาจาก Wikipedia และได้รับอนุญาตให้ใช้ภายใต้ สัญญาอนุญาต CC BY-SA 4.0

VIT Pytorch implementation

All images have been taken from Wikipedia and are licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

อ่านเพิ่มเติม

งานวิจัย โค้ด และข้อมูลอื่น ๆ ที่ PapersWithCode รวบรวมเกี่ยวกับ Vision Transformer

https://paperswithcode.com/method/vision-transformer

⚠️โฆษณา⚠️: บทความนี้ที่สรุปโดย กลุ่มผู้ใช้ PyTorch เกาหลี มีประโยชน์ไหม? หาก สมัครเป็นสมาชิก เราจะส่งบทความสำคัญให้ทางอีเมล! (ค่าเริ่มต้นคือ Weekly แต่ เปลี่ยนเป็น Daily ได้)

1 ความคิดเห็น

gcback 2024-04-22

ขอบคุณที่พยายามจัดทำข้อมูลที่มีประโยชน์ให้ครับ.^