Transformer
Reference
Sequence to Sequence (Seq2Seq)
- Transformer is a type of Seq2Seq model.
Application
- Seq2Seq for multi label classification An object can belong to multiple classes.
- Seq2Seq for Syntatic Parsing
- Deep Learning for human language Processing
- Seq2Seq for Object Detection
Architecture
Encoder
- RNN, CNN, and Self-Attention are all viable choices for model encoders.
- Input: sequence of vectors
- Output: sequence of vectors
Transformer Encoder Architecture
- Simplified Transformer Encoder Architecture
Residual
Problem: Vanishing gradients in deep networks. Solution: Add skip connections to allow gradients to flow through the network.
Decoder
Decoders can be broadly categorized into two types: Autoregressive and Non-Autoregressive.
-
Autogressive decoder: one by one eating words and output
-
Non-Autoregressive decoder: all words at once
"Masked" Self-Attention
只考慮之前的,因為之後的根本沒產生。
Encoder-Decoder 連接處
Cross-Attention
info
待完成
Training Tip
Teacher Forcing
- 先標記 -> 計算 cross-entropy loss(one-hot vector vs real probability distribution)
- Training 時直接強迫輸出正確答案
Copy Mechanism
- Pointer Network
Guided Attention
- Monotonic Attention, Location-aware Attention
- 在speech Recognition,TTS 上可能很重要
- In some tasks, input and output are monotonically aligned.