Deep Thinking
What is Deep Thinking?
深度不夠,長度來湊
Question -----> LLM ----->
<think> ....... </think>-> Verification , Explore , Planning
Example: Alpha Go
- Alpha Go 的思考過程是用 MCTS (Monte Carlo Tree Search)
Test Time Scaling
- 思考的越多結果會越好!
- Paper: Scaling Scaling Laws with Board Games
Build Reasoning LLM Method
你可以混著用
Chain of Thought (CoT)
- Don't need to change the model
- Few-shot CoT
- 給Example叫LLM回答
- Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Zero-shot CoT
- 直接叫LLM "Let think step by step" 來回答問題
- Paper: Large Language Models are Zero-Shot Reasoners
- Long CoT
- Paper: Towards Reasoning Era
- Supervised CoT
- 用更好的prompt來引導LLM回答問題
- Paper: Supervised Chain of Thought
給Model reasoning 工作流程
- Don't need to change the model
How to explore?
用同一個問題 問LLM很多次,他會給出不同的答案
- Paper: Large Language Monkeys
How to choose the right answer?
- Majority Vote (Self-consistency)
- Confidence(used in CoT decoding)
- 加上 Verification
Parallel vs. Sequential vs. Parallel + Sequential
How to verify the step?
- create process verifier to predict the correctness of each step
- Paper: Let's Verify Step by Step
- Paper: Math-Shepherd
- Beam Search
- Paper: Self-Evaluation Guided Beam Search for Reasoning
- Paper: Deductive Beam Search
- Monte Carlo Tree Search
- Paper: Monte Carlo Tree Search Boosts Reasoning
- ReST-MCTS
- Paper: ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
- Paper: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
資訊
Imitation Learning and Reinforcement Learning Post-Training 的一種特例
Imitation Learning
- Need to change the model
- 教 model reasoning
Reasoning process data how to come from?
use LLM to generate reasoning process data 在對的answer 情況下去把readoning process 也視為是對的,並用於訓練
- Paper: rStar-Math
- Stream of Search (SoS)
- 也可以把錯誤的過程放進去
- Paper: Stream of Search (SoS)
- Paper: O1 Replication Journey
Reinforcement Learning
- Need to change the model
- Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning