Deep Thinking

What is Deep Thinking?

深度不夠，長度來湊

Question -----> LLM -----> thinking process answer

<think> ....... </think> -> Verification , Explore , Planning

Example: Alpha Go

Alpha Go 的思考過程是用 MCTS (Monte Carlo Tree Search)

Test Time Scaling

思考的越多結果會越好！
Paper: Scaling Scaling Laws with Board Games

Build Reasoning LLM Method

你可以混著用

Chain of Thought (CoT)

Don't need to change the model

Few-shot CoT
- 給Example叫LLM回答
- Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Zero-shot CoT
- 直接叫LLM "Let think step by step" 來回答問題
- Paper: Large Language Models are Zero-Shot Reasoners
Long CoT
- Paper: Towards Reasoning Era
Supervised CoT
- 用更好的prompt來引導LLM回答問題
- Paper: Supervised Chain of Thought

給Model reasoning 工作流程

Don't need to change the model

How to explore?

用同一個問題問ＬＬＭ很多次，他會給出不同的答案

Paper: Large Language Monkeys

How to choose the right answer?

Majority Vote (Self-consistency)
- Paper: Self-Consistency Improves Chain of Thought Reasoning in Language Models
Confidence(used in CoT decoding)
- Paper: Chain-of-Thought Reasoning Without Prompting
加上 Verification
- Paper: Training Verifiers to Solve Math Word Problems

Parallel vs. Sequential vs. Parallel + Sequential

How to verify the step?

create process verifier to predict the correctness of each step
Paper: Let's Verify Step by Step
Paper: Math-Shepherd
Beam Search
Paper: Self-Evaluation Guided Beam Search for Reasoning
Paper: Deductive Beam Search
Monte Carlo Tree Search
Paper: Monte Carlo Tree Search Boosts Reasoning
ReST-MCTS
Paper: ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Paper: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

情報

Imitation Learning and Reinforcement Learning Post-Training 的一種特例

Imitation Learning

Need to change the model
教 model reasoning

Reasoning process data how to come from?

use LLM to generate reasoning process data 在對的answer 情況下去把readoning process 也視為是對的，並用於訓練

Paper: rStar-Math
Stream of Search (SoS)
也可以把錯誤的過程放進去
Paper: Stream of Search (SoS)
Paper: O1 Replication Journey

Reinforcement Learning

Need to change the model
Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

What is Deep Thinking?​

Example: Alpha Go​

Test Time Scaling​

Build Reasoning LLM Method​

Chain of Thought (CoT)​

給Model reasoning 工作流程​

How to explore?​

How to choose the right answer?​

Parallel vs. Sequential vs. Parallel + Sequential​

How to verify the step?​

Imitation Learning​

Reasoning process data how to come from?​

Reinforcement Learning​