Training LLM to play chess using Deepseek GRPO reinforcement learning

Efficient NLP • March 1, 2025

Efficient NLP

@efficientnlp

About

Efficient NLP My name is Bai Li, I'm a machine learning engineer and PhD in natural language processing. Reach me at: Email: [email protected] LinkedIn: https://www.linkedin.com/in/libai/

Latest Posts

PT4M

What NoGIL Python means for machine learning

Efficient NLP4 months ago

13743

PT4M

Inference Characteristics of Streaming Speech Recognition

Efficient NLP7 months ago

2342

PT4M

Voice Writer: AI Dictation for Novelists

Efficient NLP8 months ago

1287

PT4M

How to measure LLM writing quality when there is no right answer?

Efficient NLP11 months ago

2099

Video Description

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io In this video, we see how popular LLMs like GPT-4o, o1 Reasoning, and DeepSeek R1 show some understanding of chess, they often fail to play legal moves. To address this, we train our own reasoning-focused chess LLM using the Group Relative Policy Optimization (GRPO) method introduced in DeepSeek R1. We walk through how GRPO differs from traditional PPO (Proximal Policy Optimization) and fine-tune LLaMA 8B and Qwen 7B using TRL (Transformers Reinforcement Learning) and Unsloth libraries - the results are surprising! Finally, we review some other chess-playing neural networks like Deepmind's Grandmaster Chess without Search and ChessGPT. 0:00 - Introduction 1:18 - Chess RL Strategy 3:51 - How well do the best LLMs understand chess? 6:41 - Picking a base model 8:31 - Unsloth and TRL libraries for RL with LLMs 9:38 - LoRA (Low Rank Adaptation) 10:55 - GSM8K reasoning example 12:06 - PPO (Proximal Policy Optimization) 14:12 - GRPO (Group Relative Policy Optimization) 17:15 - GRPO training results 18:11 - Analysis of results for LLaMA and Qwen 20:52 - Limitations of GRPO on small models 23:29 - Grandmaster-level chess without search 27:10 - ChessGPT and other LLMs that play chess