Training vision-language models (VLMs) for complex reasoning remains a chal- lenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.
To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on- policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the on- line training phase in this scenario and that without sufficient distributional align- ment between teacher and student, on-policy distillation fails to provide meaning- ful guidance.
We evaluate VOLD across diverse benchmarks including MMMU- Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a mar- gin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
VOLD features a two-stage post-training pipeline designed to transfer reasoning capabilities from text-only teacher LLMs to student VLMs without requiring vision-based reasoning data. The pipeline consists of two sequential stages: Stage 1 performs supervised fine-tuning (SFT) to align the student's output distribution with the teacher's reasoning patterns, while Stage 2 applies a uni- fied objective combining reinforcement learning and on-policy knowledge distillation to enhance reasoning capabilities.
Stage 1: SFT for Policy Alignment. The goal of Stage 1 is to reduce the initial policy diver- gence between the student VLM and teacher LLM, creating a foundation that enables the student to effectively follow the teacher's reasoning process during the on-policy phase. This initial alignment is crucial to ensure that the student model's output distribution sufficiently overlaps with the teacher's, allowing for meaningful guidance during subsequent training.
Stage 2: Unified RL and On-Policy Distillation. Building on the aligned model from Stage 1, our core contribution is a unified objective that seamlessly combines reinforcement learning with teacher distillation. This combined signal enhances reasoning without requiring any vision-based reasoning data. The teacher provides token-level guidance on the student's own rollout prefixes, while the GRPO component drives the student towards high-reward solutions through trajectory-level binary rewards on verifiable text-only reasoning tasks. We also introduce Reward-Guided KL Masking to selectively apply distillation only to incorrect responses, allowing the model to freely explore novel correct paths without teacher interference.
VOLD achieves state-of-the-art performance despite training exclusively on text data, outperforming baselines that use images during fine-tuning.
| Model | Images in FT |
Multimodal General Tasks | Multimodal Math | Visual IQ-Test | |||||
|---|---|---|---|---|---|---|---|---|---|
| MMMU-Pro (Vision) |
MMStar | Math Vision |
MathVista | MathVerse | DynaMath (Avg) |
WeMath | LogicVista | ||
| Qwen2.5-VL-3B | - | 27.1 | 55.9 | 21.9 | 61.2 | 31.2 | 42.7 | 22.9 | 40.3 |
| XReasoner-3B (repl.) | ✗ | 31.0 | 55.2 | 24.4 | 61.1 | 35.7 | 47.2 | 30.6 | 41.1 |
| VLM-R1 3B-Math | ✓ | 28.6 | 56.7 | 21.9 | 62.7‡ | 32.2‡ | 42.7 | 30.0 | 40.5 |
| VLAA-Thinker 3B | ✓ | 24.6 | 55.6 | 24.4 | 61.0‡ | 36.4 | 47.5 | 31.5 | 38.5 |
| VOLD (Ours) | ✗ | 32.0 | 55.2 | 28.0 | 61.9 | 37.9 | 50.7 | 31.81 | 45.0 |
Table 1: VOLD achieves state-of-the-art performance despite training exclusively on text data, outperforming baselines that use images during fine-tuning. Baselines marked with ‡ were trained on portions of the evaluation set.
VOLD consistently outperforms vanilla GRPO on both validation accuracy (Geo3K dataset) and training reward (orz-57k dataset), demonstrating successful text-to-vision knowledge transfer and the benefits of on-policy distillation.
Validation Accuracy (Geo3K)
Training Reward (orz-57k)
Figure 3: Accuracy on Geo3K (left) and Reward on orz-57k (right) over training steps.
This ablation demonstrates the critical role of aligning the student with the teacher's output distribution. We compare our full method, which uses teacher-generated SFT data for alignment, against variants trained on the original MoT dataset, creating a policy mismatch. The results show that without proper alignment, on-policy distillation provides no additional benefit.
| Components | Dataset Performance | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SFT MoT |
RL | On-Policy Dist. |
MMMU-Pro | MMStar | Mathvision | MathVista | MathVerse | DynaMath (Avg.) |
WeMath | LogicVista |
| ✗ | ✗ | ✗ | 27.1 | 55.9 | 21.9 | 61.2 | 31.2 | 42.7 | 22.9 | 40.3 |
| ✓ | ✗ | ✗ | 27.3 | 54.1 | 22.0 | 59.1 | 31.3 | 42.4 | 21.4 | 38.0 |
| ✗ | ✓ | ✗ | 27.5 | 55.2 | 23.8 | 61.2 | 31.2 | 46.7 | 24.6 | 40.1 |
| ✓ | ✓ | ✗ | 31.0 | 55.2 | 24.4 | 61.1 | 35.7 | 47.2 | 30.6 | 41.1 |
| ✓ | ✓ | ✓ | 30.8 | 55.1 | 24.5 | 61.0 | 35.9 | 47.4 | 30.6 | 41.2 |
| VOLD (ours) | 32.0 | 55.2 | 28.0 | 61.9 | 37.9 | 50.7 | 31.8 | 45.0 | ||
Table 2: Without proper policy alignment, on-policy distillation yields no benefit.
This table isolates the contribution of each component in our two-stage framework. We show performance after SFT-only, after adding RL (GRPO), and with our full unified objective. While Stage 1 SFT aligns the policy, it temporarily degrades performance due to unfiltered teacher traces. Stage 2, which combines RL with on-policy distillation, provides the largest performance gains, demonstrating that both components are essential for optimal reasoning.
| Components | Dataset Performance | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SFT Teacher-MoT |
RL | On-Policy Dist. |
MMMU-Pro | MMStar | Mathvision | MathVista | MathVerse | DynaMath (Avg.) |
WeMath | LogicVista |
| ✓ | ✗ | ✗ | 25.8 | 49.7 | 18.6 | 55.1 | 27.8 | 42.1 | 21.4 | 28.9 |
| ✓ | ✓ | ✗ | 29.7 | 50.5 | 24.0 | 58.4 | 34.1 | 47.6 | 30.4 | 38.3 |
| ✓ | ✓ | ✓ | 32.0 | 55.2 | 28.0 | 61.9 | 38.0 | 50.7 | 31.8 | 45.0 |
Table 3: The full VOLD pipeline consistently achieves the best performance.
@article{bousselham2025vold,
author = {Bousselham, Walid and Kuehne, Hilde and Schmid, Cordelia},
title = {VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation},
journal = {arXiv preprint arXiv:2510.23497},
year = {2025},
}