VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Abstract

Training vision-language models (VLMs) for complex reasoning remains a chal- lenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.

To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on- policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the on- line training phase in this scenario and that without sufficient distributional align- ment between teacher and student, on-policy distillation fails to provide meaning- ful guidance.

We evaluate VOLD across diverse benchmarks including MMMU- Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a mar- gin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

Method: VOLD Framework

VOLD features a two-stage post-training pipeline designed to transfer reasoning capabilities from text-only teacher LLMs to student VLMs without requiring vision-based reasoning data. The pipeline consists of two sequential stages: Stage 1 performs supervised fine-tuning (SFT) to align the student's output distribution with the teacher's reasoning patterns, while Stage 2 applies a uni- fied objective combining reinforcement learning and on-policy knowledge distillation to enhance reasoning capabilities.

Stage 1: SFT for Policy Alignment. The goal of Stage 1 is to reduce the initial policy diver- gence between the student VLM and teacher LLM, creating a foundation that enables the student to effectively follow the teacher's reasoning process during the on-policy phase. This initial alignment is crucial to ensure that the student model's output distribution sufficiently overlaps with the teacher's, allowing for meaningful guidance during subsequent training.

Stage 2: Unified RL and On-Policy Distillation. Building on the aligned model from Stage 1, our core contribution is a unified objective that seamlessly combines reinforcement learning with teacher distillation. This combined signal enhances reasoning without requiring any vision-based reasoning data. The teacher provides token-level guidance on the student's own rollout prefixes, while the GRPO component drives the student towards high-reward solutions through trajectory-level binary rewards on verifiable text-only reasoning tasks. We also introduce Reward-Guided KL Masking to selectively apply distillation only to incorrect responses, allowing the model to freely explore novel correct paths without teacher interference.

Results

SOTA Performance on Multimodal Reasoning Benchmarks

VOLD achieves state-of-the-art performance despite training exclusively on text data, outperforming baselines that use images during fine-tuning.

Model	Images in FT	Multimodal General Tasks		Multimodal Math					Visual IQ-Test
		MMMU-Pro (Vision)	MMStar	Math Vision	MathVista	MathVerse	DynaMath (Avg)	WeMath	LogicVista
Qwen2.5-VL-3B	-	27.1	55.9	21.9	61.2	31.2	42.7	22.9	40.3
XReasoner-3B (repl.)	✗	31.0	55.2	24.4	61.1	35.7	47.2	30.6	41.1
VLM-R1 3B-Math	✓	28.6	56.7	21.9	62.7^‡	32.2^‡	42.7	30.0	40.5
VLAA-Thinker 3B	✓	24.6	55.6	24.4	61.0^‡	36.4	47.5	31.5	38.5
VOLD (Ours)	✗	32.0	55.2	28.0	61.9	37.9	50.7	31.81	45.0

Table 1: VOLD achieves state-of-the-art performance despite training exclusively on text data, outperforming baselines that use images during fine-tuning. Baselines marked with ^‡ were trained on portions of the evaluation set.

Learning Dynamics

VOLD consistently outperforms vanilla GRPO on both validation accuracy (Geo3K dataset) and training reward (orz-57k dataset), demonstrating successful text-to-vision knowledge transfer and the benefits of on-policy distillation.

Validation Accuracy (Geo3K)

Training Reward (orz-57k)

Figure 3: Accuracy on Geo3K (left) and Reward on orz-57k (right) over training steps.

Policy Alignment Ablation

This ablation demonstrates the critical role of aligning the student with the teacher's output distribution. We compare our full method, which uses teacher-generated SFT data for alignment, against variants trained on the original MoT dataset, creating a policy mismatch. The results show that without proper alignment, on-policy distillation provides no additional benefit.

Components			Dataset Performance
SFT MoT	RL	On-Policy Dist.	MMMU-Pro	MMStar	Mathvision	MathVista	MathVerse	DynaMath (Avg.)	WeMath	LogicVista
✗	✗	✗	27.1	55.9	21.9	61.2	31.2	42.7	22.9	40.3
✓	✗	✗	27.3	54.1	22.0	59.1	31.3	42.4	21.4	38.0
✗	✓	✗	27.5	55.2	23.8	61.2	31.2	46.7	24.6	40.1
✓	✓	✗	31.0	55.2	24.4	61.1	35.7	47.2	30.6	41.1
✓	✓	✓	30.8	55.1	24.5	61.0	35.9	47.4	30.6	41.2
VOLD (ours)			32.0	55.2	28.0	61.9	37.9	50.7	31.8	45.0

Table 2: Without proper policy alignment, on-policy distillation yields no benefit.

Component Analysis of VOLD

This table isolates the contribution of each component in our two-stage framework. We show performance after SFT-only, after adding RL (GRPO), and with our full unified objective. While Stage 1 SFT aligns the policy, it temporarily degrades performance due to unfiltered teacher traces. Stage 2, which combines RL with on-policy distillation, provides the largest performance gains, demonstrating that both components are essential for optimal reasoning.

Components			Dataset Performance
SFT Teacher-MoT	RL	On-Policy Dist.	MMMU-Pro	MMStar	Mathvision	MathVista	MathVerse	DynaMath (Avg.)	WeMath	LogicVista
✓	✗	✗	25.8	49.7	18.6	55.1	27.8	42.1	21.4	28.9
✓	✓	✗	29.7	50.5	24.0	58.4	34.1	47.6	30.4	38.3
✓	✓	✓	32.0	55.2	28.0	61.9	38.0	50.7	31.8	45.0

Table 3: The full VOLD pipeline consistently achieves the best performance.

BibTeX

@article{bousselham2025vold,
  author    = {Bousselham, Walid and Kuehne, Hilde and Schmid, Cordelia},
  title     = {VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation},
  journal   = {arXiv preprint arXiv:2510.23497},
  year      = {2025},
}