Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks

Page Website: https://zgca-ai4edu.github.io/Euclids_Gift/

Github: https://github.com/LiamLian0727/Euclids_Gift

📢 News

[10/24/2025] We trained Qwen3VL (4B, 8B, and 30B) using Euclid30K, and the results show that the models also achieve significant gains across various spatial intelligence tasks. The weights of the fine-tuned models are available here.

Model	SuperClevr	Omni3D Bench	VSIBench	MindCube
Qwen3VL-4B	55.36	27.74	35.51	26.11
Qwen3VL-Euclid-4B	61.24 (+5.88)	31.74 (+4.00)	42.26 (+6.75)	32.98 (+6.87)
Qwen3VL-8B	48.30	34.01	33.25	34.16
Qwen3VL-Euclid-8B	48.96 (+0.66)	35.03 (+1.02)	35.54 (+2.29)	41.02 (+6.86)
Qwen3VL-30B	64.12	36.71	40.00	39.75
Qwen3VL-Euclid-30B	70.18 (+6.06)	38.90 (+2.19)	45.80 (+5.80)	40.68 (+0.93)

Qwen3VL and Qwen3VL-Euclid are evaluated using the same prompting template defined in test/eval_qwen.sh to ensure a fair comparison.

[10/17/2025] Thanks to Synced (机器之心) for covering our work: wechat article / zhihu.
[09/30/2025] We release our paper in arXiv and Euclid30K dataset in huggingface.

Abstract

Spatial intelligence spans abilities such as visualizing and transforming shapes, mental rotation, reasoning about relative positions and containment, and counting/estimation. These remain challenging for modern Multimodal Large Language Models (MLLMs). We propose solving Euclidean geometry problems as a surrogate task and construct Euclid30K, a dataset of roughly 30K 2D and 3D geometry questions. We then fine‑tune Qwen2.5‑VL and RoboBrain2.0 models with Group Relative Policy Optimization (GRPO), enabling the models to internalize and apply Euclidean principles for shape recognition, counting, relation extraction, and multi‑step deductive reasoning. Without task‑specific adaptations, our models achieve significant zero‑shot gains on four spatial‑reasoning benchmarks: Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube. For example, on VSI‑Bench, average accuracy improves from 34.5% to 40.5% (+5.5 percentage points); RoboBrain2.0‑Euclid‑7B reaches 49.6%, surpassing the previous SOTA (Spatial‑MLLM).

Quick Start

1) Environment Setup

Training

Install EasyR1 following the official documentation.
Install the required Python dependencies: pip install -r requirements.txt in our GitHub repository.
Download the Euclid30K dataset from Hugging Face: https://huggingface.co/datasets/LiamLian0727/Euclid30K

Evaluation

Install lmms‑eval following its official documentation. You can either:
- Use the lmms-eval/ copy included in this repository; or
- Copy the four task folders provided under test/lmms_eval/tasks/ into your existing lmms‑eval setup.
Download the benchmark datasets Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube_lmms_eval; then update the dataset paths in each corresponding YAML under test/lmms_eval/tasks/.

2) Training

Below is an example command for training (e.g., 8 GPUs). For multi‑node multi‑GPU training, see the example script train/dist_train.sh.

python3 -m verl.trainer.main \
    config=examples/config.yaml \
    data.train_files=/mnt/datasets/Euclid30K/Euclid30K_train.parquet \
    data.val_files=/mnt/datasets/Euclid30K/Euclid30K_val.parquet \
    worker.actor.model.model_path=/mnt/models/Qwen2.5-VL-7B-Instruct \
    trainer.experiment_name=EXPERIMENT_NAME \
    worker.actor.micro_batch_size_per_device_for_update=1 \
    worker.actor.micro_batch_size_per_device_for_experience=8 \
    worker.actor.clip_ratio_low=0.2 \
    worker.actor.clip_ratio_high=0.28 \
    worker.reward.reward_function=/mnt/code/Euclids_Gift/train/euclid.py:compute_score \
    trainer.total_epochs=10 \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_checkpoint_path=/mnt/models/Qwen2.5-VL-7B-Euclid

3) Evaluation

Use test/eval_qwen.sh, test/eval_robo.sh, and test/eval_euclid.sh to evaluate the Qwen2.5‑VL series, the RoboBrain 2.0 series, and Euclid models trained on Euclid30K, respectively.

Before running these scripts, set model_path in each script to the path of the model you want to evaluate.

Notably, as noted in VSIBench, spatial reasoning ability is the primary bottleneck limiting MLLM performance on the VSI-Bench test. Therefore, to better demonstrate how models perceive scenes and perform spatial reasoning, and to verify whether they genuinely acquire spatial intelligence from geometric knowledge, we deviate from the original VSI-Bench setup, which uses prompts such as "Answer with the option's letter from the given choices directly" or "Please answer the question using a single word or phrase" and constrains the maximum response length to 16 tokens. Instead, we follow the prompt configuration described in RoboBrain2.0 Sec. B, which encourages the model to first reason about the problem before providing an answer, and we set the maximum response length to 1024 tokens. This setup allows us to observe the model's intermediate reasoning process and assess whether it has internalized transferable spatial priors from Euclid30K training.

Citation

If you find this project or the dataset helpful, please cite:

@misc{Euclids_Gift,
    title={Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks},
    author={Shijie Lian and Changti Wu and Laurence Tianruo Yang and Hang Yuan and Bin Yu and Lei Zhang and Kai Chen},
    year={2025},
    eprint={2509.24473},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.24473}
}