Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks

Page Website: https://zgca-ai4edu.github.io/Euclids_Gift/

Github: https://github.com/LiamLian0727/Euclids_Gift

📢 News

  • [10/24/2025] We trained Qwen3VL (4B, 8B, and 30B) using Euclid30K, and the results show that the models also achieve significant gains across various spatial intelligence tasks. The weights of the fine-tuned models are available here.
Model SuperClevr Omni3D Bench VSIBench MindCube
Qwen3VL-4B 55.36 27.74 35.51 26.11
Qwen3VL-Euclid-4B 61.24 (+5.88) 31.74 (+4.00) 42.26 (+6.75) 32.98 (+6.87)
Qwen3VL-8B 48.30 34.01 33.25 34.16
Qwen3VL-Euclid-8B 48.96 (+0.66) 35.03 (+1.02) 35.54 (+2.29) 41.02 (+6.86)
Qwen3VL-30B 64.12 36.71 40.00 39.75
Qwen3VL-Euclid-30B 70.18 (+6.06) 38.90 (+2.19) 45.80 (+5.80) 40.68 (+0.93)

Qwen3VL and Qwen3VL-Euclid are evaluated using the same prompting template defined in test/eval_qwen.sh to ensure a fair comparison.

Abstract

Spatial intelligence spans abilities such as visualizing and transforming shapes, mental rotation, reasoning about relative positions and containment, and counting/estimation. These remain challenging for modern Multimodal Large Language Models (MLLMs). We propose solving Euclidean geometry problems as a surrogate task and construct Euclid30K, a dataset of roughly 30K 2D and 3D geometry questions. We then fine‑tune Qwen2.5‑VL and RoboBrain2.0 models with Group Relative Policy Optimization (GRPO), enabling the models to internalize and apply Euclidean principles for shape recognition, counting, relation extraction, and multi‑step deductive reasoning. Without task‑specific adaptations, our models achieve significant zero‑shot gains on four spatial‑reasoning benchmarks: Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube. For example, on VSI‑Bench, average accuracy improves from 34.5% to 40.5% (+5.5 percentage points); RoboBrain2.0‑Euclid‑7B reaches 49.6%, surpassing the previous SOTA (Spatial‑MLLM).

image

Quick Start

1) Environment Setup

Training

Evaluation

2) Training

Below is an example command for training (e.g., 8 GPUs). For multi‑node multi‑GPU training, see the example script train/dist_train.sh.

python3 -m verl.trainer.main \
    config=examples/config.yaml \
    data.train_files=/mnt/datasets/Euclid30K/Euclid30K_train.parquet \
    data.val_files=/mnt/datasets/Euclid30K/Euclid30K_val.parquet \
    worker.actor.model.model_path=/mnt/models/Qwen2.5-VL-7B-Instruct \
    trainer.experiment_name=EXPERIMENT_NAME \
    worker.actor.micro_batch_size_per_device_for_update=1 \
    worker.actor.micro_batch_size_per_device_for_experience=8 \
    worker.actor.clip_ratio_low=0.2 \
    worker.actor.clip_ratio_high=0.28 \
    worker.reward.reward_function=/mnt/code/Euclids_Gift/train/euclid.py:compute_score \
    trainer.total_epochs=10 \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_checkpoint_path=/mnt/models/Qwen2.5-VL-7B-Euclid

3) Evaluation

Use test/eval_qwen.sh, test/eval_robo.sh, and test/eval_euclid.sh to evaluate the Qwen2.5‑VL series, the RoboBrain 2.0 series, and Euclid models trained on Euclid30K, respectively.

Before running these scripts, set model_path in each script to the path of the model you want to evaluate.

Notably, as noted in VSIBench, spatial reasoning ability is the primary bottleneck limiting MLLM performance on the VSI-Bench test. Therefore, to better demonstrate how models perceive scenes and perform spatial reasoning, and to verify whether they genuinely acquire spatial intelligence from geometric knowledge, we deviate from the original VSI-Bench setup, which uses prompts such as "Answer with the option's letter from the given choices directly" or "Please answer the question using a single word or phrase" and constrains the maximum response length to 16 tokens. Instead, we follow the prompt configuration described in RoboBrain2.0 Sec. B, which encourages the model to first reason about the problem before providing an answer, and we set the maximum response length to 1024 tokens. This setup allows us to observe the model's intermediate reasoning process and assess whether it has internalized transferable spatial priors from Euclid30K training.

Citation

If you find this project or the dataset helpful, please cite:

@misc{Euclids_Gift,
    title={Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks},
    author={Shijie Lian and Changti Wu and Laurence Tianruo Yang and Hang Yuan and Bin Yu and Lei Zhang and Kai Chen},
    year={2025},
    eprint={2509.24473},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.24473}
}
Downloads last month
23
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiamLian0727/Qwen3VL_Euclid_8B

Finetuned
(85)
this model
Quantizations
2 models

Dataset used to train LiamLian0727/Qwen3VL_Euclid_8B

Collection including LiamLian0727/Qwen3VL_Euclid_8B