STELAR-VISION

Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

Carnegie Mellon University

Introduction

Higher Accuracy, Faster Inference, Shorter Reasoning, Greener

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM_S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. The data and code will be available.

An overview of the STELAR-Vision framework.

Motivation

Limitations of the Popular Chain-of-Thought Reasoning Structures. The widely adopted Chain-of-Thought (CoT) reasoning paradigm (in green) often results in unnecessarily verbose reasoning processes, as demonstrated in the first example. Under CoT reasoning, the model redundantly counts each cube, whereas with Graph topology (in blue), it quickly identifies the key point of the question. In the bottom-row example, CoT reasoning begins with a detailed examination of each subplot but ultimately arrives at an incorrect answer. In contrast, Tree topology (in red) initiates reasoning with a high-level overview before delving into specific features. In both scenarios, CoT-style reasoning proves suboptimal.

Comparison of topology accuracy across subjects: Accuracy of Chain, Tree, and Graph reasoning topological structures per subject of MATH-V dataset. Chain remains the best overall reasoning structure, while Tree, and Graph perform better in at reasoning subjects such as "graph theory" or "statistics".

Distribution of generated reasoning token length of Chain, Tree, and Graph topological structures in TopoAug Dataset. The box within each violin plot represents the median, and 25% and 75% percentile thresholds.

Contribution

We propose STELAR-Vision, a training framework explicitly trained for topology-aware reasoning. It leverages diverse reasoning topologies such as chains, trees, and graphs, aligns reasoning paths with question characteristics, and enables adaptive and efficient multimodal inference.
We introduce TopoAug, a data generation pipeline that automatically produces diverse topological reasoning and annotates optimal structures per question. We also integrate Frugal Learning into the learning framework, achieving reductions in output length with minimal accuracy tradeoff.
By conducting experiments with post-training supervision and reinforcement learning, STELAR-Vision improves accuracy by 9.7% over its base model and its larger variant Qwen2VL-72B-Instruct by 7.3%. On the out-of-distribution dataset, it surpasses The Frugal Learning variant reduces output length by 18.1% while maintaining comparable accuracy.

Experiments

Conclusion

In this work, we propose STELAR-Vision, a training framework that enables topology-aware reasoning in VLMs via generated responses. STELAR-Vision enhances vision-language reasoning by leveraging diverse topological structures, achieving a 9.7% accuracy improvement over its base model and outperforming its larger variant Qwen2VL-72B-Instruct by 7.3%. The Frugal Learning variant reduces output length by 18.1% while maintaining comparable accuracy, surpassing Chain-Only baselines in both efficiency and task effectiveness. STELAR-Vision demonstrates strong generalization across five diverse OOD datasets and achieves 4.3% higher overall accuracy on in-distribution tasks, consistently outperforming Chain-Only training.