VisOnlyQA

Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang

Penn State University
COLM 2025

arXiv Code

🤗

Dataset

🔍

Examples

Examples from VisOnlyQA and answers from LVLMs. VisOnlyQA includes 12 tasks for evaluating the capability of LVLMs to perceive basic geometric infomration, such as angle, shape, and size. State-of-the-art LVLMs still often cannot accurately perceive geometric infomration. Questions in this figure are abbreviated.

Accuracy scores of LVLMs and human on VisOnlyQA. Even state-of-the-art LVLMs, such as GPT-4o and Gemini 1.5 Pro, still struggle with visual perception of geometric information, and their performance is very far from human performance.

Introduction

Large Vision Language Models (LVLMs) have demonstrated significant advancement across a range of vision-language tasks, including visual math reasoning and academic exams. However, there is a lack of analysis of the capability of LVLMs to perceive visual information in images. Specifically, it remains unclear how accurately LVLMs can perceive geometric information, such as shape, angle, and size, although the perception of these properties is crucial for tasks that require a detailed visual understanding.

To bridge this gap, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, consisting of 12 tasks that ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments on VisOnlyQA reveal that LVLMs still often cannot accurately perceive basic geometric information in images.

Our experiments highlight the following findings:

Even large LVLMs struggle with perceiving geometric information — 23 LVLMs we evaluate, including GPT-4o and Gemini 2.5 Pro, work poorly on VisOnlyQA, while human performance is nearly perfect.
Additional training data does not fully solve this issue — fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks.
Insights for future improvement — stronger language models improve the capability of LVLMs to perceive geometric information, suggesting that the way LVLMs process information from visual encoders is a bottleneck.

More examples are provided below.

VisOnlyQA includes questions about geometric information in four types of figures: geometric shapes, chemical structures, charts, and 3D shaes. They are acquired from two types of sources: Real and Synthetic.

VisOnlyQA includes the following splits.

VisOnlyQA_Eval-Real includes the Real figures, sourced from existing popular datasets and human-annotated questions created by us.
- Geometric shapes from datasets such as MathVista, Geometry3K, GeoQA+, GEOS, and UniGeo.
- Chemistry figures from the MMMU dataset.
- Charts from ChartQA and CharXiv datasets.
VisOnlyQA_{Eval-Synthetic} and VisOnlyQA_Train include Synthetic figures and questions, which are automatically generated by our Python scripts.
- Geometric shapes are generated via Python scripts in a new dataset called SyntheticGeometry, based on an open-source AlphaGeometry reproduction.
- 3D shapes are sourced from CLEVR and SuperCLEVR datasets.

Dataset Statistics of VisOnlyQA


	Geometry						Chemistry		Charts		3D		Total
	Triangle	Quadrilateral	Length	Angle	Area	Diameter	Shape (s)	Shape (m)	Extraction	Intersection	Size	Angle	Total
Eval-Real	100	100	100	100	100	100	50	50	100	100	--	--	900
Eval-Synthetic	100	100	100	100	100	--	--	--	--	--	100	100	700
Train	10k	10k	10k	10k	10k	--	--	--	--	--	10k	10k	70k
Answer Format	True/False	True/False	True/False	5 options	5 options	5 options	True/False	Select Multiple	5 options	True/False	3 options	5 options

Example Responses from LVLMs for VisOnlyQA_Eval-Real

Geometry-Triangle Geometry-Quadrilateral Geometry-Length Geometry-Angle Geometry-Area Geometry-Diameter Chemistry-Shape (single) Chemistry-Shape (multiple) Charts-Extraction Charts-Intersection

Geometry-Triangle (Eval-Real)

Question:

There is a triangle DBC in the figure. True or False?
A triangle is a polygon with three edges and three vertices, which are explicitly connected in the figure.

Gold Answer: False

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	In the figure, there is a triangle labeled \( \triangle ABC \). The line segment \( DE \) is parallel to \( BC \) and divides \( \triangle ABC \) into two smaller triangles, \( \triangle ADE \) and \( \triangle BDE \). However, there is no triangle labeled \( \triangle DBC \) in the figure. Final answer: False.	✅	False.	✅
Gemini 1.5 Pro	The figure shows points D, B, and C connected by line segments DB, BC, and CD. These three segments and the points D, B, and C form a triangle. Final Answer: True	❌	True	❌

Geometry-Quadrilateral (Eval-Real)

Question:

There is no quadrilateral CBPQ in this figure. True or False?
A quadrilateral is a four-sided polygon having four edges (sides) and four corners (vertices).

Gold Answer: True

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	In the figure, points C, B, P, and Q are not all connected to form a closed shape with four sides. A quadrilateral requires four connected edges forming a closed loop. Here, the points do not form such a shape. Final answer: True.	✅	True	✅
Gemini 1.5 Pro	The statement is False. The points C, B, P, and Q are connected by line segments CB, BP, PQ, and QC, forming a closed four-sided figure. This meets the definition of a quadrilateral. It doesn't matter that some of the lines forming it happen to also be part of other shapes (triangle and circle). Final Answer: False	❌	False	❌

Geometry-Length (Eval-Real)

Question:

Line CE is X times longer than AD. Which of the following options is a reasonable estimate of X? You only need to estimate from the visual information and do not need to do any mathematical reasoning.
(a) 0.5 (b) 1 (c) 2 (d) 4 (e) 0.25

Gold Answer: a

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	From the image, line \( CE \) appears to be about twice as long as line \( AD \). Visually estimating, \( CE \) is significantly longer than \( AD \), but not excessively so. The final answer: (c) 2.	❌	c	❌
Gemini 1.5 Pro	CE looks to be roughly the same length as AD. Therefore, X is approximately 1. Final Answer: (b)	❌	(a)	✅

Geometry-Angle (Eval-Real)

Question:

Which of the following options is a reasonable estimate of the angle AOD in the figure? You only need to estimate from the visual information and do not need to do any mathematical reasoning.
(a) 180 degrees (b) 10 degrees (c) 135 degrees (d) 90 degrees (e) 45 degrees

Gold Answer: d

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	From the visual information in the figure, angle \( \angle AOD \) appears to be an acute angle, significantly less than 90 degrees. It looks closer to a small angle, possibly around 45 degrees. Therefore, the reasonable estimate for angle \( \angle AOD \) is: (e) 45 degrees	❌	e	❌
Gemini 1.5 Pro	The angle AOD appears to be close to a right angle. A right angle measures 90 degrees. Final Answer: (d)	✅	c	❌

Geometry-Area (Eval-Real)

Question:

ADB is X times larger in area than ABC. Which of the following options is a reasonable estimate?
You only need to estimate from the visual information and do not need to do any mathematical reasoning.
(a) 0.5 (b) 0.25 (c) 1 (d) 4 (e) 2

Gold Answer: (c)

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	From the visual information, triangle \( \triangle ADB \) appears to be larger than triangle \( \triangle ABC \). The base \( DB \) is longer than \( EC \), and the height from \( A \) to \( DB \) is the same as to \( EC \). This suggests \( \triangle ADB \) is significantly larger than \( \triangle ABC \). Final answer: (e) 2	❌	e	❌
Gemini 1.5 Pro	ADB appears roughly twice as large as ABC. This is because ADB encompasses most of the area within triangle ADC, which itself is roughly the same area as triangle ABC. Therefore, ADB is approximately 2 times larger than ABC. Final Answer: (e)	❌	e	❌

Geometry-Diameter (Eval-Real)

Question:

In the figure, the line AB is a diameter of a circle. True or False?

Gold Answer: True

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	In the figure, \( AB \) is a chord of the circle, and \( E \) is a point on the circle such that \( \triangle AEB \) is inscribed. For \( AB \) to be a diameter, the angle \( \angle AEB \) must be a right angle (90 degrees) according to the inscribed angle theorem. However, the figure does not provide enough information to confirm that \( \angle AEB \) is a right angle. Final answer: False.	❌	True.	✅
Gemini 1.5 Pro	A diameter of a circle is a line segment that passes through the center of the circle and whose endpoints lie on the circle. In this figure, AB passes through point O (the circle's center) and points A and B lie on the circle. Final Answer: True	✅	True	✅

Chemistry-Shape-Single (Eval-Real)

Question:

H and O are attached to opposite vertices of the hexagonal structure. True or False? You must answer this question using visual information and do not use knowledge of chemistry unless explicitly instructed to do so.

Gold Answer: False

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	In the image, the hexagonal structure represents a cyclohexane ring. The oxygen (O) is attached to one vertex of the hexagon, and the hydrogen (H) is attached to a different vertex. However, they are not on opposite vertices of the hexagon. Final answer: False.	✅	False	✅
Gemini 1.5 Pro	The Oxygen (O) is attached to a vertex of the hexagon. The Hydrogen (H) is attached to a point along the edge of the hexagon below the Oxygen and adjacent to the isopropyl group (the two CH3's attached to a central carbon). Therefore, they are not at opposite vertices. Final Answer: False.	✅	False	✅

Chemistry-Shape-Multi (Eval-Real)

Question:

List all figures that satisfy the following conditions from a, b, c, and d. You must answer this question using visual information and do not need to use your knowledge of chemistry unless explicitly instructed to do so.

Conditions: Including a pentagonal structure.

Gold Answer: a,d

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	All the figures (a, b, c, and d) include a pentagonal structure. Each structure has a five-membered ring, which is a pentagon. Final answer: a, b, c, d.	❌	a, b, c, d	❌
Gemini 1.5 Pro	The figures (a), (b), (c), and (d) all depict pentagonal structures as part of their molecular diagrams. Final Answer: a, b, c, d	❌	a, b, c, d	❌

Charts-Extraction (Eval-Real)

Question:

Answer the following question about the chart in the figure. Select the most reasonable answer from the provided options.

Stock Index of Dow Jones on 28 Jan is
(a) 200 (b) 350 (c) 250 (d) 0.08 (e) 3000

Gold Answer: a

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	The chart for the Dow Jones is located in the middle row on the left side. On January 28, the stock index value is approximately 350. Final answer: (b) 350	❌	(c) 250	❌
Gemini 1.5 Pro	The question asks for the Stock Index of Dow Jones on 28 Jan. The Dow Jones plot is the third row, left side. We look for 28 Jan on the x-axis of this plot. The corresponding value on the y-axis is approximately 200. Final Answer: a	✅	a	✅

Charts-Intersection (Eval-Real)

Question:

In the figure, the lines for India and the Taliban intersect between 2012 and 2014. True or False?

Gold Answer: False

Model Responses

Model	Answer w/ CoT		Answer w/o CoT
GPT-4o	The lines for India and the Taliban intersect between 2012 and 2014. In 2013, the percentage for India is 38, and for the Taliban, it is 33. In 2014, the percentage for India is 51, and for the Taliban, it is 25. This indicates that the lines intersected between these years. Final answer: True.	❌	True	❌
Gemini 1.5 Pro	The lines for India and the Taliban intersect between the years 2012 and 2013, which is within the range of 2012 and 2014. Final Answer: True	❌	True	❌

Results of Existing LVLMs on VisOnlyQA

We evaluate 23 LVLMs, including GPT-4o and Gemini 2.5 Pro, on VisOnlyQA. Our results show that even state-of-the-art LVLMs still often cannot accurately perceive basic geometric information in images.

Full results are provided at the end of this page.

rader-chart-of-lvlm-performance-on-visonlyqa

Results of LVLMs fine-tuned on the Training Set of VisOnlyQA

We evaluate LVLMs fine-tuned on the training set of VisOnlyQA.

Positive results: All models achieve near-perfect performance in 3D-Size after fine-tuning, and models larger than 7B show large improvement even on the out-of-distribution figures in Geometry-Length and Area. This result partially supports our hypothesis that training data for existing LVLMs are insufficient and indicates that our approach of using synthetic training data has the potential to improve the capability of LVLMs to perceive geometric information.

Negative results: However, fine-tuned models are still often much worse than human performance, even on in-distribution figures. Specifically, fine-tuning almost does not improve performance in 3D-Angle, and we observe relatively small improvements on Geometry-Triangle, Quadrilateral, and Angle, even on in-distribution figures. This result indicates that fine-tuning on datasets that require accurate perception of geometric information is not always effective, depending on the properties of target tasks.

Accuracy scores of LVLMs fine-tuned on the training set of VisOnlyQA.

LLMs of LVLMs Influence Their Geometric Perception

InternVL2 4B and 8B and Qwen2-VL 2B and 7B use the same vision transformers (ViT) as visual encoders while using different language models, respectively. We expected the visual encoders to play a major role in geometric perception, and models using the same ViT to perform similarly on VisOnlyQA, particularly after fine-tuning — since fine-tuning would help models understand tasks, further reducing the impact of the reasoning capability of language models.

However, there are performance gaps between LVLMs with the same ViT and different language models, and the gaps become larger after fine-tuning. This observation indicates that language models of LVLMs affect the capability to perceive geometric information, and the influence of LLMs of LVLMs is not limited to reasoning or knowledge.

This result suggests that language models play a crucial role in processing visual information encoded by ViT, and strong language models are needed even for tasks that do not require challenging reasoning or knowledge.

Larger language models improve the performance of LVLMs on VisOnlyQA_Eval when using the same visual encoders.

Model	ViT	LLM	Original		Fine-tuned
Model	ViT	LLM	Real	Synthetic	Real	Synthetic
InternVL2-4B	304M	3.8B	38.4	34.1	46.0	57.7
InternVL2-8B	304M	7.7B	40.7	35.0	52.4	64.6
Qwen2-VL-2B	675M	1.5B	32.3	33.6	43.8	54.6
Qwen2-VL-7B	675M	7.6B	38.9	37.1	48.2	65.0

Full Performance Tables of Existing LVLMs on VisOnlyQA

Accuracy scores on the Eval-Real subset (900 examples in total) of VisOnlyQA.

Model	Triangle	Quadrilateral	Diameter	Length	Angle	Area	Shape (s)	Shape (m)	Extraction	Intersection	Average
Random	50.0	50.0	20.0	20.0	20.0	50.0	50.0	6.2	20.0	50.0	34.2
Phi-3.5-vision	48.0	50.0	17.0	17.0	27.0	50.0	54.0	10.0	29.0	50.0	35.6
LLaVA-Next 8B	50.0	50.0	16.0	15.0	26.0	49.0	42.0	4.0	22.0	49.0	33.3
LLaVA-Next 34B	49.0	50.0	30.0	15.0	22.0	44.0	34.0	10.0	35.0	50.0	35.2
Llama 3.2 11B	50.0	47.0	17.0	15.0	26.0	43.0	34.0	8.0	32.0	50.0	33.4
Llama 3.2 90B	51.0	46.0	14.0	28.0	27.0	48.0	60.0	20.0	35.0	45.0	37.1
MolMo 7B-D	49.0	45.0	20.0	11.0	23.0	56.0	40.0	12.0	31.0	48.0	34.3
MolMo 72B	44.0	47.0	22.0	25.0	33.0	50.0	48.0	30.0	46.0	52.0	39.8
Qwen2-VL-2B	43.0	44.0	15.0	19.0	26.0	47.0	38.0	12.0	27.0	45.0	32.3
Qwen2-VL-7B	50.0	50.0	23.0	19.0	34.0	46.0	46.0	16.0	45.0	52.0	38.9
Qwen2-VL-72B	44.0	52.0	27.0	27.0	37.0	61.0	56.0	36.0	53.0	53.0	44.4
InternVL2-4B	50.0	56.0	30.0	17.0	18.0	49.0	54.0	16.0	38.0	53.0	38.4
InternVL2-8B	44.0	36.0	29.0	30.0	27.0	56.0	50.0	22.0	52.0	56.0	40.7
InternVL2-26B	44.0	47.0	24.0	22.0	26.0	55.0	58.0	28.0	47.0	46.0	39.3
InternVL2-40B	43.0	45.0	32.0	23.0	31.0	57.0	28.0	30.0	61.0	58.0	42.1
InternVL2-76B	44.0	42.0	28.0	34.0	45.0	56.0	60.0	36.0	63.0	54.0	46.0
Claude 3.5 Sonnet	50.0	47.0	23.0	20.0	33.0	59.0	52.0	40.0	61.0	52.0	43.4
Claude Sonnet 4	38.0	57.0	32.0	25.0	33.0	66.0	72.0	44.0	70.0	54.0	48.1
Claude Opus 4	41.0	47.0	35.0	34.0	36.0	60.0	72.0	50.0	80.0	50.0	49.3
GPT-4o-mini	45.0	66.0	26.0	19.0	30.0	58.0	58.0	32.0	40.0	53.0	42.4
GPT-4o	58.0	48.0	27.0	34.0	38.0	69.0	72.0	50.0	46.0	58.0	48.8
Gemini 1.5 Flash	47.0	51.0	25.0	24.0	39.0	60.0	68.0	42.0	58.0	58.0	49.2
Gemini 1.5 Pro	47.0	53.0	33.0	40.0	53.0	70.0	62.0	52.0	67.0	53.0	52.6
Gemini 2.5 Pro	66.0	52.0	55.0	59.0	56.0	90.0	92.0	88.0	86.0	72.0	79.0
Human	96.7	90.0	93.3	93.3	86.7	100.0	93.3	93.0	93.3	95.0	93.5

Accuracy scores on the Eval-Synthetic subset (700 examples in total) of VisOnlyQA.

Model	Triangle	Quadrilateral	Length	Angle	Area	Size	Angle (3D)	Average
Random	50.0	50.0	20.0	20.0	20.0	33.3	20.0	30.5
Phi-3.5-vision	54.0	55.0	15.0	22.0	21.0	39.0	20.0	32.3
LLaVA-Next 8B	50.0	50.0	17.0	21.0	19.0	26.0	19.0	28.9
LLaVA-Next 34B	51.0	50.0	25.0	24.0	20.0	48.0	32.0	35.7
Llama 3.2 11B	54.0	52.0	31.0	21.0	21.0	32.0	21.0	33.1
Llama 3.2 90B	61.0	56.0	12.0	16.0	20.0	45.0	26.0	33.7
MolMo 7B-D	49.0	56.0	22.0	20.0	14.0	29.0	27.0	31.0
MolMo 72B	51.0	55.0	23.0	22.0	18.0	50.0	27.0	35.1
Qwen2-VL-2B	50.0	50.0	31.0	23.0	20.0	38.0	23.0	33.6
Qwen2-VL-7B	58.0	59.0	24.0	18.0	22.0	58.0	21.0	37.1
Qwen2-VL-72B	51.0	56.0	33.0	21.0	26.0	76.0	27.0	41.4
InternVL2-4B	50.0	51.0	21.0	24.0	18.0	57.0	18.0	34.1
InternVL2-8B	51.0	57.0	21.0	17.0	23.0	46.0	30.0	35.0
InternVL2-26B	51.0	53.0	30.0	23.0	21.0	72.0	25.0	39.3
InternVL2-40B	51.0	54.0	30.0	23.0	21.0	69.0	25.0	39.0
InternVL2-76B	52.0	51.0	29.0	18.0	22.0	84.0	27.0	40.4
Claude 3.5 Sonnet	61.0	63.0	33.0	20.0	34.0	62.0	22.0	42.1
Claude Sonnet 4	38.0	57.0	32.0	25.0	33.0	54.0	25.0	37.7
Claude Opus 4	41.0	47.0	35.0	34.0	36.0	50.0	34.0	39.6
GPT-4o-mini	60.0	51.0	21.0	20.0	18.0	27.0	23.0	31.4
GPT-4o	66.0	56.0	25.0	17.0	26.0	60.0	23.0	39.0
Gemini 1.5 Flash	54.0	51.0	29.0	21.0	19.0	60.0	21.0	36.4
Gemini 1.5 Pro	54.0	57.0	34.0	21.0	40.0	69.0	22.0	42.4
Gemini 2.5 Pro	66.0	52.0	55.0	59.0	56.0	72.0	59.0	59.9
Human	95.0	95.0	95.0	90.0	95.0	100.0	95.0	95.0

VisOnlyQA

Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Introduction

VisOnlyQA Dataset

Dataset Statistics of VisOnlyQA

Experiments