Logo VisOnlyQA

Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Penn State University

Introduction

Large Vision Language Models (LVLMs) have demonstrated significant advancement across a range of vision-language tasks, including visual math reasoning and academic exams. However, there is a lack of analysis of the capability of LVLMs to perceive visual information in images. Specifically, it remains unclear how accurately LVLMs can perceive geometric information, such as shape, angle, and size, although the perception of these properties is crucial for tasks that require a detailed visual understanding.

To bridge this gap, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, consisting of 12 tasks that ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments on VisOnlyQA reveal that LVLMs still often cannot accurately perceive basic geometric information in images.

Our experiments highlight the following findings:

  1. Even large LVLMs struggle with perceiving geometric information — 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on VisOnlyQA, while human performance is nearly perfect.
  2. Additional training data does not fully solve this issue — fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks.
  3. Insights for future improvement — stronger language models improve the capability of LVLMs to perceive geometric information, suggesting that the way LVLMs process information from visual encoders is a bottleneck.

Logo VisOnlyQA Dataset

More examples are provided below.

VisOnlyQA includes questions about geometric information in four types of figures: geometric shapes, chemical structures, charts, and 3D shaes. They are acquired from two types of sources: Real and Synthetic.

VisOnlyQA includes the following splits.

  • VisOnlyQAEval-Real includes the Real figures, sourced from existing popular datasets and human-annotated questions created by us.
    • Geometric shapes from datasets such as MathVista, Geometry3K, GeoQA+, GEOS, and UniGeo.
    • Chemistry figures from the MMMU dataset.
    • Charts from ChartQA and CharXiv datasets.
  • VisOnlyQAEval-Synthetic and VisOnlyQATrain include Synthetic figures and questions, which are automatically generated by our Python scripts.
    • Geometric shapes are generated via Python scripts in a new dataset called SyntheticGeometry, based on an open-source AlphaGeometry reproduction.
    • 3D shapes are sourced from CLEVR and SuperCLEVR datasets.

Examples from VisOnlyQA

Examples from VisOnlyQA and answers from LVLMs. VisOnlyQA includes 12 tasks for evaluating the capability of LVLMs to perceive basic geometric infomration, such as angle, shape, and size. State-of-the-art LVLMs still often cannot accurately perceive geometric infomration. Questions in this figure are abbreviated.

Dataset Statistics of VisOnlyQA

Geometry Chemistry Charts 3D Total
Triangle Quadrilateral Length Angle Area Diameter Shape (s) Shape (m) Extraction Intersection Size Angle
Eval-Real 100 100 100 100 100 100 50 50 100 100 -- -- 900
Eval-Synthetic 100 100 100 100 100 -- -- -- -- -- 100 100 700
Train 10k 10k 10k 10k 10k -- -- -- -- -- 10k 10k 70k
Answer Format True/False True/False True/False 5 options 5 options 5 options True/False Select Multiple 5 options True/False 3 options 5 options

Experiments

Example Responses from LVLMs for VisOnlyQAEval-Real

Geometry-Triangle (Eval-Real)

example-triangle-figure

Question:

There is a triangle DBC in the figure. True or False?
A triangle is a polygon with three edges and three vertices, which are explicitly connected in the figure.

Gold Answer: False

Model Responses

Model Answer w/ CoT Answer w/o CoT
GPT-4o In the figure, there is a triangle labeled \( \triangle ABC \). The line segment \( DE \) is parallel to \( BC \) and divides \( \triangle ABC \) into two smaller triangles, \( \triangle ADE \) and \( \triangle BDE \). However, there is no triangle labeled \( \triangle DBC \) in the figure.

Final answer: False.
False.
Gemini 1.5 Pro The figure shows points D, B, and C connected by line segments DB, BC, and CD. These three segments and the points D, B, and C form a triangle.

Final Answer: True
True

Results of Existing LVLMs on VisOnlyQA

We evaluate 20 LVLMs, including GPT-4o and Gemini 1.5 Pro, on VisOnlyQA. Our results show that even state-of-the-art LVLMs still often cannot accurately perceive basic geometric information in images.

Full results are provided at the end of this page.

rader-chart-of-lvlm-performance-on-visonlyqa

Accuracy scores of LVLMs and human on VisOnlyQA. Even state-of-the-art LVLMs, such as GPT-4o and Gemini 1.5 Pro, still struggle with visual perception of geometric information, and their performance is very far from human performance.

Results of LVLMs fine-tuned on the Training Set of VisOnlyQA

We evaluate LVLMs fine-tuned on the training set of VisOnlyQA.

Positive results: All models achieve near-perfect performance in 3D-Size after fine-tuning, and models larger than 7B show large improvement even on the out-of-distribution figures in Geometry-Length and Area. This result partially supports our hypothesis that training data for existing LVLMs are insufficient and indicates that our approach of using synthetic training data has the potential to improve the capability of LVLMs to perceive geometric information.

Negative results: However, fine-tuned models are still often much worse than human performance, even on in-distribution figures. Specifically, fine-tuning almost does not improve performance in 3D-Angle, and we observe relatively small improvements on Geometry-Triangle, Quadrilateral, and Angle, even on in-distribution figures. This result indicates that fine-tuning on datasets that require accurate perception of geometric information is not always effective, depending on the properties of target tasks.

finetuned_accuracy_synthetic finetuned_accuracy_reaul

Accuracy scores of LVLMs fine-tuned on the training set of VisOnlyQA.

LLMs of LVLMs Influence Their Geometric Perception

InternVL2 4B and 8B and Qwen2-VL 2B and 7B use the same vision transformers (ViT) as visual encoders while using different language models, respectively. We expected the visual encoders to play a major role in geometric perception, and models using the same ViT to perform similarly on VisOnlyQA, particularly after fine-tuning — since fine-tuning would help models understand tasks, further reducing the impact of the reasoning capability of language models.

However, there are performance gaps between LVLMs with the same ViT and different language models, and the gaps become larger after fine-tuning. This observation indicates that language models of LVLMs affect the capability to perceive geometric information, and the influence of LLMs of LVLMs is not limited to reasoning or knowledge.

This result suggests that language models play a crucial role in processing visual information encoded by ViT, and strong language models are needed even for tasks that do not require challenging reasoning or knowledge.

Larger language models improve the performance of LVLMs on VisOnlyQAEval when using the same visual encoders.

Model ViT LLM Original Fine-tuned
Real Synthetic Real Synthetic
InternVL2-4B 304M 3.8B 38.4 34.1 46.0 57.7
InternVL2-8B 304M 7.7B 40.7 35.0 52.4 64.6
Qwen2-VL-2B 675M 1.5B 32.3 33.6 43.8 54.6
Qwen2-VL-7B 675M 7.6B 38.9 37.1 48.2 65.0

Full Performance Tables of Existing LVLMs on VisOnlyQA

Accuracy scores on the Eval-Real subset (900 examples in total) of VisOnlyQA.

Model Triangle Quadrilateral Diameter Length Angle Area Shape (s) Shape (m) Extraction Intersection Average
Random50.050.020.020.020.050.050.06.220.050.034.2
Phi-3.5-vision48.050.017.017.027.050.054.010.029.050.035.6
LLaVA-Next 8B50.050.016.015.026.049.042.04.022.049.033.3
LLaVA-Next 34B49.050.030.015.022.044.034.010.035.050.035.2
Llama 3.2 11B50.047.017.015.026.043.034.08.032.050.033.4
Llama 3.2 90B51.046.014.028.027.048.060.020.035.045.037.1
MolMo 7B-D49.045.020.011.023.056.040.012.031.048.034.3
MolMo 72B44.047.022.025.033.050.048.030.046.052.039.8
Qwen2-VL-2B43.044.015.019.026.047.038.012.027.045.032.3
Qwen2-VL-7B50.050.023.019.034.046.046.016.045.052.038.9
Qwen2-VL-72B44.052.027.027.037.061.056.036.053.053.044.4
InternVL2-4B50.056.030.017.018.049.054.016.038.053.038.4
InternVL2-8B44.036.029.030.027.056.050.022.052.056.040.7
InternVL2-26B44.047.024.022.026.055.058.028.047.046.039.3
InternVL2-40B43.045.032.023.031.057.028.030.061.058.042.1
InternVL2-76B44.042.028.034.045.056.060.036.063.054.046.0
Claude 3.5 Sonnet50.047.023.020.033.059.052.040.061.052.043.4
GPT-4o-mini45.066.026.019.030.058.058.032.040.053.042.4
GPT-4o58.048.027.034.038.069.072.050.046.058.048.8
Gemini 1.5 Flash47.051.025.024.039.060.068.042.058.058.049.2
Gemini 1.5 Pro47.053.033.040.053.070.062.052.067.053.052.6
Human96.790.093.393.386.7100.093.393.093.395.093.5

Accuracy scores on the Eval-Synthetic subset (700 examples in total) of VisOnlyQA.

Model Triangle Quadrilateral Length Angle Area Size Angle (3D) Average
Random50.050.020.020.020.033.320.030.5
Phi-3.5-vision54.055.015.022.021.039.020.032.3
LLaVA-Next 8B50.050.017.021.019.026.019.028.9
LLaVA-Next 34B51.050.025.024.020.048.032.035.7
Llama 3.2 11B54.052.031.021.021.032.021.033.1
Llama 3.2 90B61.056.012.016.020.045.026.033.7
MolMo 7B-D49.056.022.020.014.029.027.031.0
MolMo 72B51.055.023.022.018.050.027.035.1
Qwen2-VL-2B50.050.031.023.020.038.023.033.6
Qwen2-VL-7B58.059.024.018.022.058.021.037.1
Qwen2-VL-72B51.056.033.021.026.076.027.041.4
InternVL2-4B50.051.021.024.018.057.018.034.1
InternVL2-8B51.057.021.017.023.046.030.035.0
InternVL2-26B51.053.030.023.021.072.025.039.3
InternVL2-40B51.054.030.023.021.069.025.039.0
InternVL2-76B52.051.029.018.022.084.027.040.4
Claude 3.5 Sonnet61.063.033.020.034.062.022.042.1
GPT-4o-mini60.051.021.020.018.027.023.031.4
GPT-4o66.056.025.017.026.060.023.039.0
Gemini 1.5 Flash54.051.029.021.019.060.021.036.4
Gemini 1.5 Pro54.057.034.021.040.069.022.042.4
Human95.095.095.090.095.0100.095.095.0

BibTeX

@article{kamoi2024visonlyqa,
  author = {Ryo Kamoi and Yusen Zhang and Sarkar Snigdha Sarathi Das and Ranran Haoran Zhang and Rui Zhang},
  journal = {arXiv preprint arXiv:2412.00947},
  title = {VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information},
  year = {2024}
}