EnglishComparison

6 Vision API Models Tested: Gemini 2.5, GPT-4.1, and Qwen3 VL for Image Understanding

A practical benchmark of Gemini 2.5 Flash, Gemini 2.5 Flash Lite, GPT-4.1 Mini, GPT-4.1 Nano, Qwen3 VL Flash, and Qwen3 VL Plus for image understanding APIs, covering accuracy, latency, cost per successful image, usage signals, failure modes, and production routing advice.

Crazyrouter Team

June 22, 2026 / 3 views

6 Vision API Models Tested: Gemini 2.5, GPT-4.1, and Qwen3 VL for Image Understanding

Crazyrouter

Check live pricing Read the docs Open image tool Create account

6 Vision API Models Tested: Gemini 2.5, GPT-4.1, and Qwen3 VL for Image Understanding#

If you are building image understanding into a product, the phrase “supports images” is not enough.

A model page can say that a model supports vision input, but your production route still has to answer harder questions:

Does the OpenAI-compatible image_url payload actually reach the model?
Does HTTP 200 mean the model really saw the image?
Which route is fast enough for user-facing image uploads?
Which route is cheap enough for bulk image classification?
Which route should be used as fallback when the first model fails?
Can usage metadata reveal a broken media path?

To make the comparison more practical, I tested six vision-capable models through the same OpenAI-compatible API shape:

gemini-2.5-flash
gemini-2.5-flash-lite
gpt-4.1-mini
gpt-4.1-nano
qwen3-vl-flash
qwen3-vl-plus

The goal is not to crown a universal winner. The useful question is: which model should you route to for a specific user workflow?

Test setup#

All tests used the Crazyrouter OpenAI-compatible Base URL:

text

https://cn.crazyrouter.com/v1

The request format was chat/completions, with the image passed through messages[].content[] as an image_url object.

Each model was tested on two stable public images:

Python logo
GitHub logo

Each image was run three times per model, so every model had six requests in total.

Test time: 2026-06-21T13:36:32Z.

This is a vision API smoke test. It is useful for checking whether the image_url route works and whether the model can perform simple visual recognition. It is not a full OCR, chart reasoning, document extraction, handwriting, or medical-image benchmark.

Quick recommendation#

Based on this run:

Real-time user uploads / lowest latency: use gpt-4.1-mini
Bulk logo, icon, or simple image classification: use qwen3-vl-flash
Low-cost Gemini route: consider gemini-2.5-flash-lite
Low-cost OpenAI-family route: consider gpt-4.1-nano
Quality-oriented Qwen VL route: use qwen3-vl-plus as an upgrade path
Do not use as default image_url vision route in this run: gemini-2.5-flash

The most important finding is simple:

HTTP 200 does not prove image understanding succeeded.

In this test, gemini-2.5-flash returned HTTP success for all six requests, but the visual recognition score was 0/6. It also produced outputs such as “no image provided,” incorrect CBC logo recognition, and unrelated object descriptions.

That is the dangerous failure mode: the API call appears successful, but the model did not correctly process the image.

Overall results#

Model	HTTP success	Correct recognition	No-image replies	Avg latency	Median latency	Slowest request	Input price / 1M tokens	Output price / 1M tokens	Estimated cost / 10k test-style calls	Positioning
`qwen3-vl-flash`	6/6	6/6	0	3.819s	3.493s	5.975s	$0.05	$0.40	$0.0915	Best low-cost route for bulk recognition
`gpt-4.1-mini`	6/6	6/6	0	1.491s	1.292s	2.189s	$0.26	$1.04	$0.5226	Best low-latency route for user-facing features
`gpt-4.1-nano`	6/6	6/6	0	2.863s	2.562s	4.213s	$0.065	$0.26	$0.1666	Low-cost OpenAI-family route
`qwen3-vl-plus`	6/6	6/6	0	3.859s	3.729s	4.821s	$0.1429	$1.4286	$0.3848	Quality-oriented Qwen VL upgrade route
`gemini-2.5-flash`	6/6	0/6	1	4.965s	4.333s	9.507s	$0.17	$0.68	$0.6168	image_url path failed in this run
`gemini-2.5-flash-lite`	6/6	6/6	0	2.618s	2.627s	4.195s	$0.055	$0.22	$0.5466	Low-cost Gemini lightweight route

The estimated 10k-call cost is based on observed usage in this simple logo recognition test. It should not be treated as universal pricing for all image workloads. Larger images, OCR tasks, long descriptions, and multi-image prompts can change token usage significantly.

The useful production metric is not just model price. It is cost per successful image task.

A cheap route that frequently needs retries or fallback can be more expensive than a more reliable route.

Accuracy: five models passed, one model failed#

Correct recognition in this smoke test:

qwen3-vl-flash: 6/6
gpt-4.1-mini: 6/6
gpt-4.1-nano: 6/6
qwen3-vl-plus: 6/6
gemini-2.5-flash-lite: 6/6
gemini-2.5-flash: 0/6

For simple logo and icon recognition, five of the six routes worked correctly. This means many lightweight models are already enough for basic image classification.

But gemini-2.5-flash is the important cautionary example: HTTP success does not mean the image path is healthy.

Latency: GPT-4.1 Mini was the fastest#

Average latency from lowest to highest:

gpt-4.1-mini: avg 1.491s, median 1.292s, slowest 2.189s
gemini-2.5-flash-lite: avg 2.618s, median 2.627s, slowest 4.195s
gpt-4.1-nano: avg 2.863s, median 2.562s, slowest 4.213s
qwen3-vl-flash: avg 3.819s, median 3.493s, slowest 5.975s
qwen3-vl-plus: avg 3.859s, median 3.729s, slowest 4.821s
gemini-2.5-flash: avg 4.965s, median 4.333s, slowest 9.507s

For user-facing features, latency is part of product quality. If the user uploads an image and waits for a response, one or two seconds can matter.

For those workflows, gpt-4.1-mini is the strongest default route in this run.

Cost: Qwen3 VL Flash was the cheapest successful route#

Estimated cost for 10,000 test-style calls:

qwen3-vl-flash: about $0.0915
gpt-4.1-nano: about $0.1666
qwen3-vl-plus: about $0.3848
gpt-4.1-mini: about $0.5226
gemini-2.5-flash-lite: about $0.5466
gemini-2.5-flash: about $0.6168

For high-volume tasks such as logo detection, icon classification, screenshot pre-filtering, or dataset tagging, qwen3-vl-flash is the strongest low-cost candidate.

The key is that it was not only cheap. It also passed the visual recognition smoke test.

Model-by-model notes#

GPT-4.1 Mini: best for real-time interactions#

gpt-4.1-mini had the lowest average latency and passed 6/6 recognition.

Use it for:

user image uploads
support screenshot analysis
chat apps with image input
latency-sensitive agent workflows

The tradeoff is cost. It is not the cheapest route, so it should not automatically be used for every bulk image task.

Qwen3 VL Flash: best for bulk low-cost recognition#

qwen3-vl-flash passed 6/6 recognition and had the lowest estimated cost in this test.

Use it for:

bulk logo recognition
icon detection
simple image classification
screenshot pre-classification
high-volume visual tagging

It is slower than gpt-4.1-mini, but for batch workloads that may be acceptable.

Gemini 2.5 Flash Lite: usable low-cost Gemini route#

gemini-2.5-flash-lite passed 6/6 recognition and had acceptable latency.

It is a reasonable candidate if you want a Gemini-family backup route. However, usage metadata was not as straightforward as the Qwen route, so keep a visual smoke test in production.

GPT-4.1 Nano: low-cost OpenAI-family backup#

gpt-4.1-nano passed 6/6 recognition and is much cheaper than gpt-4.1-mini.

Use it for simple visual tags and lightweight classification. Do not assume it is the right route for complex document understanding, OCR, or deep visual reasoning.

Qwen3 VL Plus: quality-oriented Qwen upgrade route#

qwen3-vl-plus passed the test, but its latency and output price are higher than flash.

It is better treated as an upgrade route when qwen3-vl-flash is not enough, not as the default for every simple logo recognition task.

Gemini 2.5 Flash: do not use as default in this image_url route#

This was the problematic route in the test.

Results:

HTTP success: 6/6
Correct recognition: 0/6
No-image reply: 1
Incorrect unrelated outputs
Suspicious usage/image-token signals

This does not necessarily prove the model itself is incapable. It may indicate an adapter, media-fetch, payload-conversion, or upstream routing issue in this specific image_url path.

But for production, the conclusion is the same: do not use it as the default vision route until your own smoke test confirms that image handling is fixed.

Scenario-based routing advice#

Scenario	Default route	Fallback	Why
Real-time user image uploads	`gpt-4.1-mini`	`qwen3-vl-flash` or `gemini-2.5-flash-lite`	prioritize latency and reliability
Bulk logo or icon recognition	`qwen3-vl-flash`	`gpt-4.1-nano`	lowest cost among successful routes
Simple screenshot classification	`qwen3-vl-flash` or `gpt-4.1-nano`	`gpt-4.1-mini`	start cheap, upgrade hard cases
Support screenshot analysis	`gpt-4.1-mini`	`qwen3-vl-plus`	user-facing latency matters
OCR or document pre-filtering	separate benchmark required	stronger OCR/document model	logo tests do not prove OCR quality
Agent visual input	`gpt-4.1-mini` or `qwen3-vl-flash`	forced smoke test plus fallback	agents amplify bad visual inputs
Gemini backup route	`gemini-2.5-flash-lite`	`gpt-4.1-nano`	Flash Lite worked; Flash failed in this run

Why usage signals matter#

Many image benchmarks only check the output text. In production, usage metadata can also be a health signal.

If a request returns HTTP 200 but:

prompt tokens look like only the text prompt arrived
image token fields are zero or missing
the model says “no image provided”
the answer is unrelated to the image

then the issue may be in the image transport path rather than the model itself.

Possible causes include:

image_url not forwarded correctly
gateway media fetch failure
base64 or inline conversion failure
OpenAI-compatible payload converted incorrectly
upstream accepting the request but ignoring image content
token accounting not matching actual media processing

For vision routes, text-only health checks are not enough. Use visual smoke tests.

API example#

python

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://cn.crazyrouter.com/v1"
)

response = client.chat.completions.create(
    model="qwen3-vl-flash",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Identify the main logo or object in this image."},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://raw.githubusercontent.com/github/explore/main/topics/python/python.png",
                    "detail": "low"
                }
            }
        ]
    }],
    max_tokens=40,
    temperature=0,
)

print(response.choices[0].message.content)

Do not add UTM parameters to API endpoints. UTM belongs in human-facing links, not SDK base_url values.

Final takeaway#

Vision API selection should be based on the user workflow, not just the model name.

For real-time interactions, optimize for correct recognition plus low latency.
For bulk classification, optimize for cost per successful image.
For agents, optimize for reliability, monitoring, and fallback behavior.
For OCR and document understanding, run a separate benchmark with real documents.

My practical ranking from this run: