EnglishAI Model Comparisons

youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text

We ran a live OCR benchmark for youtu-vita on eight image-understanding tasks, including documents, receipts, UI screenshots, rotated pages, scene text, and low-resolution small text. Here are the actual results, latency numbers, weak spots, and what they mean for production OCR workflows.

Crazyrouter Team

June 24, 2026 / 6 views

Crazyrouter

Read the docs Check live pricing Open image tool Create account

youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text#

If you are evaluating OCR-capable multimodal models for production work, the question is not just whether a model can read text in an image. The real question is whether it can do it consistently, with structured outputs, across the kinds of inputs teams actually send in production: screenshots, receipts, tables, rotated documents, scene text, and low-resolution UI captures.

We ran a live benchmark for youtu-vita through our OpenAI-compatible API path and scored it on a controlled OCR test set. This article shares the actual test data, what the model passed, where it struggled, and what kind of OCR workloads it looks good at right now.

Test setup#

This benchmark was run on June 24, 2026.

We used a local generated OCR benchmark set with these eight cases:

document_basic
receipt_total
ui_settings
table_statement
scene_text_signboard
rotated_document
low_res_small_text
chart_with_legend

The benchmark images were generated locally so the test would be stable and reproducible, rather than depending on external image hosts. Each request used the same OpenAI-compatible image input shape and the same JSON-only output instruction. The model was asked to:

transcribe visible text
answer structured OCR questions
return machine-readable JSON

Scoring method#

Each case was scored on four dimensions:

Format stability Did the model return valid structured JSON?
OCR text match Did it correctly capture the required visible text?
Regex-sensitive fields Did it preserve exact formats for fields like IDs or totals?
Structured answers Did it answer the requested key fields correctly?

The final score per case is a weighted total. A score of 1.000 means the model fully passed that case under this benchmark.

youtu-vita live benchmark results#

The full 8-case run completed successfully.

Headline metrics#

Success rate: 100%
Average total score: 0.875
p50 latency: 3794 ms
p90 latency: 4088 ms
Slowest case: 15284 ms

Case-by-case results#

Case ID	Category	Status	Score	Latency
`document_basic`	Document OCR	200	`1.000`	`3852 ms`
`receipt_total`	Receipt OCR	200	`1.000`	`3616 ms`
`ui_settings`	UI screenshot OCR	200	`1.000`	`3794 ms`
`table_statement`	Table OCR	200	`1.000`	`3434 ms`
`scene_text_signboard`	Scene text	200	`1.000`	`1661 ms`
`rotated_document`	Rotated document	200	`1.000`	`2908 ms`
`low_res_small_text`	Small text / low resolution	200	`1.000`	`4088 ms`
`chart_with_legend`	Chart reasoning	200	`0.000`	`15284 ms`

What youtu-vita did well#

For this run, youtu-vita was strong on the OCR tasks most teams care about first:

1. Clean document OCR#

It correctly extracted:

title
date
document ID
paragraph text

On document_basic, it returned a full structured transcription and correctly answered:

title = Quarterly Operations Summary
date = 2026-06-24
document_id = AB-123456

2. Receipt reading#

It handled receipt-style layout correctly on receipt_total, including:

merchant name
total amount
receipt number

That matters because receipt OCR often breaks on spacing, alignment, or repeated numeric fields. In this run, youtu-vita passed the receipt case with a full score.

3. UI screenshot OCR#

On ui_settings, it correctly captured:

page title
button labels
error code
supporting text

It also returned the structured answers we asked for:

primary_cta = Continue
error_code = E102

That makes it promising for support automation, QA workflows, screen parsing, and screenshot-based extraction tasks.

4. Table OCR#

On table_statement, the model passed the table case with a full 1.000 score in this run.

That is important because table OCR is often where vision models look good at plain text but fail at row-column alignment. In this benchmark, youtu-vita handled the table extraction cleanly enough to pass both the visible text requirements and the structured answer checks.

5. Rotated documents#

On rotated_document, the model also scored 1.000.

That suggests it is not limited to perfectly upright scanned pages. If your OCR workflow includes phone photos, skewed uploads, or documents captured in the wild, this is a meaningful result.

6. Low-resolution small text#

One of the most practically useful passes in this run was low_res_small_text, which also scored 1.000.

That case is closer to real dashboard and UI OCR than a clean printed PDF. If you need to read release notes, settings screens, logs, or admin panels from screenshots, this is a positive signal.

Where youtu-vita was weak#

The weak spot in this run was not standard OCR. It was chart reasoning.

On chart_with_legend, youtu-vita returned HTTP 200 but scored 0.000. It also took much longer than the rest of the test set at 15284 ms.

That tells us two things:

The model can complete the request, but this benchmark did not show reliable performance on chart interpretation.
OCR and chart understanding should be treated as separate capabilities.

This matters because many teams group all “image understanding” into one bucket. That is too coarse. A model can be strong at:

OCR
receipt parsing
screenshot reading
text extraction

and still be weak at:

chart reasoning
visual analytics
higher-order graph interpretation

Practical interpretation#

Based on this live run, youtu-vita looks strongest for these workloads:

Good fit#

document OCR
receipt OCR
UI screenshot extraction
rotated page reading
small-text screenshot parsing
sign and scene text extraction

Not yet a strong conclusion#

chart understanding
graph interpretation
analytics-style visual reasoning

If your workload is mostly “read the text, extract the fields, give me clean JSON,” this benchmark suggests youtu-vita is already useful.

If your workload is “understand a chart, infer trends, compare the latest month, and reason visually,” this benchmark does not support calling it strong there yet.

Why this matters for production teams#

Many OCR evaluations are too soft. They say a model is “good at image understanding” after a single logo or document test. That does not help if you need to decide whether to route:

support screenshots
invoices
receipts
internal ops dashboards
phone photos of documents

into a production OCR pipeline.

This benchmark is more useful because it separates:

OCR stability
structured extraction
small-text robustness
rotated-input handling
chart reasoning

For this run, youtu-vita was clearly stable across the OCR-heavy categories.

Final verdict#

For this live benchmark, youtu-vita was the most stable OCR model we tested in this run.

The actual results were:

100% request success across the full 8-case OCR set
0.875 average total score
1.000 on 7 of 8 OCR-oriented cases
failure only on the chart reasoning case

That makes youtu-vita a strong candidate if your main need is text extraction from images, especially for:

business documents
receipts
UI screenshots
low-resolution text
rotated pages

It does not mean it is the best choice for every kind of vision workload. But if your problem is OCR, not visual analytics, this is one of the clearest positive live results we have seen in this benchmark.