youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text
We ran a live OCR benchmark for youtu-vita on eight image-understanding tasks, including documents, receipts, UI screenshots, rotated pages, scene text, and low-resolution small text. Here are the actual results, latency numbers, weak spots, and what they mean for production OCR workflows.
youtu-vita OCR Benchmark 2026: Live Test Results on Documents, Receipts, UI Screens, and Small Text#
If you are evaluating OCR-capable multimodal models for production work, the question is not just whether a model can read text in an image. The real question is whether it can do it consistently, with structured outputs, across the kinds of inputs teams actually send in production: screenshots, receipts, tables, rotated documents, scene text, and low-resolution UI captures.
We ran a live benchmark for youtu-vita through our OpenAI-compatible API path and scored it on a controlled OCR test set. This article shares the actual test data, what the model passed, where it struggled, and what kind of OCR workloads it looks good at right now.
Test setup#
This benchmark was run on June 24, 2026.
We used a local generated OCR benchmark set with these eight cases:
document_basicreceipt_totalui_settingstable_statementscene_text_signboardrotated_documentlow_res_small_textchart_with_legend
The benchmark images were generated locally so the test would be stable and reproducible, rather than depending on external image hosts. Each request used the same OpenAI-compatible image input shape and the same JSON-only output instruction. The model was asked to:
- transcribe visible text
- answer structured OCR questions
- return machine-readable JSON
Scoring method#
Each case was scored on four dimensions:
- Format stability Did the model return valid structured JSON?
- OCR text match Did it correctly capture the required visible text?
- Regex-sensitive fields Did it preserve exact formats for fields like IDs or totals?
- Structured answers Did it answer the requested key fields correctly?
The final score per case is a weighted total. A score of 1.000 means the model fully passed that case under this benchmark.
youtu-vita live benchmark results#
The full 8-case run completed successfully.
Headline metrics#
- Success rate:
100% - Average total score:
0.875 - p50 latency:
3794 ms - p90 latency:
4088 ms - Slowest case:
15284 ms
Case-by-case results#
| Case ID | Category | Status | Score | Latency |
|---|---|---|---|---|
document_basic | Document OCR | 200 | 1.000 | 3852 ms |
receipt_total | Receipt OCR | 200 | 1.000 | 3616 ms |
ui_settings | UI screenshot OCR | 200 | 1.000 | 3794 ms |
table_statement | Table OCR | 200 | 1.000 | 3434 ms |
scene_text_signboard | Scene text | 200 | 1.000 | 1661 ms |
rotated_document | Rotated document | 200 | 1.000 | 2908 ms |
low_res_small_text | Small text / low resolution | 200 | 1.000 | 4088 ms |
chart_with_legend | Chart reasoning | 200 | 0.000 | 15284 ms |
What youtu-vita did well#
For this run, youtu-vita was strong on the OCR tasks most teams care about first:
1. Clean document OCR#
It correctly extracted:
- title
- date
- document ID
- paragraph text
On document_basic, it returned a full structured transcription and correctly answered:
title = Quarterly Operations Summarydate = 2026-06-24document_id = AB-123456
2. Receipt reading#
It handled receipt-style layout correctly on receipt_total, including:
- merchant name
- total amount
- receipt number
That matters because receipt OCR often breaks on spacing, alignment, or repeated numeric fields. In this run, youtu-vita passed the receipt case with a full score.
3. UI screenshot OCR#
On ui_settings, it correctly captured:
- page title
- button labels
- error code
- supporting text
It also returned the structured answers we asked for:
primary_cta = Continueerror_code = E102
That makes it promising for support automation, QA workflows, screen parsing, and screenshot-based extraction tasks.
4. Table OCR#
On table_statement, the model passed the table case with a full 1.000 score in this run.
That is important because table OCR is often where vision models look good at plain text but fail at row-column alignment. In this benchmark, youtu-vita handled the table extraction cleanly enough to pass both the visible text requirements and the structured answer checks.
5. Rotated documents#
On rotated_document, the model also scored 1.000.
That suggests it is not limited to perfectly upright scanned pages. If your OCR workflow includes phone photos, skewed uploads, or documents captured in the wild, this is a meaningful result.
6. Low-resolution small text#
One of the most practically useful passes in this run was low_res_small_text, which also scored 1.000.
That case is closer to real dashboard and UI OCR than a clean printed PDF. If you need to read release notes, settings screens, logs, or admin panels from screenshots, this is a positive signal.
Where youtu-vita was weak#
The weak spot in this run was not standard OCR. It was chart reasoning.
On chart_with_legend, youtu-vita returned HTTP 200 but scored 0.000. It also took much longer than the rest of the test set at 15284 ms.
That tells us two things:
- The model can complete the request, but this benchmark did not show reliable performance on chart interpretation.
- OCR and chart understanding should be treated as separate capabilities.
This matters because many teams group all “image understanding” into one bucket. That is too coarse. A model can be strong at:
- OCR
- receipt parsing
- screenshot reading
- text extraction
and still be weak at:
- chart reasoning
- visual analytics
- higher-order graph interpretation
Practical interpretation#
Based on this live run, youtu-vita looks strongest for these workloads:
Good fit#
- document OCR
- receipt OCR
- UI screenshot extraction
- rotated page reading
- small-text screenshot parsing
- sign and scene text extraction
Not yet a strong conclusion#
- chart understanding
- graph interpretation
- analytics-style visual reasoning
If your workload is mostly “read the text, extract the fields, give me clean JSON,” this benchmark suggests youtu-vita is already useful.
If your workload is “understand a chart, infer trends, compare the latest month, and reason visually,” this benchmark does not support calling it strong there yet.
Why this matters for production teams#
Many OCR evaluations are too soft. They say a model is “good at image understanding” after a single logo or document test. That does not help if you need to decide whether to route:
- support screenshots
- invoices
- receipts
- internal ops dashboards
- phone photos of documents
into a production OCR pipeline.
This benchmark is more useful because it separates:
- OCR stability
- structured extraction
- small-text robustness
- rotated-input handling
- chart reasoning
For this run, youtu-vita was clearly stable across the OCR-heavy categories.
Final verdict#
For this live benchmark, youtu-vita was the most stable OCR model we tested in this run.
The actual results were:
100%request success across the full 8-case OCR set0.875average total score1.000on 7 of 8 OCR-oriented cases- failure only on the chart reasoning case
That makes youtu-vita a strong candidate if your main need is text extraction from images, especially for:
- business documents
- receipts
- UI screenshots
- low-resolution text
- rotated pages
It does not mean it is the best choice for every kind of vision workload. But if your problem is OCR, not visual analytics, this is one of the clearest positive live results we have seen in this benchmark.





