From AI Research to Production: Optimizing Multimodal Retrieval, OCR, and System Design

Competition leaderboards reward correctness on a fixed benchmark; production rewards behavior under inputs you never designed for. The three systems below were all built for academic deadlines — a national AI challenge, an ACM Multimedia shared task, and an NLP coursework project — yet each one forced an engineering decision that mattered far beyond the contest scoreboard.

This post unpacks those decisions, grounded strictly in what we wrote down in the papers and report, not in retrospective storytelling. The goal is to make visible the gap between what wins a competition and what would survive a production traffic spike — and to show, paper by paper, the techniques that close it.

Summary (What you will read)

KPTER — A dual-layer retrieval system whose K-pointer algorithm replaces $O(N^K)$ nested loops with $O(K \cdot N)$ for temporal multi-event search. SOICT 2025.

ZSE-Cap — A zero-shot ensemble (CLIP + SigLIP + DINOv2) reaching mAP 0.966 on a 400K-image corpus and a 130× CIDEr lift via prompt-engineered Gemma 3. Top-4 @ EVENTA / ACM Multimedia 2025.

Sino-Nom OCR — Fine-tuning PP-OCRv5 on degraded historical manuscripts: H-mean 0.731 → 0.952 with knowledge-distilled PP-HGNetV2_B4 and OHEM 3:1.

Lessons Learned — Three rules I now apply before training any new model.

1. KPTER — Temporal Logic in Video Search

The Ho Chi Minh AI Challenge (AIC) 2025 introduced the TRAKE task — Temporal Retrieval and Alignment of Key Events. Given an ordered chain of $K$ text queries, find the video segment $(t_s, t_e)$ that contains all $K$ events in that exact temporal order. Standard keyframe retrieval cannot solve this. A single frame is a snapshot; TRAKE wants a story.

Our team WATLERE scored 82/88 in the preliminary round and advanced to the final, where the jury board’s tentative evaluation rated the overall system Excellent. Per-task: Outstanding on Textual Known-Item Search, Excellent on Visual KIS and VQA, Good on TRAKE.

Three-stage keyframe pipeline

Before any retrieval can happen, you have to extract a useful set of frames from a long video without drowning your vector index. The pipeline chains three filters:

Shot-boundary detection with TransNetV2, threshold lowered from 0.5 to 0.1 to preserve subtle scene transitions.
Low-information filtering — drop frames whose grayscale standard deviation falls below a fixed threshold (black screens, fades, static content).
Feature-based selection — encode each remaining frame with DINOv2 and keep a candidate $f_{\text{current}}$ only if its relative feature change from the last selected keyframe $f_{\text{prev}}$ exceeds $T = 0.5$ :

\text{feature\_diff} = \frac{\lVert f_{\text{current}} - f_{\text{prev}} \rVert}{\lVert f_{\text{prev}} \rVert}

This is much stronger than fixed temporal sampling: a 30-second product unboxing keeps far more frames than a 30-second talking-head shot, exactly as it should.

For discrete queries (KIS, VQA), four parallel retrieval engines run concurrently and a final ranker fuses them:

CLIP H-14 (DFN5B-CLIP-ViT-H-14) — strong zero-shot, open-domain visual-semantic alignment.
BEiT-3 (beit3_large_patch16_384_coco_retrieval) — Multiway Transformer fine-tuned on COCO; better at compositional, object-rich queries.
OCR index in Elasticsearch — PaddleOCR + VietOCR for standard text, Qwen-2.5-VL for stylized/curved text.
ASR index in Elasticsearch — Whisper JAX (v3.8) on TPU, gated by Voice Activity Detection.

Why two visual encoders? CLIP H-14 is great for open-domain queries — “a girl being interviewed.” BEiT-3 (fine-tuned on COCO) is much better at compositional queries — “two strips of watermelon, one strip of pineapple, on a sandwich.” Either one alone underperforms on the queries the other was designed for.

Their ranked outputs, plus OCR and ASR results from Elasticsearch, are unified by Weighted Reciprocal Rank Fusion:

\text{Score}(d) = \sum_{i \in \{\text{vector}, \text{OCR}, \text{ASR}\}} w_i \cdot \frac{1}{k + \text{rank}_i(d)}

rank_i(d) is the position of document d in the i-th source’s ranking, k is a smoothing constant. The weights are tuned per-task: for KIS we weight vector search higher; for queries that mention text on screen (“formula CuSO4”), OCR weight wins. After WRRF, an object/color filter (managed by Polars) prunes the result set to user-specified attributes.

For storage: keyframe embeddings are indexed in Milvus for low-latency ANN search; OCR/ASR text in Elasticsearch with BM25 + dense hybrid search; structured object/color metadata in Polars dataframes for fast in-memory filtering.

K-pointer sequential re-ranking

The TRAKE task asks for a sequence of $K$ events in order. The naive approach is a $K$ -deep nested loop over candidate lists: $O(N^K)$ — quadratic at $K=2$ , cubic at $K=3$ , dead at $K=4$ . This blows past the AIC time budget on the very first query.

The K-pointer algorithm replaces nested loops with a single sweep using $K-1$ pointers:

1
# Each candidate list C_i is pre-sorted by (video_id, time_sec).
2
# All pointers only move forward across the algorithm.
3

4
for pivot_idx in range(K):
5
    pivot = C[pivot_idx]
6
    pos = [0] * K  # positions in each non-pivot list
7

8
    for cur in pivot:
9
        # advance preceding-stage pointers (j < pivot_idx)
10
        for j in range(pivot_idx):
11
            while pos[j] + 1 < len(C[j]) and C[j][pos[j] + 1].time_sec < cur.time_sec:
12
                pos[j] += 1
13

14
        # advance succeeding-stage pointers (j > pivot_idx)
15
        for j in range(pivot_idx + 1, K):
16
            while pos[j] < len(C[j]) and C[j][pos[j]].time_sec < cur.time_sec:
17
                pos[j] += 1
18

19
        # validate same-video and temporal-window constraints, then aggregate
20
        if all_valid(cur, pos, window_sec):
21
            cur.score = aggregate_rank_scores(cur, pos)

Sweeping the $K-1$ non-pivot lists across $N_i$ pivot items has amortized cost $O(\sum_{j \neq i} N_j) = O(N)$ ; repeating for $K$ pivots gives $O(K \cdot N)$ on top of an initial $O(N \log N_{\max})$ sort. For $K \in \{2, 3\}$ — the realistic case — total complexity is effectively $O(N \log N)$ , with great cache locality because pointers never go backward.

Intuition (Why K-pointer beats nested loops here)

TRAKE candidate lists are already sorted by time. Nested loops re-discover that order at every level; the K-pointer algorithm exploits it once, globally. It is the classic “merge $k$ sorted lists” trick, repurposed for temporal validation. We did not invent the idea — we matched it to the right problem.

A real query that the system answered well

“The video shows instructions on solving a multiple-choice exercise … the correct answer is the metal with atomic mass 23. This is the reaction of an alkali metal with CuSO4.”

The frame is a generic blackboard exercise — visually indistinguishable from thousands of other tutorial videos. Pure CLIP embeddings cannot disambiguate it. What identifies this exact clip is the literal string CuSO4 written on the board.

The fix needed no new code once the architecture existed: bump OCR weight, type CUSO4 into the query, let Elasticsearch hit the OCR index, and the correct keyframe surfaces in the top 3 of U-mode results. This is the dual-layer architecture earning its keep — text on screen carrying a signal that vision encoders alone simply cannot expose.

Honest engineering retrospective

The K-pointer algorithm worked — TRAKE was rated Good, not Excellent, but the bottleneck was elsewhere. Our Extend Keyframe UI feature, designed to give judges sub-second temporal precision by loading a $\pm 2.5$ -second frame strip on demand, was the actual latency problem. As the paper documents: “the time taken for processing, loading, and interactively selecting the precise keyframe for submission was not sufficiently optimized for real-time interaction.” The user identified the right video instantly; the UI cost us minutes per submission across the round.

This is the difference between “ML is the bottleneck” and “the ML output handoff is the bottleneck” — and it is exactly the kind of mistake you only catch with end-to-end latency budgets, not isolated model benchmarks.

2. ZSE-Cap — Zero-Shot Ensemble for Article-Grounded Captioning

The EVENTA Track-1 task at ACM Multimedia 2025 combines two coupled sub-tasks: given a query image, retrieve its source news article from a corpus of 400K+ images / 200K+ articles (the OpenEvents V1 dataset), then generate a caption that grounds the image in that article’s narrative — named entities, locations, causes, consequences. It is image captioning where “a man stands at a podium” is a failure: you need to know which man, what speech, what consequences.

Our system, ZSE-Cap, finished Top-4 on the private test set with a final score of 0.42002 and no task-specific fine-tuning at all. The whole pipeline is foundation models, glued together with weights and prompts. (arXiv:2507.20564)

Stage 1 — Ensemble retrieval

We pre-compute embeddings for every database image with three models, query against each, and fuse:

Model	Training signal	What it brings to the ensemble
CLIP	Contrastive image-text	Strong open-domain visual-semantic alignment
SigLIP	Sigmoid loss image-text	Better stability than softmax CLIP; complementary to it
DINOv2	Self-distillation, no text	Fine-grained visual patterns; sees what text-supervised models miss

We compared two fusion strategies on the public test set:

Weighted Ensemble (WE) — sum of L2 distances per model, with normalised weights:

S_{\text{WE}}(I_c) = \sum_{m \in \{\text{CLIP}, \text{SigLIP}, \text{DINOv2}\}} w_m \cdot d_m(I_q, I_c)

After grid search, raw weights of $0.5, 0.3, 0.3$ (DINOv2/SigLIP/CLIP) normalised to $w_{\text{DINOv2}} = 0.4545$ , $w_{\text{CLIP}} = w_{\text{SigLIP}} = 0.2727$ . DINOv2 is weighted highest because its purely visual signal disambiguates near-duplicates that text-aligned models collapse together.

Reciprocal Rank Fusion (RRF) — parameter-free, rank-based, $k = 0$ .

WE outperformed RRF on every metric, so we shipped WE.

Intuition (Why three models beat the best single model)

CLIP and SigLIP look at images through the lens of language — they capture what humans would say about an image. DINOv2 has never read a caption in its life — it captures what an image visually is. On stock photos with high text overlap (CLIP/SigLIP get fooled), DINOv2 is the tiebreaker. The reverse is true for abstract conceptual queries.

Public-test retrieval results

Method	mAP	R@1	R@10
BLIP	0.589	0.475	0.746
CLIP (single)	0.981	0.969	0.997
SigLIP (single)	—	0.973	0.998
DINOv2 (single)	—	0.971	0.998
Ensemble — Reciprocal Rank Fusion	0.991	0.985	0.999
Ensemble — Weighted L2 (final)	0.994	0.99	0.999

Going from CLIP-only to the weighted ensemble adds 1.3 mAP points and 2.1 R@1 points — modest in absolute terms, decisive in a leaderboard race. On the private test set the same configuration scored mAP 0.966, R@1 0.955, R@10 0.983 — a small public-private gap that confirms the approach generalises.

Stage 2 — Prompt-guided captioning

Once we know the article, we feed Gemma-3-27b-it a triplet: (query image, full retrieved article, structured prompt). The prompt is what does the work.

Solution (The 4-step prompt that drove the 130× CIDEr jump)

The captioning prompt forces the LLM through an explicit reasoning chain instead of describing the image directly:

Contextualise the image through the article first — read the article, identify the central event, figures, and narrative; understand how the image illustrates these.

Describe in service of the article’s narrative — describe only image elements that matter to the article; mention named entities visible in the frame.

Articulate the connection — make the caption explicitly say why this image accompanies this article.

Professional, journalistic style — precise, informative; no preamble like “Here is the caption:”.

Without this scaffold, Gemma defaults to either pure image description (no event grounding) or pure article summary (no visual anchor). The prompt is the cognitive orchestrator that fuses the two streams.

Captioning results on the public set

Configuration	CLIPScore	CIDEr
Gemma-3-4b-it + image only (baseline)	0.820	0.001
Gemma-3-4b-it + article + structured prompt	0.817	0.133
Gemma-3-27b-it + article + structured prompt	0.842	0.151

Image-only captioning hits CIDEr 0.001 — basically zero overlap with ground-truth captions. Add the article and the structured prompt, and CIDEr jumps to 0.133 — a 130-fold increase. Scaling the model from 4B to 27B adds another ~14% on top. The headline result is the prompt, not the model size.

On the private set the same final configuration scored CLIPScore 0.828 / CIDEr 0.133 — a slight moderation typical of blind-test transitions, but enough to anchor the Top-4 finish.

Qualitative evidence — Michelle Payne at the Melbourne Cup

The same query image, captioned two ways:

Image-only (Gemma-3-4b-it): “The image captures a jubilant, rain-slicked street scene in Melbourne, Australia, likely during a daytime parade, judging by the overcast lighting and the presence of a large crowd. The dominant color palette is a muted, cool gray-blue, punctuated by the dark sheen of the vintage car and the vibrant red of the crowd’s attire…”

ZSE-Cap (Gemma-3-27b-it + article + prompt): “Following her historic Melbourne Cup win aboard Prince of Penzance, jockey Michelle Payne and connections are driven in a vintage car along Collins Street in Melbourne on November 3, 2015. Payne’s victory marked the first time a female jockey had won the prestigious race… Visible in the car with Payne are, from left, Robert Doyle (Lord Mayor of Melbourne) and Michael Burn (Victoria Racing Club Chairman)… The win, highlighted by Payne wearing the colors of the Suffragettes, resonated as a triumph of determination and equality.”

The image is the same. The difference is the article, marshalled by the prompt.

Where ZSE-Cap fails — and what that means for production

The paper’s error analysis logged two clean failure modes:

Contextual ambiguity from near-duplicate images — a single press photo (or near-duplicate) is associated with multiple articles. Visual L2 distance cannot disambiguate; the system picks one of the candidate articles, and if the wrong one is chosen, the caption is factually wrong (right photo, wrong context). This is a retrieval-side limit a pure image-to-image pipeline cannot solve. What I would add for production: a text-image cross-encoder re-rank on top-K, and/or article-side priors (freshness, publisher reputation, cluster-deduplication of duplicate photos at indexing time).
Sensitivity to severe visual perturbations — heavy crops, JPEG compression, or drastic colour shifts move embeddings far enough that the correct image drops below visually-cleaner-but-wrong candidates. What I would add for production: query-time augmentation matching the indexing pipeline, or a small invariance head trained to absorb the augmentation distribution.

These two failure modes are also the most realistic threats in a deployed news-captioning service: stock photography is the natural enemy of any image-only retriever, and user-uploaded query images are routinely mangled by frontends.

3. Sino-Nom OCR — Fine-Tuning PP-OCRv5 on Historical Manuscripts

A full project write-up lives at /project/sino-nom-chinese-ocr; this section covers the engineering decisions that mattered most. The work was the final project for our NLP coursework at HCMUS — graded, but also a real OCR pipeline against real degraded manuscripts.

The dataset problem

Sino-Nom (chữ Hán-Nôm) manuscripts have wormholes, faded ink, bled-through pages, and woodblock-warped columns. Off-the-shelf PP-OCRv5 trained on Han Chinese reaches H-mean 0.731 out of the box — non-trivial because Sino-Nom shares many radicals with Han Chinese. The remaining 22 points are the long tail you only buy with domain data.

We unified two corpora:

NomNaOCR (Đặng et al., RIVF 2022) — 2,953 Vietnamese Sino-Nom manuscript pages from Lục Vân Tiên, Truyện Kiều, and Đại Việt Sử Ký Toàn Thư, sourced from the Vietnamese Nôm Preservation Foundation.
CWKB (Complete Works of Korean Buddhism) — Korean Buddhist canon spanning Silla through Joseon dynasties, scraped from the Dongguk University archive and labelled manually after cleaning.

NomNaOCR’s original validation set was kept untouched as our test split. Its training set plus all of CWKB was shuffled and split 80/20 into our train / validation. Final counts: 2,253 train / 564 val / 594 test. All filenames are normalised to <book_name>_<image_name>.jpg for easy provenance tracking.

Architecture — what’s actually inside `PP-OCRv5_server_det`

Component	Choice	Why
Backbone	`PP-HGNetV2_B4`, distilled from `GOT-OCR2.0`	Strong document-OCR prior already baked in by the teacher
Neck	`LKPAN` (Large Kernel PAN), 256 output channels	Multi-scale fusion across radical-level and column-level features
Head	`PFHeadLocal` (Parallel Fusion), $k = 50$	Differentiable Binarisation with sharp prob-map output
Loss	DBLoss	Probability map (Dice, $\alpha = 5$ ) + threshold map (BCE, $\beta = 10$ )

The full DB loss:

\mathcal{L}_{\text{DB}} = 5 \cdot \mathcal{L}_{\text{Dice}}(\hat{P}, P^*) + 10 \cdot \mathcal{L}_{\text{BCE}}(\hat{T}, T^*)

Plus OHEM at 3:1 negative-to-positive sampling — Sino-Nom pages are mostly background, and without OHEM the gradient gets dominated by trivial whitespace.

Optimiser: Adam ( $\beta_1 = 0.9, \beta_2 = 0.999$ ), Cosine Annealing with a 2-epoch warm-up, base LR $10^{-3}$ , L2 regularisation $10^{-6}$ . Augmentations: random crop to $640 \times 640$ , rotation in $[-10°, +10°]$ , scale $[0.5, 3.0]$ , horizontal flip $p = 0.5$ . Post-processing: binary threshold $\tau = 0.3$ , box filter 0.6, expand ratio 1.5.

Training dynamics — how I knew it was working

The first epoch tells you whether fine-tuning was even the right move:

LR ramps from $1.6 \times 10^{-5}$ to $4.6 \times 10^{-4}$ during the 2-epoch warm-up.
Training loss drops from 4.12 to 1.38 inside epoch 1.
Validation H-mean stabilises within ~1 point of best by epoch 20.
Peak validation H-mean at epoch 64 — the final exported checkpoint.

That a 100-epoch budget peaks at 64 is the loud signal that fine-tuning, not retraining, was the correct call. The pretrained weights already know that columns of black ink on yellowed paper are characters; we are mostly teaching them what these specific characters look like.

Hardware: 2× NVIDIA Tesla T4 (15,360 MiB VRAM each), CUDA 12.8 / cuDNN 9.2, distributed data-parallel via PaddlePaddle GPU + PaddleOCR. Logging every 10 steps, validation every 500 steps, full checkpoint every 10 epochs plus best-H-mean snapshot.

Final results

Metric	Baseline	Fine-tuned	$\Delta$
Precision	0.713	0.966	+0.253
Recall	0.750	0.937	+0.187
H-mean	0.731	0.952	+0.221

Where it still fails

Per-source breakdown matters: Truyện Kiều and Lục Vân Tiên (clean carved blocks) hit per-page H-mean = 1.0 routinely. Đại Việt Sử Ký Toàn Thư (faded, bled-through ink, time-degraded text) falls off — and the fix is not more training data. The fix is the missing PP-LCNet orientation classifier + UVDoc dewarp preprocessing chain that PP-OCRv5 ships with by default but we did not wire up. The detector was being asked to compensate for distortions a 2-MB preprocessing model is purpose-built to remove.

Warning (The lesson for production OCR)

A single H-mean number averages over easy and hard pages. Real digitisation services need a triage layer: clean pages bypass preprocessing for speed; degraded pages route through orientation + dewarp first. Without that, you ship one model that works “on average” — and your worst-case users are the ones who scream loudest.

4. Lessons Learned — Three Rules I Now Apply Before Touching a Model

A year of competition systems and academic projects boils down to three sentences I write on a sticky note before any new project.

Rule 1 — Zero-shot ensemble before fine-tune

Fine-tuning costs GPU hours, risks overfitting, and locks the system into one data distribution. Ensembling foundation models costs only inference time and almost always beats a single model out of the box. ZSE-Cap reached Top-4 with no fine-tuning at all — three pretrained encoders, three weights tuned on a public split, one prompt. Reach for fine-tuning only when ensembling has plateaued and you have measured the gap.

Rule 2 — Algorithmic complexity is a product feature

The K-pointer algorithm is not a research contribution; it is a refusal to pay $O(N^K)$ when $O(K \cdot N)$ is sitting right there. Latency budgets are the product feature for interactive AI. The first thing I now ask of any architecture is: what is the asymptotic cost on the worst-case input my product allows? If I cannot answer that in one sentence, the architecture is not ready.

Rule 3 — Better data preprocessing beats a bigger model, almost always

The Sino-Nom project’s residual error is not “the detector needs more layers.” It is the missing UVDoc dewarp and PP-LCNet orientation step. Most failure modes I have seen in production-style review came down to the input pipeline lying to the model — wrong orientation, unnormalised crops, near-duplicate inputs collapsed by L2. Spend the first day staring at your worst-case inputs, not your training loss.

Note

The intersection of deep learning research and software engineering is the only place where AI systems actually pay rent. Papers without production are demos; production without research-grade thinking is technical debt that compounds with every model release.

References

KPTER: K-Pointer for Temporal Event Retrieval — D. A. K. Dinh, D.-T. Dinh, L. H. T. Nguyen, T. N. Nguyen. SOICT 2025 (CCIS, Springer).
ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning — D.-T. Dinh, D. A. K. Dinh. EVENTA Challenge @ ACM Multimedia 2025. arXiv:2507.20564.
Fine-tune PaddleOCR for NomNaOCR — Đ. Đ. A. Khoa, N. L. H. Trung, Đ. Đ. Tài. NLP coursework report, HCMUS, 2025–2026. Companion project: /projects/sino-nom-chinese-ocr.
PaddleOCR 3.0 Technical Report — Cui et al., arXiv:2507.05595.
Source code: github.com/ductai05/NLP-ChineseOCR, github.com/ductai05/ZSE-Cap.