ductai05 avatar
Sino-Nom/Chinese OCR

Sino-Nom/Chinese OCR

High-concurrency Goroutine-based scraping pipeline plus PaddleOCR fine-tuning that lifted Sino-Nom character detection H-mean from 0.731 to 0.952 on the NomNaOCR benchmark.

November 1, 2025 → December 1, 2025

Summary (Headline numbers)
  • +30.2 points H-mean on the NomNaOCR test set: 0.731 → 0.952.
  • Precision 0.966, Recall 0.937 after fine-tuning PP-OCRv5 (PP-OCRv5_server_det).
  • 3,411 manuscript pages curated from NomNaOCR + the CWKB Korean Buddhist canon, scraped by a 16-worker Goroutine pipeline.
  • 2× NVIDIA T4 distributed training on Kaggle, 100 epochs, peak validation H-mean at epoch 64.

Introduction

Sino-Nom (chữ Hán-Nôm) is the script that recorded over a thousand years of Vietnamese history, literature, and Buddhist commentary before the Latin-alphabet quốc ngữ took over. Today, almost no native readers remain, and the surviving manuscripts are degraded — bleed-through ink, wormholes, faded strokes, and slanted columns that defeat off-the-shelf OCR.

This project tackled the detection sub-problem (find every character bounding box) on two large-scale, real-world Sino-Nom corpora.

Note (Scope of this report)

We focus on the text detection stage of the OCR pipeline (input: page image, output: oriented bounding boxes around every character or character cluster). Recognition (image → unicode) is the next module and not covered here.


Datasets

We unified two heterogeneous corpora into a single PaddleOCR-compatible split.

NomNaOCR — Vietnamese Sino-Nom canon

NomNaOCR is the largest open Sino-Nom OCR dataset for Vietnam, built from three landmark works:

Total: 2,953 hand-written manuscript pages sourced from the Vietnamese Nôm Preservation Foundation.

CWKB — Complete Works of Korean Buddhism

The CWKB archive covers Korean Buddhist literature from the Silla through Joseon dynasties. The site exposes pages through paginated viewers, not bulk downloads, so we built a custom scraper.

scraper/main.go
4 collapsed lines
package main
import (
"context"
"sync"
)
const workerCount = 16
func crawl(ctx context.Context, urls <-chan string, out chan<- Page) {
var wg sync.WaitGroup
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for u := range urls {
page, err := fetchAndParse(ctx, u)
if err != nil {
continue
}
out <- page
}
}()
}
wg.Wait()
close(out)
}

The Goroutine pool gave us ~4× wall-clock speedup over a sequential Python requests baseline at the same TCP connection budget — I/O-bound scraping is exactly where Go’s lightweight concurrency shines.

Final split

After standardizing filenames to <book>_<page>.jpg and writing PaddleOCR-style det_gt.txt labels, we kept NomNaOCR’s original validation set as the held-out test set, then mixed the rest with all CWKB images and re-split 80/20.

SplitImagesSource
Train2,253NomNaOCR (train) ∪ CWKB, mixed and shuffled
Validation56420% holdout of the same pool
Test594NomNaOCR original validation set, untouched

Keeping the test split untouched ensures all reported metrics are comparable to other NomNaOCR baselines.


Architecture

We fine-tuned PP-OCRv5_server_det — the higher-capacity variant of PaddleOCR 3.0’s text-detection family — built around the Differentiable Binarization (DB) algorithm.

Components

DBLoss formulation

Detection is supervised by a weighted combination of a probability map loss and a threshold map loss:

LDB=αLprobDice+βLthreshBCE\mathcal{L}_{\text{DB}} = \alpha \cdot \mathcal{L}_{\text{prob}}^{\text{Dice}} + \beta \cdot \mathcal{L}_{\text{thresh}}^{\text{BCE}}

with α=5\alpha = 5 and β=10\beta = 10 as the original PP-OCRv5 recipe.

Solution (Why OHEM 3:1 matters here)

Sino-Nom pages have a high background-to-character pixel ratio. We kept Online Hard Example Mining at the default 3:1 negative-to-positive sampling, which forces the model to spend its gradient budget on the difficult faded strokes instead of trivial whitespace. Removing OHEM in an ablation knocked H-mean down by ~3 points.


Hyperparameter Setup

The default PP-OCRv5 recipe transferred surprisingly well — only epochs needed lowering. Full configuration:

GroupSetting
OptimizerAdam (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999); LR 0.001, Cosine Annealing + 2 warm-up epochs
RegularizerL2, factor 10610^{-6}
LossDBLoss (α=5\alpha = 5 Dice + β=10\beta = 10 BCE), OHEM ratio 3:1
AugmentationRandom crop to 640 × 640, rotation [-10°, +10°], scale [0.5, 3.0], horizontal flip p=0.5p=0.5
Post-procBinary threshold τ=0.3\tau = 0.3, box filter 0.6, region expand ratio 1.5
Epochs100 (down from default 500) — enough thanks to a strong pretrained init
InitPP-OCRv5_server_det_pretrained.pdparams, full fine-tune (Backbone + Neck + Head all updated)
train_config.yml
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
name: Cosine
learning_rate: 0.001
warmup_epoch: 2
regularizer:
name: L2
factor: 1.0e-06
Loss:
name: DBLoss
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3
PostProcess:
name: DBPostProcess
thresh: 0.3
box_thresh: 0.6
unclip_ratio: 1.5

Training

Training ran on Kaggle’s free tier:

Convergence behavior

The model converged fast and stably:

That a 500-epoch recipe peaks at 64 epochs on this domain is the strongest signal that fine-tuning, not training-from-scratch, is the right move.


Results

Headline metrics on the held-out NomNaOCR test set

MetricBaseline (PP-OCRv5_server_det)Fine-tuned (ours)Δ\Delta
Precision0.7130.966+0.253
Recall0.7500.937+0.187
H-mean0.7310.952+0.221

A baseline trained on Chinese/English/Japanese data already gets you to 0.731 H-mean — Sino-Nom characters share many radicals with Han Chinese, so the prior is non-trivial. The remaining 0.22 gap is the slope you only climb with domain-specific data.

Per-source breakdown

The test set is heterogeneous; performance is not uniform across the three NomNaOCR sub-corpora:

Sub-corpusPage conditionDetection quality
Truyện KiềuClean carved woodblockExcellent
Lục Vân TiênClean printExcellent
Đại Việt Sử Ký Toàn ThưFaded, bled-through ink, time-warped paperNoticeably lower

Best- and worst-case predictions illustrate the gap clearly: on Truyện Kiều, models routinely hit H-mean = 1.0 per-page; on the worst Đại Việt Sử Ký pages, H-mean can drop to 0.0 — entire columns are missed when the ink has faded into the substrate.


Limitations

The single biggest residual error source is not the detector itself — it is the missing image preprocessing pipeline that PP-OCRv5 ships with by default and that we did not wire up:

Without these, the detector sees rotated and warped pages as out-of-distribution. Adding the preprocessing chain is the next obvious step for closing the Đại Việt Sử Ký gap.

Warning (Honest assessment)

A 0.952 H-mean is a strong number, but it averages over easy and hard sub-corpora. A production OCR service for cultural-heritage digitization needs a triage layer: clean pages route to the fast detector, degraded pages route through preprocessing first. We did not build this layer — yet.


References

  1. Dang, H.-Q. et al. NomNaOCR: The First Dataset for Optical Character Recognition on Han-Nom Script. RIVF 2022.
  2. Dongguk University. The Archive of the Cultural Heritage of Buddhist Records (CWKB). https://kabc.dongguk.edu/
  3. Cui, C. et al. PaddleOCR 3.0 Technical Report. arXiv:2507.05595, 2025.
  4. Liao, M. et al. Real-time Scene Text Detection with Differentiable Binarization (DB). AAAI 2020.