back_office_ops · saas · workflow

Hugging Face OCRs 30,000 arXiv papers using Codex, Chandra-OCR 2, and Hugging Face Jobs

About 27,000 papers indexed on Hugging Face lacked HTML versions on arXiv, making it impossible for the HuggingChat 'chat with paper' feature to work for those papers.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Identify papers missing HTML

About 27,000 papers indexed on Hugging Face lack HTML pages on arXiv, making chat with those papers impossible.

Tools used

CodexChandra-OCR 2vLLMJobsHuggingChathf-mountXet

Outcome

All parallel OCR jobs completed in about a day, and the resulting Markdown versions were integrated into Paper Pages, enabling chat with any paper on the hub.

Results

Time savedabout 29-30 hours

Volume30,000

Cost replacedabout $850

Source

https://huggingface.co/blog/nielsr/ocr-papers-jobs

How we source this →

Grounding & classification

Source type: technical build writeup

34 fields verified against source quotes, 1 dropped as unverifiable.

agentic workflowcode generationdocument aiocrknowledge basebuilder submittedmetric backednamed customerproduction runtime claimedworkflow describedsoftwarecost reductionthroughput increasetechnical build writeupback office opsagentic task executiondocument to record