back_office_ops · saas · workflow

Hugging Face OCRs 30,000 arXiv papers using Codex, Chandra-OCR 2, and Hugging Face Jobs

About 27,000 papers indexed on Hugging Face lacked HTML versions on arXiv, making it impossible for the HuggingChat 'chat with paper' feature to work for those papers.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Identify papers missing HTML
About 27,000 papers indexed on Hugging Face lack HTML pages on arXiv, making chat with those papers impossible.
Tools used
CodexChandra-OCR 2vLLMJobsHuggingChathf-mountXet
Outcome

All parallel OCR jobs completed in about a day, and the resulting Markdown versions were integrated into Paper Pages, enabling chat with any paper on the hub.

Results
Time savedabout 29-30 hours
Volume30,000
Cost replacedabout $850
Source

https://huggingface.co/blog/nielsr/ocr-papers-jobs

How we source this →

Grounding & classification
Source type: technical build writeup
34 fields verified against source quotes, 1 dropped as unverifiable.
agentic workflowcode generationdocument aiocrknowledge basebuilder submittedmetric backednamed customerproduction runtime claimedworkflow describedsoftwarecost reductionthroughput increasetechnical build writeupback office opsagentic task executiondocument to record