Dropbox uses machine learning and OCR to make text in billions of images searchable
Images and image-embedded PDFs stored by Dropbox users were invisible to search indexing because they contain only pixels rather than extractable text, leaving billions of files—including receipts, whiteboard photos, and scanned documents—unsearchable.
Dropbox deployed automatic image text recognition for Professional and Business Advanced/Enterprise plan users, achieving a throughput improvement of about 3x through TensorFlow tuning and an 88% reduction in PDF metadata extraction failures, with almost 90% of documents indexed completely.
An initial deployed pipeline version was computationally prohibitive—requiring an enormous cluster—and actual traffic was roughly twice the projected load; TensorFlow's default multicore behavior caused severe context-switching overhead that degraded throughput further.
https://dropbox.tech/machine-learning/using-machine-learning-to-index-text-from-billions-of-images