GitHub builds OctoLingua: an ANN-based machine learning classifier for programming language detection
Language detection at GitHub is non-trivial because file extensions are ambiguous, shared across languages, or absent entirely. The existing tool Linguist achieved 84% file-level accuracy but its performance declined considerably when extensions were missing or incorrect, making it unsuitable for Gists and inline code snippets in READMEs, issues, and pull requests.
OctoLingua, built on an ANN architecture using Python, Keras, and TensorFlow, surpasses Linguist in accuracy and performance and maintains good performance under various conditions, learning primarily from code vocabulary rather than file extension metadata.
Linguist relies on heuristics and a Naive Bayes classifier trained on a small sample of data; it fails as soon as file extension information is altered or removed, revealing that it does not robustly learn from code vocabulary.