quality_assurance · workflow

GitHub builds OctoLingua: an ANN-based machine learning classifier for programming language detection

Language detection at GitHub is non-trivial because file extensions are ambiguous, shared across languages, or absent entirely. The existing tool Linguist achieved 84% file-level accuracy but its performance declined considerably when extensions were missing or incorrect, making it unsuitable for Gists and inline code snippets in READMEs, issues, and pull requests.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Code pushed to repository

When code is pushed to a repository, the language detection workflow is initiated.

Tools used

OctoLinguaLinguistPythonKerasTensorFlowANN

Outcome

OctoLingua, built on an ANN architecture using Python, Keras, and TensorFlow, surpasses Linguist in accuracy and performance and maintains good performance under various conditions, learning primarily from code vocabulary rather than file extension metadata.

What failed first

Linguist relies on heuristics and a Naive Bayes classifier trained on a small sample of data; it fails as soon as file extension information is altered or removed, revealing that it does not robustly learn from code vocabulary.

Results

Volume84%

Source

https://github.blog/ai-and-ml/machine-learning/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

How we source this →

Grounding & classification

Source type: technical build writeup

20 fields verified against source quotes.

document classificationcode diff prfailure mode describedmetric backednamed customertools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assuranceextract classify route