back_office_ops · workflow

GitHub research: deep learning approach to natural language semantic code search

Code search on GitHub was limited to keyword matching, requiring users to know exact syntax or anticipate keywords in surrounding comments, with no ability to search by natural language intent.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Code summarization model training

A sequence-to-sequence model is trained to summarize code using (code, docstring) pairs as training data.

Tools used

fairseq-pyUniversal Sentence EncoderTensorFlow Hubfast.aikubeflow

Outcome

The research system achieves a BLEU score of 13.5 on a holdout set of Python code and demonstrates semantic search returning relevant results even when no keywords are shared between the query and the code.

What failed first

An initial attempt using the Universal Sentence Encoder produced embeddings that worked reasonably but lacked specificity to software development vocabulary and semantics.

Results

Volume13.5

Source

https://github.blog/ai-and-ml/machine-learning/towards-natural-language-semantic-code-search/

How we source this →

Grounding & classification

Source type: technical build writeup

17 fields verified against source quotes.

enterprise searchknowledge searchsummarizationknowledge basefailure mode describedmetric backedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupback office opsrag answering