Dosu uses LangSmith to scale evaluation-driven development for their AI GitHub assistant
As Dosu's installation base grew, their manual approach of reviewing logs with grep and print statements became unscalable, making it nearly impossible to monitor responses and identify failure modes in production—a step critical to their evaluation-driven development workflow. The broader problem Dosu was built to address is that up to 85% of developers' time is spent on non-coding tasks such as answering questions and triaging issues.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User submits GitHub issue
Users submit requests to Dosu via GitHub issues, ranging from simple codebase questions to error traces.
Tools used
LangSmithLangChainGitHub · partnerOpenAI
Outcome
LangSmith gave Dosu out-of-the-box visibility into all their activity, enabling the team to identify unforeseen failure modes at scale and integrate production monitoring into their EDD workflow. The team is now building automated evaluation dataset collection from production traffic.
What failed first
Manual log review with grep and print statements could not scale with Dosu's growth, and changing LLM prompts frequently caused regressions in areas that had previously been working well.