
AI / LLM Quality Engineer
abra • Wheaton, Illinois, United States
**Role & seniority: ** Senior AI Evaluation & Reliability Engineer (Reliability Engineer) for agentic analytics platform
**Location & work type: ** Not specified (assume remote/hybrid not stated in text)
**Stack/tools: **
-
Programming: Python
-
LLM evaluation / observability tools: Google ADK, Opik, (also mentions Opik/LangSmith-like frameworks)
-
Conceptual tooling: LLM-as-a-judge pipelines, prompt engineering, agent frameworks, testing infrastructure
-
Data: synthetic + real-world datasets
**Top 3 responsibilities: **
-
Design and implement evaluation methodologies/frameworks for AI agents and multi-agent systems
-
Build LLM-as-a-judge pipelines and agent-based evaluation systems to assess correctness and output quality
-
Define metrics/benchmarks/scorecards, analyze failure modes/edge cases, and improve production reliability/robustness
**Must-have skills: **
-
4–8+ years in software engineering, AI systems, or evaluation/QA engineering
-
Strong Python programming
-
Hands-on LLM production experience
-
Experience building evaluation/automation/testing infrastructure
-
Strong understanding of prompting, tool use, and agent behavior
-
Metrics- and reliability-oriented thinking
**Nice-to-haves: **
-
Experience with LLM evaluation frameworks (e.g., Opik, LangSmith)
-
Google ADK / agent framework experience
-
LLM-as-a-judge or ranking systems implementation
Full Description
abra R&D is looking for a Reliability Engineer! abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.
What You’ll Do
- Design and implement evaluation frameworks for AI agents and multi-agent systems
- Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
- Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
- Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
- Build data-driven evaluation pipelines using synthetic and real-world datasets
- Identify and analyze failure modes, edge cases, and non-deterministic behaviors
- Improve agent robustness, consistency, and reliability in production environments
- Work with tools such as Google ADK, Opik, and related evaluation frameworks
- Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality
Requirements
Must have
- 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
- Strong programming skills in Python
- Hands-on experience working with LLMs in production environments
- Experience building evaluation systems, automation frameworks, or testing infrastructure
- Strong understanding of prompt engineering, tool use, and agent behavior
- Ability to think in terms of metrics, correctness, and system reliability
Nice to have
- Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
- Experience with Google ADK / agent frameworks
- Experience implementing LLM-as-a-judge or ranking systems
- Background in data systems, analytics, or real-time pipelines
- Experience with multi-agent systems
- Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)