abra logo

AI / LLM Quality Engineer

abra Wheaton, Illinois, United States

onsitefull-time
Posted Apr 20, 2026

**Role & seniority: ** Senior AI Evaluation & Reliability Engineer (Reliability Engineer) for agentic analytics platform

**Location & work type: ** Not specified (assume remote/hybrid not stated in text)

**Stack/tools: **

  • Programming: Python

  • LLM evaluation / observability tools: Google ADK, Opik, (also mentions Opik/LangSmith-like frameworks)

  • Conceptual tooling: LLM-as-a-judge pipelines, prompt engineering, agent frameworks, testing infrastructure

  • Data: synthetic + real-world datasets

**Top 3 responsibilities: **

  1. Design and implement evaluation methodologies/frameworks for AI agents and multi-agent systems

  2. Build LLM-as-a-judge pipelines and agent-based evaluation systems to assess correctness and output quality

  3. Define metrics/benchmarks/scorecards, analyze failure modes/edge cases, and improve production reliability/robustness

**Must-have skills: **

  • 4–8+ years in software engineering, AI systems, or evaluation/QA engineering

  • Strong Python programming

  • Hands-on LLM production experience

  • Experience building evaluation/automation/testing infrastructure

  • Strong understanding of prompting, tool use, and agent behavior

  • Metrics- and reliability-oriented thinking

**Nice-to-haves: **

  • Experience with LLM evaluation frameworks (e.g., Opik, LangSmith)

  • Google ADK / agent framework experience

  • LLM-as-a-judge or ranking systems implementation

Full Description

abra R&D is looking for a Reliability Engineer! abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.

What You’ll Do

  • Design and implement evaluation frameworks for AI agents and multi-agent systems
  • Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
  • Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
  • Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
  • Build data-driven evaluation pipelines using synthetic and real-world datasets
  • Identify and analyze failure modes, edge cases, and non-deterministic behaviors
  • Improve agent robustness, consistency, and reliability in production environments
  • Work with tools such as Google ADK, Opik, and related evaluation frameworks
  • Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality

Requirements

Must have

  • 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
  • Strong programming skills in Python
  • Hands-on experience working with LLMs in production environments
  • Experience building evaluation systems, automation frameworks, or testing infrastructure
  • Strong understanding of prompt engineering, tool use, and agent behavior
  • Ability to think in terms of metrics, correctness, and system reliability

Nice to have

  • Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
  • Experience with Google ADK / agent frameworks
  • Experience implementing LLM-as-a-judge or ranking systems
  • Background in data systems, analytics, or real-time pipelines
  • Experience with multi-agent systems
  • Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)
PythonLLMAI EvaluationReliability EngineeringPrompt EngineeringAgentic AnalyticsTesting FrameworksData-driven PipelinesMulti-agent SystemsGoogle ADKOpikLangSmithStatistical EvaluationSystem ReliabilityProduction Engineeringmulti-location

Cookies & analytics consent

We serve candidates globally, so we only activate Google Tag Manager and other analytics after you opt in. This keeps us aligned with GDPR/UK DPA, ePrivacy, LGPD, and similar rules. Essential features still run without analytics cookies.

Read how we use data in our Privacy Policy and Terms of Service.