**Role & seniority: ** Senior AI Evaluation & Reliability Engineer (Reliability Engineer) for agentic analytics platform

**Location & work type: ** Not specified (assume remote/hybrid not stated in text)

**Stack/tools: **

Programming: Python
LLM evaluation / observability tools: Google ADK, Opik, (also mentions Opik/LangSmith-like frameworks)
Conceptual tooling: LLM-as-a-judge pipelines, prompt engineering, agent frameworks, testing infrastructure
Data: synthetic + real-world datasets

**Top 3 responsibilities: **

Design and implement evaluation methodologies/frameworks for AI agents and multi-agent systems
Build LLM-as-a-judge pipelines and agent-based evaluation systems to assess correctness and output quality
Define metrics/benchmarks/scorecards, analyze failure modes/edge cases, and improve production reliability/robustness

**Must-have skills: **

4–8+ years in software engineering, AI systems, or evaluation/QA engineering
Strong Python programming
Hands-on LLM production experience
Experience building evaluation/automation/testing infrastructure
Strong understanding of prompting, tool use, and agent behavior
Metrics- and reliability-oriented thinking

**Nice-to-haves: **

Experience with LLM evaluation frameworks (e.g., Opik, LangSmith)
Google ADK / agent framework experience
LLM-as-a-judge or ranking systems implementation

Full Description

abra R&D is looking for a Reliability Engineer! abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.

What You’ll Do

Design and implement evaluation frameworks for AI agents and multi-agent systems
Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
Build data-driven evaluation pipelines using synthetic and real-world datasets
Identify and analyze failure modes, edge cases, and non-deterministic behaviors
Improve agent robustness, consistency, and reliability in production environments
Work with tools such as Google ADK, Opik, and related evaluation frameworks
Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality

Requirements

Must have

4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
Strong programming skills in Python
Hands-on experience working with LLMs in production environments
Experience building evaluation systems, automation frameworks, or testing infrastructure
Strong understanding of prompt engineering, tool use, and agent behavior
Ability to think in terms of metrics, correctness, and system reliability

Nice to have

Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
Experience with Google ADK / agent frameworks
Experience implementing LLM-as-a-judge or ranking systems
Background in data systems, analytics, or real-time pipelines
Experience with multi-agent systems
Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)

AI / LLM Quality Engineer

Full Description

What You’ll Do

Must have

Nice to have