Role & seniority: Senior AI Evaluation & Test Engineer

Stack/tools: Python; scripting; test automation (Pytest, Selenium, Robot Framework); observability/tracing (logs, spans, session tracking); AI concepts (RAG, prompt engineering, explainability, guard rails); CI/CD release gates; familiarity with AI evaluation frameworks (e.g., Arize, Braintrust, DeepEval, LangSmith, Ragas) is a plus

Top 3 responsibilities

Build and maintain AI evaluation pipelines to test, measure, and evaluate AI system behavior and performance
Define AI quality metrics/KPIs (factuality, faithfulness, toxicity, grounding precision/recall, latency, cost) with clear acceptance bars; implement release gates in CI/CD
Implement automated evaluation/testing (end-to-end and regression) and assist root-cause analysis; collaborate cross-functionally to shape user-facing AI behavior

Must-have skills

BS/MS in CS/CE/IT/EE or related field; 5+ years in software testing with at least 2 years evaluating AI/ML products
Strong testing fundamentals: test plans, test cases, reports/dashboards; analytical debugging; attention to detail
Proficiency in Python and automation frameworks (Pytest, Selenium, Robot Framework)
Working knowledge of generative AI models and related concepts; understanding differences between traditional software testing and AI evaluation
Team player, good communication, able to work in fast-paced/startup environments

Nice-to-haves

Experience with AI

Full Description

Job Summary

We are looking for an AI Evaluation & Test Engineer to join our growing team to ensure that our generative AI models and applications are safe, accurate, trustworthy, and deliver an elegant user experience. You will serve as the first customer of our AI systems. This role is ideal for product-minded engineers who obsess over product quality and customer-centricity, and are passionate about shaping the behavior of AI systems in the real world.

Key Responsibilities

Build and maintain AI evaluation pipelines to test, measure, and evaluate the behavior and performance of AI systems.
Implement traces, spans, and session tracking for observability and identify error propagation in multi-step pipelines.
Define AI quality metrics and KPIs around factuality, faithfulness, toxicity, grounding precision/recall, latency, cost, etc., with clear acceptance bars.
Implement evaluation and testing automation to enable end-to-end system and regression testing at scale.
Define criteria for and implement release gates in the CI/CD pipeline.
Find creative ways to break products.
Assist in root cause analysis and troubleshooting of bugs and field issues.
Collaborate with cross-functional teammates from product, engineering, linguistics, and customer support to shape human-AI interaction paradigms and ensure that our AI models and applications deliver the desired outcome and user experience.

Minimum Qualifications and Experience

Bachelor’s or Master’s degree in CS/CE/IT/EE/E&TC or related fields with 5+ years of experience in manual and automation testing of software products, with at least 2 years in evaluating and testing AI/ML products.

Required Expertise

Strong software testing fundamentals and expertise in writing test plans, executing test cases, and generating detailed reports and dashboards.
Strong analytical and debugging skills, and attention to detail.
Proficiency in Python, scripting, and software testing automation frameworks and tools such as Pytest, Selenium, Robot Framework, etc.
Working knowledge of generative AI models, AI agents, and related concepts such as retrieval augmented generation (RAG), prompt engineering, context engineering, explainability, traceability, observability, guard rails, reasoning, specificity, etc.
Sound understanding of the fundamental differences in the approach for testing conventional software versus evaluating generative AI systems.
Team player with excellent interpersonal skills and the ability to collaborate effectively with remote and cross-functional team members.

AI Evals & Test Engineer