Role & seniority

AI QA Engineer (mid to senior; 3–6+ years in QA/SDET or evaluation-focused ML/AI testing)

Stack / tools

Python for test automation and scenario evaluation
ML evaluation tools, LLM/RAG testing, model benchmarking suites
Vector databases, retrieval systems, multi-agent workflows, AI pipelines
CI/CD pipelines, DevOps tooling, observability platforms
Data quality validation, embeddings, retrieval accuracy, ranking/precision metrics

Top 3 responsibilities

Define and own end-to-end QA strategy across UI, backend, data, and AI components; design test plans and implement evaluation suites for LLMs, retrieval, and agent workflows
Conduct scenario-based, regression, and red-team tests; monitor AI behavior with metrics, logs, dashboards, and alerts; identify edge cases, bias, and reliability issues
Integrate automated tests into CI/CD; manage defect triage with clear reproduction docs; support delivery of safe, reliable AI for experimental and production features

Must-have skills

3–6+ years in QA, SDET, or evaluation-focused ML/AI testing; experience with nondeterministic/probabilistic systems
Strong Python scripting for automation, evaluation, and defect detection
Experience with ML evaluation tools, LLM/RAG testing, or model benchmarking
Familiarity with vector databases, retrieval systems, multi-agent workflows, and AI pipelines
Understanding of CI/CD, DevOps tooling, and observability; ability to validate data, embeddings, and ranking

Full Description

Our client is seeking an AI QA Engineer to ensure reliable, safe, and high-quality AI and data-driven systems in fast-paced, experimental projects.

Roles and Responsibilities

Define and own the end-to-end QA strategy across UI, backend, data, and AI components.
Design and implement test plans covering functional, non-functional, and behavioral requirements.
Build automated and manual evaluation suites for LLMs, retrieval systems, and agent workflows.
Conduct scenario-based tests, regression tests, and red-team exercises to uncover edge cases and risks.
Validate data quality, embedding correctness, retrieval accuracy, and monitor for model drift or hallucinations.
Define metrics, logs, dashboards, and alerts to monitor AI behavior, latency, cost, and errors.
Detect and escalate reliability, bias, and performance issues early in the delivery cycle.
Manage defect triage workflows, categorize failures across UI, API, data, and model layers, and ensure clear reproduction documentation.
Integrate automated tests into CI/CD pipelines to catch regressions early.
Support pods in delivering safe, reliable, and stable AI behavior for all experimental and production features.

Requirements

3–6+ years of experience in QA, SDET, or evaluation-focused ML/AI testing, preferably with nondeterministic or probabilistic systems.
Strong Python scripting skills for test automation, scenario evaluation, and defect detection.
Experience with ML evaluation tools, LLM/RAG testing, or model benchmarking suites.
Familiarity with vector databases, retrieval systems, multi-agent workflows, and AI pipelines.
Understanding of CI/CD pipelines, DevOps tooling, and observability platforms.
Ability to validate data correctness, embeddings, retrieval accuracy, and ranking/precision metrics.
Strong instincts for edge cases, risk modes, adversarial failures, and failure-mode analysis.
Curious, systematic, and detail-oriented mindset, with high ownership and discipline.
Excellent communication skills for clear, actionable defect reporting.

For more information – please apply for this job or send your CV directly and we will contact you to provide further details. Cavendish (Recruitment) Professionals Ltd are proud to be an equal opportunity employer and we believe that inclusivity begins with the candidate experience. All qualified applicants will receive consideration for employment regardless of gender, race, age, sexual orientation, religion, or belief. Show more Show less