Role & seniority: QA Manager at Scaled Cognition

Stack/tools: Python (intermediate), conversational AI/LLM systems, testing pipelines, evaluation benchmarks, production monitoring metrics; AI/LLM libraries and tooling

Top 3 responsibilities

Develop and implement scalable QA plans for evaluating AI agents; define KPIs to track progress over time
Collaborate with product and engineering to document findings, test fixes, and recommend improvements to models and conversational flows
Lead and mentor QA engineers; establish testing best practices and processes for conversational AI

Must-have skills

Intermediate Python
Experience building/testing conversational AI/LLM systems
Background in evaluation benchmarks and production monitoring metrics
Documentation precision for test plans, cases, and bug reports
Ability to work with AI tooling to enable rapid iteration

Nice-to-haves

Experience building automated testing pipelines for scalable QA
Familiarity with AI-powered assistants/tooling and rapid prototyping
History of cross-functional collaboration in product/engineering
Location & work type: Location and work type not specified in the provided text

Full Description

Scaled Cognition is the world’s only model lab dedicated exclusively to customer experience and pioneering agentic models purpose-built for reliable action-taking enterprise applications. Backed by Khosla Ventures, the company’s flagship Agentic Pretrained Transformer (APT) eliminates hallucinations, enforces enterprise policies and increases reliability in real-world CX workflows.

Founded by serial AI entrepreneurs, former Microsoft Corporate Vice President of Conversational AI Dan Roth, and UC Berkeley AI Professor Dan Klein, and built by a team of world-class PhD researchers and engineers, Scaled Cognition advances the science of agentic AI to deliver safe, policy-aligned automation that enterprises can trust.

As an QA Manager at Scaled Cognition you will

Develop and implement scalable QA plans for evaluating AI agents, defining key performance metrics to measure progress over time.
Collaborate with product and engineering teams to document findings, test fixes, and recommend improvements to the underlying models and conversational flows.
Lead and mentor a team of QA engineers, establishing best practices and processes for testing conversational AI agents.

Example projects could include

Building test sets to track regressions, agent robustness, and end-to-end testing.
Reviewing and analyzing voice and chat transcripts, and quickly identify conversational gaps and provide data for faster iteration on customer deployments.
Designing and automating testing pipelines to scale QA capacity across a diverse portfolio of customers and to continuously evaluate the performance of our AI agents.

Preferred Qualifications

Intermediate-level proficiency in Python and experience building and testing conversational AI/LLM systems.
Background in implementing evaluation benchmarks, and production monitoring metrics.
Experience working with libraries and tooling common in the AI/LLM ecosystem.
Demonstrated precision in documenting test plans, test cases, and bug reports, ensuring data is accurate and easily understandable by cross-functional teams.
Experience with leveraging AI-powered assistants/tooling to enable rapid iteration, prototyping, and accelerated delivery.