
Evaluation Scenario Writer - AI Agent Testing Specialist
Mindrift • Hyderabad, Telangana, India
Salary: $12 / hour
Role & seniority: Experienced software developer / test automation specialist; part-time, project-based (non-permanent)
Stack/tools
-
Languages/QA: Python (pytest, async/await, subprocess, file operations); functional and integration testing
-
Front-end: React-based interfaces
-
DevOps: Docker, CI/CD (GitHub Actions)
-
Environment: working with production codebases; containerized evaluations
Top 3 responsibilities
-
Design and refine challenging coding test cases and comprehensive functional tests
-
Craft “fair but hard” tasks with full context and complex reasoning requirements
-
Analyze AI failures, identify gaps, and iterate based on expert QA feedback
Must-have skills
-
5+ years in software development
-
Proficiency in Python (pytest, async/await, subprocess, file I/O)
-
Experience with full-stack development (React front-end, robust back-end)
-
Testing experience (functional and integration)
-
Docker familiarity; running evaluations in containers
-
CI/CD understanding (GitHub Actions)
-
Degree in Computer Science, Software Engineering, or related field
-
English proficiency at B2
Nice-to-haves
-
Deeper back-end systems experience
-
Production coding experience with AI/testing focus
-
Prior experience with AI evaluation or QA review processes
Location & work type
-
Remote/telework feasible; project-based, part-time, non-permanent
-
Compensation details vary by project and role; rates up to $12/hour or projec
Full Description
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation isproject-based, not permanent employment. What this opportunity involves
You’ll create challenging coding test cases that push AI coding systems to their limits
- Review and refine realistic coding tasks based on provided production codebases with realistic scope, requirements and information sources
- Write comprehensive functional tests that validate actual end-to-end behavior and edge-cases, not just superficial checks
- Craft “fair but hard” challenges where the AI has all the context it needs, but has to work for it (information scattered across files and external sources, complex reasoning required)
- Analyze AI failures to understand what the model struggles with vs. what it masters
- Iterate based on feedback from expert QA reviewers who score your work on 7 quality criteria
- What we look for
This opportunity is a good fit for experienced developers, software engineers, and/or test automation specialists open to part-time, non-permanent projects. Ideally, contributors will have
- Degree in Computer Science, Software Engineering or related fields
- 5+ years in software development, primarily Python (pytest, async/await, subprocess, file operations)
- Background in Full-Stack development, with an equal focus on building React-based interfaces and robust Back-end systems
- Experience writing tests (functional, integration – not just running them)
- Docker containers (running evaluations locally in containers)
CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results) English proficiency - B2 How it works Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid Effort estimate Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted. Payment Paid contributions, with rates up to $12/hour* Fixed project rate or individual rates, depending on the project Some projects include incentive payments
*Note: Rates vary based on expertise, skills assessment, location, project needs, and other factors. Higher rates may be offered to highly specialized experts. Lower rates may apply during onboarding or non-core project phases. Payment details are shared per project.