Mindrift logo

Evaluation Scenario Writer - AI Agent Testing Specialist

Mindrift Kuwait

remotepart-time

Salary: $40 / hour

Posted Feb 22, 2026
  • Role & seniority

    • Project-based contributor (non-permanent), experienced developers, software engineers, and/or test automation specialists

    • 5+ years in software development; degree in Computer Science, Software Engineering, or related field

  • Stack / tools

    • Python (pytest, async/await, subprocess, file operations)

    • Full-Stack: React-based front-end + robust back-end systems

    • Docker (running evaluations locally in containers)

    • CI/CD: GitHub Actions (as a user, triggers, labels, results)

  • Top 3 responsibilities

    • Create challenging coding test cases that stress AI coding systems

    • Review and refine realistic coding tasks based on production codebases; write comprehensive end-to-end functional tests

    • Analyze AI failures to distinguish weaknesses vs. strengths; iterate with feedback from expert QA reviewers

  • Must-have skills

    • Degree in CS/Software Engineering or related field

    • 5+ years in software development; strong Python background

    • Full-Stack experience (React front-end + back-end)

    • Experience writing functional and integration tests

    • Docker proficiency; ability to run evaluations locally

    • CI/CD knowledge (GitHub Actions usage)

    • English proficiency at B2

  • Nice-to-haves

    • Prior experience with test automation and evaluating AI/ML systems

    • Advanced understanding of testing strategies beyond basic test execution

    • Availability for part-time engagement and project-based workloads

  • Location & work type

    • Remote, pr

Full Description

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation isproject-based, not permanent employment. What this opportunity involves

You’ll create challenging coding test cases that push AI coding systems to their limits

  • Review and refine realistic coding tasks based on provided production codebases with realistic scope, requirements and information sources
  • Write comprehensive functional tests that validate actual end-to-end behavior and edge-cases, not just superficial checks
  • Craft “fair but hard” challenges where the AI has all the context it needs, but has to work for it (information scattered across files and external sources, complex reasoning required)
  • Analyze AI failures to understand what the model struggles with vs. what it masters
  • Iterate based on feedback from expert QA reviewers who score your work on 7 quality criteria
  • What we look for

This opportunity is a good fit for experienced developers, software engineers, and/or test automation specialists open to part-time, non-permanent projects. Ideally, contributors will have

  • Degree in Computer Science, Software Engineering or related fields
  • 5+ years in software development, primarily Python (pytest, async/await, subprocess, file operations)
  • Background in Full-Stack development, with an equal focus on building React-based interfaces and robust Back-end systems
  • Experience writing tests (functional, integration – not just running them)
  • Docker containers (running evaluations locally in containers)

CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results) English proficiency - B2 How it works Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid Effort estimate Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted. Payment Paid contributions, with rates up to $40/hour* Fixed project rate or individual rates, depending on the project Some projects include incentive payments

*Note: Rates vary based on expertise, skills assessment, location, project needs, and other factors. Higher rates may be offered to highly specialized experts. Lower rates may apply during onboarding or non-core project phases. Payment details are shared per project.

PythonPytestAsync/AwaitSubprocessFile OperationsFull-Stack DevelopmentReactBack-end SystemsFunctional TestsIntegration TestsDockerCI/CDGitHub ActionsAI Agent TestingCoding Test Case CreationEdge-Case Testing

Cookies & analytics consent

We serve candidates globally, so we only activate Google Tag Manager and other analytics after you opt in. This keeps us aligned with GDPR/UK DPA, ePrivacy, LGPD, and similar rules. Essential features still run without analytics cookies.

Read how we use data in our Privacy Policy and Terms of Service.