Role & seniority
- AI QA Specialist (LLM Evaluation) (mid-level) — 3+ years practical experience
Stack/tools
- Languages: Python; TypeScript/React/Next.js (NX)
- Evaluation/QA: pytest, LangSmith, Weights & Biases, custom eval frameworks
- Data: BigQuery, Spark, Pandas
- Infra/DevOps: GCP (containers/K8s), Docker, Terraform
- CI/CD: GitHub Actions
- Other: Slack, Confluence, Linear, Google Workspace, GitHub, Notion
Top 3 responsibilities
- Design/build/operate quality evaluation infrastructure for AI agents (metrics, datasets, pipelines)
- Execute red teaming and safety/policy verification pre-release; run regression tests
- Use statistical experimental design (A/B testing, significance tests) to quantify quality improvements and report results
Must-have skills
- Knowledge of LLM/genAI evaluation (factuality/hallucination detection, quantitative quality measurement)
- Statistics + experimental design foundations
- Building evaluation pipelines in Python and integrating tests into CI/CD
- Designing prompt/tool regression tests
- Japanese: fluent or English: business level
Nice-to-haves
- NLP/ML benchmark/eval design experience
- AI safety / Responsible AI knowledge
- Red teaming / penetration testing experience
- Multi-agent workflows, tool use, long-context evaluation experience
- Large

Full Description

About JAPAN AI

JAPAN AI, Inc. was established in April 2023 as a group company of Geniee, Inc. (TSE Growth Market) with the mission of dramatically expanding human potential through AI technology. We drive cutting-edge AI R&D both domestically and internationally.

Our ambition goes far beyond building AI chatbots. We are building "the brain of the enterprise" — a next-generation core system where AI autonomously executes business operations by integrating all of a company's SaaS tools. With JAPAN AI STUDIO at the center, we are implementing a world where — given a database — no separate application is needed; AI performs the work and returns only the results.

Through the transformative power of AI, we aim to create new value and contribute to the advancement of society as a whole. Join us in leading AI innovation and shaping a future where technology empowers people to achieve more.

Related URLs

Our Website Company Introduction Materials Tech Blog Careers

Why We're Hiring

The output quality of AI agents is directly tied to enterprise operations. "Sort of working" is not acceptable.

In a world where JAPAN AI STUDIO functions as "the brain of the enterprise" — autonomously executing tasks such as approval workflows, resource allocation, and prospect discovery — a wrong AI output means approvals that should have been rejected go through, incorrect staffing decisions are made, and inappropriate customers are approached. For "the brain of the enterprise" to be trusted, a system that scientifically evaluates and guarantees the accuracy, safety, and consistency of generated responses is essential.

JAPAN AI is hiring an AI QA Specialist to build a quality assurance framework — based on automated evaluation pipelines, red teaming, and statistical experimental design — that scientifically guarantees the quality of AI agents used in production by over 200 companies.

Mission

"Scientifically evaluate and guarantee the output quality of agents."

Evaluate and guarantee AI agent output quality through scientific methods. Build systems for automated evaluation, red teaming, safety verification, and regression detection. Ensure the quality of products used in production by approximately 200 companies through a "science of quality" approach.

Role & Expectations

As an AI QA Specialist, you will lead the design, construction, and operation of the quality evaluation infrastructure for AI agents.

Own the entire process from evaluation metric selection and design to integrating automated evaluation pipelines into CI/CD Plan and execute red teaming to detect safety risks before release Quantitatively verify the effectiveness of quality improvements through A/B test analysis based on statistical experimental design Feed evaluation signals back to the research and development teams, creating a compound-interest loop for model improvement Ensure the quality of products used in production by ~200 companies through a "science of quality" approach

Why You'll Love This Role

Quality determines product trust — In a production environment used by ~200 companies, the evaluation infrastructure you build becomes the last line of defense for release quality. You will feel the direct business impact of quality assurance. Greenfield position — Design and build the entirely new specialized domain of AI agent QA from scratch. Scientific approach — Unlike traditional QA, this role demands intellectual rigor using statistics, experimental design, and NLP evaluation methods. Guardian of product quality — Support quality across all products with a target of ≥95% pre-release quality degradation detection rate. Frontline of AI safety — Engage in Responsible AI practices including red teaming, adversarial testing, and policy compliance verification. Rapid-growth environment — In a startup that has grown to 200+ people and 9 products in just 3 years, you will have significant autonomy in technical decision-making.

Job Description

Evaluation Infrastructure Design & Development Design, build, and maintain evaluation sets (synthetic data + real logs) Select and design evaluation metrics (win rate, task success, factuality, harm detection) Build automated evaluation pipelines and integrate them into CI/CD Design agent harnesses (multi-turn, tool use, long-context support) Safety & Quality Verification Plan and execute red-teaming (adversarial testing) Build safety and policy compliance verification frameworks Design and run prompt/tool regression tests Analyze and improve issues related to hallucination, bias, and output quality Statistical Analysis & Reporting Design and analyze statistical experiments (A/B tests, significance testing) Create quality reports and improvement proposals Visualize regression detection and quality trends Feed evaluation signals back to research and development teams

Example Scenarios The following are illustrative scenarios for this role

Scenario 1: Quality gate for new model adoption

An LLM provider releases a new model. You run regression tests against existing evaluation sets and detect a 3% drop in factuality scores. You analyze the root cause, adjust prompts, and complete the migration to the new model while maintaining quality.

Scenario 2: Safety verification for an enterprise customer

When deploying JAPAN AI AGENT for a financial institution, you design and execute industry-specific red-teaming scenarios (confidential information leakage, inappropriate financial advice, etc.). You achieve ≥99% policy compliance and pass the customer's security review.

Scenario 3: A/B test to validate prompt optimization

To improve agent response quality, you compare two prompt strategies via A/B testing. Through statistical significance testing, you demonstrate that the new prompt improves task success rate by 12%, and the decision is made to deploy it to production.

Key Results (KR/Metrics)

Evaluation coverage rate (test case coverage) Regression detection rate (pre-release quality degradation detection ≥ 95%) Evaluation pipeline execution time (completed within CI/CD) False positive / false negative rate Safety incident rate (post-release)

Team Structure

Approximately 120 members are part of the development organization.

The AI QA Specialist operates as a dedicated quality assurance function, collaborating closely with

Agentic Product Engineer — Agent feature development Research Engineer — Research and development, model improvement Agent Harness Engineer / Software Engineer (AI Platform) — AI execution infrastructure development Product Manager — Product design and quality requirements definition

You May Be a Good Fit If You

Bachelor's degree or equivalent practical experience in Computer Science, Software Engineering, Artificial Intelligence, Machine Learning, Mathematics, Physics, or related fields 3+ years of practical experience as a software engineer or QA engineer Knowledge of LLM / generative AI evaluation methods (prompt evaluation, quantitative output quality measurement, hallucination detection, etc.) Foundational knowledge of statistics and experimental design Experience building evaluation pipelines in Python Experience integrating tests into CI/CD pipelines Experience designing prompt / tool regression tests

Language requirement (at least one of the following)

Japanese: Fluent — able to discuss product development without friction

English: Business level Strong Candidates May Also Have

NLP / ML evaluation benchmark design experience Knowledge of AI safety / Responsible AI Red teaming / penetration testing experience Experience evaluating multi-agent workflows, tool use, and long-context scenarios Large-scale data processing experience (Spark / BigQuery, etc.) Ability to read, comprehend, and reproduce research papers Technical communication ability in English

Tech Stack

Languages: Python (evaluation pipelines & analysis) , TypeScript / React / Next.js (frontend) / NX

Evaluation/QA: pytest, LangSmith, Weights & Biases, custom eval frameworks

Data: BigQuery, Spark, Pandas

Infrastructure: GCP (containers / K8s) , Docker, Terraform

CI/CD: GitHub Actions

Tools: Slack, Confluence, Linear, Google Workspace, GitHub, Notion

AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin

Work environment: Mac (Apple Silicon) , dual monitors available

Learning & Development Support

AI Tool Usage Support Company covers the cost of using AI tools such as JAPAN AI SaaS services, Cursor, ChatGPT, Claude, etc. Development Tool Support If a desired development tool is paid, the cost is covered (up to ¥30,000 per year) Book Purchase Assistance Company covers the cost of purchasing books for learning, such as technical books (up to ¥30,000 per half-year) Language Learning / Qualification Support Company covers the cost of Japanese or English learning programs and qualification acquisition Refresh Allowance Company covers the cost of services used for personal refreshment (up to ¥5,000 per month)

e. g., gym, yoga, chiropractic, aquarium, movies, theme park tickets, etc.

Housing Allowance
Housing allowance provided for those living in designated areas (up to ¥30,000 per month)
職種 / 募集ポジション【JAPAN AI】AI QA Specialist (LLM Evaluation) / English

雇用形態正社員

給与

年収

Monthly: ¥500,000～¥1,000,000 (incl. 45h fixed overtime) Stock options available Reviews & bonuses: twice/year OT beyond 45h paid separately Negotiable based on experience and skills

勤務地

163-6006 東京都新宿区西新宿住友不動産新宿オークタワー５/６階地図で確認

Work Style Hybrid work: 3 days in office, 2 days remote Flexible working hours : Core time is negotiable Flexibility : Future consideration for more flexible work styles is possible

Hiring Process 1. Application Review 2. Coding Assessment 3. Interviews (4–5 rounds) 4. Offer A reference check will be conducted prior to the final interview.

会社情報

会社名株式会社ジーニー

事業内容・広告プラットフォーム事業・マーケティングSaaS事業・デジタルPR事業

設立年月日 2010年4月14日

代表者代表取締役社長工藤智昭

資本金 100百万円（連結、2025年3月末現在）

従業員数 877名（連結、2025年3月末現在）

本社所在地東京都新宿区西新宿6-8-1 住友不動産新宿オークタワー5/6階

就業時間 10: 00～19:00 ※土日祝は休業日となります ※出向の場合は、出向先の規程に準じます

福利厚生【待遇・福利厚生】・書籍購入補助（半期 30,000円まで）・リフレッシュ手当（毎月 5,000円まで）・部活動手当（毎月5,000円まで）・家賃手当（当社指定の駅を対象とし毎月30,000円まで）・シャッフルランチ/ディナー（四半期に一度ランチ1,000円まで、ディナー5,000円まで）・資格取得支援制度、英語学習支援制度（業務に必要な場合のみ）・リフレッシュ休暇制度（3年間継続勤務した社員へ毎年付与される特別休暇 2日）・定期健康診断（年1回）・従業員持株会・書籍購入補助（半期 30,000円まで）・リフレッシュ手当（毎月 5,000円まで）・部活動手当（毎月5,000円まで）・シャッフルランチ/ディナー（四半期に一度ランチ1,000円まで、ディナー5,000円まで）・リフレッシュ休暇制度（3年間継続勤務した社員へ毎年付与される特別休暇 2日）・定期健康診断（年1回）【保険】・社会保険完備【諸手当】・交通費全額支給

代表プロフィール早稲田大学大学院卒業後、株式会社リクルート（現株式会社リクルートホールディングス）へ入社。2010年4月株式会社ジーニーを創業、代表取締役社長に就任。2023年4月には戦略的AIカンパニーJAPAN AI株式会社を設立し、同社の代表取締役社長を兼任している。

企業成長ランキング ■ Financial Times社発表のアジア成長企業ランキング2020を受賞 Financial Times社とStatista社が、アジア太平洋地域12カ国5,000万以上の企業を対象に実施した調査で、飛躍的活躍を遂げた企業500社に選出されました。

休日休暇完全週休二日制所定休日：土・日・祝日休暇：年次有給休暇、夏季休暇（3日）、年末年始休暇（12月31日〜1月3日）、慶弔休暇

グループ会社 CATS株式会社（日本） JAPAN AI株式会社（日本）ソーシャルワイヤー株式会社（日本） Geniee International Pte., Ltd.（シンガポール） Geniee Vietnam Co., Ltd.（ベトナム） PT. Geniee Technology Indonesia（インドネシア） PT. Adstars Media Pariwara（インドネシア） Zelto,Inc.（米国） AdPushup Software India Pvt Ltd.（インド）

備考・試用期間正社員/契約社員：1か月・受動喫煙対策敷地内禁煙（屋外に喫煙場所設置）・従事すべき業務の変更の範囲会社の定める業務・就業の場所の変更の範囲会社の定める場所・有期労働契約を更新する場合の基準に関する事項（通算契約期間又は更新回数の上限を含む）更新の上限なし

【JAPAN AI】AI QA Specialist (LLM Evaluation) / English

Full Description

Example Scenarios The following are illustrative scenarios for this role

The AI QA Specialist operates as a dedicated quality assurance function, collaborating closely with

Language requirement (at least one of the following)