// POSTED: May 3, 2026

Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America

Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America | Full-Time Contract | US Eastern Timezone Overlap Required (5+ hours daily)

The Mission: Trust Has to Be Earned — Every Release

95% of enterprise AI pilots fail — not because the technology is broken, but because users don't trust it. At Peach Pilot we are building an enterprise AI operating system where trust is the product. That means every feature we ship must work exactly as the user expects, every time. One broken interaction at the wrong moment can undo months of adoption. You are the last line of defense before our platform reaches a CFO's desk.

Peach Pilot is a funded US-based AI startup building an enterprise AI operating system for business leaders. We are closing the AI trust gap — making powerful AI feel effortless and reliable for the people who run companies, not just the engineers who build software.

We are an early-stage founding team moving fast and hiring remotely across Latin America.

The Role

This is a hands-on, high-ownership role. You will build and own the QA function at Peach Pilot — writing test code, designing eval pipelines, and setting the quality bar as we move from early-stage development into full production and enterprise deployment. We are not looking for someone who manages spreadsheets and delegates everything. We are looking for someone who can do the work, knows what good looks like, and raises the bar across the entire engineering team.

This is a fully remote contract role based in Latin America. As the company scales, there is a path to a larger leadership role. For now the focus is getting the product right.

You will work directly with the US-based founding engineering team and must be available during US Eastern business hours with a minimum of 5 hours of daily overlap.

The Challenge: QA for AI is a Different Problem

Traditional QA assumes deterministic outputs. LLMs don't give you that. You will be building a quality function from scratch in an environment where:

Multi-model routing (Claude, GPT-4o, Grok, Gemini) means the same input can produce different outputs depending on which model handled it
Agent orchestration and governance agents must maintain a structurally separate audit trail any drift between execution and governance is a critical failure
The file ingestion pipeline (Word, Excel, PowerPoint, PDF) must survive edge cases that enterprise clients will find within the first week of deployment
Your users are CEOs and operations leaders who have never used a terminal. A confusing error state isn't a minor bug it kills adoption

What You Will Own & Build

First 90 Days — Build the QA Foundation
Establish the testing framework from zero: unit, integration, end-to-end, and LLM-specific evaluation pipelines
Define quality standards, test coverage requirements, and documentation practices in partnership with the VP of Engineering
Audit the existing platform and identify the highest-risk surfaces before the next major customer deployment
Own the QA function end to end and be the voice of quality across the engineering team
AI & Agent Testing —
- Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring across Claude, GPT-4o, Grok, and Gemini —
- Build automated test suites for the agent orchestration layer including governance agent audit trail integrity and human-override behavior
- Validate the Enterprise Knowledge Graph (Neo4j + vector search) for data accuracy, retrieval quality, and failure modes under real enterprise data conditions
Platform & Integration Testing
- Own end-to-end testing of the file ingestion pipeline across document types (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit trail continuity
- Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow
- Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers
UX Quality
- Partner with the Full-Stack Engineer to define and test trust-layer UX standards onboarding flows, progressive disclosure, uncertainty states, and real-time document viewers
- Act as the internal advocate for the non-technical enterprise user — if a CEO would be confused by it, it ships

Who You Are

7+ years of QA engineering experience with at least 3 years in a lead or senior role where you both wrote test code and owned quality outcomes
Hands-on experience testing LLM-powered applications
you understand prompt sensitivity, output variance, and how to build eval pipelines that catch regressions across model updates
You write test code. Python is your primary tool
Experience building and maintaining CI/CD-integrated test suites
Comfortable testing complex API chains, async/streaming responses, and multi-service workflows
Built or significantly improved a QA function in an early-stage or fast-moving environment
Strong English communication skills written and verbal
Available during US Eastern business hours with minimum 5 hours of daily overlap

Even Better If

Experience with LLM evaluation frameworks such as LangSmith, PromptFlow, or custom eval pipelines
Experience testing agent frameworks such as LangChain or CrewAI
Background in enterprise software or regulated industries where audit trail integrity is non-negotiable
Insurance industry background is a strong plus

The Stack You'll Test Against

AI/LLM: Anthropic Claude, OpenAI GPT-4o, xAI Grok, Gemini Frontend: React/Next.js, TypeScript, Tailwind CSS Backend: Python, Node.js/TypeScript (FastAPI/Express) Data & Graph: Neo4j, Snowflake, Azure Cosmos DB, Azure AI Search Infrastructure: Azure (Functions, Key Vault), CI/CD pipelines Visualization: Plotly, D3, Recharts, Mermaid

Compensation

Competitive contractor rate commensurate with experience. Paid monthly via Deel in USD.

The Clincher

Tell us about a quality failure — one you caught before it shipped, or one that got through. What did you build or change after it, and how did you make sure your team could catch the next one without you?

APPLY NOW

Peach Pilot — Principal QA Engineer (AI Systems & Platform) Remote — Latin America

More Remote Jobs