[< BACK]
// POSTED: May 2, 2026

Gen AI Site Reliability Engineer (SRE) -Senior Associate-AI Managed Services - operate

APPLY NOW

Industry/Sector

Not Applicable

Specialism

Managed Services

Management Level

Senior Associate

Job Description & Summary

At PwC, our people in managed services focus on a variety of outsourced solutions and support clients across numerous functions. These individuals help organisations streamline their operations, reduce costs, and improve efficiency by managing key processes and functions on their behalf. They are skilled in project management, technology, and process optimization to deliver high-quality services to clients.

Those in managed service management and strategy at PwC will focus on transitioning and running services, along with managing delivery teams, programmes, commercials, performance and delivery risk. Your work will involve the process of continuous improvement and optimising of the managed services process, tools and services.

Focused on relationships, you are building meaningful client connections, and learning how to manage and inspire others. Navigating increasingly complex situations, you are growing your personal brand, deepening technical expertise and awareness of your strengths. You are expected to anticipate the needs of your teams and clients, and to deliver quality. Embracing increased ambiguity, you are comfortable when the path forward isn’t clear, you ask questions, and you use these moments as opportunities to grow.

Examples of the skills, knowledge, and experiences you need to lead and deliver value at this level include but are not limited to:

GenAI Site Reliability Engineer

Observability | Incident Response | Reliability Engineering | AWS and GenAI Operations

Purpose: Operate, monitor, and continuously improve the reliability of in-scope AI platforms and services.

Role

GenAI Site Reliability Engineer

Level

AC - Staff - Experienced

Tower

AI Operations & Platform Support (AI Managed Services)

Experience

4+ years in SRE, production support, cloud operations, or a similar run-state engineering role

Work Location

Bangalore / Hyderabad, India (Remote)

Key Platforms

AWS / Amazon Bedrock, OpenAI / ChatGPT Enterprise, observability and ITSM tooling

Role profile

Hands-on reliability engineer focused on monitoring, incident response, service health, and operational stability for AI workloads.

Primary focus

Observability, alerting, incident investigation, RCA support, automation, and post-change validation.

Best fit

An engineer who likes messy production problems, can separate signal from noise, and is comfortable owning issues through restoration and follow-up.

Role Summary

As a GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope AI services, investigate incidents, restore service, and implement reliability improvements. The role is oriented around real run-state support rather than net-new build work, so we need people who can work from alerts, logs, traces, tickets, dashboards, and imperfect documentation to drive structured troubleshooting and better outcomes over time.

Key Responsibilities

1. Monitoring, alerting, and service health

2. Incident triage, restoration, and problem management

3. Reliability improvement and automation

4. Operational readiness and knowledge management

Preferred Skills and Experience

Skill area

Preferred background

SRE and production operations

Hands-on experience supporting production services in a cloud environment, including monitoring, troubleshooting, incident response, and restoration.

Observability

Experience building dashboards and alerts and using logs, metrics, and traces to diagnose issues. CloudWatch, Datadog, Splunk, New Relic, Grafana, or OpenTelemetry experience is relevant.

Cloud and GenAI platform operations

Working knowledge of AWS operations and familiarity with Bedrock, OpenAI, or adjacent AI platform services used in enterprise production environments.

Incident and problem management

Experience working within ITIL-aligned processes for incident, problem, request, and change management, including strong ticket hygiene and runbook discipline.

Automation and scripting

Ability to automate diagnostics or repetitive support activities using Python, shell scripting, or similar tools.

Critical thinking and collaboration

Ability to solve ambiguous production issues, work across teams, ask the right questions, and engage stakeholders to move investigations and actions forward.

Nice to Have

• Experience supporting Bedrock or OpenAI-powered workloads in production.

• Experience with service reliability metrics such as SLIs, SLOs, MTTA, MTTR, and error trends.

• Exposure to cost and usage monitoring, quota or throttling investigation, and post-change validation.

• AWS certifications or other cloud reliability certifications.

Working Style & Core Behaviors

What Good Looks Like

Team Context

You will join PwC’s AI Operations & Platform Support team supporting a clients’ run-state AI environment. The operating model is centered on Level 2 and Level 3 support, monitoring, incident response, service requests, minor enhancements, and continuous improvement across AWS/Bedrock, OpenAI, and related platform components.

This role will work in a managed-services model focused on incident management, service requests, monitoring, minor enhancements, knowledge management, and continuous improvement. Success depends not only on technical skill, but also on ownership, collaboration, and the ability to engage stakeholders to progress work.

Travel Requirements

0%

Job Posting End Date

Interested in this role?Apply on iHire