Industry/Sector
Not ApplicableSpecialism
Managed ServicesManagement Level
Senior AssociateJob Description & Summary
At PwC, our people in managed services focus on a variety of outsourced solutions and support clients across numerous functions. These individuals help organisations streamline their operations, reduce costs, and improve efficiency by managing key processes and functions on their behalf. They are skilled in project management, technology, and process optimization to deliver high-quality services to clients.Focused on relationships, you are building meaningful client connections, and learning how to manage and inspire others. Navigating increasingly complex situations, you are growing your personal brand, deepening technical expertise and awareness of your strengths. You are expected to anticipate the needs of your teams and clients, and to deliver quality. Embracing increased ambiguity, you are comfortable when the path forward isn’t clear, you ask questions, and you use these moments as opportunities to grow.
Examples of the skills, knowledge, and experiences you need to lead and deliver value at this level include but are not limited to:
GenAI Site Reliability Engineer
Observability | Incident Response | Reliability Engineering | AWS and GenAI Operations
Purpose: Operate, monitor, and continuously improve the reliability of in-scope AI platforms and services.
Role
GenAI Site Reliability Engineer
Level
AC - Staff - Experienced
Tower
AI Operations & Platform Support (AI Managed Services)
Experience
4+ years in SRE, production support, cloud operations, or a similar run-state engineering role
Work Location
Bangalore / Hyderabad, India (Remote)
Key Platforms
AWS / Amazon Bedrock, OpenAI / ChatGPT Enterprise, observability and ITSM tooling
Role profile
Hands-on reliability engineer focused on monitoring, incident response, service health, and operational stability for AI workloads.
Primary focus
Observability, alerting, incident investigation, RCA support, automation, and post-change validation.
Best fit
An engineer who likes messy production problems, can separate signal from noise, and is comfortable owning issues through restoration and follow-up.
Role Summary
As a GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope AI services, investigate incidents, restore service, and implement reliability improvements. The role is oriented around real run-state support rather than net-new build work, so we need people who can work from alerts, logs, traces, tickets, dashboards, and imperfect documentation to drive structured troubleshooting and better outcomes over time.
Key Responsibilities
1. Monitoring, alerting, and service health
2. Incident triage, restoration, and problem management
3. Reliability improvement and automation
4. Operational readiness and knowledge management
Preferred Skills and Experience
Skill area
Preferred background
SRE and production operations
Hands-on experience supporting production services in a cloud environment, including monitoring, troubleshooting, incident response, and restoration.
Observability
Experience building dashboards and alerts and using logs, metrics, and traces to diagnose issues. CloudWatch, Datadog, Splunk, New Relic, Grafana, or OpenTelemetry experience is relevant.
Cloud and GenAI platform operations
Working knowledge of AWS operations and familiarity with Bedrock, OpenAI, or adjacent AI platform services used in enterprise production environments.
Incident and problem management
Experience working within ITIL-aligned processes for incident, problem, request, and change management, including strong ticket hygiene and runbook discipline.
Automation and scripting
Ability to automate diagnostics or repetitive support activities using Python, shell scripting, or similar tools.
Critical thinking and collaboration
Ability to solve ambiguous production issues, work across teams, ask the right questions, and engage stakeholders to move investigations and actions forward.
Nice to Have
• Experience supporting Bedrock or OpenAI-powered workloads in production.
• Experience with service reliability metrics such as SLIs, SLOs, MTTA, MTTR, and error trends.
• Exposure to cost and usage monitoring, quota or throttling investigation, and post-change validation.
• AWS certifications or other cloud reliability certifications.
Working Style & Core Behaviors
What Good Looks Like
Team Context
You will join PwC’s AI Operations & Platform Support team supporting a clients’ run-state AI environment. The operating model is centered on Level 2 and Level 3 support, monitoring, incident response, service requests, minor enhancements, and continuous improvement across AWS/Bedrock, OpenAI, and related platform components.
This role will work in a managed-services model focused on incident management, service requests, monitoring, minor enhancements, knowledge management, and continuous improvement. Success depends not only on technical skill, but also on ownership, collaboration, and the ability to engage stakeholders to progress work.
Travel Requirements
0%Job Posting End Date