TITLE: Remote SRE Jobs – Senior Site Reliability Engineer (Remote) – $130k‑$170k USD – Full‑Time – Escondido, California – Cloud/DevOps, Kubernetes, Terraform, Prometheus --- **Who we are** We are a mid‑stage SaaS company that grew from a garage‑side prototype to a platform serving > 200 enterprise customers worldwide. Our flagship product—an API‑driven data‑pipeline—processes ≈ 15 TB of events per day, and we guarantee customers 99.9 % uptime. The engineering culture is built on blunt feedback, data‑driven post‑mortems, and a relentless focus on reliability. While the code lives in the cloud, the heart of our operational decisions is made by a small, tight‑knit crew spread across the globe. **Why this role exists now** In the last 12 months we added three new data‑centers (AWS us‑east‑1, us‑west‑2 and GCP europe‑west1) to shave latency for European clients. That expansion bumped our monthly alert volume from ≈ 2,800 to ≈ 5,200, and our MTTR climbed from 12 minutes to 18 minutes because the on‑call rotation stretched thin. The leadership team decided it was time to double‑down on site reliability: we need a senior engineer who can own the reliability roadmap, coach the junior members, and tighten our alert fatigue. **Where you’ll sit (virtually)** Although the job is remote, we have a legal entity in Escondido, California that handles payroll, benefits, and compliance. You’ll be part of a “virtual office” that meets daily in a Slack channel called #sre‑hub, a weekly video‑call huddle, and a quarterly in‑person meetup hosted in Escondido, California when travel permits. Being anchored to Escondido, California helps us stay aligned with local tax regulations and gives you a community of other remote professionals who live in the same time zone. **The team you’ll join** - **Size & composition:** 12 engineers total—5 senior SREs, 4 junior reliability engineers, 2 platform developers, and 1 manager. - **Current metrics:** 99.92 % uptime over the past quarter, 5,200 alerts processed per month, 18‑minute average MTTR, 0.2 % alert fatigue (defined as > 3 alerts per incident). - **SLA commitments:** 99.9 % availability for all customer‑facing APIs, 99.7 % for internal data‑processing pipelines. **What you’ll do day‑to‑day** 1. **Own reliability initiatives** – Define and ship SLOs for new services, write error‑budget policies, and track them in Grafana dashboards. 2. **Incident ownership** – Lead the response during high‑severity incidents, drive the post‑mortem narrative, and ensure actionable remediation items are filed in JIRA within 24 hours. 3. **Automation & tooling** – Write Terraform modules to provision Kubernetes clusters, build Helm charts for micro‑services, and shrink manual run‑books into reproducible Ansible playbooks. 4. **Capacity planning** – Run quarterly load‑tests using Locust, model growth with Python scripts, and present forecasts to product leadership. 5. **Mentorship** – Pair up with junior SREs for “bug‑hunting” sessions, run monthly reliability workshops, and contribute to our internal “SRE Playbook”. **Who we think will thrive** - **5+ years** of production‑grade experience with Linux/Unix, networking, and cloud infrastructure (AWS or GCP). - **Deep familiarity** with monitoring stacks: Prometheus, Grafana, Alertmanager, and log aggregation via Splunk or ELK. - **Infrastructure‑as‑Code** fluency: Terraform ≥ 0.13, Helm ≥ 3, and Ansible. - **Container orchestration**: Running production workloads on Kubernetes (experience with EKS or GKE). - **Programming**: Comfortable writing Python or Go for automation; Bash scripting is a given. - **Incident mindset**: You can stay calm under pressure, triage noisy alerts, and keep a clear incident timeline. - **Communication**: Able to explain complex reliability concepts to product managers and non‑technical stakeholders in plain language. **Tools & tech stack (the ones we actually use)** - **Cloud** – AWS (EC2, RDS, S3, Lambda) and GCP (Compute Engine, Cloud SQL, Pub/Sub). - **Container** – Docker ≥ 20, Kubernetes ≥ 1.24, Helm ≥ 3.5. - **IaC** – Terraform ≥ 1.0, Ansible ≥ 2.9. - **CI/CD** – GitHub Actions, Jenkins, CircleCI (for legacy pipelines). - **Monitoring** – Prometheus, Grafana, Alertmanager, Datadog (for some legacy services). - **Logging** – Splunk, Elasticsearch‑Kibana stack, Loki. - **Incident response** – PagerDuty, Opsgenie (we’re migrating fully to PagerDuty). - **Version control** – GitHub (private repos, branch protection rules). - **Collaboration** – Slack (primary chat), Confluence (knowledge base), JIRA (ticketing). **On‑call rhythm & expectations** Our on‑call schedule is a 7‑day rotation with a 48‑hour backup window. Each engineer handles roughly ≈ 350 alerts per month, averaging ≈ 2 incidents per week. We have a “no‑call‑out‑of‑hours” policy for holidays: the next engineer in the rotation covers the entire period, and the team shares the load. During an incident you’ll have a clear run‑book, but we also encourage “play‑by‑play