Site Reliability Engineering · Agentic AI

Ivan Padilla

I keep large-scale digital platforms fast, observable, and online. I also build agentic AI that makes the pager quieter.

Currently: Coding Away 5+ yrs in SRE

Conway's Game of Life, seeded with my name. Entropy comes for every system; good ones recover.

Platform uptime
99.9%+
Transactions / day
700k+
SREs led & mentored
8
Observability budget owned
$400k

About

I'm a player-coach SRE Manager with six+ years designing, implementing, and supporting observability for complex systems at scale. At Taco Bell I lead the team that keeps a multi-million-dollar-a-day digital platform healthy across web, iOS, and Android in U.S. and international markets.

These days my favorite problems live where reliability meets AI: building agents that investigate incidents, automating away toil, and teaching engineers how to put agentic tools to work safely. Off the clock you'll find me under a car hood or deep in a synthesizer patch.

BS, Computer Science, Summa Cum Laude · Southern New Hampshire University

Portrait of Ivan Padilla

Experience

Six years, one mission: keep Taco Bell's digital platform online.

  1. Mar 2024 – Present

    Manager, Site Reliability Engineering · Digital

    Taco Bell · Irvine, CA

    • Lead and mentor a 24/7 on-call team of eight SREs supporting U.S. and international digital platforms: Ecommerce, Menu, Payments, Loyalty, and Delivery services.
    • Own observability strategy across Datadog, CloudWatch, and Lumigo, keeping telemetry actionable for every digital channel.
    • Serve as technical lead for org-wide AI initiatives, building agentic tooling on AWS AgentCore and the Claude Agent SDK to accelerate incident triage.
    • Modernized SRE practice with SLIs/SLOs, QBRs, and an on-call feedback loop that keeps alerting high-signal and burnout low.

    Leadership · Observability · Agentic AI · Incident Management

  2. Feb 2022 – Mar 2024

    Tech Lead, Sr. Site Reliability Engineer · Digital

    Taco Bell · Irvine, CA

    • Led a three-person SRE team standing up end-to-end observability for a next-gen serverless ecommerce architecture.
    • Built an AWS Bedrock + Claude Slack bot prototype that answered support questions and pulled live API data for incident triage.
    • Owned Datadog platform governance: the $400k budget, the vendor relationship, and utilization optimization.
    • Shipped self-service Retool apps and automated bulk refunds, tax-rate verification, and alert analysis.

    Serverless · Datadog · AWS Bedrock · Retool

  3. Oct 2020 – Feb 2022

    Site Reliability Engineer · Digital

    Taco Bell · Irvine, CA

    • Engineered a Datadog observability framework from the ground up for a large-scale Hybris ecommerce monolith.
    • Rolled out PagerDuty across all of Taco Bell's digital and technology teams, with training, guidance, and documentation included.
    • Scaled EC2 capacity for major promotional events like National Taco Day and Steal a Base, Steal a Taco.

    Datadog · PagerDuty · CloudFormation · EC2

Projects

Agents, automation, and the occasional synthesizer.

2025 · Internal · Taco Bell

Tax-Rate Verification Automation

When incorrect tax rates surfaced at store locations, I volunteered to build a Retool workflow that verifies data integrity automatically, turning a 4-hour-a-week manual slog into a 10-minute check.

Retool · Automation · 4 hrs → 10 min

2024 · Internal · Taco Bell

AI Slack Bot for SRE Support

An early agentic prototype on AWS Bedrock and Claude: a Slack bot that answered from our knowledge base and called live APIs to help triage incidents. It was the proof of concept that seeded our org's AI adoption.

AWS Bedrock · Claude · Slack Bolt · Lambda

2023 · Featured on the Taco Bell blog

Restless Innovation Challenge

Formed a team to build an AI-powered proof-of-concept web app assisting our service desk, built with React, Next.js, LangChain, and AWS. Our media team wrote up the journey.

React · Next.js · LangChain

Read the story

Open source

LiveStreamGamer

TikTok Live meets a PyBoy Game Boy emulator. Viewers control the game through live-stream comments in real time.

Python · PyBoy · TikTok Live API

View on GitHub

Open source

Deluge Tune Crafter

Converts audio files to MIDI tailored for the Synthstrom Deluge. Audio goes in; Deluge-compatible XML comes out.

Python · MIDI · Audio DSP

View on GitHub

Teaching

Yum! Brands · Tech Elevate · ~300 participants

Co-host of “AI-First Platform & Reliability Engineering”

I co-hosted a course in Tech Elevate, Yum! Brands' flagship six-to-nine-week technical learning program serving nearly 300 engineers across KFC, Pizza Hut, Taco Bell, and Habit Burger & Grill. Our track gave engineers hands-on experience using agentic AI to deploy services, investigate failures, and explore self-healing workflows. We combined SRE, platform engineering, and observability practice with emerging AI tools so engineers could fold agentic AI into their day-to-day work while designing systems that are more resilient, scalable, and easier to support.

Read about Tech Elevate at Yum! Brands

Toolbox

AI & Agents

  • Claude Code
  • Claude Agent SDK
  • AWS Bedrock
  • AWS AgentCore
  • MCP
  • ChromaDB
  • Pinecone

Observability

  • Datadog
  • CloudWatch
  • Lumigo
  • Grafana
  • PagerDuty
  • Incident.io
  • FullStory
  • Amplitude

Cloud & Automation

  • AWS Serverless
  • Terraform
  • Terragrunt
  • CloudFormation
  • Retool
  • CI/CD

Code & Data

  • Python
  • JavaScript
  • TypeScript
  • SQL
  • PostgreSQL
  • MongoDB
  • DynamoDB

Contact

Open to conversations about SRE leadership, observability, and agentic AI.