> A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
_Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives._
* [Scaling Production Globally — The service mesh facelift (Part-1)](https://achievers.engineering/scaling-production-globally-service-mesh-face-lift-part-1-30ad6d393d04)
* [Load Testing Kubernetes: Building a Framework (Part-1)](https://achievers.engineering/load-testing-kubernetes-building-a-framework-part-1-bdc0af4ae7e2)
* [Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)](https://achievers.engineering/load-testing-kubernetes-resolving-bottlenecks-and-improving-performance-part-2-c4f08102f105)
* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
* [Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb](https://medium.com/airbnb-engineering/intelligent-automation-platform-empowering-conversational-ai-and-beyond-at-airbnb-869c44833ff2)
* [Production Secret Management at Airbnb](https://medium.com/airbnb-engineering/production-secret-management-at-airbnb-ad230e1bc0f6)
* [Automating Data Protection at Scale, Part 1](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08)
* [Automating Data Protection at Scale, Part 2](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-2-c2b8d2068216)
* [Automating Data Protection at Scale, Part 3](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-3-34e592c45d46)
* [Why Are the Top Internet Companies Choosing SRE over Traditional O&M?](https://www.alibabacloud.com/blog/why-are-the-top-internet-companies-choosing-sre-over-traditional-o%26m_596099)
* [Architecture and Practices of Bilibili's Real-time Platform](https://www.alibabacloud.com/blog/architecture-and-practices-of-bilibilis-real-time-platform_596676)
* [How Asana ships stable web applicationreleases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
* [Analysis of recent downtime & what we’re doing to prevent futureincidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
* [Developer environment: Achieving reliability by making it fast toreset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
* [Playing the blame-less game](https://medium.com/asos-techblog/playing-the-blame-less-game-3708f8195344)
* [A day in the life of… Cat S (Head of Reliability Engineering)](https://medium.com/asos-techblog/a-day-in-the-life-of-cat-smith-head-of-reliability-engineering-629e10a26590)
* [An AKS Performance Journey: Part 1 — Sizing Everything Up](https://medium.com/asos-techblog/an-aks-performance-journey-part-1-sizing-everything-up-ee6d2346ea99)
* [An AKS Performance Journey: Part 2 — Networking It Out](https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69)
* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
* [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)
* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
* [How Reliability and Product Teams Collaborate at Booking.com](https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb)
* [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3)
* [Troubleshooting: A journey into the unknown](https://medium.com/booking-com-infrastructure/troubleshooting-a-journey-into-the-unknown-e31b524fa86)
* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
* [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)
* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)
* [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7)
* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)
* [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)
* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)
* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)
* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)
* [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94)
* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)
* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)
* [Debunking the seven most popular Site Reliability Engineering myths](https://medium.com/dbs-tech-blog/debunking-the-seven-most-popular-site-reliability-engineering-myths-a3be8d870ff2)
* [How To Use SRE To Cultivate A Blameless Culture In The Workplace](https://medium.com/dbs-tech-blog/how-to-use-sre-to-cultivate-a-blameless-culture-in-the-workplace-1981fd1c7871)
* [Lessons learned from running GraphQL at scale](https://blog.dream11engineering.com/lessons-learned-from-running-graphql-at-scale-2ad60b3cefeb)
* [Break circuits, save Kong 🦍](https://blog.dream11engineering.com/break-circuits-save-kong-3680d88a0639)
* [Finding Order in Chaos: How We Automated Performance Testing with Torque](https://blog.dream11engineering.com/finding-order-in-chaos-how-we-automated-performance-testing-with-torque-6eb63706fcea)
* [Maintaining hyper-sonic releases at Dream11](https://blog.dream11engineering.com/maintaining-hyper-sonic-releases-at-dream11-c26f2145fe28)
* [To Scale In Or Scale Out? Here’s How We Scale at Dream11](https://blog.dream11engineering.com/to-scale-in-or-scale-out-heres-how-we-scale-at-dream11-f88ef5e71cbc)
* [Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11](https://blog.dream11engineering.com/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33)
* [Atlas: Our journey from a Python monolith to a managed platform](https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform)
* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)
* [How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020](https://codeascraft.com/2021/02/25/how-etsy-prepared-for-historic-volumes-of-holiday-traffic-in-2020/)
* [Using Fault-Injection to Improve our new Runtime Platform’s Reliability](https://medium.com/expedia-group-tech/using-fault-injection-to-improve-our-new-platforms-reliability-656b1147b132)
* [Learning from Incidents at Expedia Group](https://medium.com/expedia-group-tech/learning-from-incidents-at-expedia-group-51a8c72a4286)
* [Getting Started with Elasticsearch](https://medium.com/expedia-group-tech/getting-started-with-elastic-search-6af62d7df8dd)
* [All about ISTIO-PROXY 5xx Issues](https://medium.com/expedia-group-tech/all-about-istio-proxy-5xx-issues-e0221b29e692)
* [Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?](https://medium.com/expedia-group-tech/autoscaling-in-kubernetes-why-doesnt-the-horizontal-pod-autoscaler-work-for-me-5f0094694054)
* [How to Keep Your Kubernetes Deployments Balanced Across Multiple zones](https://medium.com/expedia-group-tech/how-to-keep-your-kubernetes-deployments-balanced-across-multiple-zones-dfe719847b41)
* [Are Your Dropwizard Latency Metrics Misleading You?](https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374)
* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
* [How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com](https://github.blog/2023-09-26-how-github-uses-github-actions-and-actions-larger-runners-to-build-and-test-github-com/)
* [The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects](https://github.blog/2023-09-21-the-github-security-labs-journey-to-disclosing-500-cves-in-open-source-projects/)
* [CodeQL team uses AI to power vulnerability detection in code](https://github.blog/2023-09-12-codeql-team-uses-ai-to-power-vulnerability-detection-in-code/)
* [Building organization-wide governance and re-use for CI/CD and automation with GitHub Actions](https://github.blog/2023-04-05-building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions/)
* [Enabling branch deployments through IssueOps with GitHub Actions](https://github.blog/2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions/)
* [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https://about.gitlab.com/blog/2021/01/14/this-sre-attempted-to-roll-out-an-haproxy-change/)
* [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/)
* [Update: Elasticsearch lessons learnt for Advanced Global Search](https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/)
* [Lessons in iteration from a new team in infrastructure](https://about.gitlab.com/blog/2020/11/09/lessons-in-iteration-from-new-infrastructure-team/)
* [How we optimized infrastructure spend at GitLab](https://about.gitlab.com/blog/2020/10/27/how-we-optimized-our-infrastructure-spend-at-gitlab/)
* [How we scaled async workload processing at GitLab.com using Sidekiq](https://about.gitlab.com/blog/2020/06/24/scaling-our-use-of-sidekiq/)
* [Inside GitLab: How we release software patches](https://about.gitlab.com/blog/2020/05/13/how-we-release-software-patches/)
* [What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab](https://about.gitlab.com/blog/2019/11/15/tracking-down-missing-tcp-keepalives/)
* [How we used delayed replication for disaster recovery with PostgreSQL](https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/)
* [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https://medium.com/gocardless-tech/deploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e)
* [How we compress Pub/Sub messages and more, saving a load of money](https://medium.com/gocardless-tech/how-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a)
* [Fear-free PostgreSQL migrations for Rails](https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/)
* [Observability at GoCardless: a tale of API performance improvement](https://gocardless.com/blog/observability-at-gocardless-a-tale-of-api-performance-improvement/)
* [Debugging the PostgreSQL query planner](https://gocardless.com/blog/debugging-the-postgres-query-planner/)
* [Zero-downtime Postgres migrations - the hard parts](https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/)
* [In search of performance - how we shaved 200ms off every POST request](https://gocardless.com/blog/in-search-of-performance-how-we-shaved-200ms-off-every-post-request/)
* [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/)
* [Incident review: API and Dashboard outage on 10 October 2017](https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/)
* [Kubernetes - A Practical Introduction for Application Developers](https://www.godaddy.com/engineering/2018/05/02/kubernetes-introduction-for-developers/)
* [An Intuitive Node.js Client for the Kubernetes API](https://www.godaddy.com/engineering/2018/04/10/an-intuitive-nodejs-client-for-the-kubernetes-api/)
* [Observability at Scale](https://developer.gs.com/blog/posts/observability-at-scale)
* [Enabling Highly Available Trino Clusters at Goldman Sachs](https://developer.gs.com/blog/posts/enabling-highly-available-trino-clusters-at-goldman-sachs)
* [Infrastructure and the Command Chain Pattern](https://developer.gs.com/blog/posts/infrastructure-and-command-chain-pattern)
* [Mobile CICD with EC2 macOS](https://developer.gs.com/blog/posts/mobile-cicd-with-ec2-macos)
* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
* [How SRE teams are organized, and how to get started](https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started)
* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
* [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
* [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)
* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke)
* [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin)
* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry)
* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski)
* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood)
* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai)
* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff)
* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason)
* [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe)
* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato)
* [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang)
* [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz)
* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing)
* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji)
* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz)
* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)
* [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https://blog.khanacademy.org/how-khan-academy-successfully-handled-2-5x-traffic-in-a-week/)
* [Rethinking site capacity projections with Capacity Analyzer](https://engineering.linkedin.com/blog/2021/rethinking-site-capacity-projections-with-capacity-analyzer)
* [Insights into a Product SRE team at LinkedIn](https://www.linkedin.com/pulse/insights-product-sre-team-linkedin-zaina-afoulki/?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D)
* [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries)
* [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)
* [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag)
* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
* [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)
* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)
* [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear)
* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
* [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen)
* [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading)
* [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring)
* [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe)
* [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness)
* [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal)
* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba)
* [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley)
* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe)
* [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)
* [Dynamic alert routing with Prometheus and Alertmanager](https://tech.loveholidays.com/dynamic-alert-routing-with-prometheus-and-alertmanager-f6a919edb5f8)
* [Making loveholidays 18% faster with HTTP/3](https://tech.loveholidays.com/making-loveholidays-18-faster-with-http-3-1860879528a7)
* [Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code](https://tech.loveholidays.com/enforcing-best-practice-on-self-serve-infrastructure-with-terraform-atlantis-and-policy-as-code-911f4f8c3e00)
* [The 5 principles that helped scale loveholidays](https://tech.loveholidays.com/the-5-principles-that-helped-scale-loveholidays-7ea0b0fd3df9)
* [Realtime Fastly logs with Grafana Loki for under $1 a day](https://tech.loveholidays.com/realtime-fastly-logs-with-grafana-loki-for-under-1-a-day-5b63ccf32d66)
* [Scaling a Kubernetes Platform across the Enterprise](https://medium.com/macquarie-engineering-blog/scaling-a-kubernetes-platform-across-the-enterprise-c07a53b6022e)
* [Monitoring Cloud Environments at Scale with Prometheus and Thanos](https://mattermost.com/blog/monitoring-cloud-environments-at-scale-with-prometheus-and-thanos/)
* [How We Use Sloth to do SLO Monitoring and Alerting with Prometheus](https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/)
* [Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems](https://engineering.mercari.com/en/blog/entry/20220805-who-watches-the-watchmen-keeping-an-eye-on-our-monitoring-systems/)
* [What the Microservices SRE Team are doing as SRE Evangelists](https://engineering.mercari.com/en/blog/entry/20220225-cdb2b6deff/)
* [What it’s like to work as an embedded microservices SRE](https://engineering.mercari.com/en/blog/entry/20220228-work-as-an-embedded-microservices-sre/)
* [The Merpay SRE Team: Past and future](https://engineering.mercari.com/en/blog/entry/20210831-a91c3dca9d/)
* [Embedded SRE at Mercari](https://engineering.mercari.com/en/blog/entry/20220221-embedded-sre-at-mercari/)
* [What the SRE team wants to achieve with the development team](https://engineering.mercari.com/en/blog/entry/20210129-embedded-sre/)
* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
* [Improving Meta’s SLO workflows with data annotations](https://engineering.fb.com/2022/08/29/developer-tools/improving-metas-slo-workflows-with-data-annotations/)
* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
* [Lessons from Building Observability Tools at Netflix](https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17)
* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)
* [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)
* [Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems](https://www.infoq.com/podcasts/netflix-sre-sociotechnical-systems/)
* [Defining Modern Software Roles: SREs at New Relic](https://newrelic.com/blog/nerd-life/new-relic-sre)
* [10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)](https://newrelic.com/blog/best-practices/site-reliability-engineering-careers)
* [What Tools Do Site Reliability Engineers Use?](https://newrelic.com/blog/best-practices/best-sre-tools)
* [A Day in the Life of a New Relic SRE](https://newrelic.com/blog/nerd-life/what-does-an-sre-do)
* [7 Habits of Highly Successful Site Reliability Engineers](https://newrelic.com/blog/best-practices/site-reliability-engineer-sre-habits)
* [Triggered: Incident #1234 (incident process needs fixing)](https://medium.com/paypal-tech/triggered-incident-1234-incident-process-needs-fixing-2a09dbac9edd)
* [Implementing Observability in a Service Mesh](https://medium.com/paypal-tech/implementing-observability-in-a-service-mesh-273c7409283d)
* [PostgreSQL at Scale: Database Schema Changes Without Downtime](https://medium.com/paypal-tech/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680)
* [Scaling GraphQL at PayPal](https://medium.com/paypal-tech/scaling-graphql-at-paypal-b5b5ac098810)
* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
* [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel)
* [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)
* [Ensuring High Availability of Ads Realtime Streaming Services](https://medium.com/pinterest-engineering/ensuring-high-availability-of-ads-realtime-streaming-services-ea3889420490)
* [Improving efficiency and reducing runtime using S3 read optimization](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
* [Scaling Kubernetes with Assurance at Pinterest](https://medium.com/pinterest-engineering/scaling-kubernetes-with-assurance-at-pinterest-a23f821168da)
* [What we learned from an iOS app OOMs incident](https://medium.com/pinterest-engineering/what-we-learned-from-an-ios-app-ooms-incident-eb31eada251)
* [How we designed our Continuous Integration System to be more than 50% Faster](https://medium.com/pinterest-engineering/how-we-designed-our-continuous-integration-system-to-be-more-than-50-faster-b70a59342fe2)
* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
* [How to avoid global outage — Seamlessly migrating DaemonSet labels](https://engineering.prezi.com/intro-4727024fc2c1)
* [In search of speed — debugging Elasticsearch performance](https://engineering.prezi.com/in-search-of-speed-debugging-elasticsearch-performance-9ce8edf4af40)
* [Prometheus at Prezi: replacing 10 years of anti-patterns](https://engineering.prezi.com/prometheus-at-prezi-replacing-10-years-of-anti-patterns-e3c2317e6ca)
* [From Ops to SRE: Evolution of the OpenShift Dedicated Team](https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team)
* [5 Agile Practices Every SRE Team Should Adopt](https://www.openshift.com/blog/5-agile-practices-every-sre-team-should-adopt)
* [7 Best Practices for Writing Kubernetes Operators: An SRE Perspective](https://www.openshift.com/blog/7-best-practices-for-writing-kubernetes-operators-an-sre-perspective)
* [PROFILING: MEASUREMENT AND ANALYSIS](https://technology.riotgames.com/news/profiling-measurement-and-analysis)
* [RUNNING ONLINE SERVICES AT RIOT: PART I](https://technology.riotgames.com/news/running-online-services-riot-part-i)
* [RUNNING ONLINE SERVICES AT RIOT: PART II](https://technology.riotgames.com/news/running-online-services-riot-part-ii)
* [RUNNING ONLINE SERVICES AT RIOT: PART III](https://technology.riotgames.com/news/running-online-services-riot-part-iii)
* [RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX](https://technology.riotgames.com/news/running-online-services-riot-part-iii-part-deux)
* [RUNNING ONLINE SERVICES AT RIOT: PART IV](https://technology.riotgames.com/news/running-online-services-riot-part-iv)
* [RUNNING ONLINE SERVICES AT RIOT: PART V](https://technology.riotgames.com/news/running-online-services-riot-part-v)
* [THE EVOLUTION OF SECURITY AT RIOT](https://technology.riotgames.com/news/evolution-security-riot)
* [RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE](https://technology.riotgames.com/news/running-automated-test-pipeline-league-client-update)
* [AUTOMATED TESTING FOR LEAGUE OF LEGENDS](https://technology.riotgames.com/news/automated-testing-league-legends)
* [Looking at the Kubernetes Control Plane for Multi-Tenancy](https://engineering.salesforce.com/looking-at-the-kubernetes-control-plane-for-multi-tenancy-88914cd7aa89)
* [Optimizing EKS networking for scale](https://engineering.salesforce.com/optimizing-eks-networking-for-scale-1325706c8f6d)
* [Zero Downtime Node Patching in a Kubernetes Cluster](https://engineering.salesforce.com/zero-downtime-node-patching-in-a-kubernetes-cluster-cdceb21c8c8c)
* [How, Not Why: An Alternative to the Five Whys for Post-Mortems](https://engineering.salesforce.com/how-not-why-an-alternative-to-the-five-whys-for-post-mortems-4518098cca17)
* [A Generic Sidecar Injector for Kubernetes](https://engineering.salesforce.com/a-generic-sidecar-injector-for-kubernetes-c05eede1f6bb)
* [Implementation of a monitoring strategy for products based on microservices](https://engineering.salesforce.com/implementation-of-a-monitoring-strategy-for-products-based-on-microservices-24ad24c4c3e5)
* [10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use](https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c)
* [Our Journey to a Near Perfect Log Pipeline](https://engineering.salesforce.com/our-journey-to-a-near-perfect-log-pipeline-6ae2f80cf7a0)
* [Optimizing Performance with Web Workers](https://engineering.salesforce.com/optimizing-performance-with-web-workers-612b48621d8d)
* [Take A Moment To Refocus](https://engineering.salesforce.com/take-a-moment-to-refocus-86b6546c90c)
* [Reliability engineering for some of top 10 sites in Scandinavia](https://alexewerlof.medium.com/reliability-engineering-for-some-of-top-10-sites-in-scandinavia-91e388d8d13a)
* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
* [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)
* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)
* [It’s Just a Monitoring Change](https://sbg.technology/2020/12/09/its-just-a-monitoring-change/)
* [“What's the worst that could happen?”: A worked example of how we deal with live incidents](https://sbg.technology/2020/04/02/whats-the-worst-that-can-happen/)
* [Rising from the Ashes](https://sbg.technology/2020/02/07/rising-from-the-ashes/)
* [Crash! Bang! Wallop! Practice makes perfect](https://sbg.technology/2018/05/04/firedrills-in-core/)
* [Performance Left Right and Center](https://sbg.technology/2017/10/23/performance-left-right-and-center/)
* [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)
* [How to Implement Service Level Objectives in New Relic APM](https://medium.com/slalom-build/how-to-implement-service-level-objectives-in-new-relic-apm-f34f8746118b)
* [Beginners Guide to DevOps: How to Make It into the Industry](https://medium.com/slalom-build/beginners-guid-to-devops-how-to-make-it-into-the-industry-c1652d59807)
* [Why isn’t all test automation run on the pipeline?](https://medium.com/slalom-build/why-isnt-all-test-automation-run-on-the-pipeline-b2c57afbdf5a)
* [The Many Shapes of Site Reliability Engineering](https://medium.com/slalom-build/the-many-shapes-of-site-reliability-engineering-468359866517)
* [How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS](https://medium.com/slalom-build/how-to-build-a-secure-by-default-kubernetes-cluster-with-a-basic-ci-cd-pipeline-on-aws-ebfe0da1c7c9)
* [Secret Management Architectures: Finding the balance between security and complexity](https://medium.com/slalom-build/secret-management-architectures-finding-the-balance-between-security-and-complexity-d857ceaa2300)
* [Detecting Malicious Requests with Keras & Tensorflow](https://medium.com/slalom-build/detecting-malicious-requests-with-keras-tensorflow-5d5db06b4f28)
* [The Lego Monolith — A Monolith Microservice Proof of Concept](https://medium.com/slalom-build/the-lego-monolith-a-monolith-microservice-proof-of-concept-a402ca1654e4)
* [Managing Secrets Using Hashicorp Vault](https://medium.com/slalom-build/managing-secrets-using-hashicorp-vault-ed6b9e0375ac)
* [Packaging Spring Boot Applications for Deployment on Kubernetes](https://medium.com/slalom-build/packaging-spring-boot-applications-for-deployment-on-kubernetes-5fb64bc65406)
* [Immutable Infrastructure and Continuous Delivery in the Cloud](https://medium.com/slalom-build/immutable-infrastructure-and-continuous-delivery-in-the-cloud-56ee4b31b8d5)
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
* [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
* [Designing a Better Kubernetes Experience for Developers](https://engineering.atspotify.com/2021/03/01/designing-a-better-kubernetes-experience-for-developers/)
* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)
* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
* [“This should never happen. If it does, call the developers.”](https://stackoverflow.blog/2021/03/18/creating-a-good-feedback-loop-between-ops-and-devs-using-documentation/)
* [Infrastructure as code: Create and configure infrastructure elements in seconds](https://stackoverflow.blog/2021/03/08/infrastructure-as-code-create-and-configure-infrastructure-elements-in-seconds/)
* [Fulfilling the promise of CI/CD](https://stackoverflow.blog/2021/01/19/fulfilling-the-promise-of-ci-cd/)
* [Guest Post - Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
* [Scaling Club Leaderboard Infrastructure for Millions of Users](https://medium.com/strava-engineering/scaling-club-leaderboard-infrastructure-for-millions-of-users-9ee857ce8cfe)
* [Distributed Tracing at Strava](https://medium.com/strava-engineering/distributed-tracing-at-strava-e9d784b9ddf2)
* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog)
* [How We Improved Our Performance Using ElasticSearch Plugins: Part 1](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-1-b0850a7e5224)
* [How We Improved Our Performance Using ElasticSearch Plugins: Part 2](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-2-b051da2ee85b)
* [Tinder’s move to Kubernetes](https://medium.com/tinder-engineering/tinders-move-to-kubernetes-cda2a6372f44)
* [Benefits of benchmarking with Go](https://medium.com/tokopedia-engineering/benefits-of-benchmarking-with-go-f8bfa177f7fa)
* [Simulating Customized Chaos in Golang using Toxiproxy](https://medium.com/tokopedia-engineering/simulating-customized-chaos-in-golang-using-toxiproxy-b913584d88a7)
* [How Tokopedia Rank Millions of Products in Search Page](https://medium.com/tokopedia-engineering/how-tokopedia-rank-millions-of-products-in-search-page-70e358ea2274)
* [Logging at Twitter: Updated](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/logging-at-twitter-updated)
* [Deleting data distributed throughout your microservices architecture](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/deleting-data-distributed-throughout-your-microservices-architecture)
* [Deterministic Aperture: A distributed, load balancing algorithm](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/daperture-load-balancer)
* [MetricsDB: TimeSeries Database for storing metrics at Twitter](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb)
* [Web Performance and Related Stories — upgrad.com](https://engineering.upgrad.com/web-performance-and-related-stories-upgrad-com-a9fb9c6bb766)
* [Beginner’s guide to web analytics](https://engineering.upgrad.com/beginners-guide-to-analytics-c8ce3e92fa42)
* [iOS Continuous Deployment with Bitbucket, Jenkins and Fastlane at UpGrad](https://engineering.upgrad.com/ios-continuous-deployment-with-bitbucket-jenkins-and-fastlane-at-upgrad-699b3b48acca)
* [How We Improved Website Performance by Evolving Our Infrastructure](https://www.wix.engineering/post/how-we-improved-website-performance-by-evolving-our-infrastructure)
* [Wix Inbox Journey: 3 Approaches for Zero Downtime Database Migration](https://www.wix.engineering/post/wix-inbox-journey-3-approaches-for-zero-downtime-database-migration)
* [Moving Velo to Multiple Container Sites: The Why, The How and The Lessons Learned](https://www.wix.engineering/post/moving-velo-to-multiple-container-sites-the-why-the-how-and-the-lessons-learned)
* [Making Order in CI/CD Mess](https://www.wix.engineering/post/making-order-in-ci-cd-mess)
* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
* [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus)
* [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald)
* [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas)
* [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali)
* [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke)
* [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne)
* [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas)
* [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)
* [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve)
* [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower)
* [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne)
* [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal)
* [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger)
* [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins)
* [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable)
* [The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan)
* [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin)
* [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek)
* [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec)
* [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan)
* [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin)
* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
* [Building Secure & Reliable Systems](https://www.oreilly.com/library/view/building-secure-and/9781492083115/) | [Read free online version hosted by Google](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
* [Site Reliability Engineering](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/) | [Read free online version hosted by Google](https://sre.google/sre-book/table-of-contents/)
* [The Site Reliability Workbook from Google](https://www.oreilly.com/library/view/the-site-reliability/9781492029496/) | [Read free online version hosted by Google](https://sre.google/workbook/table-of-contents/)
* [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/) | [Read free online version hosted by Google](https://github.com/google/googlesre/blob/main/publications/Training_Site_Reliability_Engineers.pdf)
* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/) | [Complimentary Copy from Nginx](https://www.nginx.com/resources/library/97-things-every-sre-should-know/)
* [Incident Metrics in SRE](https://www.oreilly.com/library/view/incident-metrics-in/9781098103163/) | [Read free online version hosted by Google](https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/)
* [Monitoring the SRE Golden Signals](https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook)
* [Site Reliability Engineering: Philosophies, habits, and tools for SRE success](https://newrelic.com/resources/ebooks/site-reliability-engineering) | [Portable version](https://newrelic.com/sites/default/files/2021-08/site-reliability-engineering-handbook.pdf)
* [97 Things Every Cloud Engineer Should Know](https://www.redhat.com/rhdc/managed-files/cl-97-things-cloud-engineers-know-e-book-oreilly-f28602-202105-en.pdf)
If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github.