SRE_Case_Studies/README.md

# How they SRE

![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square) [![CI](https://github.com/upgundecha/howtheysre/actions/workflows/workflow.yml/badge.svg)](https://github.com/upgundecha/howtheysre/actions/workflows/workflow.yml) [![CodeQL](https://github.com/upgundecha/howtheysre/actions/workflows/codeql.yml/badge.svg)](https://github.com/upgundecha/howtheysre/actions/workflows/codeql.yml)

![How they SRE](headline.png)

> A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

## Introduction

__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.

Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.

_Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives._

### Topics

* Site Reliability Engineering
* Hiring and Building SRE teams
* SRE Culture
* DevOps
* Monitoring & Observability
* Alerting
* Incident Response & Post-Mortem
* On-Call
* Testing in Production
* Chaos Engineering
* Automation
* Performance

## Organizations

<details>
  <summary>Achievers</summary>

### Blog Posts

* [Enter the Abattoir - Building 'à la carte' gitops tooling](https://achievers.engineering/enter-the-abattoir-ee5e2019f0b3)
* [Scaling Production Globally — The service mesh facelift (Part-1)](https://achievers.engineering/scaling-production-globally-service-mesh-face-lift-part-1-30ad6d393d04)
* [Scaling Production Globally - Solving observability problems for developers (Part-2)](https://achievers.engineering/scaling-production-globally-solving-observability-problems-for-developers-part-2-b5416ce5eb8a)
* [Load Testing Kubernetes: Building a Framework (Part-1)](https://achievers.engineering/load-testing-kubernetes-building-a-framework-part-1-bdc0af4ae7e2)
* [Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)](https://achievers.engineering/load-testing-kubernetes-resolving-bottlenecks-and-improving-performance-part-2-c4f08102f105)

</details>

<details>
  <summary>Airbnb</summary>

### Blog Posts

* [Automated Incident Management Through Slack](https://medium.com/airbnb-engineering/incident-management-ae863dc5d47f)
* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
* [Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb](https://medium.com/airbnb-engineering/intelligent-automation-platform-empowering-conversational-ai-and-beyond-at-airbnb-869c44833ff2)
* [Production Secret Management at Airbnb](https://medium.com/airbnb-engineering/production-secret-management-at-airbnb-ad230e1bc0f6)
* [Automating Data Protection at Scale, Part 1](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08)
* [Automating Data Protection at Scale, Part 2](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-2-c2b8d2068216)
* [Automating Data Protection at Scale, Part 3](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-3-34e592c45d46)

</details>

<details>
  <summary>Algolia</summary>

### Blog Posts

* [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/)
* [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)

</details>

<details>
  <summary>Alibaba Cloud</summary>

### Blog Posts

* [Why Are the Top Internet Companies Choosing SRE over Traditional O&M?](https://www.alibabacloud.com/blog/why-are-the-top-internet-companies-choosing-sre-over-traditional-o%26m_596099)
* [Architecture and Practices of Bilibili's Real-time Platform](https://www.alibabacloud.com/blog/architecture-and-practices-of-bilibilis-real-time-platform_596676)

</details>

<details>
  <summary>Asana</summary>

### Blog Posts

* [How Asana uses Asana: Security incident response](https://blog.asana.com/2021/09/engineering-security-incident-response/#close)
* [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
* [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
* [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)

</details>

<details>
  <summary>ASOS</summary>

### Blog Posts

* [Playing the blame-less game](https://medium.com/asos-techblog/playing-the-blame-less-game-3708f8195344)
* [A day in the life of… Cat S (Head of Reliability Engineering)](https://medium.com/asos-techblog/a-day-in-the-life-of-cat-smith-head-of-reliability-engineering-629e10a26590)
* [An AKS Performance Journey: Part 1 — Sizing Everything Up](https://medium.com/asos-techblog/an-aks-performance-journey-part-1-sizing-everything-up-ee6d2346ea99)
* [An AKS Performance Journey: Part 2 — Networking It Out](https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69)
* [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57)
* [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e)
* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)

</details>

<details>
  <summary>Atlassian</summary>

### Blog Posts

* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
* [Incident Postmortem Template](https://www.atlassian.com/incident-management/postmortem/templates)

</details>

<details>
  <summary>BackMarket</summary>

### Blog Posts

* [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)

</details>

<details>
  <summary>Baidu</summary>

### Videos

* [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu)
* [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
* [Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity](https://www.youtube.com/watch?v=x3c0PPkSf14)

</details>

<details>
  <summary>Basecamp</summary>

### Blog Posts

* [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/)
* [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/)
* [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/)
* [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)

### Books

* [Shape Up](https://basecamp.com/shapeup/webbook)

</details>

<details>
  <summary>Bloomberg</summary>

### Videos

* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen)
* [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski)
* [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov)
* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)

</details>

<details>
  <summary>Booking.com</summary>

### Blog Posts

* [How Reliability and Product Teams Collaborate at Booking.com](https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb)
* [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3)
* [Troubleshooting: A journey into the unknown](https://medium.com/booking-com-infrastructure/troubleshooting-a-journey-into-the-unknown-e31b524fa86)

### Videos

* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)

</details>

<details>
  <summary>Capital One</summary>

### Blog Posts

* [Automate Application Monitoring with Slack](https://www.capitalone.com/tech/software-engineering/how-to-automate-application-monitoring-slack-bots/)
* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
* [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89)
* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
* [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)
* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)
* [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7)
* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)
* [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)
* [Secure Docker Containers Require Secure Applications](https://medium.com/capital-one-tech/secure-docker-containers-require-secure-applications-75eb358abef9)
* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)
* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)
* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)
* [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94)
* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)

### Major incidents & analysis reports

* [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/)
* [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf)
  
### Videos

* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)

</details>

<details>
  <summary>Coinbase</summary>

### Blog Posts

* [Open Sourcing Coinbase’s Secure Deployment Pipeline](https://blog.coinbase.com/open-sourcing-coinbases-secure-deployment-pipeline-ae6c78e25517)
  
</details>

<details>
  <summary>DAZN</summary>

### Blog Posts

* [Site Reliability at DAZN](https://medium.com/dazn-tech/site-reliability-at-dazn-a3ba4af0638d)

</details>

<details>
  <summary>DBS</summary>

### Blog Posts

* [Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far](https://medium.com/dbs-tech-blog/presenting-at-ithomes-sre-conference-our-dbs-sre-transformation-journey-thus-far-9b6778ce53e8)
* [Debunking the seven most popular Site Reliability Engineering myths](https://medium.com/dbs-tech-blog/debunking-the-seven-most-popular-site-reliability-engineering-myths-a3be8d870ff2)
* [How To Use SRE To Cultivate A Blameless Culture In The Workplace](https://medium.com/dbs-tech-blog/how-to-use-sre-to-cultivate-a-blameless-culture-in-the-workplace-1981fd1c7871)
* [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4)
* [Automating Configuration Management at Scale](https://medium.com/dbs-tech-blog/automating-configuration-management-at-scale-5c7927f83df3)
* [How DBS dispelled the myths of Chaos Engineering](https://medium.com/dbs-tech-blog/how-dbs-dispelled-the-myths-of-chaos-engineering-e5873ac78c9)
* [Double, Double Toil and Trouble](https://medium.com/dbs-tech-blog/double-double-toil-and-trouble-applying-sre-practices-to-alleviate-toil-for-devops-teams-259b958a10dd)

### Videos

* [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)

</details>

<details>
  <summary>DeepSource</summary>

### Blog Posts

* [Redis diskless replication: What, how, why and the caveats](https://deepsource.io/blog/redis-diskless-replication/)
* [How to setup Vault with Kubernetes](https://deepsource.io/blog/setup-vault-kubernetes/)
* [Breaking down zero downtime deployments in Kubernetes](https://deepsource.io/blog/zero-downtime-deployment/)

</details>

<details>
  <summary>Dream11</summary>

### Blog Posts

* [Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’.](https://blog.dream11engineering.com/deployment-at-scale-story-behind-dream11s-in-house-blue-green-deployment-platform-oneclick-b2c761b12896)
* [Enhancing security and trust with AWS WAFv2](https://blog.dream11engineering.com/enhancing-security-and-trust-with-aws-wafv2-8b050b1cba37)
* [Lessons learned from running GraphQL at scale](https://blog.dream11engineering.com/lessons-learned-from-running-graphql-at-scale-2ad60b3cefeb)
* [Break circuits, save Kong 🦍](https://blog.dream11engineering.com/break-circuits-save-kong-3680d88a0639)
* [Finding Order in Chaos: How We Automated Performance Testing with Torque](https://blog.dream11engineering.com/finding-order-in-chaos-how-we-automated-performance-testing-with-torque-6eb63706fcea)
* [Maintaining hyper-sonic releases at Dream11](https://blog.dream11engineering.com/maintaining-hyper-sonic-releases-at-dream11-c26f2145fe28)
* [To Scale In Or Scale Out? Here’s How We Scale at Dream11](https://blog.dream11engineering.com/to-scale-in-or-scale-out-heres-how-we-scale-at-dream11-f88ef5e71cbc)
* [Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11](https://blog.dream11engineering.com/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33)

</details>

<details>
  <summary>Dropbox</summary>

### Blog Posts

* [Dropbox Engineering Career Framework - Reliability Engineer (SRE)](https://dropbox.github.io/dbx-career-framework/)
* [Atlas: Our journey from a Python monolith to a managed platform](https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform)
* [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex)
* [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system)

### Videos

* [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)

</details>

<details>
  <summary>eBay</summary>

### Blog Posts

* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)

### Video

* [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)

</details>

<details>
  <summary>Epic Games</summary>

### Video

* [AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players](https://youtu.be/MCLrA401vHw)

</details>

<details>
  <summary>Etsy</summary>

### Blog Posts

* [Improving the Deployment Experience of a Ten-Year Old Application](https://codeascraft.com/)
* [How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020](https://codeascraft.com/2021/02/25/how-etsy-prepared-for-historic-volumes-of-holiday-traffic-in-2020/)
* [Your brain on progress](https://increment.com/reliability/brain-on-progress/)
* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
* [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/)
* [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/)
* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
* [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/)

### Videos

* [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4)
* [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)

</details>

<details>
  <summary>Expedia</summary>

### Blog Posts

* [Automating Performance Standards](https://medium.com/expedia-group-tech/automating-performance-standards-b51efc92d237)
* [Error Budget Policy - Part 1 - Adoption at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policy-adoption-at-expedia-group-7d80d41c4a8b)
* [Error Budget Policy - Part 2 - Practices at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policies-in-practice-4c98f56a28c1)
* [Using Fault-Injection to Improve our new Runtime Platform’s Reliability](https://medium.com/expedia-group-tech/using-fault-injection-to-improve-our-new-platforms-reliability-656b1147b132)
* [Learning from Incidents at Expedia Group](https://medium.com/expedia-group-tech/learning-from-incidents-at-expedia-group-51a8c72a4286)
* [Improving Vrbo Homepage Loading Experience](https://medium.com/expedia-group-tech/improving-vrbo-homepage-loading-experience-e4b2207535f4)
* [Troubleshooting 502 errors: ECS Checklist](https://medium.com/expedia-group-tech/troubleshooting-502-errors-ecs-checklist-9da383399d96)
* [Getting Started with Elasticsearch](https://medium.com/expedia-group-tech/getting-started-with-elastic-search-6af62d7df8dd)
* [All about ISTIO-PROXY 5xx Issues](https://medium.com/expedia-group-tech/all-about-istio-proxy-5xx-issues-e0221b29e692)
* [Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?](https://medium.com/expedia-group-tech/autoscaling-in-kubernetes-why-doesnt-the-horizontal-pod-autoscaler-work-for-me-5f0094694054)
* [How to Keep Your Kubernetes Deployments Balanced Across Multiple zones](https://medium.com/expedia-group-tech/how-to-keep-your-kubernetes-deployments-balanced-across-multiple-zones-dfe719847b41)
* [Are Your Dropwizard Latency Metrics Misleading You?](https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374)
* [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4)
* [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac)
* [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)

</details>

<details>
  <summary>Fastly</summary>

### Videos

* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
* [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)

</details>

<details>
  <summary>G-Research</summary>

### Blog Posts

* [Our SRE Journey at G-Research](https://www.gresearch.com/blog/article/our-sre-journey-at-g-research/)
* [The SRE Journey Continues](https://www.gresearch.com/blog/article/the-sre-journey-continues/)
* [OpenTSDB Meta Cache – trade-offs for performance](https://www.gresearch.com/blog/article/opentsdb-meta-cache-trade-offs-for-performance/)

</details>

<details>
  <summary>Getaround</summary>

### Blog Posts

* [How we handle incidents at Getaround](https://getaround.tech/incident-handling-at-getaround/)
* [Evolution Of Our Continuous Delivery Process](https://getaround.tech/continuous-integration/)

</details>

<details>
  <summary>GitHub</summary>

### Blog Posts

* [How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com](https://github.blog/2023-09-26-how-github-uses-github-actions-and-actions-larger-runners-to-build-and-test-github-com/)
* [The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects](https://github.blog/2023-09-21-the-github-security-labs-journey-to-disclosing-500-cves-in-open-source-projects/)
* [CodeQL team uses AI to power vulnerability detection in code](https://github.blog/2023-09-12-codeql-team-uses-ai-to-power-vulnerability-detection-in-code/)
* [Addressing GitHub’s recent availability issues](https://github.blog/2023-05-16-addressing-githubs-recent-availability-issues/)
* [Building organization-wide governance and re-use for CI/CD and automation with GitHub Actions](https://github.blog/2023-04-05-building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions/)
* [Enabling branch deployments through IssueOps with GitHub Actions](https://github.blog/2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions/)
* [Using ChatOps to help Actions on-call engineers](https://github.blog/2021-12-01-using-chatops-to-help-actions-on-call-engineers/)
* [Partitioning GitHub’s relational databases to handle scale](https://github.blog/2021-09-27-partitioning-githubs-relational-databases-scale/)
* [Increasing developer happiness with GitHub code scanning](https://github.blog/2021-09-07-increasing-developer-happiness-github-code-scanning/)
* [Why (and how) GitHub is adopting OpenTelemetry](https://github.blog/2021-05-26-why-and-how-github-is-adopting-opentelemetry/)
* [Improving large monorepo performance on GitHub](https://github.blog/2021-03-16-improving-large-monorepo-performance-on-github/)
* [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/)
* [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/)
* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)
* [Reducing flaky builds by 18x](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/)
* [The evolving role of operations in DevOps](https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/)
* [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/)
* [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/)

### Major incidents & analysis reports

* [GitHub Availability Report: August 2023](https://github.blog/2023-09-13-github-availability-report-august-2023/)
* [GitHub Availability Report: July 2023](https://github.blog/2023-08-09-github-availability-report-july-2023/)
* [GitHub Availability Report: June 2023](https://github.blog/2023-07-12-github-availability-report-june-2023/)
* [GitHub Availability Report: May 2023](https://github.blog/2023-06-14-github-availability-report-may-2023/)
* [GitHub Availability Report: April 2023](https://github.blog/2023-05-03-github-availability-report-april-2023/)
* [GitHub Availability Report: March 2023](https://github.blog/2023-04-05-github-availability-report-march-2023/)
* [GitHub Availability Report: February 2023](https://github.blog/2023-03-01-github-availability-report-february-2023/)
* [GitHub Availability Report: January 2023](https://github.blog/2023-02-01-github-availability-report-january-2023/)
* [GitHub Availability Report: December 2022](https://github.blog/2023-01-04-github-availability-report-december-2022/)
* [GitHub Availability Report: November 2022](https://github.blog/2022-12-07-github-availability-report-november-2022/)
* [GitHub Availability Report: October 2022](https://github.blog/2022-11-02-github-availability-report-october-2022/)
* [GitHub Availability Report: September 2022](https://github.blog/2022-10-05-github-availability-report-september-2022/)
* [GitHub Availability Report: August 2022](https://github.blog/2022-09-07-github-availability-report-august-2022/)
* [GitHub Availability Report: July 2022](https://github.blog/2022-08-03-github-availability-report-july-2022/)
* [GitHub Availability Report: June 2022](https://github.blog/2022-07-06-github-availability-report-june-2022/)
* [GitHub Availability Report: May 2022](https://github.blog/2022-06-01-github-availability-report-may-2022/)
* [GitHub Availability Report: April 2022](https://github.blog/2022-05-04-github-availability-report-april-2022/)
* [GitHub Availability Report: March 2022](https://github.blog/2022-04-06-github-availability-report-march-2022/)
* [GitHub Availability Report: February 2022](https://github.blog/2022-03-02-github-availability-report-february-2022/)
* [GitHub Availability Report: January 2022](https://github.blog/2022-02-02-github-availability-report-january-2022/)
* [GitHub Availability Report: December 2021](https://github.blog/2022-01-05-github-availability-report-december-2021/)
* [GitHub Availability Report: November 2021](https://github.blog/2021-12-01-github-availability-report-november-2021/)
* [GitHub Availability Report: October 2021](https://github.blog/2021-11-04-github-availability-report-october-2021/)
* [GitHub Availability Report: September 2021](https://github.blog/2021-10-06-github-availability-report-september-2021/)
* [GitHub Availability Report: August 2021](https://github.blog/2021-09-01-github-availability-report-august-2021/)
* [GitHub Availability Report: July 2021](https://github.blog/2021-08-04-github-availability-report-july-2021/)
* [GitHub Availability Report: June 2021](https://github.blog/2021-07-07-github-availability-report-june-2021/)
* [GitHub Availability Report: May 2021](https://github.blog/2021-06-02-github-availability-report-may-2021/)
* [GitHub Availability Report: April 2021](https://github.blog/2021-05-05-github-availability-report-april-2021/)
* [GitHub Availability Report: March 2021](https://github.blog/2021-04-07-github-availability-report-march-2021/)
* [GitHub Availability Report: February 2021](https://github.blog/2021-03-03-github-availability-report-february-2021/)
* [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/)
* [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/)
* [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/)
* [GitHub Availability Report: August 2020](https://github.blog/2020-09-02-github-availability-report-august-2020/)
* [GitHub Availability Report: July 2020](https://github.blog/2020-08-05-github-availability-report-july-2020/)
* [Introducing the GitHub Availability Report](https://github.blog/2020-07-08-introducing-the-github-availability-report/)
* [February service disruptions post-incident analysis](https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/)
* [October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
* [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/)
* [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/)

### Videos

* [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)

</details>

<details>
  <summary>GitLab</summary>

### Blog Posts

* [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https://about.gitlab.com/blog/2021/01/14/this-sre-attempted-to-roll-out-an-haproxy-change/)
* [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/)
* [Update: Elasticsearch lessons learnt for Advanced Global Search](https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/)
* [Lessons in iteration from a new team in infrastructure](https://about.gitlab.com/blog/2020/11/09/lessons-in-iteration-from-new-infrastructure-team/)
* [How we optimized infrastructure spend at GitLab](https://about.gitlab.com/blog/2020/10/27/how-we-optimized-our-infrastructure-spend-at-gitlab/)
* [How we scaled async workload processing at GitLab.com using Sidekiq](https://about.gitlab.com/blog/2020/06/24/scaling-our-use-of-sidekiq/)
* [Inside GitLab: How we release software patches](https://about.gitlab.com/blog/2020/05/13/how-we-release-software-patches/)
* [What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab](https://about.gitlab.com/blog/2019/11/15/tracking-down-missing-tcp-keepalives/)
* [How we used delayed replication for disaster recovery with PostgreSQL](https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/)

</details>

<details>
  <summary>GoCardless</summary>

### Blog Posts

* [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https://medium.com/gocardless-tech/deploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e)
* [How we compress Pub/Sub messages and more, saving a load of money](https://medium.com/gocardless-tech/how-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a)
* [Fear-free PostgreSQL migrations for Rails](https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/)
* [Observability at GoCardless: a tale of API performance improvement](https://gocardless.com/blog/observability-at-gocardless-a-tale-of-api-performance-improvement/)
* [Debugging the PostgreSQL query planner](https://gocardless.com/blog/debugging-the-postgres-query-planner/)
* [Zero-downtime Postgres migrations - the hard parts](https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/)
* [In search of performance - how we shaved 200ms off every POST request](https://gocardless.com/blog/in-search-of-performance-how-we-shaved-200ms-off-every-post-request/)

### Major incidents & analysis reports

* [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/)
* [Incident review: API and Dashboard outage on 10 October 2017](https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/)

</details>

<details>
  <summary>GoDaddy</summary>

### Blog Posts

* [Kubernetes Gated Deployments](https://www.godaddy.com/engineering/2019/08/13/kubernetes-gated-deployments/)
* [Kubernetes External Secrets](https://www.godaddy.com/engineering/2019/04/16/kubernetes-external-secrets/)
* [Kubernetes - A Practical Introduction for Application Developers](https://www.godaddy.com/engineering/2018/05/02/kubernetes-introduction-for-developers/)
* [An Intuitive Node.js Client for the Kubernetes API](https://www.godaddy.com/engineering/2018/04/10/an-intuitive-nodejs-client-for-the-kubernetes-api/)

</details>

<details>
  <summary>Gojek</summary>

### Blog Posts

* [Introducing Skynet: Infrastructure as Code for Gojek](https://www.gojek.io/blog/introducing-skynet/)
* [Scaling Our Geo-Search Service For 10x Load](https://www.gojek.io/blog/scaling-our-geo-search-service-for-10x-load/)
* [Why We Swear by the RCA](https://www.gojek.io/blog/why-we-swear-by-the-rca)
* [How We Upgrade Kubernetes on GKE](https://blog.gojek.io/how-we-upgrade-kubernetes-on-gke/)
* [How We Monitor Apache Airflow in Production](https://blog.gojek.io/how-we-monitor-apache-airflow-in-production/)

</details>

<details>
  <summary>Goldman Sachs</summary>

### Blog Posts

* [Observability at Scale](https://developer.gs.com/blog/posts/observability-at-scale)
* [Enabling Highly Available Trino Clusters at Goldman Sachs](https://developer.gs.com/blog/posts/enabling-highly-available-trino-clusters-at-goldman-sachs)
* [Infrastructure and the Command Chain Pattern](https://developer.gs.com/blog/posts/infrastructure-and-command-chain-pattern)
* [Mobile CICD with EC2 macOS](https://developer.gs.com/blog/posts/mobile-cicd-with-ec2-macos)
* [Announcing CatchIT - Source Code Secret Scanner](https://developer.gs.com/blog/posts/catchit-source-code-secret-scanner)
* [Building Platforms for Data Engineering](https://developer.gs.com/blog/posts/legend_data_engineering_platforms)

</details>

<details>
  <summary>Google</summary>

### Blog Posts

* [Pitfalls and Patterns in Microservice Dependency Management](https://www.infoq.com/articles/pitfalls-patterns-microservice-dependency-management/)
* [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses)
* [Google site reliability using Go](https://go.dev/solutions/google/sitereliability)
* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
* [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/)
* [How SRE teams are organized, and how to get started](https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started)

### Videos

* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
* [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
* [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)
* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke)
* [Implementing Distributed Consensus](https://www.usenix.org/conference/srecon20americas/presentation/ludtke)
* [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin)
* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry)
* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski)
* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood)
* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai)
* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff)
* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason)
* [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe)
* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato)
* [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang)
* [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz)
* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing)
* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji)
* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz)
* [Pragmatic Automation](https://www.usenix.org/conference/srecon19americas/presentation/luebbe)
* [Sublinear Scaling in Practice: The 1k SRE Project](https://www.usenix.org/conference/srecon19americas/presentation/rath)
* [Strategies to Edit Production Data](https://www.usenix.org/conference/srecon19americas/presentation/qiu)
* [The Curse of SRE Autonomy and How to Manage It](https://www.usenix.org/conference/srecon19americas/presentation/bondi)
* [Scaling SRE Organizations: The Journey from 1 to Many Teams](https://www.usenix.org/conference/srecon19americas/presentation/franco)
* [SRE Classroom - How to Design a Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19americas/presentation/thomas)
* [Using PRDs and User Journeys to Design User-Friendly Tools](https://www.usenix.org/conference/srecon19americas/presentation/stockman)
* [How Google SRE and Developers Work Together](https://www.youtube.com/watch?v=DOQqOrHs3VY)
* [SREcon21 - Experiments for SRE](https://www.youtube.com/watch?v=yjusNjAFxFg)

</details>

<details>
  <summary>Grab</summary>

### Blog Posts

* [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab)
* [Our Journey to Continuous Delivery at Grab (Part 2)](https://engineering.grab.com/blog/2/)
* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)](https://engineering.grab.com/designing-resilient-systems-part-1)
* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)](https://engineering.grab.com/designing-resilient-systems-part-2)
* [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https://engineering.grab.com/beyond-retries-part-3)
* [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
* [How We Designed the Quotas Microservice to Prevent Resource Abuse](https://engineering.grab.com/quotas-service)
* [How We Scaled Our Cache and Got a Good Night's Sleep](https://engineering.grab.com/how-we-scaled-our-cache-and-got-a-good-nights-sleep)

</details>

<details>
  <summary>Grammarly</summary>

### Blog Posts

* [Scaling AWS Infrastructure to Support Multiple Regions](https://www.grammarly.com/blog/engineering/scaling-aws-infrastructure/)
* [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)

</details>

<details>
  <summary>Gusto</summary>

### Blog Posts

* [Service Level Objectives for On-call Peace of Mind](https://engineering.gusto.com/slos-for-peace-of-mind/)
* [Debugging Sidekiq Poison Pills](https://engineering.gusto.com/debugging-sidekiq-poison-pills/)

</details>

<details>
  <summary>Halodoc</summary>

### Blog Posts

* [Site Reliability Engineering for Native mobile apps](https://www.infoq.com/articles/site-reliability-engineering-mobile-apps/)

</details>

<details>
  <summary>Heroku</summary>

### Blog Posts

* [The Adventures of Rendezvous in Heroku’s New Architecture](https://blog.heroku.com/engineering)
* [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)

</details>

<details>
  <summary>IBM</summary>

### Blog Posts

* [What is Site Reliability Engineering (SRE)?](https://www.ibm.com/cloud/learn/site-reliability-engineering)
* [AIOps tools and solutions](https://www.ibm.com/cloud/aiops)

</details>

<details>
  <summary>Indeed</summary>

### Blog Posts

* [Indeed SRE: An Inside Look](https://engineering.indeedblog.com/blog/2022/04/sre/)
* [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/)
* [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/)
* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)

### Videos

* [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)

</details>

<details>
  <summary>Khan Academy</summary>

### Blog Posts

* [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https://blog.khanacademy.org/how-khan-academy-successfully-handled-2-5x-traffic-in-a-week/)
* [Evolving our content infrastructure](https://blog.khanacademy.org/evolving-our-content-infrastructure/)

</details>

<details>
  <summary>LinkedIn</summary>

### Blog Posts

* [Rethinking site capacity projections with Capacity Analyzer](https://engineering.linkedin.com/blog/2021/rethinking-site-capacity-projections-with-capacity-analyzer)
* [Insights into a Product SRE team at LinkedIn](https://www.linkedin.com/pulse/insights-product-sre-team-linkedin-zaina-afoulki/?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D)
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)
* [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre)
* [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions)
* [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries)
* [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)
* [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag)
* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
* [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)
* [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)
* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)
* [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear)
* [Hiring SREs at LinkedIn, 2017](https://engineering.linkedin.com/blog/2017/07/hiring-sres-at-linkedin)
* [Open Sourcing Iris and Oncall](https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall)
* [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin)
* [Failure is Not an Option](https://engineering.linkedin.com/blog/2017/01/failure-is-not-an-option)
* [MTTD and MTTR Are Key](https://engineering.linkedin.com/blog/2016/12/mttd-and-mttr-are-key)
* [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed)

### Videos

* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
* [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen)
* [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading)
* [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring)
* [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe)
* [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness)
* [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal)
* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba)
* [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley)
* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe)
* [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)

### Tools

* [On-Call](https://github.com/linkedin/oncall)

</details>

<details>
  <summary>Loggi</summary>

### Blog Posts

* [The Release Manager model](https://partiu.loggi.com/the-release-manager-model-7af93f9f499f)
* [SRE Teams #8: Loggi](https://sreteams.substack.com/p/loggi)

</details>

<details>
  <summary>Loveholidays</summary>

### Blog Posts

* [Dynamic alert routing with Prometheus and Alertmanager](https://tech.loveholidays.com/dynamic-alert-routing-with-prometheus-and-alertmanager-f6a919edb5f8)
* [Making loveholidays 18% faster with HTTP/3](https://tech.loveholidays.com/making-loveholidays-18-faster-with-http-3-1860879528a7)
* [Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code](https://tech.loveholidays.com/enforcing-best-practice-on-self-serve-infrastructure-with-terraform-atlantis-and-policy-as-code-911f4f8c3e00)
* [The 5 principles that helped scale loveholidays](https://tech.loveholidays.com/the-5-principles-that-helped-scale-loveholidays-7ea0b0fd3df9)
* [Realtime Fastly logs with Grafana Loki for under $1 a day](https://tech.loveholidays.com/realtime-fastly-logs-with-grafana-loki-for-under-1-a-day-5b63ccf32d66)

</details>

<details>
  <summary>Macquarie</summary>

### Blog Posts

* [Our DevSecOps journey with Golang](https://medium.com/macquarie-engineering-blog/our-devsecops-journey-with-golang-a1af38328c36)
* [Pipeline Configuration as Code with Kotlin](https://medium.com/macquarie-engineering-blog/pipeline-configuration-as-code-with-kotlin-dec9ab9ee6fa)
* [DevOps and Segregation of Duties](https://medium.com/macquarie-engineering-blog/devops-and-segregation-of-duties-ea4a7dcc7217)
* [Macquarie embraces DevOps](https://medium.com/macquarie-engineering-blog/macquarie-embraces-devops-30f0fe62496a)
* [Scaling a Kubernetes Platform across the Enterprise](https://medium.com/macquarie-engineering-blog/scaling-a-kubernetes-platform-across-the-enterprise-c07a53b6022e)

</details>

<details>
  <summary>Mattermost</summary>

### Blog Posts

* [Monitoring Cloud Environments at Scale with Prometheus and Thanos](https://mattermost.com/blog/monitoring-cloud-environments-at-scale-with-prometheus-and-thanos/)
* [How We Use Sloth to do SLO Monitoring and Alerting with Prometheus](https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/)

</details>

<details>
  <summary>Meituan (美团)</summary>

### Blog Posts

* [The development and practice of SRE in the cloud (云端的SRE发展与实践)](https://tech.meituan.com/2017/08/03/meituanyun-sre.html)

</details>

<details>
  <summary>Mercari</summary>

### Blog Posts

* [Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems](https://engineering.mercari.com/en/blog/entry/20220805-who-watches-the-watchmen-keeping-an-eye-on-our-monitoring-systems/)
* [What the Microservices SRE Team are doing as SRE Evangelists](https://engineering.mercari.com/en/blog/entry/20220225-cdb2b6deff/)
* [What it’s like to work as an embedded microservices SRE](https://engineering.mercari.com/en/blog/entry/20220228-work-as-an-embedded-microservices-sre/)
* [The Merpay SRE Team: Past and future](https://engineering.mercari.com/en/blog/entry/20210831-a91c3dca9d/)
* [Embedded SRE at Mercari](https://engineering.mercari.com/en/blog/entry/20220221-embedded-sre-at-mercari/)
* [What the SRE team wants to achieve with the development team](https://engineering.mercari.com/en/blog/entry/20210129-embedded-sre/)
* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)

</details>

<details>
  <summary>Meta</summary>

### Blog Posts

* [Improving Meta’s SLO workflows with data annotations](https://engineering.fb.com/2022/08/29/developer-tools/improving-metas-slo-workflows-with-data-annotations/)
* [SLICK: Adopting SLOs for improved reliability](https://engineering.fb.com/2021/12/13/production-engineering/slick/)
* [More details about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/)
* [Update about the October 4th outage](https://engineering.fb.com/2021/10/04/networking-traffic/outage/)

### Videos

* [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney)
* [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli)
* [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong)
* [Using ML to Automate Dynamic Error Categorization](https://www.usenix.org/conference/srecon19asia/presentation/davoli)

</details>

<details>
  <summary>Microsoft</summary>

### Videos

* [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ)
* [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k)
* [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson)
* [Study on Human Factors and Team Culture to Improve Pager Fatigue](https://www.usenix.org/conference/srecon20americas/presentation/barteneva)
* [Prioritizing Trust While Creating Applications](https://www.usenix.org/conference/srecon19emea/presentation/davis)
* [Building Resilience: How to Learn More from Incidents](https://www.usenix.org/conference/srecon19emea/presentation/stenning)
* [A Tale of Two Postmortems: A Human Factors View](https://www.usenix.org/conference/srecon19asia/presentation/lund-postmortem)
* [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy)
* [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy)
* [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)

</details>

<details>
  <summary>MIRO</summary>

### Blog Posts

* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)

</details>

<details>
  <summary>Monzo</summary>

### Blog Posts

* [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo)
* [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)
* [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)
* [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo)

### Videos

* [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)

### Tools

* [Response](https://github.com/monzo/response)

</details>

<details>
  <summary>Netflix</summary>

### Blog Posts

* [Achieving observability in async workflows](https://netflixtechblog.com/achieving-observability-in-async-workflows-cd89b923c784)
* [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
* [Lessons from Building Observability Tools at Netflix](https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17)
* [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f)
* [Telltale: Netflix Application Monitoring Simplified](https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba)
* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)
* [Introducing Dispatch](https://netflixtechblog.com/introducing-dispatch-da4b8a2a8072)
* [Applying Netflix DevOps Patterns to Windows](https://netflixtechblog.com/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79)
* [ChAP: Chaos Automation Platform](https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f)
* [Starting the Avalanche](https://netflixtechblog.com/starting-the-avalanche-640e69b14a06)
* [Netflix Chaos Monkey Upgraded](https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d)
* [Chaos Engineering Upgraded](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa)
* [Automated Failure Testing](https://netflixtechblog.com/automated-failure-testing-86c1b8bc841f)
* [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)
* [Introducing Atlas: Netflix’s Primary Telemetry Platform](https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a)
* [FIT: Failure Injection Testing](https://netflixtechblog.com/fit-failure-injection-testing-35d8e2a9bb2)
* [Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis](https://netflixtechblog.com/announcing-security-monkey-aws-security-configuration-monitoring-and-analysis-1f2bfb001708)
* [Lessons Netflix Learned from the AWS Outage](https://netflixtechblog.com/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04)
* [Scryer: Netflix’s Predictive Auto Scaling Engine](https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270)

### Major incidents & analysis reports

* [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
  
### Videos

* [AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)](https://www.youtube.com/watch?v=0QS1TWLooo0)
* [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed)
* [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens)
* [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https://www.youtube.com/watch?v=DlWYNoLmma8)
* [Real user performance monitoring at Netflix scale ‐ Martin Spier](https://www.youtube.com/watch?v=4RG2DUK03_0)
* [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://www.youtube.com/watch?v=rgfww8tLM0A)
* [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https://www.youtube.com/watch?v=LaKGx0dAUlo)
* [Netflix: Multi-Regional Resiliency and Amazon Route 53](https://www.youtube.com/watch?v=WDDkLOT8SCk)
* [Designing Services for Resilience: Netflix Lessons](https://www.youtube.com/watch?v=RWyZkNzvC-c)
* [South Bay SRE Meetup - Netflix Cloud Performance Team](https://www.youtube.com/watch?v=uQ0flQOtQEA)
* [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https://www.youtube.com/watch?v=T_D1G42G0dE)
* [How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows](https://www.youtube.com/watch?v=8tsIqfvizpU)
* [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM)
* [AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)](https://www.youtube.com/watch?v=leqUbSY55hY)
* [SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs](https://www.youtube.com/watch?v=koGaH4ffXaU)
* [From Sys Admin to Netflix SRE](https://www.youtube.com/watch?v=lZI51YzIgVE)
* [Application Resilience Engineering and Operations at Netflix with Hystrix](https://www.youtube.com/watch?v=RzlluokGi1w)
* [Injecting Failure at Netflix](https://www.youtube.com/watch?v=ioXV28GtXeo)
* [LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability](https://www.youtube.com/watch?v=3D0zS3kPNUU)
* [Incident Management at Netflix Velocity](https://www.infoq.com/presentations/netflix-incident-management/)

### Podcasts

* [Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems](https://www.infoq.com/podcasts/netflix-sre-sociotechnical-systems/)

### Tools

* [Dispatch](https://github.com/Netflix/dispatch)

</details>

<details>
  <summary>New Relic</summary>

### Blog Posts

* [Defining Modern Software Roles: SREs at New Relic](https://newrelic.com/blog/nerd-life/new-relic-sre)
* [10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)](https://newrelic.com/blog/best-practices/site-reliability-engineering-careers)
* [What Tools Do Site Reliability Engineers Use?](https://newrelic.com/blog/best-practices/best-sre-tools)
* [A Day in the Life of a New Relic SRE](https://newrelic.com/blog/nerd-life/what-does-an-sre-do)
* [7 Habits of Highly Successful Site Reliability Engineers](https://newrelic.com/blog/best-practices/site-reliability-engineer-sre-habits)
* [Adopting the practice of SRE](https://newrelic.com/blog/best-practices/adopting-sre-practices)
* [Using modern observability to establish a data-driven culture](https://newrelic.com/blog/best-practices/observability-data-driven-culture)

</details>

<details>
  <summary>Nubank</summary>

### Blog Posts

* [How we deal with technical incidents](https://building.nubank.com.br/how-we-deal-with-incidents/)
* [How we do On-Call Rotations at Nubank](https://building.nubank.com.br/how-we-do-on-call-rotations-at-nubank/)
* [How we scale our data platform efficiently and reliably](https://building.nubank.com.br/distributing-the-data-team-to-boost-innovation-reliably/)
* [Why We Killed Our End-to-End Test Suite](https://building.nubank.com.br/why-we-killed-our-end-to-end-test-suite/)
* [Automatic retraining for machine learning models: tips and lessons learned](https://building.nubank.com.br/automatic-retraining-for-machine-learning-models/)

</details>

<details>
  <summary>OpenAI</summary>

### Blog Posts

* [March 20 ChatGPT outage: Here’s what happened](https://openai.com/blog/march-20-chatgpt-outage)
* [OpenAI SRE and scaling explained easy.](https://medium.com/@Pran-Ker/openai-sre-miracle-19a33bdd3145)
* [Scaling Kubernetes to 2,500 nodes](https://openai.com/research/scaling-kubernetes-to-2500-nodes)
* [Scaling Kubernetes to 7,500 nodes](https://openai.com/research/scaling-kubernetes-to-7500-nodes)
* [Scaling AI Infrastructure at OpenAI](https://www.youtube.com/watch?v=cK7qFZ9J6k0)

</details>

<details>
  <summary>PayPal</summary>

### Blog Posts

* [Triggered: Incident #1234 (incident process needs fixing)](https://medium.com/paypal-tech/triggered-incident-1234-incident-process-needs-fixing-2a09dbac9edd)
* [Implementing Observability in a Service Mesh](https://medium.com/paypal-tech/implementing-observability-in-a-service-mesh-273c7409283d)
* [PostgreSQL at Scale: Database Schema Changes Without Downtime](https://medium.com/paypal-tech/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680)
* [Scaling GraphQL at PayPal](https://medium.com/paypal-tech/scaling-graphql-at-paypal-b5b5ac098810)

### Videos

* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
* [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel)
* [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)

</details>

<details>
  <summary>Picnic</summary>

### Blog Posts

* [Micrometer and the Modern Observability Stack](https://blog.picnic.nl/micrometer-and-the-modern-observability-stack-ebf72283bd8e)
* [Monitoring and Observability at Picnic](https://blog.picnic.nl/monitoring-and-observability-at-picnic-684cefd845c4)

</details>

<details>
  <summary>Pinterest</summary>

### Blog Posts

* [Ensuring High Availability of Ads Realtime Streaming Services](https://medium.com/pinterest-engineering/ensuring-high-availability-of-ads-realtime-streaming-services-ea3889420490)
* [Improving efficiency and reducing runtime using S3 read optimization](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
* [Scaling Kubernetes with Assurance at Pinterest](https://medium.com/pinterest-engineering/scaling-kubernetes-with-assurance-at-pinterest-a23f821168da)
* [What we learned from an iOS app OOMs incident](https://medium.com/pinterest-engineering/what-we-learned-from-an-ios-app-ooms-incident-eb31eada251)
* [How we designed our Continuous Integration System to be more than 50% Faster](https://medium.com/pinterest-engineering/how-we-designed-our-continuous-integration-system-to-be-more-than-50-faster-b70a59342fe2)
* [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737)
* [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a)
* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
* [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64)

### Videos

* [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji)
* [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas)
* [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)

</details>

<details>
  <summary>Postman</summary>

### Blog Posts

* [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)

</details>

<details>
  <summary>Prezi</summary>

### Blog Posts

* [How to avoid global outage — Seamlessly migrating DaemonSet labels](https://engineering.prezi.com/intro-4727024fc2c1)
* [In search of speed — debugging Elasticsearch performance](https://engineering.prezi.com/in-search-of-speed-debugging-elasticsearch-performance-9ce8edf4af40)
* [Prometheus at Prezi: replacing 10 years of anti-patterns](https://engineering.prezi.com/prometheus-at-prezi-replacing-10-years-of-anti-patterns-e3c2317e6ca)

</details>

<details>
  <summary>Red Hat</summary>

### Blog Posts

* [From Ops to SRE: Evolution of the OpenShift Dedicated Team](https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team)
* [5 Agile Practices Every SRE Team Should Adopt](https://www.openshift.com/blog/5-agile-practices-every-sre-team-should-adopt)
* [7 Best Practices for Writing Kubernetes Operators: An SRE Perspective](https://www.openshift.com/blog/7-best-practices-for-writing-kubernetes-operators-an-sre-perspective)

</details>

<details>
  <summary>Riot Games</summary>

### Blog Posts

* [THE LEGENDS OF RUNETERRA CI/CD PIPELINE](https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline)
* [STRATEGIES FOR WORKING IN UNCERTAIN SYSTEMS](https://technology.riotgames.com/news/strategies-working-uncertain-systems)
* [IMPROVING THE DEVELOPER EXPERIENCE FOR OPERATING SERVICES](https://technology.riotgames.com/news/improving-developer-experience-operating-services)
* [SCALABILITY AND LOAD TESTING FOR VALORANT](https://technology.riotgames.com/news/scalability-and-load-testing-valorant)
* [LEVERAGING GOLANG FOR GAME DEVELOPMENT AND OPERATIONS](https://technology.riotgames.com/news/leveraging-golang-game-development-and-operations)
* [CONTROLLED CHAOS WITH FAULT INJECTION TESTING](https://technology.riotgames.com/)
* [DOWN THE RABBIT HOLE OF PERFORMANCE MONITORING](https://technology.riotgames.com/news/down-rabbit-hole-performance-monitoring)
* [PROFILING: THE CASE OF THE MISSING MILLISECONDS](https://technology.riotgames.com/news/profiling-case-missing-milliseconds)
* [PROFILING: REAL WORLD PERFORMANCE IN LEAGUE](https://technology.riotgames.com/news/profiling-real-world-performance-league)
* [PROFILING: OPTIMISATION](https://technology.riotgames.com/news/profiling-optimisation)
* [PROFILING: MEASUREMENT AND ANALYSIS](https://technology.riotgames.com/news/profiling-measurement-and-analysis)
* [RUNNING ONLINE SERVICES AT RIOT: PART I](https://technology.riotgames.com/news/running-online-services-riot-part-i)
* [RUNNING ONLINE SERVICES AT RIOT: PART II](https://technology.riotgames.com/news/running-online-services-riot-part-ii)
* [RUNNING ONLINE SERVICES AT RIOT: PART III](https://technology.riotgames.com/news/running-online-services-riot-part-iii)
* [RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX](https://technology.riotgames.com/news/running-online-services-riot-part-iii-part-deux)
* [RUNNING ONLINE SERVICES AT RIOT: PART IV](https://technology.riotgames.com/news/running-online-services-riot-part-iv)
* [RUNNING ONLINE SERVICES AT RIOT: PART V](https://technology.riotgames.com/news/running-online-services-riot-part-v)
* [THE EVOLUTION OF SECURITY AT RIOT](https://technology.riotgames.com/news/evolution-security-riot)
* [RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE](https://technology.riotgames.com/news/running-automated-test-pipeline-league-client-update)
* [AUTOMATED TESTING FOR LEAGUE OF LEGENDS](https://technology.riotgames.com/news/automated-testing-league-legends)

</details>

<details>
  <summary>Salesforce</summary>

### Blog Posts

* [Looking at the Kubernetes Control Plane for Multi-Tenancy](https://engineering.salesforce.com/looking-at-the-kubernetes-control-plane-for-multi-tenancy-88914cd7aa89)
* [Optimizing EKS networking for scale](https://engineering.salesforce.com/optimizing-eks-networking-for-scale-1325706c8f6d)
* [Zero Downtime Node Patching in a Kubernetes Cluster](https://engineering.salesforce.com/zero-downtime-node-patching-in-a-kubernetes-cluster-cdceb21c8c8c)
* [How, Not Why: An Alternative to the Five Whys for Post-Mortems](https://engineering.salesforce.com/how-not-why-an-alternative-to-the-five-whys-for-post-mortems-4518098cca17)
* [A Generic Sidecar Injector for Kubernetes](https://engineering.salesforce.com/a-generic-sidecar-injector-for-kubernetes-c05eede1f6bb)
* [Implementation of a monitoring strategy for products based on microservices](https://engineering.salesforce.com/implementation-of-a-monitoring-strategy-for-products-based-on-microservices-24ad24c4c3e5)
* [10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use](https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c)
* [Our Journey to a Near Perfect Log Pipeline](https://engineering.salesforce.com/our-journey-to-a-near-perfect-log-pipeline-6ae2f80cf7a0)
* [Optimizing Performance with Web Workers](https://engineering.salesforce.com/optimizing-performance-with-web-workers-612b48621d8d)
* [Take A Moment To Refocus](https://engineering.salesforce.com/take-a-moment-to-refocus-86b6546c90c)

</details>

<details>
  <summary>Schibsted Media</summary>

### Blog Posts

* [Reliability engineering for some of top 10 sites in Scandinavia](https://alexewerlof.medium.com/reliability-engineering-for-some-of-top-10-sites-in-scandinavia-91e388d8d13a)

</details>

<details>
  <summary>Scribd</summary>

### Blog Posts

* [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html)
* [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html)
* [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)

</details>

<details>
  <summary>Shopify</summary>

### Blog Posts

* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
* [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)
* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)

### Videos

* [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge)
* [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne)
* [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)

</details>

<details>
  <summary>Sky Betting and Gaming</summary>

### Blog Posts

* [It’s Just a Monitoring Change](https://sbg.technology/2020/12/09/its-just-a-monitoring-change/)
* [“What's the worst that could happen?”: A worked example of how we deal with live incidents](https://sbg.technology/2020/04/02/whats-the-worst-that-can-happen/)
* [Rising from the Ashes](https://sbg.technology/2020/02/07/rising-from-the-ashes/)
* [Crash! Bang! Wallop! Practice makes perfect](https://sbg.technology/2018/05/04/firedrills-in-core/)
* [Performance Left Right and Center](https://sbg.technology/2017/10/23/performance-left-right-and-center/)

</details>

<details>
  <summary>Slack</summary>

### Blog Posts

* [Slack’s Incident on 2-22-22](https://slack.engineering/slacks-incident-on-2-22-22/)
* [Infrastructure Observability for Changing the Spend Curve](https://slack.engineering/infrastructure-observability-for-changing-the-spend-curve/)
* [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)

### Videos

* [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton)
* [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)

</details>

<details>
  <summary>Slalom Build</summary>

### Blog Posts

* [How to Implement Service Level Objectives in New Relic APM](https://medium.com/slalom-build/how-to-implement-service-level-objectives-in-new-relic-apm-f34f8746118b)
* [Beginners Guide to DevOps: How to Make It into the Industry](https://medium.com/slalom-build/beginners-guid-to-devops-how-to-make-it-into-the-industry-c1652d59807)
* [GitHub Actions: Beyond CI/CD](https://medium.com/slalom-build/github-actions-beyond-ci-cd-cb3ddc6abaa)
* [Why isn’t all test automation run on the pipeline?](https://medium.com/slalom-build/why-isnt-all-test-automation-run-on-the-pipeline-b2c57afbdf5a)
* [The Many Shapes of Site Reliability Engineering](https://medium.com/slalom-build/the-many-shapes-of-site-reliability-engineering-468359866517)
* [How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS](https://medium.com/slalom-build/how-to-build-a-secure-by-default-kubernetes-cluster-with-a-basic-ci-cd-pipeline-on-aws-ebfe0da1c7c9)
* [Secret Management Architectures: Finding the balance between security and complexity](https://medium.com/slalom-build/secret-management-architectures-finding-the-balance-between-security-and-complexity-d857ceaa2300)
* [Detecting Malicious Requests with Keras & Tensorflow](https://medium.com/slalom-build/detecting-malicious-requests-with-keras-tensorflow-5d5db06b4f28)
* [The Lego Monolith — A Monolith Microservice Proof of Concept](https://medium.com/slalom-build/the-lego-monolith-a-monolith-microservice-proof-of-concept-a402ca1654e4)
* [Managing Secrets Using Hashicorp Vault](https://medium.com/slalom-build/managing-secrets-using-hashicorp-vault-ed6b9e0375ac)
* [Packaging Spring Boot Applications for Deployment on Kubernetes](https://medium.com/slalom-build/packaging-spring-boot-applications-for-deployment-on-kubernetes-5fb64bc65406)
* [Immutable Infrastructure and Continuous Delivery in the Cloud](https://medium.com/slalom-build/immutable-infrastructure-and-continuous-delivery-in-the-cloud-56ee4b31b8d5)

</details>

<details>
  <summary>Soundcloud</summary>

### Blog Posts

* [How to Successfully Hand Over Systems](https://developers.soundcloud.com/blog/how-to-successfully-hand-over-systems)
* [Building a Healthy On-Call Culture](https://developers.soundcloud.com/blog/building-a-healthy-on-call-culture)
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
* [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
* [What I Learned in One Year as an SRE Trainee](https://developers.soundcloud.com/blog/sre-trainee)
* [Tests Under the Magnifying Lens](https://developers.soundcloud.com/blog/tests-under-the-magnifying-lens)

</details>

<details>
  <summary>Spotify</summary>

### Blog Posts

* [Matt Clarke: Senior Backend Infrastructure Engineer](https://engineering.atspotify.com/2021/03/09/my-beat-matt-clarke/)
* [Designing a Better Kubernetes Experience for Developers](https://engineering.atspotify.com/2021/03/01/designing-a-better-kubernetes-experience-for-developers/)
* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)

### Videos

* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)

</details>

<details>
  <summary>Squarespace</summary>

### Blog Posts

* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)

### Videos

* [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na)
* [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo)
* [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson)
* [Creating a Code Review Culture](https://www.usenix.org/conference/srecon19americas/presentation/turner)

</details>

<details>
  <summary>Stack Overflow</summary>

### Blog Posts

* [“This should never happen. If it does, call the developers.”](https://stackoverflow.blog/2021/03/18/creating-a-good-feedback-loop-between-ops-and-devs-using-documentation/)
* [Infrastructure as code: Create and configure infrastructure elements in seconds](https://stackoverflow.blog/2021/03/08/infrastructure-as-code-create-and-configure-infrastructure-elements-in-seconds/)
* [Fulfilling the promise of CI/CD](https://stackoverflow.blog/2021/01/19/fulfilling-the-promise-of-ci-cd/)
* [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/)
* [Guest Post - Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
* [How We Built Our Blog](https://stackoverflow.blog/2015/07/02/how-we-built-our-blog/)
* [Stack Overflow Frees Up Engineering Time with Netlify](https://www.netlify.com/blog/stack-overflow-case-study/)

### Videos

* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)

</details>

<details>
  <summary>Strava</summary>

### Blog Posts

* [Scaling Club Leaderboard Infrastructure for Millions of Users](https://medium.com/strava-engineering/scaling-club-leaderboard-infrastructure-for-millions-of-users-9ee857ce8cfe)
* [Distributed Tracing at Strava](https://medium.com/strava-engineering/distributed-tracing-at-strava-e9d784b9ddf2)

</details>

<details>
  <summary>Stripe</summary>

### Blog Posts

* [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines)
* [Fast builds, secure builds. Choose two.](https://stripe.com/blog/fast-secure-builds-choose-two)
* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog)

### Videos

* [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson)
* [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)

</details>

<details>
  <summary>Target</summary>

### Blog Posts

* [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html)
* [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html)
* [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html)

</details>

<details>
  <summary>Teads</summary>

### Blog Posts

* [Scaling your on-duty team](https://medium.com/teads-engineering/scaling-your-on-duty-team-bc467c480747)

</details>

<details>
  <summary>Tinder</summary>

### Blog Posts

* [The Ultimate Load Test](https://medium.com/tinder-engineering/the-ultimate-load-test-c32b37adc11b)
* [How We Improved Our Performance Using ElasticSearch Plugins: Part 1](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-1-b0850a7e5224)
* [How We Improved Our Performance Using ElasticSearch Plugins: Part 2](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-2-b051da2ee85b)
* [Tinder’s move to Kubernetes](https://medium.com/tinder-engineering/tinders-move-to-kubernetes-cda2a6372f44)

</details>

<details>
  <summary>Tokopedia</summary>

### Blog Posts

* [Benefits of benchmarking with Go](https://medium.com/tokopedia-engineering/benefits-of-benchmarking-with-go-f8bfa177f7fa)
* [Simulating Customized Chaos in Golang using Toxiproxy](https://medium.com/tokopedia-engineering/simulating-customized-chaos-in-golang-using-toxiproxy-b913584d88a7)
* [How Tokopedia Rank Millions of Products in Search Page](https://medium.com/tokopedia-engineering/how-tokopedia-rank-millions-of-products-in-search-page-70e358ea2274)

</details>

<details>
  <summary>Trivago</summary>

### Blog Posts

* [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)

</details>

<details>
  <summary>Twilio</summary>

### Blog Posts

* [Twilio SRE Gameday Template](https://github.com/twilio/gameday/blob/main/get_to_know_your_systems.md)

</details>

<details>
  <summary>Twitter</summary>

### Blog Posts

* [Logging at Twitter: Updated](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/logging-at-twitter-updated)
* [Deleting data distributed throughout your microservices architecture](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/deleting-data-distributed-throughout-your-microservices-architecture)
* [Deterministic Aperture: A distributed, load balancing algorithm](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/daperture-load-balancer)
* [MetricsDB: TimeSeries Database for storing metrics at Twitter](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb)
* [The Infrastructure Behind Twitter: Scale](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale)
* [The infrastructure behind Twitter: efficiency and optimization](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/the-infrastructure-behind-twitter-efficiency-and-optimization)

</details>

<details>
  <summary>Uber</summary>

### Blog Posts

* [Founding Uber SRE](https://lethain.com/founding-uber-sre/)
* [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/)
* [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/)
* [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/)

### Videos

* [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee)
* [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud)
* [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)

</details>

<details>
  <summary>Udemy</summary>

### Blog Posts

* [Blameless Incident Reviews at Udemy](https://medium.com/udemy-engineering/blameless-incident-reviews-at-udemy-aa4773dbaf0b)
* [How Udemy does Build Engineering](https://medium.com/udemy-engineering/how-udemy-does-build-engineering-9722e98a4208)

</details>

<details>
  <summary>upGrad</summary>

### Blog Posts

* [Web Performance and Related Stories — upgrad.com](https://engineering.upgrad.com/web-performance-and-related-stories-upgrad-com-a9fb9c6bb766)
* [Beginner’s guide to web analytics](https://engineering.upgrad.com/beginners-guide-to-analytics-c8ce3e92fa42)
* [iOS Continuous Deployment with Bitbucket, Jenkins and Fastlane at UpGrad](https://engineering.upgrad.com/ios-continuous-deployment-with-bitbucket-jenkins-and-fastlane-at-upgrad-699b3b48acca)

</details>

<details>
  <summary>VGW</summary>

### Blog Posts

* [The SRE Incident Response game](https://medium.com/@bruce_25864/the-sre-incident-response-game-db242fff391c)

### Videos

* [Level Up Your Incident Response With Gameplay](https://youtu.be/c2-52EP8_7c)

</details>

<details>
  <summary>Wikimedia Foundation</summary>

### Videos

* [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli)
* [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)

</details>

<details>
  <summary>Wix</summary>

### Blog Posts

* [How We Improved Website Performance by Evolving Our Infrastructure](https://www.wix.engineering/post/how-we-improved-website-performance-by-evolving-our-infrastructure)
* [Wix Inbox Journey: 3 Approaches for Zero Downtime Database Migration](https://www.wix.engineering/post/wix-inbox-journey-3-approaches-for-zero-downtime-database-migration)
* [Moving Velo to Multiple Container Sites: The Why, The How and The Lessons Learned](https://www.wix.engineering/post/moving-velo-to-multiple-container-sites-the-why-the-how-and-the-lessons-learned)
* [Making Order in CI/CD Mess](https://www.wix.engineering/post/making-order-in-ci-cd-mess)

</details>

<details>
  <summary>Yelp</summary>

### Blog Posts

* [The process: Implementing Yelp’s failover strategy](https://increment.com/reliability/yelp-traffic-failover-strategy/)

### Videos

* [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu)

</details>

<details>
  <summary>Zalando</summary>

### Blog Posts

* [Tracing SRE’s journey in Zalando - Part I](https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html)
* [Tracing SRE’s journey in Zalando - Part II](https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html)
* [Tracing SRE’s journey in Zalando - Part III](https://engineering.zalando.com/posts/2021/10/sre-journey-part3.html)

</details>

<details>
  <summary>Zerodha</summary>

### Blog Posts

* [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)

</details>

<details>
  <summary>Zomato</summary>

### Blog Posts

* [Huddle Diaries – DevOps and Data Platform](https://www.zomato.com/blog/huddle-diaries-devops-and-data-platform)

</details>

## SRECon Mix Playlist

### Videos

* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
* [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus)
* [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald)
* [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas)
* [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali)
* [Cloudlock - My Life as a Solo SRE](https://www.usenix.org/conference/srecon19emea/presentation/murphy)
* [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke)
* [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne)
* [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas)
* [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)
* [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve)
* [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower)
* [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne)
* [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal)
* [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger)
* [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins)
* [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable)
* [The  New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan)
* [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin)
* [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek)
* [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec)
* [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan)
* [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin)
* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)

---

## Resources

### Books

* [__New!__ Enterprise Roadmap to SRE](https://learning.oreilly.com/library/view/enterprise-roadmap-to/9781098117740/)
* [Building Secure & Reliable Systems](https://www.oreilly.com/library/view/building-secure-and/9781492083115/) | [Read free online version hosted by Google](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
* [Site Reliability Engineering](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/) | [Read free online version hosted by Google](https://sre.google/sre-book/table-of-contents/)
* [The Site Reliability Workbook from Google](https://www.oreilly.com/library/view/the-site-reliability/9781492029496/) | [Read free online version hosted by Google](https://sre.google/workbook/table-of-contents/)
* [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/) | [Read free online version hosted by Google](https://github.com/google/googlesre/blob/main/publications/Training_Site_Reliability_Engineers.pdf)
* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/) | [Complimentary Copy from Nginx](https://www.nginx.com/resources/library/97-things-every-sre-should-know/)
* [SLO Adoption and Usage in Site Reliability Engineering](https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/)
* [Practical Site Reliability Engineering](https://www.oreilly.com/library/view/practical-site-reliability/9781788839563/)
* [Implementing Service Level Objectives](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/)
* [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
* [Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856/)
* [Security Chaos Engineering](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/)
* [Chaos Engineering Observability](https://www.oreilly.com/library/view/chaos-engineering-observability/9781492051046/)
* [Database Reliability Engineering](https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/)
* [What Is SRE?](https://www.oreilly.com/library/view/what-is-sre/9781492054429/)
* [Database Reliability Engineering: What, Why, and How?](https://www.oreilly.com/library/view/database-reliability-engineering/9781492030942/)
* [Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/)
* [Chaos Engineering: Site reliability through controlled disruption](https://www.manning.com/books/chaos-engineering)
* [Incident Metrics in SRE](https://www.oreilly.com/library/view/incident-metrics-in/9781098103163/) | [Read free online version hosted by Google](https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/)
* [Engineering Reliable Mobile Applications](https://www.oreilly.com/library/view/engineering-reliable-mobile/9781492057444/)
* [Monitoring the SRE Golden Signals](https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook)
* [Site Reliability Engineering: Philosophies, habits, and tools for SRE success](https://newrelic.com/resources/ebooks/site-reliability-engineering) | [Portable version](https://newrelic.com/sites/default/files/2021-08/site-reliability-engineering-handbook.pdf)
* [97 Things Every Cloud Engineer Should Know](https://www.redhat.com/rhdc/managed-files/cl-97-things-cloud-engineers-know-e-book-oreilly-f28602-202105-en.pdf)
* [Real-World SRE](https://www.packtpub.com/product/real-world-sre/9781788628884)
* [Hands-on Site Reliability Engineering](https://bpbonline.com/products/hands-on-site-reliability-engineering?_pos=1&_sid=839999550&_ss=r)

### Events

* [SRECon Past Events](https://www.usenix.org/srecon#past)
* [ChaosConf](https://www.chaosconf.io/)
* [SLOConf](https://www.sloconf.com/)
  * [SLOConf 2021 Playlist](https://www.youtube.com/watch?v=-lHPDx90Ppg&list=PLLNq9CBV7AFwyRzICyCRKdcsAPAlG5bPu)
* [cdCon](https://events.linuxfoundation.org/cdcon/)
  * [cdCon 2021 Playlist](https://www.youtube.com/watch?v=MQU4fKhau1w&list=PL2KXbZ9-EY9TWsV-Jz8ARSt1ko0Yd36ah)
  * [cdCon 2020 Playlist](https://www.youtube.com/watch?v=qLMrcEj-R9Y&list=PL2KXbZ9-EY9RbYURc1CDrOJpbrPMtc0P7)

### Other Resources

#### Awesome Lists

* [Awesome SRE](https://github.com/dastergon/awesome-sre)
* [Awesome Site Reliability Engineering Tools](https://github.com/SquadcastHub/awesome-sre-tools)
* [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
* [Awesome Monitoring](https://github.com/crazy-canux/awesome-monitoring)
* [Awesome Observability](https://github.com/adriannovegil/awesome-observability)
* [Awesome MLOps](https://github.com/visenger/awesome-mlops)
* [ML-Ops.org](https://ml-ops.org/)

#### SRE Resources from various organizations

* [Google SRE Page](https://sre.google/)
* [Google SRE Classroom](https://sre.google/classroom/)
* [Google Cloud SRE Page](https://cloud.google.com/sre)
* [Microsoft SRE Page](https://docs.microsoft.com/en-us/azure/site-reliability-engineering/)
* [School of SRE from LinkedIn](https://linkedin.github.io/school-of-sre/)
* [Stripe Increment Magazine Issue 16 on Reliability](https://increment.com/reliability/)
* [AWS Observability Recipes](https://aws-observability.github.io/aws-o11y-recipes/)
* [Awesome Sysadmin](https://github.com/awesome-foss/awesome-sysadmin)

#### Incidents & postmortems

* [The Verica Open Incident Database](https://www.thevoid.community/)
* [Postmortem Templates](https://github.com/dastergon/postmortem-templates)
* [Incident Review and Postmortem Best Practices](https://blog.pragmaticengineer.com/postmortem-best-practices/)

#### Newsletters

* [SRE Weekly Newsletter](https://sreweekly.com/)
* [Chaos Engineering Newsletter](https://chaosengineering.news/)
* [DevOps Weekly Newsletter](http://devopsweekly.com)

## Credits

* Inspired by [Howtheytest](https://github.com/abhivaikar/howtheytest) from [Abhijeet Vaikar](https://github.com/abhivaikar)
* The list of organizations is referred from my other repo [awesome-engineering](https://github.com/upgundecha/awesome-engineering)
* Banner image [Cartoon vector created by vectorjuice - www.freepik.com](https://www.freepik.com/vectors/cartoon)

## Other How They... repos

* [Howtheytest](https://github.com/abhivaikar/howtheytest)
* [Howtheydevops](https://github.com/bregman-arie/howtheydevops)
* [Howtheyaws](https://github.com/upgundecha/howtheyaws)

## Contribute

Contributions welcome! Read the [contribution guidelines](contributing.md) first.

## License

[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0)

To the extent possible under law, Unmesh Gundecha has waived all copyright and
related or neighboring rights to this work.

---

If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github.
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								# How they SRE
-												Update README.md

Added CodeQL badge
											
										
										
											2022-11-03 23:13:16 +05:30
+								![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square) [![CI](https://github.com/upgundecha/howtheysre/actions/workflows/workflow.yml/badge.svg)](https://github.com/upgundecha/howtheysre/actions/workflows/workflow.yml) [![CodeQL](https://github.com/upgundecha/howtheysre/actions/workflows/codeql.yml/badge.svg)](https://github.com/upgundecha/howtheysre/actions/workflows/codeql.yml)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Fix README image

											
										
										
											2021-10-19 11:31:49 -04:00
+								![How they SRE](headline.png)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix

											
										
										
											2021-02-15 21:42:27 +08:00
+								> A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								## Introduction
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
-												Fix

											
										
										
											2021-02-15 21:42:27 +08:00
+								_Note to readers: This list refers to some of the articles, posts, videos, tools, and techniques published before 2015. Please use such material with caution as there may be recent advances in technology and practices which offer better alternatives and perspectives._
-												Added notes to reader

											
										
										
											2021-02-15 21:13:20 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								### Topics
 								* Site Reliability Engineering
 								* Hiring and Building SRE teams
 								* SRE Culture
 								* DevOps
 								* Monitoring & Observability
 								* Alerting
-												Added notes to reader

											
										
										
											2021-02-15 21:13:20 +08:00
+								* Incident Response & Post-Mortem
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								* On-Call
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* Testing in Production
 								* Chaos Engineering
 								* Automation
 								* Performance
 								## Organizations
-												Adding Achievers to list of blogs

											
										
										
											2021-04-28 12:57:25 -04:00
+								<details>
 								  <summary>Achievers</summary>
 								### Blog Posts
-												Adding Achievers

											
										
										
											2021-04-28 12:59:18 -04:00
+								* [Enter the Abattoir - Building 'à la carte' gitops tooling](https://achievers.engineering/enter-the-abattoir-ee5e2019f0b3)
-												Adding Achievers Istio link

											
										
										
											2022-04-13 08:45:50 -04:00
+								* [Scaling Production Globally — The service mesh facelift (Part-1)](https://achievers.engineering/scaling-production-globally-service-mesh-face-lift-part-1-30ad6d393d04)
-												Achievers istio part 2

											
										
										
											2022-06-09 13:57:04 -04:00
+								* [Scaling Production Globally - Solving observability problems for developers (Part-2)](https://achievers.engineering/scaling-production-globally-solving-observability-problems-for-developers-part-2-b5416ce5eb8a)
-												Adding achievers load testing articles

											
										
										
											2023-11-24 09:26:01 -05:00
+								* [Load Testing Kubernetes: Building a Framework (Part-1)](https://achievers.engineering/load-testing-kubernetes-building-a-framework-part-1-bdc0af4ae7e2)
 								* [Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)](https://achievers.engineering/load-testing-kubernetes-resolving-bottlenecks-and-improving-performance-part-2-c4f08102f105)
-												Adding Achievers to list of blogs

											
										
										
											2021-04-28 12:57:25 -04:00
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Airbnb</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new posts

											
										
										
											2022-08-28 10:42:18 +08:00
+								* [Automated Incident Management Through Slack](https://medium.com/airbnb-engineering/incident-management-ae863dc5d47f)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
 								* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
 								* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
-												Include missing blog posts about Airbnb
											
										
										
											2023-10-05 12:01:11 -04:00
+								* [Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb](https://medium.com/airbnb-engineering/intelligent-automation-platform-empowering-conversational-ai-and-beyond-at-airbnb-869c44833ff2)
 								* [Production Secret Management at Airbnb](https://medium.com/airbnb-engineering/production-secret-management-at-airbnb-ad230e1bc0f6)
 								* [Automating Data Protection at Scale, Part 1](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08)
 								* [Automating Data Protection at Scale, Part 2](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-2-c2b8d2068216)
 								* [Automating Data Protection at Scale, Part 3](https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-3-34e592c45d46)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
 								<details>
 								  <summary>Algolia</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/)
 								* [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)
 								</details>
-												Include Alibaba Cloud content

											
										
										
											2021-10-06 15:30:42 -04:00
+								<details>
 								  <summary>Alibaba Cloud</summary>
 								### Blog Posts
 								* [Why Are the Top Internet Companies Choosing SRE over Traditional O&M?](https://www.alibabacloud.com/blog/why-are-the-top-internet-companies-choosing-sre-over-traditional-o%26m_596099)
-												added alibaba, ibm, stack overflow posts

											
										
										
											2021-10-27 02:08:48 +08:00
+								* [Architecture and Practices of Bilibili's Real-time Platform](https://www.alibabacloud.com/blog/architecture-and-practices-of-bilibilis-real-time-platform_596676)
-												Include Alibaba Cloud content

											
										
										
											2021-10-06 15:30:42 -04:00
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Asana</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added posts from Asana and Dream11
											
										
										
											2021-09-11 17:42:56 +08:00
+								* [How Asana uses Asana: Security incident response](https://blog.asana.com/2021/09/engineering-security-incident-response/#close)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
 								* [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
 								* [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>ASOS</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Posts from Macquarie, Expedia
											
										
										
											2021-09-11 17:58:54 +08:00
+								* [Playing the blame-less game](https://medium.com/asos-techblog/playing-the-blame-less-game-3708f8195344)
 								* [A day in the life of… Cat S (Head of Reliability Engineering)](https://medium.com/asos-techblog/a-day-in-the-life-of-cat-smith-head-of-reliability-engineering-629e10a26590)
 								* [An AKS Performance Journey: Part 1 — Sizing Everything Up](https://medium.com/asos-techblog/an-aks-performance-journey-part-1-sizing-everything-up-ee6d2346ea99)
 								* [An AKS Performance Journey: Part 2 — Networking It Out](https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57)
 								* [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e)
 								* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
 								</details>
 								<details>
 								  <summary>Atlassian</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
 								* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
 								* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
-												Added notes to reader

											
										
										
											2021-02-15 21:13:20 +08:00
+								* [Incident Postmortem Template](https://www.atlassian.com/incident-management/postmortem/templates)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
-												Added BackMarket

											
										
										
											2021-02-16 17:32:40 +01:00
+								<details>
 								  <summary>BackMarket</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added BackMarket

											
										
										
											2021-02-16 17:32:40 +01:00
+								* [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Baidu</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu)
 								* [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
-												Update README.md

Added SREcon 2021 video
											
										
										
											2021-10-26 11:22:27 -04:00
+								* [Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity](https://www.youtube.com/watch?v=x3c0PPkSf14)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
 								<details>
 								  <summary>Basecamp</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/)
 								* [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/)
 								* [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/)
 								* [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Books
-												Disabled mk linting, some fixes and additions

											
										
										
											2021-02-17 09:21:47 +08:00
+								* [Shape Up](https://basecamp.com/shapeup/webbook)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Bloomberg</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen)
 								* [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski)
 								* [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov)
 								* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
 								* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Booking.com</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Blog posts from Booking.com

											
										
										
											2021-02-14 22:22:05 +08:00
+								* [How Reliability and Product Teams Collaborate at Booking.com](https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb)
 								* [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3)
 								* [Troubleshooting: A journey into the unknown](https://medium.com/booking-com-infrastructure/troubleshooting-a-journey-into-the-unknown-e31b524fa86)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
 								* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Capital One</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Fix markdown error

											
										
										
											2021-09-03 14:30:38 +08:00
+								* [Automate Application Monitoring with Slack](https://www.capitalone.com/tech/software-engineering/how-to-automate-application-monitoring-slack-bots/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
 								* [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89)
 								* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
 								* [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)
 								* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)
 								* [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7)
 								* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)
 								* [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)
 								* [Secure Docker Containers Require Secure Applications](https://medium.com/capital-one-tech/secure-docker-containers-require-secure-applications-75eb358abef9)
 								* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)
 								* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)
 								* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)
 								* [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94)
 								* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
 								* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Major incidents & analysis reports
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/)
 								* [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
 								* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
 								* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
 								* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)
 								</details>
-												Added posts from Strava, Riot Games, Etsy

											
										
										
											2021-09-04 12:06:39 +08:00
+								<details>
 								  <summary>Coinbase</summary>
 								### Blog Posts
 								* [Open Sourcing Coinbase’s Secure Deployment Pipeline](https://blog.coinbase.com/open-sourcing-coinbases-secure-deployment-pipeline-ae6c78e25517)
 								</details>
-												Adding article - Site Reliability at DAZN
											
										
										
											2021-10-08 17:51:38 +05:30
+								<details>
 								  <summary>DAZN</summary>
 								### Blog Posts
 								* [Site Reliability at DAZN](https://medium.com/dazn-tech/site-reliability-at-dazn-a3ba4af0638d)
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
-												Adding article - Site Reliability at DAZN
											
										
										
											2021-10-08 17:51:38 +05:30
+								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>DBS</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added post from DBS
											
										
										
											2022-10-16 12:24:23 +08:00
+								* [Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far](https://medium.com/dbs-tech-blog/presenting-at-ithomes-sre-conference-our-dbs-sre-transformation-journey-thus-far-9b6778ce53e8)
-												Added new posts from DBS and Goldman Sachs

											
										
										
											2022-03-13 11:47:49 +08:00
+								* [Debunking the seven most popular Site Reliability Engineering myths](https://medium.com/dbs-tech-blog/debunking-the-seven-most-popular-site-reliability-engineering-myths-a3be8d870ff2)
 								* [How To Use SRE To Cultivate A Blameless Culture In The Workplace](https://medium.com/dbs-tech-blog/how-to-use-sre-to-cultivate-a-blameless-culture-in-the-workplace-1981fd1c7871)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4)
-												Added posts from Grab and DBS

											
										
										
											2021-02-16 22:52:08 +08:00
+								* [Automating Configuration Management at Scale](https://medium.com/dbs-tech-blog/automating-configuration-management-at-scale-5c7927f83df3)
-												New Posts

											
										
										
											2021-09-03 14:27:15 +08:00
+								* [How DBS dispelled the myths of Chaos Engineering](https://medium.com/dbs-tech-blog/how-dbs-dispelled-the-myths-of-chaos-engineering-e5873ac78c9)
 								* [Double, Double Toil and Trouble](https://medium.com/dbs-tech-blog/double-double-toil-and-trouble-applying-sre-practices-to-alleviate-toil-for-devops-teams-259b958a10dd)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
 								</details>
-												Add DeepSource

Thanks for putting this together! This change adds a few blogposts from DeepSource SRE team.
											
										
										
											2021-02-16 22:58:09 +05:30
+								<details>
 								  <summary>DeepSource</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Add DeepSource

Thanks for putting this together! This change adds a few blogposts from DeepSource SRE team.
											
										
										
											2021-02-16 22:58:09 +05:30
+								* [Redis diskless replication: What, how, why and the caveats](https://deepsource.io/blog/redis-diskless-replication/)
 								* [How to setup Vault with Kubernetes](https://deepsource.io/blog/setup-vault-kubernetes/)
 								* [Breaking down zero downtime deployments in Kubernetes](https://deepsource.io/blog/zero-downtime-deployment/)
 								</details>
-												Added posts from Asana and Dream11
											
										
										
											2021-09-11 17:42:56 +08:00
+								<details>
 								  <summary>Dream11</summary>
 								### Blog Posts
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’.](https://blog.dream11engineering.com/deployment-at-scale-story-behind-dream11s-in-house-blue-green-deployment-platform-oneclick-b2c761b12896)
 								* [Enhancing security and trust with AWS WAFv2](https://blog.dream11engineering.com/enhancing-security-and-trust-with-aws-wafv2-8b050b1cba37)
-												Added posts from Asana and Dream11
											
										
										
											2021-09-11 17:42:56 +08:00
+								* [Lessons learned from running GraphQL at scale](https://blog.dream11engineering.com/lessons-learned-from-running-graphql-at-scale-2ad60b3cefeb)
 								* [Break circuits, save Kong 🦍](https://blog.dream11engineering.com/break-circuits-save-kong-3680d88a0639)
 								* [Finding Order in Chaos: How We Automated Performance Testing with Torque](https://blog.dream11engineering.com/finding-order-in-chaos-how-we-automated-performance-testing-with-torque-6eb63706fcea)
 								* [Maintaining hyper-sonic releases at Dream11](https://blog.dream11engineering.com/maintaining-hyper-sonic-releases-at-dream11-c26f2145fe28)
 								* [To Scale In Or Scale Out? Here’s How We Scale at Dream11](https://blog.dream11engineering.com/to-scale-in-or-scale-out-heres-how-we-scale-at-dream11-f88ef5e71cbc)
 								* [Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11](https://blog.dream11engineering.com/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Dropbox</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [Dropbox Engineering Career Framework - Reliability Engineer (SRE)](https://dropbox.github.io/dbx-career-framework/)
-												New posts from Dropbox, GitHub and SoundCloud

											
										
										
											2021-03-17 12:04:53 +08:00
+								* [Atlas: Our journey from a Python monolith to a managed platform](https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex)
 								* [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
 								</details>
 								<details>
 								  <summary>eBay</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
 								* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
 								* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
 								* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)
 								### Video
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
 								</details>
-												Added Epic Games section

											
										
										
											2021-10-15 11:17:46 -07:00
+								<details>
 								  <summary>Epic Games</summary>
 								### Video
-												removed leading space

											
										
										
											2021-10-15 20:05:09 -07:00
+								* [AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players](https://youtu.be/MCLrA401vHw)
-												Added Epic Games section

											
										
										
											2021-10-15 11:17:46 -07:00
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Etsy</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added posts from Strava, Riot Games, Etsy

											
										
										
											2021-09-04 12:06:39 +08:00
+								* [Improving the Deployment Experience of a Ten-Year Old Application](https://codeascraft.com/)
-												Update README.md
											
										
										
											2021-03-09 08:31:13 -08:00
+								* [How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020](https://codeascraft.com/2021/02/25/how-etsy-prepared-for-historic-volumes-of-holiday-traffic-in-2020/)
-												Added Increment issue on Reliability

											
										
										
											2021-03-01 17:52:54 +08:00
+								* [Your brain on progress](https://increment.com/reliability/brain-on-progress/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
 								* [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/)
 								* [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/)
 								* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
-												Disabled mk linting, some fixes and additions

											
										
										
											2021-02-17 09:21:47 +08:00
+								* [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4)
 								* [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)
 								</details>
 								<details>
 								  <summary>Expedia</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [Automating Performance Standards](https://medium.com/expedia-group-tech/automating-performance-standards-b51efc92d237)
-												Adding blog posts for error budget policy adoption in Expedia (2 parts)

											
										
										
											2022-07-19 13:29:14 -05:00
+								* [Error Budget Policy - Part 1 - Adoption at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policy-adoption-at-expedia-group-7d80d41c4a8b)
 								* [Error Budget Policy - Part 2 - Practices at Expedia Group](https://medium.com/expedia-group-tech/error-budget-policies-in-practice-4c98f56a28c1)
-												Posts from Macquarie, Expedia
											
										
										
											2021-09-11 17:58:54 +08:00
+								* [Using Fault-Injection to Improve our new Runtime Platform’s Reliability](https://medium.com/expedia-group-tech/using-fault-injection-to-improve-our-new-platforms-reliability-656b1147b132)
 								* [Learning from Incidents at Expedia Group](https://medium.com/expedia-group-tech/learning-from-incidents-at-expedia-group-51a8c72a4286)
 								* [Improving Vrbo Homepage Loading Experience](https://medium.com/expedia-group-tech/improving-vrbo-homepage-loading-experience-e4b2207535f4)
 								* [Troubleshooting 502 errors: ECS Checklist](https://medium.com/expedia-group-tech/troubleshooting-502-errors-ecs-checklist-9da383399d96)
 								* [Getting Started with Elasticsearch](https://medium.com/expedia-group-tech/getting-started-with-elastic-search-6af62d7df8dd)
 								* [All about ISTIO-PROXY 5xx Issues](https://medium.com/expedia-group-tech/all-about-istio-proxy-5xx-issues-e0221b29e692)
 								* [Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?](https://medium.com/expedia-group-tech/autoscaling-in-kubernetes-why-doesnt-the-horizontal-pod-autoscaler-work-for-me-5f0094694054)
 								* [How to Keep Your Kubernetes Deployments Balanced Across Multiple zones](https://medium.com/expedia-group-tech/how-to-keep-your-kubernetes-deployments-balanced-across-multiple-zones-dfe719847b41)
 								* [Are Your Dropwizard Latency Metrics Misleading You?](https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4)
 								* [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac)
 								* [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)
 								</details>
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								<details>
 								  <summary>Fastly</summary>
 								### Videos
 								* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
 								* [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)
 								</details>
-												Update G-Research SRE blogging and video mention

											
										
										
											2023-12-08 11:11:50 -05:00
+								<details>
 								  <summary>G-Research</summary>
 								### Blog Posts
-												fix G-Research link issue

											
										
										
											2023-12-08 11:14:41 -05:00
+								* [Our SRE Journey at G-Research](https://www.gresearch.com/blog/article/our-sre-journey-at-g-research/)
-												Update G-Research SRE blogging and video mention

											
										
										
											2023-12-08 11:11:50 -05:00
+								* [The SRE Journey Continues](https://www.gresearch.com/blog/article/the-sre-journey-continues/)
 								* [OpenTSDB Meta Cache – trade-offs for performance](https://www.gresearch.com/blog/article/opentsdb-meta-cache-trade-offs-for-performance/)
 								</details>
-												Added upGrad, getAround and Tinder

											
										
										
											2021-09-04 14:17:24 +08:00
+								<details>
 								  <summary>Getaround</summary>
 								### Blog Posts
 								* [How we handle incidents at Getaround](https://getaround.tech/incident-handling-at-getaround/)
 								* [Evolution Of Our Continuous Delivery Process](https://getaround.tech/continuous-integration/)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>GitHub</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Include missing blog posts about GitHub's practices
											
										
										
											2023-10-08 13:27:20 -04:00
+								* [How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com](https://github.blog/2023-09-26-how-github-uses-github-actions-and-actions-larger-runners-to-build-and-test-github-com/)
 								* [The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects](https://github.blog/2023-09-21-the-github-security-labs-journey-to-disclosing-500-cves-in-open-source-projects/)
 								* [CodeQL team uses AI to power vulnerability detection in code](https://github.blog/2023-09-12-codeql-team-uses-ai-to-power-vulnerability-detection-in-code/)
 								* [Addressing GitHub’s recent availability issues](https://github.blog/2023-05-16-addressing-githubs-recent-availability-issues/)
 								* [Building organization-wide governance and re-use for CI/CD and automation with GitHub Actions](https://github.blog/2023-04-05-building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions/)
 								* [Enabling branch deployments through IssueOps with GitHub Actions](https://github.blog/2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions/)
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [Using ChatOps to help Actions on-call engineers](https://github.blog/2021-12-01-using-chatops-to-help-actions-on-call-engineers/)
-												Posts from PayPal and GitHub
											
										
										
											2021-09-29 23:04:53 +08:00
+								* [Partitioning GitHub’s relational databases to handle scale](https://github.blog/2021-09-27-partitioning-githubs-relational-databases-scale/)
-												Added posts from Salesforce and GitHub
											
										
										
											2021-09-11 18:12:03 +08:00
+								* [Increasing developer happiness with GitHub code scanning](https://github.blog/2021-09-07-increasing-developer-happiness-github-code-scanning/)
 								* [Why (and how) GitHub is adopting OpenTelemetry](https://github.blog/2021-05-26-why-and-how-github-is-adopting-opentelemetry/)
-												New posts from Dropbox, GitHub and SoundCloud

											
										
										
											2021-03-17 12:04:53 +08:00
+								* [Improving large monorepo performance on GitHub](https://github.blog/2021-03-16-improving-large-monorepo-performance-on-github/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/)
 								* [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/)
 								* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)
 								* [Reducing flaky builds by 18x](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/)
 								* [The evolving role of operations in DevOps](https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/)
 								* [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/)
 								* [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Major incidents & analysis reports
-												Fix markdown error

											
										
										
											2022-01-22 11:48:04 +08:00
-												 Add GitHub's incident reports for 2023

Consider this one as a WIP, as 2023 is not over yet 😉
											
										
										
											2023-10-08 13:42:38 -04:00
+								* [GitHub Availability Report: August 2023](https://github.blog/2023-09-13-github-availability-report-august-2023/)
 								* [GitHub Availability Report: July 2023](https://github.blog/2023-08-09-github-availability-report-july-2023/)
 								* [GitHub Availability Report: June 2023](https://github.blog/2023-07-12-github-availability-report-june-2023/)
 								* [GitHub Availability Report: May 2023](https://github.blog/2023-06-14-github-availability-report-may-2023/)
 								* [GitHub Availability Report: April 2023](https://github.blog/2023-05-03-github-availability-report-april-2023/)
 								* [GitHub Availability Report: March 2023](https://github.blog/2023-04-05-github-availability-report-march-2023/)
 								* [GitHub Availability Report: February 2023](https://github.blog/2023-03-01-github-availability-report-february-2023/)
 								* [GitHub Availability Report: January 2023](https://github.blog/2023-02-01-github-availability-report-january-2023/)
-												Add GitHub's incident reports for 2022
											
										
										
											2023-10-08 13:41:08 -04:00
+								* [GitHub Availability Report: December 2022](https://github.blog/2023-01-04-github-availability-report-december-2022/)
 								* [GitHub Availability Report: November 2022](https://github.blog/2022-12-07-github-availability-report-november-2022/)
 								* [GitHub Availability Report: October 2022](https://github.blog/2022-11-02-github-availability-report-october-2022/)
 								* [GitHub Availability Report: September 2022](https://github.blog/2022-10-05-github-availability-report-september-2022/)
 								* [GitHub Availability Report: August 2022](https://github.blog/2022-09-07-github-availability-report-august-2022/)
 								* [GitHub Availability Report: July 2022](https://github.blog/2022-08-03-github-availability-report-july-2022/)
 								* [GitHub Availability Report: June 2022](https://github.blog/2022-07-06-github-availability-report-june-2022/)
 								* [GitHub Availability Report: May 2022](https://github.blog/2022-06-01-github-availability-report-may-2022/)
 								* [GitHub Availability Report: April 2022](https://github.blog/2022-05-04-github-availability-report-april-2022/)
 								* [GitHub Availability Report: March 2022](https://github.blog/2022-04-06-github-availability-report-march-2022/)
 								* [GitHub Availability Report: February 2022](https://github.blog/2022-03-02-github-availability-report-february-2022/)
 								* [GitHub Availability Report: January 2022](https://github.blog/2022-02-02-github-availability-report-january-2022/)
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [GitHub Availability Report: December 2021](https://github.blog/2022-01-05-github-availability-report-december-2021/)
 								* [GitHub Availability Report: November 2021](https://github.blog/2021-12-01-github-availability-report-november-2021/)
 								* [GitHub Availability Report: October 2021](https://github.blog/2021-11-04-github-availability-report-october-2021/)
-												Added posts from GitHub
											
										
										
											2021-10-15 17:47:20 +09:00
+								* [GitHub Availability Report: September 2021](https://github.blog/2021-10-06-github-availability-report-september-2021/)
-												Added posts from Salesforce and GitHub
											
										
										
											2021-09-11 18:12:03 +08:00
+								* [GitHub Availability Report: August 2021](https://github.blog/2021-09-01-github-availability-report-august-2021/)
 								* [GitHub Availability Report: July 2021](https://github.blog/2021-08-04-github-availability-report-july-2021/)
 								* [GitHub Availability Report: June 2021](https://github.blog/2021-07-07-github-availability-report-june-2021/)
 								* [GitHub Availability Report: May 2021](https://github.blog/2021-06-02-github-availability-report-may-2021/)
 								* [GitHub Availability Report: April 2021](https://github.blog/2021-05-05-github-availability-report-april-2021/)
 								* [GitHub Availability Report: March 2021](https://github.blog/2021-04-07-github-availability-report-march-2021/)
-												Added posts from Etsy and GitHub

											
										
										
											2021-03-09 22:30:05 +08:00
+								* [GitHub Availability Report: February 2021](https://github.blog/2021-03-03-github-availability-report-february-2021/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/)
 								* [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/)
 								* [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/)
 								* [GitHub Availability Report: August 2020](https://github.blog/2020-09-02-github-availability-report-august-2020/)
 								* [GitHub Availability Report: July 2020](https://github.blog/2020-08-05-github-availability-report-july-2020/)
 								* [Introducing the GitHub Availability Report](https://github.blog/2020-07-08-introducing-the-github-availability-report/)
 								* [February service disruptions post-incident analysis](https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/)
 								* [October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
 								* [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/)
 								* [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
 								</details>
-												Add a selection of SRE and infrastructure related blog posts from GitLab
											
										
										
											2021-02-18 15:40:51 +01:00
+								<details>
 								  <summary>GitLab</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Add a selection of SRE and infrastructure related blog posts from GitLab
											
										
										
											2021-02-18 15:40:51 +01:00
+								* [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https://about.gitlab.com/blog/2021/01/14/this-sre-attempted-to-roll-out-an-haproxy-change/)
 								* [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/)
 								* [Update: Elasticsearch lessons learnt for Advanced Global Search](https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/)
 								* [Lessons in iteration from a new team in infrastructure](https://about.gitlab.com/blog/2020/11/09/lessons-in-iteration-from-new-infrastructure-team/)
 								* [How we optimized infrastructure spend at GitLab](https://about.gitlab.com/blog/2020/10/27/how-we-optimized-our-infrastructure-spend-at-gitlab/)
 								* [How we scaled async workload processing at GitLab.com using Sidekiq](https://about.gitlab.com/blog/2020/06/24/scaling-our-use-of-sidekiq/)
 								* [Inside GitLab: How we release software patches](https://about.gitlab.com/blog/2020/05/13/how-we-release-software-patches/)
 								* [What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab](https://about.gitlab.com/blog/2019/11/15/tracking-down-missing-tcp-keepalives/)
 								* [How we used delayed replication for disaster recovery with PostgreSQL](https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/)
 								</details>
-												Add GoCardless

											
										
										
											2021-02-16 11:57:50 +00:00
+								<details>
 								  <summary>GoCardless</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Add GoCardless

											
										
										
											2021-02-16 11:57:50 +00:00
+								* [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https://medium.com/gocardless-tech/deploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e)
 								* [How we compress Pub/Sub messages and more, saving a load of money](https://medium.com/gocardless-tech/how-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a)
 								* [Fear-free PostgreSQL migrations for Rails](https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/)
 								* [Observability at GoCardless: a tale of API performance improvement](https://gocardless.com/blog/observability-at-gocardless-a-tale-of-api-performance-improvement/)
 								* [Debugging the PostgreSQL query planner](https://gocardless.com/blog/debugging-the-postgres-query-planner/)
 								* [Zero-downtime Postgres migrations - the hard parts](https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/)
 								* [In search of performance - how we shaved 200ms off every POST request](https://gocardless.com/blog/in-search-of-performance-how-we-shaved-200ms-off-every-post-request/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Major incidents & analysis reports
-												Add GoCardless

											
										
										
											2021-02-16 11:57:50 +00:00
+								* [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/)
 								* [Incident review: API and Dashboard outage on 10 October 2017](https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/)
 								</details>
-												Created a new section of GoDaddy

Created a new section of GoDaddy's blog posts. Not sure if all the blogs are relevant or not.
											
										
										
											2021-10-28 12:04:06 +05:30
+								<details>
 								  <summary>GoDaddy</summary>
-												Fixed error
											
										
										
											2021-10-28 18:32:27 +05:30
-												Created a new section of GoDaddy

Created a new section of GoDaddy's blog posts. Not sure if all the blogs are relevant or not.
											
										
										
											2021-10-28 12:04:06 +05:30
+								### Blog Posts
-												Update README.md
											
										
										
											2021-10-28 12:45:05 +05:30
-												Markdown lint error
											
										
										
											2021-10-28 18:42:18 +05:30
+								* [Kubernetes Gated Deployments](https://www.godaddy.com/engineering/2019/08/13/kubernetes-gated-deployments/)
 								* [Kubernetes External Secrets](https://www.godaddy.com/engineering/2019/04/16/kubernetes-external-secrets/)
 								* [Kubernetes - A Practical Introduction for Application Developers](https://www.godaddy.com/engineering/2018/05/02/kubernetes-introduction-for-developers/)
-												Created a new section of GoDaddy

Created a new section of GoDaddy's blog posts. Not sure if all the blogs are relevant or not.
											
										
										
											2021-10-28 12:04:06 +05:30
+								* [An Intuitive Node.js Client for the Kubernetes API](https://www.godaddy.com/engineering/2018/04/10/an-intuitive-nodejs-client-for-the-kubernetes-api/)
 								</details>
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								<details>
 								  <summary>Gojek</summary>
 								### Blog Posts
-												feat: Added 2 more 2022's articles from Gojek Tech blog

											
										
										
											2022-10-02 15:15:49 +07:00
+								* [Introducing Skynet: Infrastructure as Code for Gojek](https://www.gojek.io/blog/introducing-skynet/)
 								* [Scaling Our Geo-Search Service For 10x Load](https://www.gojek.io/blog/scaling-our-geo-search-service-for-10x-load/)
-												fix the URL of the Gojek blog post why-we-swear-by-the-rca

											
										
										
											2021-10-07 21:51:01 +07:00
+								* [Why We Swear by the RCA](https://www.gojek.io/blog/why-we-swear-by-the-rca)
-												Added two more link posts from Gojek Tech

											
										
										
											2021-10-07 21:06:03 +07:00
+								* [How We Upgrade Kubernetes on GKE](https://blog.gojek.io/how-we-upgrade-kubernetes-on-gke/)
 								* [How We Monitor Apache Airflow in Production](https://blog.gojek.io/how-we-monitor-apache-airflow-in-production/)
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
-												Added new posts from DBS and Goldman Sachs

											
										
										
											2022-03-13 11:47:49 +08:00
+								</details>
 								<details>
 								  <summary>Goldman Sachs</summary>
 								### Blog Posts
 								* [Observability at Scale](https://developer.gs.com/blog/posts/observability-at-scale)
 								* [Enabling Highly Available Trino Clusters at Goldman Sachs](https://developer.gs.com/blog/posts/enabling-highly-available-trino-clusters-at-goldman-sachs)
 								* [Infrastructure and the Command Chain Pattern](https://developer.gs.com/blog/posts/infrastructure-and-command-chain-pattern)
 								* [Mobile CICD with EC2 macOS](https://developer.gs.com/blog/posts/mobile-cicd-with-ec2-macos)
 								* [Announcing CatchIT - Source Code Secret Scanner](https://developer.gs.com/blog/posts/catchit-source-code-secret-scanner)
 								* [Building Platforms for Data Engineering](https://developer.gs.com/blog/posts/legend_data_engineering_platforms)
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Google</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new links from Picnic Engg, Google & AWS

											
										
										
											2021-09-03 12:57:56 +08:00
+								* [Pitfalls and Patterns in Microservice Dependency Management](https://www.infoq.com/articles/pitfalls-patterns-microservice-dependency-management/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses)
-												Update README.md

Added link in google section of site reliability using Go
											
										
										
											2021-10-08 20:06:31 +05:30
+								* [Google site reliability using Go](https://go.dev/solutions/google/sitereliability)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
 								* [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/)
-												Update Google blog posts

Include missing entry on SRE teams organization, based on the
experience of Google.

											
										
										
											2021-10-10 14:48:38 -04:00
+								* [How SRE teams are organized, and how to get started](https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
 								* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
 								* [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
 								* [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)
 								* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke)
 								* [Implementing Distributed Consensus](https://www.usenix.org/conference/srecon20americas/presentation/ludtke)
 								* [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin)
 								* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry)
 								* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski)
 								* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood)
 								* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai)
 								* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff)
 								* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason)
 								* [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe)
 								* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato)
 								* [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang)
 								* [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz)
 								* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing)
 								* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji)
 								* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz)
 								* [Pragmatic Automation](https://www.usenix.org/conference/srecon19americas/presentation/luebbe)
 								* [Sublinear Scaling in Practice: The 1k SRE Project](https://www.usenix.org/conference/srecon19americas/presentation/rath)
 								* [Strategies to Edit Production Data](https://www.usenix.org/conference/srecon19americas/presentation/qiu)
 								* [The Curse of SRE Autonomy and How to Manage It](https://www.usenix.org/conference/srecon19americas/presentation/bondi)
 								* [Scaling SRE Organizations: The Journey from 1 to Many Teams](https://www.usenix.org/conference/srecon19americas/presentation/franco)
 								* [SRE Classroom - How to Design a Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19americas/presentation/thomas)
 								* [Using PRDs and User Journeys to Design User-Friendly Tools](https://www.usenix.org/conference/srecon19americas/presentation/stockman)
-												Add video link on Google SRE culture
											
										
										
											2021-10-09 19:20:20 +05:30
+								* [How Google SRE and Developers Work Together](https://www.youtube.com/watch?v=DOQqOrHs3VY)
-												Add SRECON21 video for Google
											
										
										
											2021-11-06 13:17:06 +05:30
+								* [SREcon21 - Experiments for SRE](https://www.youtube.com/watch?v=yjusNjAFxFg)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
 								<details>
 								  <summary>Grab</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab)
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [Our Journey to Continuous Delivery at Grab (Part 2)](https://engineering.grab.com/blog/2/)
-												Added posts from Grab and DBS

											
										
										
											2021-02-16 22:52:08 +08:00
+								* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)](https://engineering.grab.com/designing-resilient-systems-part-1)
 								* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)](https://engineering.grab.com/designing-resilient-systems-part-2)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https://engineering.grab.com/beyond-retries-part-3)
 								* [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
-												Added posts from Grab and DBS

											
										
										
											2021-02-16 22:52:08 +08:00
+								* [How We Designed the Quotas Microservice to Prevent Resource Abuse](https://engineering.grab.com/quotas-service)
-												Added Khan Academy

											
										
										
											2021-02-16 23:11:47 +08:00
+								* [How We Scaled Our Cache and Got a Good Night's Sleep](https://engineering.grab.com/how-we-scaled-our-cache-and-got-a-good-nights-sleep)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
 								<details>
 								  <summary>Grammarly</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [Scaling AWS Infrastructure to Support Multiple Regions](https://www.grammarly.com/blog/engineering/scaling-aws-infrastructure/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
 								</details>
-												Added Gusto Post

											
										
										
											2021-10-09 12:19:58 +05:30
+								<details>
 								  <summary>Gusto</summary>
 								### Blog Posts
 								* [Service Level Objectives for On-call Peace of Mind](https://engineering.gusto.com/slos-for-peace-of-mind/)
-												Added posts from Facebook, Slack and Gusto

											
										
										
											2021-10-10 19:16:11 +08:00
+								* [Debugging Sidekiq Poison Pills](https://engineering.gusto.com/debugging-sidekiq-poison-pills/)
-												Added Gusto Post

											
										
										
											2021-10-09 12:19:58 +05:30
 								</details>
-												Halodoc adaptation of SRE principles 

Halodoc adaptation of SRE principles for Native Mobile Apps
											
										
										
											2021-07-13 21:44:34 +05:30
+								<details>
 								  <summary>Halodoc</summary>
 								### Blog Posts
 								* [Site Reliability Engineering for Native mobile apps](https://www.infoq.com/articles/site-reliability-engineering-mobile-apps/)
 								</details>
-												Added Khan Academy

											
										
										
											2021-02-16 23:11:47 +08:00
+								<details>
 								  <summary>Heroku</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [The Adventures of Rendezvous in Heroku’s New Architecture](https://blog.heroku.com/engineering)
-												Added Khan Academy

											
										
										
											2021-02-16 23:11:47 +08:00
+								* [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
 								</details>
-												added IBM SRE blogs
											
										
										
											2022-10-02 09:27:24 +05:30
+								<details>
 								  <summary>IBM</summary>
 								### Blog Posts
 								* [What is Site Reliability Engineering (SRE)?](https://www.ibm.com/cloud/learn/site-reliability-engineering)
-												fixed error
											
										
										
											2022-10-02 09:47:05 +05:30
+								* [AIOps tools and solutions](https://www.ibm.com/cloud/aiops)
-												updated readme, fixed another error
											
										
										
											2022-10-02 09:50:59 +05:30
-												added IBM SRE blogs
											
										
										
											2022-10-02 09:27:24 +05:30
+								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Indeed</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new posts

											
										
										
											2022-08-28 10:42:18 +08:00
+								* [Indeed SRE: An Inside Look](https://engineering.indeedblog.com/blog/2022/04/sre/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/)
 								* [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/)
 								* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
 								</details>
 								<details>
-												Added Khan Academy

											
										
										
											2021-02-16 23:11:47 +08:00
+								  <summary>Khan Academy</summary>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added Khan Academy

											
										
										
											2021-02-16 23:11:47 +08:00
+								* [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https://blog.khanacademy.org/how-khan-academy-successfully-handled-2-5x-traffic-in-a-week/)
 								* [Evolving our content infrastructure](https://blog.khanacademy.org/evolving-our-content-infrastructure/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								</details>
 								<details>
 								  <summary>LinkedIn</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New post and fix

											
										
										
											2021-03-22 17:55:36 +08:00
+								* [Rethinking site capacity projections with Capacity Analyzer](https://engineering.linkedin.com/blog/2021/rethinking-site-capacity-projections-with-capacity-analyzer)
-												Update README.md
											
										
										
											2021-02-18 22:34:55 +08:00
+								* [Insights into a Product SRE team at LinkedIn](https://www.linkedin.com/pulse/insights-product-sre-team-linkedin-zaina-afoulki/?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D)
-												Added uniqueness check

											
										
										
											2021-02-21 21:08:36 +08:00
+								* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre)
-												Add linkedin sre blog post

											
										
										
											2021-02-17 04:30:28 +00:00
+								* [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries)
 								* [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)
 								* [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag)
 								* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
 								* [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
 								* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p)
 								* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)
 								* [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)
 								* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)
 								* [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear)
-												Added uniqueness check

											
										
										
											2021-02-21 21:08:36 +08:00
+								* [Hiring SREs at LinkedIn, 2017](https://engineering.linkedin.com/blog/2017/07/hiring-sres-at-linkedin)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Open Sourcing Iris and Oncall](https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall)
 								* [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin)
 								* [Failure is Not an Option](https://engineering.linkedin.com/blog/2017/01/failure-is-not-an-option)
 								* [MTTD and MTTR Are Key](https://engineering.linkedin.com/blog/2016/12/mttd-and-mttr-are-key)
 								* [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
 								* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
 								* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
 								* [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen)
 								* [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading)
 								* [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring)
 								* [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe)
 								* [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness)
 								* [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal)
 								* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba)
 								* [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley)
 								* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe)
 								* [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)
-												Added tool from Monzo

											
										
										
											2021-10-15 22:11:04 +08:00
+								### Tools
-												Added tools section in LinkedIn and Netflix

											
										
										
											2021-10-15 22:02:37 +08:00
 								* [On-Call](https://github.com/linkedin/oncall)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
-												Add Loggi content
											
										
										
											2021-07-05 13:01:50 -03:00
+								<details>
 								  <summary>Loggi</summary>
 								### Blog Posts
-												added alibaba, ibm, stack overflow posts

											
										
										
											2021-10-27 02:08:48 +08:00
-												Add Loggi content
											
										
										
											2021-07-05 13:01:50 -03:00
+								* [The Release Manager model](https://partiu.loggi.com/the-release-manager-model-7af93f9f499f)
-												Update Loggi's content
											
										
										
											2021-08-24 12:36:02 -03:00
+								* [SRE Teams #8: Loggi](https://sreteams.substack.com/p/loggi)
-												Add Loggi content
											
										
										
											2021-07-05 13:01:50 -03:00
 								</details>
-												Added posts from Loveholidays

											
										
										
											2022-03-13 12:11:19 +08:00
+								<details>
 								  <summary>Loveholidays</summary>
 								### Blog Posts
 								* [Dynamic alert routing with Prometheus and Alertmanager](https://tech.loveholidays.com/dynamic-alert-routing-with-prometheus-and-alertmanager-f6a919edb5f8)
 								* [Making loveholidays 18% faster with HTTP/3](https://tech.loveholidays.com/making-loveholidays-18-faster-with-http-3-1860879528a7)
 								* [Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code](https://tech.loveholidays.com/enforcing-best-practice-on-self-serve-infrastructure-with-terraform-atlantis-and-policy-as-code-911f4f8c3e00)
 								* [The 5 principles that helped scale loveholidays](https://tech.loveholidays.com/the-5-principles-that-helped-scale-loveholidays-7ea0b0fd3df9)
 								* [Realtime Fastly logs with Grafana Loki for under $1 a day](https://tech.loveholidays.com/realtime-fastly-logs-with-grafana-loki-for-under-1-a-day-5b63ccf32d66)
 								</details>
-												Posts from Macquarie, Expedia
											
										
										
											2021-09-11 17:58:54 +08:00
+								<details>
 								  <summary>Macquarie</summary>
-												Fix markdown
											
										
										
											2021-09-11 18:16:25 +08:00
+								### Blog Posts
-												Posts from Macquarie, Expedia
											
										
										
											2021-09-11 17:58:54 +08:00
-												Fix starting spaces
											
										
										
											2021-09-11 18:18:22 +08:00
+								* [Our DevSecOps journey with Golang](https://medium.com/macquarie-engineering-blog/our-devsecops-journey-with-golang-a1af38328c36)
 								* [Pipeline Configuration as Code with Kotlin](https://medium.com/macquarie-engineering-blog/pipeline-configuration-as-code-with-kotlin-dec9ab9ee6fa)
 								* [DevOps and Segregation of Duties](https://medium.com/macquarie-engineering-blog/devops-and-segregation-of-duties-ea4a7dcc7217)
 								* [Macquarie embraces DevOps](https://medium.com/macquarie-engineering-blog/macquarie-embraces-devops-30f0fe62496a)
 								* [Scaling a Kubernetes Platform across the Enterprise](https://medium.com/macquarie-engineering-blog/scaling-a-kubernetes-platform-across-the-enterprise-c07a53b6022e)
-												Posts from Macquarie, Expedia
											
										
										
											2021-09-11 17:58:54 +08:00
 								</details>
-												Add Mattermost blog posts about Monitoring and Alerting

											
										
										
											2021-10-26 17:34:43 +03:00
+								<details>
 								  <summary>Mattermost</summary>
 								### Blog Posts
 								* [Monitoring Cloud Environments at Scale with Prometheus and Thanos](https://mattermost.com/blog/monitoring-cloud-environments-at-scale-with-prometheus-and-thanos/)
 								* [How We Use Sloth to do SLO Monitoring and Alerting with Prometheus](https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/)
 								</details>
-												Add meituan and Zalando blogposts
											
										
										
											2021-10-22 19:16:54 +08:00
+								<details>
 								  <summary>Meituan (美团)</summary>
 								### Blog Posts
 								* [The development and practice of SRE in the cloud (云端的SRE发展与实践)](https://tech.meituan.com/2017/08/03/meituanyun-sre.html)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Mercari</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new posts

											
										
										
											2022-08-28 10:42:18 +08:00
+								* [Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems](https://engineering.mercari.com/en/blog/entry/20220805-who-watches-the-watchmen-keeping-an-eye-on-our-monitoring-systems/)
 								* [What the Microservices SRE Team are doing as SRE Evangelists](https://engineering.mercari.com/en/blog/entry/20220225-cdb2b6deff/)
 								* [What it’s like to work as an embedded microservices SRE](https://engineering.mercari.com/en/blog/entry/20220228-work-as-an-embedded-microservices-sre/)
 								* [The Merpay SRE Team: Past and future](https://engineering.mercari.com/en/blog/entry/20210831-a91c3dca9d/)
 								* [Embedded SRE at Mercari](https://engineering.mercari.com/en/blog/entry/20220221-embedded-sre-at-mercari/)
 								* [What the SRE team wants to achieve with the development team](https://engineering.mercari.com/en/blog/entry/20210129-embedded-sre/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
 								* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
 								* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								<details>
 								  <summary>Meta</summary>
 								### Blog Posts
-												Fix markdown error

											
										
										
											2022-01-22 11:48:04 +08:00
-												Added new post from Meta

											
										
										
											2022-09-03 10:57:18 +08:00
+								* [Improving Meta’s SLO workflows with data annotations](https://engineering.fb.com/2022/08/29/developer-tools/improving-metas-slo-workflows-with-data-annotations/)
-												New posts!

											
										
										
											2022-01-22 11:45:16 +08:00
+								* [SLICK: Adopting SLOs for improved reliability](https://engineering.fb.com/2021/12/13/production-engineering/slick/)
 								* [More details about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/)
 								* [Update about the October 4th outage](https://engineering.fb.com/2021/10/04/networking-traffic/outage/)
 								### Videos
 								* [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney)
 								* [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli)
 								* [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong)
 								* [Using ML to Automate Dynamic Error Categorization](https://www.usenix.org/conference/srecon19asia/presentation/davoli)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Microsoft</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ)
 								* [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k)
 								* [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson)
 								* [Study on Human Factors and Team Culture to Improve Pager Fatigue](https://www.usenix.org/conference/srecon20americas/presentation/barteneva)
 								* [Prioritizing Trust While Creating Applications](https://www.usenix.org/conference/srecon19emea/presentation/davis)
 								* [Building Resilience: How to Learn More from Incidents](https://www.usenix.org/conference/srecon19emea/presentation/stenning)
 								* [A Tale of Two Postmortems: A Human Factors View](https://www.usenix.org/conference/srecon19asia/presentation/lund-postmortem)
 								* [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy)
 								* [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy)
 								* [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>MIRO</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
 								* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
 								* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
 								</details>
 								<details>
 								  <summary>Monzo</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo)
 								* [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)
 								* [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)
 								* [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
-												Added tool from Monzo

											
										
										
											2021-10-15 22:11:04 +08:00
+								### Tools
 								* [Response](https://github.com/monzo/response)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Netflix</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New Posts

											
										
										
											2021-09-03 14:27:15 +08:00
+								* [Achieving observability in async workflows](https://netflixtechblog.com/achieving-observability-in-async-workflows-cd89b923c784)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
-												Added older posts from Netflix

											
										
										
											2021-02-17 00:24:13 +08:00
+								* [Lessons from Building Observability Tools at Netflix](https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f)
 								* [Telltale: Netflix Application Monitoring Simplified](https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba)
 								* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)
 								* [Introducing Dispatch](https://netflixtechblog.com/introducing-dispatch-da4b8a2a8072)
 								* [Applying Netflix DevOps Patterns to Windows](https://netflixtechblog.com/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79)
 								* [ChAP: Chaos Automation Platform](https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f)
 								* [Starting the Avalanche](https://netflixtechblog.com/starting-the-avalanche-640e69b14a06)
 								* [Netflix Chaos Monkey Upgraded](https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d)
 								* [Chaos Engineering Upgraded](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa)
-												Added older posts from Netflix

											
										
										
											2021-02-17 00:24:13 +08:00
+								* [Automated Failure Testing](https://netflixtechblog.com/automated-failure-testing-86c1b8bc841f)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)
-												Added older posts from Netflix

											
										
										
											2021-02-17 00:24:13 +08:00
+								* [Introducing Atlas: Netflix’s Primary Telemetry Platform](https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a)
 								* [FIT: Failure Injection Testing](https://netflixtechblog.com/fit-failure-injection-testing-35d8e2a9bb2)
 								* [Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis](https://netflixtechblog.com/announcing-security-monkey-aws-security-configuration-monitoring-and-analysis-1f2bfb001708)
 								* [Lessons Netflix Learned from the AWS Outage](https://netflixtechblog.com/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04)
-												Update README.md
											
										
										
											2022-10-05 01:45:26 +05:30
+								* [Scryer: Netflix’s Predictive Auto Scaling Engine](https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Major incidents & analysis reports
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Update README.md
											
										
										
											2021-02-18 22:34:55 +08:00
+								* [AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)](https://www.youtube.com/watch?v=0QS1TWLooo0)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed)
 								* [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens)
 								* [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https://www.youtube.com/watch?v=DlWYNoLmma8)
 								* [Real user performance monitoring at Netflix scale ‐ Martin Spier](https://www.youtube.com/watch?v=4RG2DUK03_0)
 								* [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://www.youtube.com/watch?v=rgfww8tLM0A)
 								* [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https://www.youtube.com/watch?v=LaKGx0dAUlo)
 								* [Netflix: Multi-Regional Resiliency and Amazon Route 53](https://www.youtube.com/watch?v=WDDkLOT8SCk)
 								* [Designing Services for Resilience: Netflix Lessons](https://www.youtube.com/watch?v=RWyZkNzvC-c)
 								* [South Bay SRE Meetup - Netflix Cloud Performance Team](https://www.youtube.com/watch?v=uQ0flQOtQEA)
 								* [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https://www.youtube.com/watch?v=T_D1G42G0dE)
 								* [How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows](https://www.youtube.com/watch?v=8tsIqfvizpU)
 								* [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM)
 								* [AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)](https://www.youtube.com/watch?v=leqUbSY55hY)
 								* [SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs](https://www.youtube.com/watch?v=koGaH4ffXaU)
 								* [From Sys Admin to Netflix SRE](https://www.youtube.com/watch?v=lZI51YzIgVE)
 								* [Application Resilience Engineering and Operations at Netflix with Hystrix](https://www.youtube.com/watch?v=RzlluokGi1w)
 								* [Injecting Failure at Netflix](https://www.youtube.com/watch?v=ioXV28GtXeo)
 								* [LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability](https://www.youtube.com/watch?v=3D0zS3kPNUU)
-												add netflix video & podcast

											
										
										
											2021-10-11 12:41:23 -05:00
+								* [Incident Management at Netflix Velocity](https://www.infoq.com/presentations/netflix-incident-management/)
 								### Podcasts
-												fix linting error (add line)

											
										
										
											2021-10-13 07:56:45 -05:00
-												add netflix video & podcast

											
										
										
											2021-10-11 12:41:23 -05:00
+								* [Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems](https://www.infoq.com/podcasts/netflix-sre-sociotechnical-systems/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Added tool from Monzo

											
										
										
											2021-10-15 22:11:04 +08:00
+								### Tools
-												Added tools section in LinkedIn and Netflix

											
										
										
											2021-10-15 22:02:37 +08:00
 								* [Dispatch](https://github.com/Netflix/dispatch)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
-												Include New Relic blog posts

											
										
										
											2021-10-15 10:53:24 -04:00
+								<details>
 								  <summary>New Relic</summary>
 								### Blog Posts
 								* [Defining Modern Software Roles: SREs at New Relic](https://newrelic.com/blog/nerd-life/new-relic-sre)
 								* [10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)](https://newrelic.com/blog/best-practices/site-reliability-engineering-careers)
 								* [What Tools Do Site Reliability Engineers Use?](https://newrelic.com/blog/best-practices/best-sre-tools)
 								* [A Day in the Life of a New Relic SRE](https://newrelic.com/blog/nerd-life/what-does-an-sre-do)
 								* [7 Habits of Highly Successful Site Reliability Engineers](https://newrelic.com/blog/best-practices/site-reliability-engineer-sre-habits)
-												Include recent New Relic's SRE blog posts
											
										
										
											2023-10-01 13:06:58 -04:00
+								* [Adopting the practice of SRE](https://newrelic.com/blog/best-practices/adopting-sre-practices)
 								* [Using modern observability to establish a data-driven culture](https://newrelic.com/blog/best-practices/observability-data-driven-culture)
-												Include New Relic blog posts

											
										
										
											2021-10-15 10:53:24 -04:00
 								</details>
-												add initial nubank sre blog post

											
										
										
											2021-10-11 09:24:05 -05:00
+								<details>
 								  <summary>Nubank</summary>
-												fix linting error (remove leading space)

											
										
										
											2021-10-11 11:58:30 -05:00
+								### Blog Posts
-												add initial nubank sre blog post

											
										
										
											2021-10-11 09:24:05 -05:00
 								* [How we deal with technical incidents](https://building.nubank.com.br/how-we-deal-with-incidents/)
-												post about on-call rotations

											
										
										
											2021-10-11 09:34:57 -05:00
+								* [How we do On-Call Rotations at Nubank](https://building.nubank.com.br/how-we-do-on-call-rotations-at-nubank/)
-												Add blog posts about NuBank
											
										
										
											2023-10-15 10:16:52 -04:00
+								* [How we scale our data platform efficiently and reliably](https://building.nubank.com.br/distributing-the-data-team-to-boost-innovation-reliably/)
 								* [Why We Killed Our End-to-End Test Suite](https://building.nubank.com.br/why-we-killed-our-end-to-end-test-suite/)
 								* [Automatic retraining for machine learning models: tips and lessons learned](https://building.nubank.com.br/automatic-retraining-for-machine-learning-models/)
-												add initial nubank sre blog post

											
										
										
											2021-10-11 09:24:05 -05:00
 								</details>
-												OpenAI SRE

Added four resources the goes in depth about OpenAI SRE and scaling of servers (or kubes). Hopefully this will be useful.
											
										
										
											2023-03-13 12:50:53 +05:30
+								<details>
 								  <summary>OpenAI</summary>
 								### Blog Posts
-												Update README.md

Added new post from OpenAI blog
											
										
										
											2023-03-25 17:48:24 +05:30
+								* [March 20 ChatGPT outage: Here’s what happened](https://openai.com/blog/march-20-chatgpt-outage)
-												OpenAI SRE

Added four resources the goes in depth about OpenAI SRE and scaling of servers (or kubes). Hopefully this will be useful.
											
										
										
											2023-03-13 12:50:53 +05:30
+								* [OpenAI SRE and scaling explained easy.](https://medium.com/@Pran-Ker/openai-sre-miracle-19a33bdd3145)
 								* [Scaling Kubernetes to 2,500 nodes](https://openai.com/research/scaling-kubernetes-to-2500-nodes)
 								* [Scaling Kubernetes to 7,500 nodes](https://openai.com/research/scaling-kubernetes-to-7500-nodes)
 								* [Scaling AI Infrastructure at OpenAI](https://www.youtube.com/watch?v=cK7qFZ9J6k0)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>PayPal</summary>
-												Posts from PayPal and GitHub
											
										
										
											2021-09-29 23:04:53 +08:00
+								### Blog Posts
 								* [Triggered: Incident #1234 (incident process needs fixing)](https://medium.com/paypal-tech/triggered-incident-1234-incident-process-needs-fixing-2a09dbac9edd)
 								* [Implementing Observability in a Service Mesh](https://medium.com/paypal-tech/implementing-observability-in-a-service-mesh-273c7409283d)
 								* [PostgreSQL at Scale: Database Schema Changes Without Downtime](https://medium.com/paypal-tech/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680)
 								* [Scaling GraphQL at PayPal](https://medium.com/paypal-tech/scaling-graphql-at-paypal-b5b5ac098810)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
 								* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
 								* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
 								* [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel)
 								* [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)
 								</details>
-												Added new links from Picnic Engg, Google & AWS

											
										
										
											2021-09-03 12:57:56 +08:00
+								<details>
 								  <summary>Picnic</summary>
 								### Blog Posts
 								* [Micrometer and the Modern Observability Stack](https://blog.picnic.nl/micrometer-and-the-modern-observability-stack-ebf72283bd8e)
 								* [Monitoring and Observability at Picnic](https://blog.picnic.nl/monitoring-and-observability-at-picnic-684cefd845c4)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Pinterest</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added one more blog post from Pinterest Engineering

											
										
										
											2021-10-07 22:23:25 +07:00
+								* [Ensuring High Availability of Ads Realtime Streaming Services](https://medium.com/pinterest-engineering/ensuring-high-availability-of-ads-realtime-streaming-services-ea3889420490)
-												New Posts

											
										
										
											2021-09-03 14:27:15 +08:00
+								* [Improving efficiency and reducing runtime using S3 read optimization](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0)
 								* [Scaling Kubernetes with Assurance at Pinterest](https://medium.com/pinterest-engineering/scaling-kubernetes-with-assurance-at-pinterest-a23f821168da)
 								* [What we learned from an iOS app OOMs incident](https://medium.com/pinterest-engineering/what-we-learned-from-an-ios-app-ooms-incident-eb31eada251)
 								* [How we designed our Continuous Integration System to be more than 50% Faster](https://medium.com/pinterest-engineering/how-we-designed-our-continuous-integration-system-to-be-more-than-50-faster-b70a59342fe2)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737)
 								* [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a)
 								* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
 								* [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji)
 								* [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas)
 								* [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)
 								</details>
 								<details>
 								  <summary>Postman</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
 								</details>
-												Add Prezi blog posts

											
										
										
											2022-10-28 17:19:13 -04:00
+								<details>
 								  <summary>Prezi</summary>
 								### Blog Posts
 								* [How to avoid global outage — Seamlessly migrating DaemonSet labels](https://engineering.prezi.com/intro-4727024fc2c1)
 								* [In search of speed — debugging Elasticsearch performance](https://engineering.prezi.com/in-search-of-speed-debugging-elasticsearch-performance-9ce8edf4af40)
 								* [Prometheus at Prezi: replacing 10 years of anti-patterns](https://engineering.prezi.com/prometheus-at-prezi-replacing-10-years-of-anti-patterns-e3c2317e6ca)
 								</details>
-												add Red Hat

											
										
										
											2021-03-02 23:58:02 +01:00
+								<details>
 								  <summary>Red Hat</summary>
 								### Blog Posts
 								* [From Ops to SRE: Evolution of the OpenShift Dedicated Team](https://www.openshift.com/blog/from-ops-to-sre-evolution-of-the-openshift-dedicated-team)
 								* [5 Agile Practices Every SRE Team Should Adopt](https://www.openshift.com/blog/5-agile-practices-every-sre-team-should-adopt)
 								* [7 Best Practices for Writing Kubernetes Operators: An SRE Perspective](https://www.openshift.com/blog/7-best-practices-for-writing-kubernetes-operators-an-sre-perspective)
 								</details>
-												Added posts from Strava, Riot Games, Etsy

											
										
										
											2021-09-04 12:06:39 +08:00
+								<details>
 								  <summary>Riot Games</summary>
 								### Blog Posts
 								* [THE LEGENDS OF RUNETERRA CI/CD PIPELINE](https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline)
 								* [STRATEGIES FOR WORKING IN UNCERTAIN SYSTEMS](https://technology.riotgames.com/news/strategies-working-uncertain-systems)
 								* [IMPROVING THE DEVELOPER EXPERIENCE FOR OPERATING SERVICES](https://technology.riotgames.com/news/improving-developer-experience-operating-services)
 								* [SCALABILITY AND LOAD TESTING FOR VALORANT](https://technology.riotgames.com/news/scalability-and-load-testing-valorant)
 								* [LEVERAGING GOLANG FOR GAME DEVELOPMENT AND OPERATIONS](https://technology.riotgames.com/news/leveraging-golang-game-development-and-operations)
 								* [CONTROLLED CHAOS WITH FAULT INJECTION TESTING](https://technology.riotgames.com/)
 								* [DOWN THE RABBIT HOLE OF PERFORMANCE MONITORING](https://technology.riotgames.com/news/down-rabbit-hole-performance-monitoring)
 								* [PROFILING: THE CASE OF THE MISSING MILLISECONDS](https://technology.riotgames.com/news/profiling-case-missing-milliseconds)
 								* [PROFILING: REAL WORLD PERFORMANCE IN LEAGUE](https://technology.riotgames.com/news/profiling-real-world-performance-league)
 								* [PROFILING: OPTIMISATION](https://technology.riotgames.com/news/profiling-optimisation)
 								* [PROFILING: MEASUREMENT AND ANALYSIS](https://technology.riotgames.com/news/profiling-measurement-and-analysis)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART I](https://technology.riotgames.com/news/running-online-services-riot-part-i)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART II](https://technology.riotgames.com/news/running-online-services-riot-part-ii)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART III](https://technology.riotgames.com/news/running-online-services-riot-part-iii)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX](https://technology.riotgames.com/news/running-online-services-riot-part-iii-part-deux)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART IV](https://technology.riotgames.com/news/running-online-services-riot-part-iv)
 								* [RUNNING ONLINE SERVICES AT RIOT: PART V](https://technology.riotgames.com/news/running-online-services-riot-part-v)
 								* [THE EVOLUTION OF SECURITY AT RIOT](https://technology.riotgames.com/news/evolution-security-riot)
 								* [RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE](https://technology.riotgames.com/news/running-automated-test-pipeline-league-client-update)
 								* [AUTOMATED TESTING FOR LEAGUE OF LEGENDS](https://technology.riotgames.com/news/automated-testing-league-legends)
 								</details>
-												Added posts from Salesforce and GitHub
											
										
										
											2021-09-11 18:12:03 +08:00
+								<details>
 								  <summary>Salesforce</summary>
 								### Blog Posts
 								* [Looking at the Kubernetes Control Plane for Multi-Tenancy](https://engineering.salesforce.com/looking-at-the-kubernetes-control-plane-for-multi-tenancy-88914cd7aa89)
 								* [Optimizing EKS networking for scale](https://engineering.salesforce.com/optimizing-eks-networking-for-scale-1325706c8f6d)
 								* [Zero Downtime Node Patching in a Kubernetes Cluster](https://engineering.salesforce.com/zero-downtime-node-patching-in-a-kubernetes-cluster-cdceb21c8c8c)
 								* [How, Not Why: An Alternative to the Five Whys for Post-Mortems](https://engineering.salesforce.com/how-not-why-an-alternative-to-the-five-whys-for-post-mortems-4518098cca17)
 								* [A Generic Sidecar Injector for Kubernetes](https://engineering.salesforce.com/a-generic-sidecar-injector-for-kubernetes-c05eede1f6bb)
 								* [Implementation of a monitoring strategy for products based on microservices](https://engineering.salesforce.com/implementation-of-a-monitoring-strategy-for-products-based-on-microservices-24ad24c4c3e5)
 								* [10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use](https://engineering.salesforce.com/10-steps-to-develop-an-incident-response-plan-youll-actually-use-6cc49d9bf94c)
 								* [Our Journey to a Near Perfect Log Pipeline](https://engineering.salesforce.com/our-journey-to-a-near-perfect-log-pipeline-6ae2f80cf7a0)
 								* [Optimizing Performance with Web Workers](https://engineering.salesforce.com/optimizing-performance-with-web-workers-612b48621d8d)
 								* [Take A Moment To Refocus](https://engineering.salesforce.com/take-a-moment-to-refocus-86b6546c90c)
 								</details>
-												add schibsted media

											
										
										
											2021-06-29 16:45:51 +02:00
+								<details>
 								  <summary>Schibsted Media</summary>
 								### Blog Posts
 								* [Reliability engineering for some of top 10 sites in Scandinavia](https://alexewerlof.medium.com/reliability-engineering-for-some-of-top-10-sites-in-scandinavia-91e388d8d13a)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Scribd</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html)
 								* [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html)
 								* [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)
-												Added posts from Slalombuid
											
										
										
											2021-02-20 12:25:35 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Shopify</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
 								* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
 								* [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
 								* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)
 								* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
 								* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge)
 								* [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne)
 								* [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)
 								</details>
-												Add Sky Betting and Gaming
											
										
										
											2021-02-22 08:16:04 +00:00
+								<details>
 								  <summary>Sky Betting and Gaming</summary>
 								### Blog Posts
 								* [It’s Just a Monitoring Change](https://sbg.technology/2020/12/09/its-just-a-monitoring-change/)
 								* [“What's the worst that could happen?”: A worked example of how we deal with live incidents](https://sbg.technology/2020/04/02/whats-the-worst-that-can-happen/)
 								* [Rising from the Ashes](https://sbg.technology/2020/02/07/rising-from-the-ashes/)
 								* [Crash! Bang! Wallop! Practice makes perfect](https://sbg.technology/2018/05/04/firedrills-in-core/)
 								* [Performance Left Right and Center](https://sbg.technology/2017/10/23/performance-left-right-and-center/)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Slack</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added posts from Uber and Slack

											
										
										
											2022-04-30 12:26:14 +08:00
+								* [Slack’s Incident on 2-22-22](https://slack.engineering/slacks-incident-on-2-22-22/)
-												Added posts from Facebook, Slack and Gusto

											
										
										
											2021-10-10 19:16:11 +08:00
+								* [Infrastructure Observability for Changing the Spend Curve](https://slack.engineering/infrastructure-observability-for-changing-the-spend-curve/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
 								* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
 								* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
 								* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
 								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton)
 								* [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)
 								</details>
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								<details>
 								  <summary>Slalom Build</summary>
 								### Blog Posts
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [How to Implement Service Level Objectives in New Relic APM](https://medium.com/slalom-build/how-to-implement-service-level-objectives-in-new-relic-apm-f34f8746118b)
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								* [Beginners Guide to DevOps: How to Make It into the Industry](https://medium.com/slalom-build/beginners-guid-to-devops-how-to-make-it-into-the-industry-c1652d59807)
 								* [GitHub Actions: Beyond CI/CD](https://medium.com/slalom-build/github-actions-beyond-ci-cd-cb3ddc6abaa)
 								* [Why isn’t all test automation run on the pipeline?](https://medium.com/slalom-build/why-isnt-all-test-automation-run-on-the-pipeline-b2c57afbdf5a)
 								* [The Many Shapes of Site Reliability Engineering](https://medium.com/slalom-build/the-many-shapes-of-site-reliability-engineering-468359866517)
 								* [How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS](https://medium.com/slalom-build/how-to-build-a-secure-by-default-kubernetes-cluster-with-a-basic-ci-cd-pipeline-on-aws-ebfe0da1c7c9)
 								* [Secret Management Architectures: Finding the balance between security and complexity](https://medium.com/slalom-build/secret-management-architectures-finding-the-balance-between-security-and-complexity-d857ceaa2300)
 								* [Detecting Malicious Requests with Keras & Tensorflow](https://medium.com/slalom-build/detecting-malicious-requests-with-keras-tensorflow-5d5db06b4f28)
 								* [The Lego Monolith — A Monolith Microservice Proof of Concept](https://medium.com/slalom-build/the-lego-monolith-a-monolith-microservice-proof-of-concept-a402ca1654e4)
 								* [Managing Secrets Using Hashicorp Vault](https://medium.com/slalom-build/managing-secrets-using-hashicorp-vault-ed6b9e0375ac)
 								* [Packaging Spring Boot Applications for Deployment on Kubernetes](https://medium.com/slalom-build/packaging-spring-boot-applications-for-deployment-on-kubernetes-5fb64bc65406)
 								* [Immutable Infrastructure and Continuous Delivery in the Cloud](https://medium.com/slalom-build/immutable-infrastructure-and-continuous-delivery-in-the-cloud-56ee4b31b8d5)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Soundcloud</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [How to Successfully Hand Over Systems](https://developers.soundcloud.com/blog/how-to-successfully-hand-over-systems)
-												New posts from Dropbox, GitHub and SoundCloud

											
										
										
											2021-03-17 12:04:53 +08:00
+								* [Building a Healthy On-Call Culture](https://developers.soundcloud.com/blog/building-a-healthy-on-call-culture)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
 								* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
 								* [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
 								* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
-												Add blog posts for SoundCloud
											
										
										
											2023-10-09 12:41:03 -04:00
+								* [What I Learned in One Year as an SRE Trainee](https://developers.soundcloud.com/blog/sre-trainee)
 								* [Tests Under the Magnifying Lens](https://developers.soundcloud.com/blog/tests-under-the-magnifying-lens)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
 								<details>
 								  <summary>Spotify</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												added spotify backend infra engineer interview

											
										
										
											2021-10-15 23:40:09 -07:00
+								* [Matt Clarke: Senior Backend Infrastructure Engineer](https://engineering.atspotify.com/2021/03/09/my-beat-matt-clarke/)
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [Designing a Better Kubernetes Experience for Developers](https://engineering.atspotify.com/2021/03/01/designing-a-better-kubernetes-experience-for-developers/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
 								* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
 								</details>
 								<details>
 								  <summary>Squarespace</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na)
 								* [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo)
 								* [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson)
 								* [Creating a Code Review Culture](https://www.usenix.org/conference/srecon19americas/presentation/turner)
 								</details>
 								<details>
-												Disabled mk linting, some fixes and additions

											
										
										
											2021-02-17 09:21:47 +08:00
+								  <summary>Stack Overflow</summary>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												New book and posts

											
										
										
											2021-03-22 17:46:42 +08:00
+								* [“This should never happen. If it does, call the developers.”](https://stackoverflow.blog/2021/03/18/creating-a-good-feedback-loop-between-ops-and-devs-using-documentation/)
 								* [Infrastructure as code: Create and configure infrastructure elements in seconds](https://stackoverflow.blog/2021/03/08/infrastructure-as-code-create-and-configure-infrastructure-elements-in-seconds/)
 								* [Fulfilling the promise of CI/CD](https://stackoverflow.blog/2021/01/19/fulfilling-the-promise-of-ci-cd/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/)
-												Disabled mk linting, some fixes and additions

											
										
										
											2021-02-17 09:21:47 +08:00
+								* [Guest Post - Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
-												added alibaba, ibm, stack overflow posts

											
										
										
											2021-10-27 02:08:48 +08:00
+								* [How We Built Our Blog](https://stackoverflow.blog/2015/07/02/how-we-built-our-blog/)
-												Adding blog post for Stack Overflow
											
										
										
											2022-10-29 05:58:37 +05:30
+								* [Stack Overflow Frees Up Engineering Time with Netlify](https://www.netlify.com/blog/stack-overflow-case-study/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
 								</details>
-												Added posts from Strava, Riot Games, Etsy

											
										
										
											2021-09-04 12:06:39 +08:00
+								<details>
 								  <summary>Strava</summary>
 								### Blog Posts
 								* [Scaling Club Leaderboard Infrastructure for Millions of Users](https://medium.com/strava-engineering/scaling-club-leaderboard-infrastructure-for-millions-of-users-9ee857ce8cfe)
 								* [Distributed Tracing at Strava](https://medium.com/strava-engineering/distributed-tracing-at-strava-e9d784b9ddf2)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Stripe</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines)
-												Include blog post on Stripe's CI/CD

											
										
										
											2022-10-23 14:28:07 -04:00
+								* [Fast builds, secure builds. Choose two.](https://stripe.com/blog/fast-secure-builds-choose-two)
-												Fix Stripe's Veneur post link

											
										
										
											2022-10-23 14:26:55 -04:00
+								* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson)
 								* [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)
 								</details>
 								<details>
 								  <summary>Target</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html)
 								* [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html)
 								* [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html)
 								</details>
-												Add "Scaling your on-duty team" blog post by Teads

Abstract of the article:

> The internet never sleeps, and even with the best design for resilience, one day, your system will go down.
At Teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue.
In a few years we grew from a start-up to a scale-up, although we operate globally, our tech team is mostly based in France. For this reason, we decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down.
											
										
										
											2021-02-23 10:44:43 +01:00
+								<details>
 								  <summary>Teads</summary>
 								### Blog Posts
 								* [Scaling your on-duty team](https://medium.com/teads-engineering/scaling-your-on-duty-team-bc467c480747)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Added upGrad, getAround and Tinder

											
										
										
											2021-09-04 14:17:24 +08:00
+								<details>
 								  <summary>Tinder</summary>
 								### Blog Posts
 								* [The Ultimate Load Test](https://medium.com/tinder-engineering/the-ultimate-load-test-c32b37adc11b)
 								* [How We Improved Our Performance Using ElasticSearch Plugins: Part 1](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-1-b0850a7e5224)
 								* [How We Improved Our Performance Using ElasticSearch Plugins: Part 2](https://medium.com/tinder-engineering/how-we-improved-our-performance-using-elasticsearch-plugins-part-2-b051da2ee85b)
 								* [Tinder’s move to Kubernetes](https://medium.com/tinder-engineering/tinders-move-to-kubernetes-cda2a6372f44)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
-												Fix the list order of Tokopedia

											
										
										
											2021-10-07 23:39:06 +07:00
+								  <summary>Tokopedia</summary>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Fix the list order of Tokopedia

											
										
										
											2021-10-07 23:39:06 +07:00
+								* [Benefits of benchmarking with Go](https://medium.com/tokopedia-engineering/benefits-of-benchmarking-with-go-f8bfa177f7fa)
 								* [Simulating Customized Chaos in Golang using Toxiproxy](https://medium.com/tokopedia-engineering/simulating-customized-chaos-in-golang-using-toxiproxy-b913584d88a7)
 								* [How Tokopedia Rank Millions of Products in Search Page](https://medium.com/tokopedia-engineering/how-tokopedia-rank-millions-of-products-in-search-page-70e358ea2274)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Added blog posts from Tokopedia Engineering

											
										
										
											2021-10-07 23:32:01 +07:00
+								</details>
 								<details>
-												Fix the list order of Tokopedia

											
										
										
											2021-10-07 23:39:06 +07:00
+								  <summary>Trivago</summary>
-												Added blog posts from Tokopedia Engineering

											
										
										
											2021-10-07 23:32:01 +07:00
 								### Blog Posts
-												Fix the list order of Tokopedia

											
										
										
											2021-10-07 23:39:06 +07:00
+								* [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
-												Added blog posts from Tokopedia Engineering

											
										
										
											2021-10-07 23:32:01 +07:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								</details>
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								<details>
 								  <summary>Twilio</summary>
 								### Blog Posts
-												Fix

											
										
										
											2021-07-28 23:05:51 +08:00
-												Fix markdown lint issues

											
										
										
											2021-07-28 23:10:17 +08:00
+								* [Twilio SRE Gameday Template](https://github.com/twilio/gameday/blob/main/get_to_know_your_systems.md)
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								</details>
-												New Posts

											
										
										
											2021-09-03 14:27:15 +08:00
+								<details>
 								  <summary>Twitter</summary>
 								### Blog Posts
 								* [Logging at Twitter: Updated](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/logging-at-twitter-updated)
 								* [Deleting data distributed throughout your microservices architecture](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/deleting-data-distributed-throughout-your-microservices-architecture)
 								* [Deterministic Aperture: A distributed, load balancing algorithm](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/daperture-load-balancer)
 								* [MetricsDB: TimeSeries Database for storing metrics at Twitter](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb)
 								* [The Infrastructure Behind Twitter: Scale](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale)
 								* [The infrastructure behind Twitter: efficiency and optimization](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/the-infrastructure-behind-twitter-efficiency-and-optimization)
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Uber</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Added posts from Uber and Slack

											
										
										
											2022-04-30 12:26:14 +08:00
+								* [Founding Uber SRE](https://lethain.com/founding-uber-sre/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/)
 								* [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/)
 								* [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
 								* [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee)
 								* [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud)
 								* [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
 								</details>
-												Added Udemy Engineering posts

											
										
										
											2022-01-22 11:09:58 +08:00
+								<details>
 								  <summary>Udemy</summary>
 								### Blog Posts
 								* [Blameless Incident Reviews at Udemy](https://medium.com/udemy-engineering/blameless-incident-reviews-at-udemy-aa4773dbaf0b)
 								* [How Udemy does Build Engineering](https://medium.com/udemy-engineering/how-udemy-does-build-engineering-9722e98a4208)
 								</details>
-												Added upGrad, getAround and Tinder

											
										
										
											2021-09-04 14:17:24 +08:00
+								<details>
 								  <summary>upGrad</summary>
 								### Blog Posts
 								* [Web Performance and Related Stories — upgrad.com](https://engineering.upgrad.com/web-performance-and-related-stories-upgrad-com-a9fb9c6bb766)
 								* [Beginner’s guide to web analytics](https://engineering.upgrad.com/beginners-guide-to-analytics-c8ce3e92fa42)
 								* [iOS Continuous Deployment with Bitbucket, Jenkins and Fastlane at UpGrad](https://engineering.upgrad.com/ios-continuous-deployment-with-bitbucket-jenkins-and-fastlane-at-upgrad-699b3b48acca)
 								</details>
-												Add SRE Game blog post

											
										
										
											2021-02-16 13:57:36 +08:00
+								<details>
 								  <summary>VGW</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Add SRE Game blog post

											
										
										
											2021-02-16 13:57:36 +08:00
+								* [The SRE Incident Response game](https://medium.com/@bruce_25864/the-sre-incident-response-game-db242fff391c)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Add Level up your incident response with gameplay video

											
										
										
											2021-02-16 14:01:03 +08:00
+								* [Level Up Your Incident Response With Gameplay](https://youtu.be/c2-52EP8_7c)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Add SRE Game blog post

											
										
										
											2021-02-16 13:57:36 +08:00
+								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Wikimedia Foundation</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli)
 								* [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)
 								</details>
-												New Posts

											
										
										
											2021-09-03 14:27:15 +08:00
+								<details>
 								  <summary>Wix</summary>
 								### Blog Posts
 								* [How We Improved Website Performance by Evolving Our Infrastructure](https://www.wix.engineering/post/how-we-improved-website-performance-by-evolving-our-infrastructure)
 								* [Wix Inbox Journey: 3 Approaches for Zero Downtime Database Migration](https://www.wix.engineering/post/wix-inbox-journey-3-approaches-for-zero-downtime-database-migration)
 								* [Moving Velo to Multiple Container Sites: The Why, The How and The Lessons Learned](https://www.wix.engineering/post/moving-velo-to-multiple-container-sites-the-why-the-how-and-the-lessons-learned)
 								* [Making Order in CI/CD Mess](https://www.wix.engineering/post/making-order-in-ci-cd-mess)
 								</details>
-												Added Increment issue on Reliability

											
										
										
											2021-03-01 17:52:54 +08:00
+								<details>
 								  <summary>Yelp</summary>
 								### Blog Posts
 								* [The process: Implementing Yelp’s failover strategy](https://increment.com/reliability/yelp-traffic-failover-strategy/)
 								### Videos
 								* [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu)
-												Fix

											
										
										
											2021-03-01 18:15:04 +08:00
+								</details>
-												add initial zalando sre blog posts

											
										
										
											2021-10-11 11:45:22 -05:00
+								<details>
 								  <summary>Zalando</summary>
-												fix linting error (remove leading space)

											
										
										
											2021-10-11 11:58:30 -05:00
+								### Blog Posts
-												add initial zalando sre blog posts

											
										
										
											2021-10-11 11:45:22 -05:00
 								* [Tracing SRE’s journey in Zalando - Part I](https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html)
 								* [Tracing SRE’s journey in Zalando - Part II](https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html)
-												Add meituan and Zalando blogposts
											
										
										
											2021-10-22 19:16:54 +08:00
+								* [Tracing SRE’s journey in Zalando - Part III](https://engineering.zalando.com/posts/2021/10/sre-journey-part3.html)
-												add initial zalando sre blog posts

											
										
										
											2021-10-11 11:45:22 -05:00
 								</details>
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								<details>
 								  <summary>Zerodha</summary>
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Blog Posts
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
 								</details>
-												Update README.md

Add zomato blog
											
										
										
											2021-10-22 17:55:38 +05:30
+								<details>
 								  <summary>Zomato</summary>
 								### Blog Posts
 								* [Huddle Diaries – DevOps and Data Platform](https://www.zomato.com/blog/huddle-diaries-devops-and-data-platform)
 								</details>
-												Added sort test and switch to chai expect

											
										
										
											2021-02-21 13:54:08 +08:00
+								## SRECon Mix Playlist
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								### Videos
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
 								* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
 								* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
 								* [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus)
 								* [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald)
 								* [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas)
 								* [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali)
-												Update G-Research SRE blogging and video mention

											
										
										
											2023-12-08 11:11:50 -05:00
+								* [Cloudlock - My Life as a Solo SRE](https://www.usenix.org/conference/srecon19emea/presentation/murphy)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke)
 								* [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne)
 								* [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas)
 								* [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)
 								* [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve)
 								* [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower)
 								* [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne)
 								* [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal)
 								* [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger)
 								* [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins)
 								* [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable)
 								* [The  New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan)
 								* [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin)
 								* [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek)
 								* [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec)
 								* [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan)
 								* [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin)
 								* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
 								* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								---
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								## Resources
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								### Books
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Added new book

											
										
										
											2022-03-06 10:48:26 +08:00
+								* [__New!__ Enterprise Roadmap to SRE](https://learning.oreilly.com/library/view/enterprise-roadmap-to/9781098117740/)
-												Fix book links

											
										
										
											2021-02-15 21:50:26 +08:00
+								* [Building Secure & Reliable Systems](https://www.oreilly.com/library/view/building-secure-and/9781492083115/) | [Read free online version hosted by Google](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
 								* [Site Reliability Engineering](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/) | [Read free online version hosted by Google](https://sre.google/sre-book/table-of-contents/)
 								* [The Site Reliability Workbook from Google](https://www.oreilly.com/library/view/the-site-reliability/9781492029496/) | [Read free online version hosted by Google](https://sre.google/workbook/table-of-contents/)
-												Replaced dead links with working ones
											
										
										
											2022-10-01 22:56:47 -05:00
+								* [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/) | [Read free online version hosted by Google](https://github.com/google/googlesre/blob/main/publications/Training_Site_Reliability_Engineers.pdf)
-												New additions and updates in Books and Incident

											
										
										
											2021-10-20 18:37:51 +08:00
+								* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/) | [Complimentary Copy from Nginx](https://www.nginx.com/resources/library/97-things-every-sre-should-know/)
-												Removed dead links
											
										
										
											2022-10-01 20:53:09 -05:00
+								* [SLO Adoption and Usage in Site Reliability Engineering](https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Practical Site Reliability Engineering](https://www.oreilly.com/library/view/practical-site-reliability/9781788839563/)
 								* [Implementing Service Level Objectives](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/)
 								* [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
 								* [Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856/)
 								* [Security Chaos Engineering](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/)
 								* [Chaos Engineering Observability](https://www.oreilly.com/library/view/chaos-engineering-observability/9781492051046/)
 								* [Database Reliability Engineering](https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/)
 								* [What Is SRE?](https://www.oreilly.com/library/view/what-is-sre/9781492054429/)
 								* [Database Reliability Engineering: What, Why, and How?](https://www.oreilly.com/library/view/database-reliability-engineering/9781492030942/)
 								* [Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/)
-												New book and posts

											
										
										
											2021-03-22 17:46:42 +08:00
+								* [Chaos Engineering: Site reliability through controlled disruption](https://www.manning.com/books/chaos-engineering)
-												Added Golden Singals book

											
										
										
											2021-10-15 23:28:36 +08:00
+								* [Incident Metrics in SRE](https://www.oreilly.com/library/view/incident-metrics-in/9781098103163/) | [Read free online version hosted by Google](https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/)
-												Update README.md
											
										
										
											2021-10-07 13:55:51 -04:00
+								* [Engineering Reliable Mobile Applications](https://www.oreilly.com/library/view/engineering-reliable-mobile/9781492057444/)
-												Added new books

											
										
										
											2022-01-22 11:18:58 +08:00
+								* [Monitoring the SRE Golden Signals](https://www.slideshare.net/OpsStack/how-to-monitoring-the-sre-golden-signals-ebook)
 								* [Site Reliability Engineering: Philosophies, habits, and tools for SRE success](https://newrelic.com/resources/ebooks/site-reliability-engineering) | [Portable version](https://newrelic.com/sites/default/files/2021-08/site-reliability-engineering-handbook.pdf)
 								* [97 Things Every Cloud Engineer Should Know](https://www.redhat.com/rhdc/managed-files/cl-97-things-cloud-engineers-know-e-book-oreilly-f28602-202105-en.pdf)
 								* [Real-World SRE](https://www.packtpub.com/product/real-world-sre/9781788628884)
 								* [Hands-on Site Reliability Engineering](https://bpbonline.com/products/hands-on-site-reliability-engineering?_pos=1&_sid=839999550&_ss=r)
-												Fix markdown lint error

											
										
										
											2021-10-15 00:38:32 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								### Events
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SRECon Past Events](https://www.usenix.org/srecon#past)
 								* [ChaosConf](https://www.chaosconf.io/)
-												Added new conferences and blog posts

											
										
										
											2021-07-28 23:02:02 +08:00
+								* [SLOConf](https://www.sloconf.com/)
 								  * [SLOConf 2021 Playlist](https://www.youtube.com/watch?v=-lHPDx90Ppg&list=PLLNq9CBV7AFwyRzICyCRKdcsAPAlG5bPu)
 								* [cdCon](https://events.linuxfoundation.org/cdcon/)
 								  * [cdCon 2021 Playlist](https://www.youtube.com/watch?v=MQU4fKhau1w&list=PL2KXbZ9-EY9TWsV-Jz8ARSt1ko0Yd36ah)
 								  * [cdCon 2020 Playlist](https://www.youtube.com/watch?v=qLMrcEj-R9Y&list=PL2KXbZ9-EY9RbYURc1CDrOJpbrPMtc0P7)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
-												Added new links from Picnic Engg, Google & AWS

											
										
										
											2021-09-03 12:57:56 +08:00
+								### Other Resources
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
+								#### Awesome Lists
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Awesome SRE](https://github.com/dastergon/awesome-sre)
 								* [Awesome Site Reliability Engineering Tools](https://github.com/SquadcastHub/awesome-sre-tools)
-												Added tests

											
										
										
											2021-02-21 12:06:03 +08:00
+								* [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
+								* [Awesome Monitoring](https://github.com/crazy-canux/awesome-monitoring)
 								* [Awesome Observability](https://github.com/adriannovegil/awesome-observability)
-												Added MLOps links

											
										
										
											2021-03-09 22:23:56 +08:00
+								* [Awesome MLOps](https://github.com/visenger/awesome-mlops)
 								* [ML-Ops.org](https://ml-ops.org/)
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
 								#### SRE Resources from various organizations
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Google SRE Page](https://sre.google/)
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
+								* [Google SRE Classroom](https://sre.google/classroom/)
-												Added new links from Picnic Engg, Google & AWS

											
										
										
											2021-09-03 12:57:56 +08:00
+								* [Google Cloud SRE Page](https://cloud.google.com/sre)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [Microsoft SRE Page](https://docs.microsoft.com/en-us/azure/site-reliability-engineering/)
-												Update README.md
											
										
										
											2021-02-16 11:52:28 +08:00
+								* [School of SRE from LinkedIn](https://linkedin.github.io/school-of-sre/)
-												Added Increment issue on Reliability

											
										
										
											2021-03-01 17:52:54 +08:00
+								* [Stripe Increment Magazine Issue 16 on Reliability](https://increment.com/reliability/)
-												Added new links from Picnic Engg, Google & AWS

											
										
										
											2021-09-03 12:57:56 +08:00
+								* [AWS Observability Recipes](https://aws-observability.github.io/aws-o11y-recipes/)
-												Replaced dead links with working ones
											
										
										
											2022-10-01 22:56:47 -05:00
+								* [Awesome Sysadmin](https://github.com/awesome-foss/awesome-sysadmin)
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
-												New additions and updates in Books and Incident

											
										
										
											2021-10-20 18:37:51 +08:00
+								#### Incidents & postmortems
-												Added incident and postmortem repo list

											
										
										
											2021-10-15 22:07:26 +08:00
 								* [The Verica Open Incident Database](https://www.thevoid.community/)
-												New additions and updates in Books and Incident

											
										
										
											2021-10-20 18:37:51 +08:00
+								* [Postmortem Templates](https://github.com/dastergon/postmortem-templates)
 								* [Incident Review and Postmortem Best Practices](https://blog.pragmaticengineer.com/postmortem-best-practices/)
-												Added incident and postmortem repo list

											
										
										
											2021-10-15 22:07:26 +08:00
-												Added Google SRE classroom and awesome observability link

											
										
										
											2021-02-27 23:24:44 +08:00
+								#### Newsletters
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* [SRE Weekly Newsletter](https://sreweekly.com/)
 								* [Chaos Engineering Newsletter](https://chaosengineering.news/)
 								* [DevOps Weekly Newsletter](http://devopsweekly.com)
 								## Credits
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Update README.md
											
										
										
											2021-02-17 01:14:48 +08:00
+								* Inspired by [Howtheytest](https://github.com/abhivaikar/howtheytest) from [Abhijeet Vaikar](https://github.com/abhivaikar)
-												Added uniqueness check

											
										
										
											2021-02-21 21:08:36 +08:00
+								* The list of organizations is referred from my other repo [awesome-engineering](https://github.com/upgundecha/awesome-engineering)
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								* Banner image [Cartoon vector created by vectorjuice - www.freepik.com](https://www.freepik.com/vectors/cartoon)
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
+								## Other How They... repos
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								* [Howtheytest](https://github.com/abhivaikar/howtheytest)
 								* [Howtheydevops](https://github.com/bregman-arie/howtheydevops)
 								* [Howtheyaws](https://github.com/upgundecha/howtheyaws)
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								## Contribute
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								Contributions welcome! Read the [contribution guidelines](contributing.md) first.
 								## License
-												Update README.md
											
										
										
											2021-02-20 23:27:34 +08:00
-												Initial Commit

											
										
										
											2021-02-14 22:03:29 +08:00
+								[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0)
 								To the extent possible under law, Unmesh Gundecha has waived all copyright and
-												Update README.md
											
										
										
											2021-02-16 11:52:28 +08:00
+								related or neighboring rights to this work.
-												Disabled mk linting, some fixes and additions

											
										
										
											2021-02-17 09:21:47 +08:00
 								---
-												Fix markdownlint errors

											
										
										
											2021-02-21 00:05:37 +08:00
+								If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github.