First lets undererstand what is SRE AKA Site Reliability Engineering?
Now , lets understand what is Difference between DevOps and SRE?
Monitoring in Site Reliability Engineering (SRE)
Monitoring is one of the primary means by which service owners keep track of a system’s health and availability. As such, monitoring strategy should be constructed thoughtfully. A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs. However, this type of email alerting is not an effective solution: a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
Ben Traynor, VP of engineering at Google and founder of Google SRE, pinpointed the essence of the SRE role in this interview:
“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.”
There are three kinds of valid monitoring output:
Alerts
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Tickets
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
Logging
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
List of Infrastructure Monitoring tools in Site Reliability Engineering
- Nagios
- Zabbix
- New Relic Infrastructure
- Datadog
- ELK
List of Log Monitoring tools in Site Reliability Engineering
- Splunk
- ELK
- Datadog
Other Site Reliability Engineering(SRE) Runtime application Tools
Chef InSpec
Chef InSpec for Scanning your applications and infrastructure. Chef InSpec is an open source (OSS) automated testing tool for integration, compliance, security, and other policy requirements.
ELK with Kibana
ELK with Kibana for Log analysis for Security Threat. “ELK” is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. … Kibana lets users visualize data with charts and graphs in Elasticsearch. The Elastic Stack is the next evolution of the ELK Stack.
HashiCorp Vault
HashiCorp Vault for security tool for certificates, API keys, or passwords. Secure, store and tightly control access to tokens, passwords, certificates, encryption keys for protecting secrets and other sensitive data using a UI, CLI, or HTTP API.
Fortify Webinspect
Fortify Webinspect for Dynamic Application Security Testing (DAST). Fortify WebInspect dynamic application security testing (DAST) software finds and prioritizes exploitable vulnerabilities in web applications.
Fortify Application Defender
Fortify Application Defender for Runtime Application Security Testing (RAST). Fortify Application Defender runtime application self-protection (RASP) protects production applications from common attacks and vulnerabilities.
AppScan on Cloud
AppScan on Cloud delivers a suite of security testing tools, including static, dynamic and interactive testing for web, mobile and open source software. It detects pervasive security vulnerabilities and facilitates remediation.
Twistlock
Understanding and Implementing Security aspect of Docker. Securing Hosts, Containers, and Serverless Across the DevSecOps Lifecycle. Trusted by more than 35% of the Fortune 100, Twistlock is the world’s first truly comprehensive cloud native security platform, providing holistic coverage across hosts, containers, and serverless computing in a single platform.
Notary
Understanding and Implementing Security aspect of Kubernetes. Notary is a core piece of plumbing in Docker’s approach to the secure supply chain whereby security is seamlessly and uniformly embedded into a workflow from development all the way through to operations. Notary is an implementation of The Update Framework (TUF) written in Go.
NewRelic
Understanding and Implementing Security aspect of Java Virtual Machine. leaders in the cloud security community, speaking at the Cloud Security Alliance, OWASP, RSA, and IAPP.
AWS Security service
Understanding and Implementing Security aspect of AWS cloud. Security, Identity, and Compliance on AWS. … AWS data protection services provide encryption and key management and threat detection that continuously monitors and protects your accounts and workloads. AWS Identity Services enable you to securely manage identities, resources, and permissions at scale.
ThreatModeler
ThreatModeler Cloud Edition automatically builds threat models for cloud infrastructures, managing potential threats for AWS and Azure environments. Our out-of-the-box cloud security solution provides an understanding of organizations’ entire attack surface and empowers enterprises to manage their risks more effectively.
Checkmarx
Flexible and accurate security solution capable of identifying hundreds of vulnerabilities. Supports over 22 coding and scripting languages and frameworks.
Trend Micro Cloud One
Detection and protection for modern applications and APIs built on your container, serverless, and other computing platforms.
Aqua Security
Full dev-to-prod container security solution on Kubernetes, Docker, OpenShift, Fargate, Lambda, AWS & other container platforms.
Synthetic Monitoring in Reliability Engineering(SRE)
The answer to such business problems is — monitoring! Yes, everyone knows that, and you as a business likely have a few monitoring systems in place.
But the challenge with real-world applications is that pings and API uptimes do not even skim the surface of the application. Modern applications are built on transactions, funnels, logins and several third-party services, and all this needs to operate together rather than be working perfectly in isolation. With traditional systems of monitoring, while you may have the confidence that your email server is working and your payment server is working, but how do you know if the payment server can send transaction emails through the email server?
Having a suite of such tests set up and run regularly allows you to answer the following critical questions at all times:
- Is the system up?
- Are all the important sub-systems up?
- Are customers able to log in?
- Are customers able to locate what they were expecting, and in the right place?
- Has any recent code change broken some part of customer experience?
- Are customers able to filter results, download reports, etc.?
- Are customers able to make payments?
- Are customers able to reach the support team from within the app?
There are following Synthetic Monitoring tools in site Reliability Engineering(SRE)
- Dynatrace
- AppDynamics
- New Relic
- Pingdom
Real User Monitoring in Reliability Engineering(SRE)
Real User Monitoring is a type of performance monitoring that captures and analyzes each transaction by users of a website or application. It’s also known as real user measurement, real user metrics, end-user experience monitoring, or simply RUM. It’s used to gauge user experience, including key metrics like load time and transaction paths, and it’s an important component of application performance management (APM).
There are following Real User Monitoring in Reliability Engineering(SRE)
- Dynatrace RUM
- AppDynamics Browser RUM
- New Relic Browser
Emergency Response in Site Reliability Engineering(SRE)
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR). The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the MTTR.
Humans add latency. Even if a given system experiences more actual failures, a system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands-on intervention. When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.” The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better. While no playbook, no matter how comprehensive it may be, is a substitute for smart engineers able to think on the fly, clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page. Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the “Wheel of Misfortune,”7 to prepare engineers to react to on-call events.
- PagerDuty – PagerDuty has developed and refined our internal incident response practices over the course of several years. Read more – https://landing.google.com/sre/workbook/chapters/incident-response/
- opsgenie – Opsgenie integrates with over 200 of the best monitoring, ITSM, ChatOps, and collaboration tools. Paired with a flexible rules engine, Opsgenie notifies the right people on-call, enabling them to take rapid action.
- Slack – Slack help you to maintain a dedicated channel as a gathering place for all subject-matter experts and Incident Commanders. The channel can be used mostly as an information ledger for the scribe, who captures actions, owners, and timestamps.
Conference calls
When asked to join any incident response, on-call engineers are required to dial in to a static conference call number. We prefer that all coordination decisions are made in the conference call, and that decision outcomes are recorded in Slack. We found this was the fastest way to make decisions. We also record every call to make sure that we can recreate any timeline in case the scribe misses important details. Some of the tools for Site Reliability Engineering(SRE) are as follows:
- GoToMeeting
- Google Hangouts
- Webex
Change Management in Site Reliability Engineering(SRE)
SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following:
- Implementing progressive rollouts
- Quickly and accurately detecting problems
- Rolling back changes safely when problems arise
This trio of practices effectively minimizes the aggregate number of users and operations exposed to bad changes. By removing humans from the loop, these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.
There are few very good tools for Change Management in Site Reliability Engineering(SRE)?
- Ansible
- Terraform
- Puppet
Demand Forecasting and Capacity Planning in Site Reliability Engineering(SRE)
Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability. There’s nothing particularly special about these concepts, except that a surprising number of services and teams don’t take the steps necessary to ensure that the required capacity is in place by the time it is needed. Capacity planning should take both organic growth (which stems from natural product adoption and usage by customers) and inorganic growth (which results from events like feature launches, marketing campaigns, or other business-driven changes) into account.
Several steps are mandatory in capacity planning:
- An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity
- An accurate incorporation of inorganic demand sources into the demand forecast
- Regular load testing of the system to correlate raw capacity(servers, disks, and so on) to service capacity
- Because capacity is critical to availability, it naturally follows that the SRE team must be in charge of capacity planning, which means they also – must be in charge of provisioning.”
Provisioning in Site Reliability Engineering(SRE)
Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive. This exercise must also be done correctly or capacity doesn’t work when needed. Adding new capacity often involves spinning up a new instance or location, making significant modification to existing systems (configuration files, load balancers, networking), and validating that the new capacity performs and delivers correct results. Thus, it is a riskier operation than load shifting, which is often done multiple times per hour, and must be treated with a corresponding degree of extra caution.
There are following tools for Provisioning in Site Reliability Engineering(SRE)
- Ansible
- Terraform
- Puppet
Efficiency and Performance in Site Reliability Engineering(SRE)
Efficient use of resources is important any time a service cares about money. Because SRE ultimately controls provisioning, it must also be involved in any work on utilization, as utilization is a function of how a given service works and how it is provisioned. It follows that paying close attention to the provisioning strategy for a service, and therefore its utilization, provides a very, very big lever on the service’s total costs.
Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency
Software systems become slower as load is added to them. A slowdown in a service equates to a loss of capacity. At some point, a slowing system stops serving, which corresponds to infinite slowness. SREs provision to meet a capacity target at a specific response speed, and thus are keenly interested in a service’s performance. SREs and product developers will (and should) monitor and modify a service to improve its performance, thus adding capacity and improving efficiency.
There are following tools for Efficiency and Performance Monitoring in Site Reliability Engineering(SRE)
- New Relic
- Appdynamics
- Datadog
- ELK
How to get SRE certification?
SREs are engineers who have software engineering experience as well as Unix systems administration and Ops and Production env experience. That’s because SREs routinely use automation to reduce human labor and increase reliability.
DevOpsSchool’s (Site Reliability Engineer) SRE Certification is a roadmap to the principles & practices that allows an organization to reliably and economically scale Developement to Ops and Productions.
Reference
Resources
- https://landing.google.com/sre/resources/
- https://nordicapis.com/what-is-site-reliability-engineering-sre/
- https://medium.com/@wansatriaandanu/how-to-make-youre-daily-activity-as-sre-devops-better-using-slack-bot-a4e379b7730a
- https://landing.google.com/sre/workbook/chapters/organizational-change/
- https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
- https://landing.google.com/sre/sre-book/chapters/software-engineering-in-sre/
- https://landing.google.com/sre/sre-book/chapters/introduction/
- How Using Technology in Teaching Affects Classrooms - January 17, 2025
- How Wizbrand is helping Organization for their SEO & Digital Landscape? - January 17, 2025
- Revised Top 10 Digital Marketing Platforms Around the Globe - January 17, 2025