Swiss Digital Network

Reducing IT Operations Costs with an SRE-Driven Operating Model in the Digital Age

[ad_1]

Key Takeaways

Maintaining a sysadmin approach for cloud-based IT operations results in hidden costs that can be mitigated by adopting Site Reliability Engineering (SRE).

Leveraging an SRE-driven model will reduce costs by:

  • Reducing the headcount required to provision and operate applications in the cloud, 
  • improving customer experience through better reliability,
  • optimizing resource consumption,
  • increasing automation,
  • and continuously and effectively ensure quality.

Introduction: The Cost Problem of Cloud Operations

Many organizations succeeded, partially or fully, to move their business-critical applications and infrastructure to the cloud. This transition has often been made in combination with the adoption of DevOps to empower development teams and target higher delivery velocity.

However, many of the same organizations are struggling to manage the costs of their new cloud- or hybrid-based operations departments and are looking for practical solutions.

In this article, we present Site Reliability Engineering (SRE) as a strategic initiative to drastically reduce the cost of cloud-based IT operations.

The Hidden Cost of the Sysadmin Approach in Cloud-Based IT Operation

“Sysadmin are tasked with running the service and responding to events and updates as they occur. [..] Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system” (Google SRE Book, 2016).

As early as 2003, Google recognized that the sysadmin approach to running distributed and dynamic systems can become very expensive. Because moving to the cloud will introduce an infrastructure silo that needs to be additionally managed. Further, increasing the velocity of delivery will also create demand for additional sysadmin work due to more frequent releases.

Cloud-based architectures often involve multiple services, microservices, and distributed systems, which inherently increase the complexity of the infrastructure. This complexity arises from the need to manage various components, their interactions, and dependencies. Additionally, the dynamic nature of cloud environments, where resources can be scaled up or down based on demand, leads to a higher volume of events that need to be monitored and managed. This includes events related to resource provisioning, scaling, failures, and performance metrics

In other words, the needs towards the sysadmin team will exponentially increase as the complexity, velocity and managed elements and events generated by cloud and DevOps are increasing simultaneously.

Our Effective SRE and Operations Efficiency Approach

At Digital Architects Zurich, a cell of the Swiss Digital Network, we started considering this impact back in 2020 (see blog posts referenced below). Especially if you target high velocity and high reliability, which is the main promise of DevOps, keeping sysadmin approach to run and operate the systems will not scale.

Therefore, one of our main contributions in the Swiss market was to democratize SRE by building and deploying Effective SRE (e.g. see blog posts on the base Effective SRE methodology or specific capabilities for continuous verification or observability as well as our public talk at the Swiss Testing Day / DevOps Fusion in 2021) as a practical framework a new operating model.

The SRE is then a key role in a cloud operating model which will take essential responsibilities in driving the specification, design, testing, observability and operations towards proactive and cost-efficient assurance of service levels such as availability and performance through the pipeline and in operations.

The Effective SRE is responsible for assuring SLOs by:

  • Co-Building, Maintaining & Operating AI-driven CD/DevOps Pipeline (jointly with DevOps Teams)
  • Co-Building, Maintaining & Operating AI-driven IT Operations Management (Observability/Monitoring, AIOps, Alerting, ChatOps, …)
  • Co-Building, Maintaining & Operating the SRE cockpit & dashboards (incl. SLO-Monitoring-, CD-, & Emergency-Status Dashboards) 

[ad_2]

Source link