What is Site Reliability Engineering “In Simple English”

Yann Mulonda
Dev Genius
Published in
8 min readNov 4, 2021

--

Intro to Site Reliability Engineering — SRE

So before we go ahead and give definitions of terms and explain stuff, I want you to take a moment and do the following:

Think of IT as Construction — think of an IT team building an app, software, IT platform, sercice, etc. The same way a construction team will build a house, building, a neighborhood, skyscraper, or a whole new city.

Now, remember this and keep this analogy in mind for the purpose of this article as we are going to be exploring different aspects of this analogy. It was pretty fascinating for me when thinking about this, how amazingly similar construction is to IT.

image source: pschwabe.com

So when thinking of a construction project; it’s safe to say that, the complexity of the infrastructure and architecture depends on what we’re building. What and how much needs to go into planning, finance, team, budgeting, timing, resources, maintenance, etc will vary depending on the project. Think of how much goes into building a 2 bedrooms single-family house — vs Mansion — vs Condos — vs Housing project in an ew neighborhood — vs 10 storages building — vs Burj Khalifa Tower — VS building a whole new city.

Now from a team perspective, you have different aspects of construction that require different skill sets or domains of expertise. Some of the Many Roles on a Construction Site include:

  • Construction Manager — Architect —Interieur & External designer — Engineer — Painter — Plumber — Construction Worker — Electrician — etc.
image source: dreamstime.com

IT team has a very similar setup as a construction team and even closely uses similar terms for their roles such as:

  • IT Manager — Architect — Graphic Designer — Software Engineer — Front-End, back-end, or full-Stack Developer — Programmer — System Admin & Engineer — Database Admin — Cloud Architect — DevOps Engineer— etc.

IT Infrastructure

IT infrastructure can simply be defined as all hardware systems, software, facilities, network resources, and services shared across an organization to support the delivery of business systems and other IT-enabled processes.

Think of IT infrastructure and Architecture in the same way or in the same context of construction infrastructure and architecture.

  • As we started to build complex IT infrastructures and software applications, we needed a way to quickly build those infrastructures, so we came up with various programming languages, frameworks, and new technology like Virtualization which lead to DataCenters.
  • Cloud is simply having access to someone else datacenter on demand from the internet — where we have the actual physical hardware in which all the virtual resources are running.
  • Virtualization has its own limitations as well — so we came up with Containraziton — Orchestration — configuration management, provisioning, monitoring, Infrastructure as code, etc.
  • and tools, lots of them!!
image source: levanture.com/Infrastructure-Services

In brief, you can read more about the five stages of IT infrastructure evolution: the mainframe era, the personal computer era, the client/server era, the enterprise computing era, and the cloud and mobile computing era. — source: wps.pearsoned.ca

Roles & Teams

To be able to properly manage and maintain an IT infrastructure, different Teams and roles with a specific domain of expertise and skillsets are needed in an IT organization.

Development Team

This is the team responsible for building applications and taking care of the software aspect of the IT infrastructure. There are many roles in this team such as Sofware Developers, Sofware engineers, Fron-End — Back-End — Full-Stack Developers, UX & UI Designers, etc. Most if not all dev Team probably function or works in Agile software development methodology settings.

Operation Team

The Operation team is responsible for designing and building the other aspects of the infrastructure such as Architecture, Networking, setting up hardware, Security, Engineering, IT support, etc. However, some organizations will have Security fall under Engineering or be its own team and vis-versa. The same applies to IT support and Architecture. Which in most organizations can also be their own team.

Some of the common roles in this are Network Administrator, System Engineer, Architect, Cloud Architect, Application & System Analyst, Database administrator, System Engineer, DevOps Engineer, etc.

Engineering Team

In software development, the engineering team is the group of developers and managers responsible for the actual production and building of the given product or service. They are the ones carrying out all of the sprints and working on new or necessary features, updates, and fixes. There are several different types and levels of team lead and management roles within an engineering team depending on its structure and specific needs.— source:pagerduty.com

All these roles required excellent knowledge of computer science, IT and engineering, managerial, and other specialized high skills. For illustration, the following are skills and domains of expertise expected from an IT professional assuming the role of DevOps Engineer:

For more info about other IT professionals skills roadmap — checkout roadmap.sh

Intro to Reliability Engineering — SRE

Now, using our analogy, building a website for a restaurant or flower shop is like building a house — building an application like Netflix or Facebook is like building the Burj Khalifa Tower or the Signature Tower Jakarta; it’s a very complex system with so many parts that connect to each other and have to work exactly as expected for the application to work properly and provide a better user experience.

Brief, there is a lot that goes into enterprise-level software, the same way there is a lot that goes into building skyscrapers. Let’s take something that most people use on a daily basis, Netflix:

Image source: nextbridge.com/technology-stack

Netflix has migrated all of its Back-End to Cloud services provided by Amazon Web Services (AWS) and uses:

S3 for content storage — IAM for internal authentication/authorization — Kinesis and Kafka for data streaming — CloudFront for content caching/delivery — AWS Elastic Transcoder for video transcoding — EC2 for hosting — Lambda for serverless functions and state machines — several types of NoSQL databases for data storage — Hadoop for data aggregation and warehousing — Jira for task and project management — Confluence for documenatation — Jenkins for build and deployment pipeline — other third party techologies.

Managing these highly complex IT infrastructures is very challenging and very crucial.

An outage of the software application can result in a massive negative impact on the company and user experience; being able to maintain the uptime of all those technology stacks is challenging to say the least and required a vast variety of knowledge, skills, and domain of expertise. Thus, it’s imperative that we use a different approach to IT Operations. This is where Site Reliability Engineering — SRE comes in.

What is SRE?

Ben Treynor Sloss launched the field of Site reliability engineering at Google. After originating at Google in 2003, the SRE concept spread and became popular in the broader software development and IT industry in the 2010s. Other companies subsequently began to employ site reliability engineers. By 2016, Google employed more than 1,000 site reliability engineers.

This diagram pretty much sums up SRE

SRE fits right at the crossroads of IT operations, Software development, and engineering. It fills some of the gaps related to Integration, System Engineering, Testing, and DevOps Engineering.

Most people would tend to mix up DevOps and SRE — to put it simply, they are similar but not the same:

DevOps Engineers are ops-focused engineers who solve development problems; Site Reliability Engineers are development-focused engineers who use software development, DevOps, and engineering practices to solve IT Operational /scalability/reliability problems. You can find a more detailed comprehensive overview of the key differences between DevOps and SREs in this article published on spacelift.io

So one might say — but why SRE…tho?

Coming back to our construction analogy, now think of different things will be if you had all inspection and security checks are done during each construction step, phase, or stage rather than after when all the work is completed — in software development — this is what we refer to DevSecOps.

DevSecOps consists of integrating IT security practices into the full life cycle of your application. To put it simply, It means thinking about application and infrastructure security from the start. Instead of isolating the role of the security team in the final stage of development. Security is considered a shared responsibility to be integrated from end to end.

Now using our construction analogy, imagine creating a team that has an overall solid understanding of how everything was built, how all the parts connect to each other. That team also has an excellent knowledge to be able to administer, maintain, make sure all the different features of the infrastructure and all aspects of the architecture work as intended (the way it was supposed to) — aka stuff are reliable — in IT, this is what an SRE team does for IT infrastructure.

Site reliability engineering is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. — source: Wikipedia

if anything, the big takeaway in why you need SRE is for the reason that SRE roles and responsibilities are crucial to the continuous improvement of people, processes, and technology in your organization. if you’re looking to get more insights and context to what problems DevOps and SRE teams solve and the tools used by both groups; you might be interested in checking out this post on SRE vs. DevOps: What’s the Difference Between Them.

If you enjoyed this article, you might also like “What is DevOps? In simple english

Cheers!!

--

--

Co-Founder & CIO @ITOT | DevOps | Senior Site Reliability Engineer @ICF󠁧󠁢󠁳󠁣󠁴 | "Learning is experience; everything else is just information!”