DevOps, SRE, DX: Getting clear on Terminology and Teams
DevOps, SRE, DX: Getting clear on Terminology and Teams
- Start Date: 2020-07-24
- Author: Rufus Pollock and Irio Musskopf
- Status: Review
- Related: 0002-dx-2020
Summary
- Deprecate DevOps as a term
- Service Reliabiliity Engineering (SRE) is preferred term, esp for the crew/team (which is led by Irio at present)
- SRE covers 2 areas:
- DX (Developer Experience): creating and improving the systems and processes for efficiently creating and running reliable solutions
- Toil: day to day reactive support and maintenance
- Support is a project. Product Owner is Irakli at present.
- "You build it, you run it".
Basic example
"X project has been launched and is entering hosting / support. Going forward SRE will own it."
"We need to deploy a new app. I will do this myself and only contact SRE if I run into problems"
"How can we improve DX in the area of continuous deployment."
Motivation
Clarify and precision: DevOps is a bit imprecise and has stopped being developer ops and become a mix of various things. SRE is both more precise.
DX is also a useful new term.
Approach
Service Reliability Engineering (SRE) will be our term for the activity and crew that:
- Ensure our systems run reliably, especially our hosted solutions for clients.
- The Devloper Experience (DX) for our developers building and maintaining these solutions is great.
We will deprecate the term DevOps other than as a label for a type of task (deployment, maintenance etc) done by developers.
Deploying and maintaining applications is a responsibility of developers not a separate team: you build it, you run it.
At the same time, we need a "crew" who are:
- High level experts (so escalation point for issues)
- Design the DX – the systems and processses – that enable our developers to "run it" (and, to an extent, "build it")
- Responsible for the running of the SaaS-like solutions (or solutions that have entered pure hosting mode)
SRE work includes 2 areas: DX and “Toil”
- DX: Developer Experience = the experience of developers in carrying out their work, specifically in creating and managing DMSes and data driven applications. Can think of it as "Product" work for our internal systems/processes around development, deployment and management of services.
- Toil: day to day support. Contrasts with DX "Product" work as it is interrupt driven and responsive.
- 9-5 or 24/7 cover
- Immediate response
Aim from Google book is that SRE team members spend less than 50% time on toil.
Commentary
DevOps is something people (developers) do. It is not an area, team or role within the org. Literally it is Developers doing Operations.
SRE = Service Reliability Engineering. This is term we will adopt going forward for a specific “team” and area of responsibility.
Support
We also have the concept of "support" (see http://playbook.datopian.com/support/). Support could usefully be divided into:
- SRE: a system is down, not working. This may evolve into technical support if the issue is traced to a bug in the underlying solution.
- Technical support: there is a bug in the solution (CKAN) that needs debugging and fixing.
Drawbacks
- We are changing terms. SRE is not as common (?)
Alternatives
- Sticking with DevOps and clarifying it
Adoption strategy
- Create sre@datopian.com and sre-team@datopian.com email addresses (former is catchall anyone can write to, latter is for team)
- Rename devops project to sre
- Send email to all-team
- Talk through on all hands (?)
Unresolved questions
None