What is Toil?
Time Off in Lieu
Exhausting Physical Labour. Work extremely hard or incessantly.
Time off in lieu, otherwise known as TOIL, is when an employer offers time off to workers who have gone above and beyond their contracted hours. Essentially, it serves as an alternative to pay, meaning that any overtime hours worked by an employee can be taken as part of their annual leave.
What is Toil with SRE perspective?
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Toil is a term coined by Google to describe tedious, repetitive tasks associated with running a production environment. For Site Reliability Engineering (SRE) teams, the aim is to reduce or even eliminate toil in order to maximize the time spent on engineering and innovation.
If teams spend the majority of their time on these types of tasks, they have less time for high-value work. As a consequence, operational costs rise and the focus becomes more reactive than proactive. This prohibits innovation.
When software engineers write code, they want it to be simple, fast, and reliable. We refer to this as “bug and cruft” free. SREs want the same thing for operations. In the realm of operations, “cruft and bugs” can be described by one word: toil. Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Toil is any engineering effort devoid of meaningful value.
Manual
This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.
Repetitive
If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
Automatable
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.
Tactical
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.
No enduring value
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.
O(n) with service growth
If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.
What is an example of TOIL in SRE
Some examples of toil may include:
- Handling quota requests
- Applying database schema changes
- Reviewing non-critical monitoring alerts
- Copying and pasting commands from a playbook
- To work hard and long.
- To proceed with laborious effort
For the individual, high-levels of toil lead to:
- Discontent and a lack of feeling of accomplishment
- Burnout
- More errors, leading to time-consuming rework to fix
- No time to learn new skills
- Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)
For the organization, high-levels of toil lead to:
- Constant shortages of team capacity
- Excessive operational support costs
- Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
- Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)
How to reduce toil
There are many ways SRE minimizes the costs of toil. The following six techniques will help your IT organization improve SRE management.
Standardize
A lack of standardization leads to a more complex IT platform, which then increases toil. Minimize the number of IT platforms in place — for example, through different types of Unix, different versions of Windows Server and multiple separate hardware suppliers. Also, interrogate function repetition. Multiple applications that carry out the same functions — for example, using overlapping customer relationship management and sales force automation applications — increases the complexity, and therefore toil, of the environment. Standardization makes it easier to manage the platform as other steps are taken.
Reuse
Many toil tasks are repetitive. Therefore, once a fix is found for a task, engineers should apply it repeatedly to the same task, even on a different part of the platform. A library of callable scripts will help reduce toil. Increasingly, many tools used in SRE come with preexisting libraries that cover the most common areas.
Monitor
Triage, also called firefighting, is the worst thing that can happen to an IT platform. A problem that affects users harms the business and creates a negative perception from the business to IT while encouraging responders to cut corners. Operations teams must institute a solid procedure for monitoring the entire IT platform — a system that can identify possible problems before they become issues and which can then initiate events to fix the problem.
Automate
Humans are, unfortunately, often the root of problems in the IT environment. Unchecked changes can domino into catastrophic issues across the platform. Therefore, look to systems that check any change before implementation, automate that change and roll back if any problems are identified post-deployment.
Improve
Poor code leads to more problems, which means more toil. Use an integrated DevOps approach with solid testing to improve initial code quality, with automated feedback loops between operations and development to raise any identified issues, along with indications of priority for fixing.
Embrace new technologies
But not too fast — and don’t assume they will remedy all problems. Machine learning, deep learning and AI will increasingly improve SRE capabilities but are still at an early stage of maturation in the market. However, waiting until they are 100% proven will cost your organization in toil levels. Introduce such technologies in small, defined areas and judge their effectiveness. Organizations can then begin to roll them out across the total platform as faith in their capabilities grows.
Toil Identification Checklist in SRE
Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil. Automatable. If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil.
Reference
- https://www.devopsschool.com/blog/toil-identification-checklist-survery-question-in-sre/
- https://sre.google/sre-book/eliminating-toil/
- https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
- https://www.rundeck.com/blog/toil-finally-a-name-for-a-problem
- https://www.devopsschool.com/blog/sre-site-reliability-engineering-summary/
- Installing Jupyter: Get up and running on your computer - November 2, 2024
- An Introduction of SymOps by SymOps.com - October 30, 2024
- Introduction to System Operations (SymOps) - October 30, 2024
This taught me what toil means in 5 mins. To-the-point article with visual explanations only explaining what is necessary to get to grips with the meaning of the term.