In today’s digital economy, web services, cloud storage, streaming services, and game servers must be operational and reliable 24/7 for users. DevOps makes it a responsibility of IT professionals like software developers to support their software in on-call rotations. Instawork states: “The complexity of a platform increases over time, causing on-call duties to become more complicated, especially for a small team.” Improving alert thresholds, error handling, and issue ticketing can make duties easier, but that is only part of the solution. When alerts go off at night during a lengthy on-call shift, a developer’s productivity can be affected for days afterwards. Teams are working to improve productivity by replacing 24/7 rotation schedules with shift patterns that minimize sleep disruption.
The effect of on-call
Clinical research has shown that extended periods of work availability that includes nights can increase stress on employees, and reduce the recovery experiences necessary for work engagement, focus, and productivity.
A 2016 University of Hamburg study in the Journal of Occupational Health Psychology asked 132 employees across 13 organizations in a variety of industries (including IT) to record cortisol levels each morning after an on-call period and record how many calls from work they received in the last 24 hours. Researchers found that cortisol was significantly higher during an extended on-call period “relative to days without availability requirements. However, this effect was not completely attributable to the occurrence of job contacts and resulting work demands; extended work availability had an independent effect after controlling for job contacts.” In other words, just anticipating being called over a 24-hour period causes stress.
A 2021 research paper surveyed participants including both on-site and off-site on-call workers. Researchers reported that 70 per cent of participants “reported difficulties in returning to sleep” after being called, possibly due to anxiety about having to attend “pending work shifts” the following day. Additionally, out of all participants, 56 per cent “reported that being on-call impacted their sleep, even when no calls were received.” Researchers write: “this finding is concerning, especially given that some workplace regulations classify on-call time as “rest” if no calls are received.”
While these first two studies are limited in that they focus broadly across multiple occupations, a general theme emerges that employee performance may be inhibited more than previously thought. However, a third study published in December 2017 in Chronobiology International, backs up this assertion. Study participants were moved into a temperature-controlled sleep laboratory and were either told they definitely would be called overnight or that they may be called. Surprisingly, participants who were told that they may be called had worse “sleep and performance outcomes” than those who knew they definitely would be called. Researchers concluded that “uncertainty during on-call periods should be taken into consideration when planning and managing on-call rostering systems.” Further, they write that traditionally, “on-call periods where calls are less likely may have been viewed as having less of an impact on sleep and performance because less actual work is likely to occur. In fact, however, the opposite may be true.”
Teams will need to implement strategies to mitigate the uncertainty that is inherit in on-call work.
Productivity and recovery strategies
Server Density CEO and founder David Mytton states: “when being on-call, the worst thing is when you’re woken up in the middle of the night, deal with an issue, and then the next night the same thing breaks – or maybe something completely different – and the fatigue builds up as you’re being woken up, night after night while on-call.” One solution he proposes is when an employee is “woken up or you deal with an out-of-hours call incident, you are off-call for 24 hours to give people a chance to recover.” Sleep deprivation over a sustained period of time results in sleep debt. Just like technical debt, it takes days or weeks to pay down the debt, and productivity is affected in the interim.
A survey of sleep research by Victoria University found that “a reduction in sleep of just 30 minutes per night may result in acute performance decrements, such as a slower reaction time and reduced vigilance over time.” Suppose an overnight alert takes just ten minutes to handle, it may take twice as long to fall back asleep after the alert. The research suggests that even in this simple case, employee performance may be affected.
Follow-the-sun scheduling
Several companies have reported their findings on tailoring their on-call systems in blog posts. GitLab uses the follow-the-sun system, in which global teams are on-call during their time zone’s daytime hours, to minimize the impact of on-call shifts on productivity. The company handbook says it has around 157 developers, which creates enough coverage that that most “on-call shifts will take place within an engineer’s normal working hours.” Developers have eight “shifts per year”, or two “shifts per quarter”, and each shift is “four-hour blocks”. Even with this much staff covering the on-call shift, GitLab still advises to take time off after being on-call. According to GitLab, just “being available for emergencies and outages causes stress, even if there are no pages. Resting is critical for proper functioning.”
One-day rotation scheduling
If a company with as robust an on-call system as GitLab is still concerned about employee effectiveness being impacted by on-call, smaller teams which may not have global coverage will need to be creative to improve their systems. A solution that smaller organizations have been implementing is replacing week-long shifts with one-day shifts, especially if overnight alerts are expected. PagerDuty recommends one-day shifts for “medium size teams where everyone is going to be responsible for one day”, but suggests that in smaller teams if shift length isn’t right it can seem like people “toss responsibility back and forth to each other.” An important consideration when creating these schedules for smaller teams is to provide adequate coverage between on-call shifts.
Shift coverage in practice
On-call teams at SoundCloud consist of between eight and 12 engineers, and boasts a “positive atmosphere engendered by a voluntary on-call policy.” SoundCloud states: “every engineering organization is different, but through trial and error, SoundCloud has found the following practices work well: different rotations have different shift cadences, but most shifts last only one or two days; the optimal frequency for being on call is about three days a month, as more than that and people risk burning out over time.” Implementing one-day shifts rather than 24/7 shifts is a wise way to acknowledge a heavier on-call load, and minimize impact to team productivity from extended shifts.
Diversity within on-call shift schedules
Individuals physiologically respond to irregular shift schedules differently. Authors found in the 2021 Behavioral Sleep Medicine research paper that the effects of on-call work can vary by demographic group. Younger workers in the study were “more likely to think about the likelihood of being called.” The authors emphasize that young “adults have a higher sleep need than older adults.” In the same study, the authors found that “female participants were almost twice as likely to think about the likelihood of being called, were more likely to report frequent thoughts about what they would have to do if they were called in to work, and were more likely to think about potential interruptions to family/leisure time as a result of being called.”
A variety of individual factors influence tolerance of extended shift conditions. Multiple STEM initiatives across North America have been created to attract young people and gender diverse workers to the IT community. This talent pool would benefit greatly from the time spent looking at on-call practices. On-call teams must expect that individuals will differ and pursue fine-tuning these systems with the diversity of individual experience in mind.
Conclusion
In addition to the more traditional approaches to reducing the burden of on-call rotations, companies are examining getting rid of 24/7 shift schedules and replacing them with new schedules to reduce the cost to team productivity and developer wellbeing. While individuals and individual teams will have different needs, trying out follow-the-sun schedules, 24-hour shifts with sufficient coverage, time off after being woken up by a call, or other strategies that limit overnight on-call may lead to more productive developers, better economics for the team, and improved morale.