Liz Fong Jones, principal developer advocate at California-based software debugging tool vendor Honeycomb.io, and Jean Clermont, program manager at international construction service provider Flatiron, sat down in a recent webinar to talk about technical debt with site reliability engineering (SRE) advocate Matt Davis and SRE architect Kurt Andersen from SRE platform Blameless.
Outsystems.com describes technical debt as the price businesses have to pay in time, money, and resources for choosing speed over quality when writing code. These software bugs and defects pile up, as no one seems to have the time to fix them, hindering a company’s ability to update, innovate and grow.
Here are the key insights from Blameless’ webinar:
Why the term ‘technical debt’?
Jones said that the term ‘debt’ is appropriate for capturing the effort and investment it takes for an organization to ensure its system is operating up to par, but entails ‘an ongoing tax on your efforts’, until it pays off. The longer it takes to catch up and pay that debt, the more work or ‘toil’ (the word the panelists use to describe the tediousness of working with technical debt) it takes, logistically and financially. Although often used negatively, debt “is an opportunity to make the right investments as long as there is a plan to pay your debt off”. However, it is tricky to try and quantify technical debt as we usually would with a traditional debt. In addition, Andersen said that the term debt is also questionable, as for many organizations, technical debt stems from a choice by default (for e.g. the default choice of an organization using human labour instead of automation) rather than the failure to fix or update technical issues.
How visible is technical debt?
The participants in the webinar agreed that measurement of the amount of technical debt an organization has is imprecise, and often represents the tip of the iceberg. Jones argued that visibility of technical debt can also be tricky as it may lull organizations into a false sense of security. However, Clermont said that some visibility of technical debt remains important so that engineers can categorize and prioritize issues to be addressed, and make decisions and trade-offs accordingly. Davis said that pinpointing all tech debt is implausible, with ‘dark debt’, for instance, a term that came out of a STELLA report, describing debt that exists, but emerges only in the presence of a snafu or outage.
What is the difference between technical debt and a bug?
Jones described a bug as a manifestation of the technical debt. Technical debt is a systematic problem that would make code particularly prone to bugs. Clermont said that every organization has different measuring sticks to provide escalation management around bugs that impact the overall functioning of the system, and when left unaddressed, bugs become technical debt at a higher degree. Davis discussed the term ‘haunted graveyard’, a term coined by former Google SRE John Reese, to describe a system that has gone through so many outages, faces a constant swamp of problems, or had developers quit the company without leaving adequate documentation, that it is ‘scary’ to step in and remediate. Preventing such a situation requires collective knowledge and ownership of how to safely operate a system, as well as a methodical and incremental reconstruction of the system, Jones concluded.
Are ‘hack weeks’ effective for addressing technical debt?
Davis described hack weeks as a temporary halt of operations, that features work within an organization to bug squash and tackle technical debt. Jones said that these never work, as organizations need to work incrementally to address technical debt, including constantly improving documentation and run books, but argued that the most effective solution is to avoid technical debt from building up, encourage people to think before committing code requests, and preemptively develop a plan for handling future incidents. Clermont said that incremental work around technical debt is a good habit he pushes on his engineers, such as writing the appropriate unit tests and building documentation in parallel with your code.
How to categorize technical debt?
Jones said that technical debt ranges from a defect that makes it harder to write new software to the manual toil required to keep the system running. She recommended a simple trend analysis based on metadata tagged on incidents. For instance, incidents related to database failure or high Central Processing Unit (CPU) utilization should prompt you to look at other things within your infrastructure to help you better understand what is trending in your environment and whether there is lingering technical debt.
How to know you are writing technical debt when writing code?
Jones recommended developers look for the ‘smells’ and have the observability and critical skills to assess whether the use of the code would be seamless. Lacking those skills is akin to creating technical debt because it implies that the programmer will be unable to understand the issues when they arise. In any case, writing appropriate unit tests is key, Jones said.
Davis said that documentation can also prevent technical debt, as it alleviates the cognitive stress of having to fix issues reactively or at a critical time for the organization when the problem is more consequential and costlier.
Are there other ways to get around technical debt than fixing it?
Jones argued that if you are dealing with a system that is bad, writing a new one and replacing the system can be better. However, it is often hard to tell whether this is a good decision or not. Andersen, however, said that choices are often made within an organization that make sense in the moment, but that later on need to be remediated due to unknowns, unexpected failures, and new features in the market, despite how robust you think your system is. However, there needs to always be room in your error budget for unknowns, Jones said.
How to balance deadlines for feature releases versus technical debt?
Davis said organizations often have feature releases coming, but unexpected outages happen before the release. Not many organizations can afford halting development of features to deal with outages, especially not startups. Clermont said that it is a business decision to ensure service is available to the customer, and an engineering decision to tackle the issues related to technical debt. He recommended reaching a happy medium, whereby you mitigate the issues to provide the customer with the service, while allowing the infrastructure to rebound for a prolonged period of time without causing too much pain in the interim. Jones said companies need to focus on mitigating that risk of outage as well as having a plan to de-risk.
What is socio-technical debt?
Davis said that doing things the old way means accepting the debt of people doing things the old way, and fixing technical debt implies getting employees to adopt new ways of doing things, leading to conflicting priorities for management. How humans feel about a process, or during an outage, is central to making these choices. Clermont said that organizations need to have frameworks in place to institutionalize documentation and make sure information is easier for people to collaborate on and edit accordingly as systems evolve.
Words of advice to engineers?
Jones said that engineers are never going to make a dent in technical debt if they are going at it alone. They need to gather data to show how much an organization is being slowed down, and how the team is being slowed, by that debt, and have a clear plan of what can be done to fix it.
You can access the full webinar here.