Eliminating normalized deviance from software teams
In my experience, normalized deviance is the most common problem plaguing dysfunctional software teams. Its symptoms include technical debt, heroism, alert spam, and frequent outages. Too many leaders look for technical solutions to what is actually a cultural problem.
Normalized deviance occurs when people become accustomed to disregarding best practices. For example, MRSA bacteria spreads in hospitals when clinicians do not wash their hands properly. Normalized deviance has been cited in the Challenger space shuttle disaster, airplane crashes where co-pilots defer to pilots in dangerous situations, and many other situations.
When confronted with poor business outcomes, tech leads (TLs) will often propose technical solutions such as a deprecation or re-architecture. However, a year later, the team finds itself right where it started, often half-finished and now maintaining an even bigger mess.
When you really dig into it, the team usually has developed a culture that normalizes deviance. Maybe the team is overcommitted because they don’t feel empowered to say no, leading to technical debt and heroism. Maybe the team doesn’t understand how their system works, so they create alerts for too many things. Maybe the team just wants to get things done or get promoted, so they focus on building new things without regard for whether the existing things are still working.
No matter how much technical expertise you throw at the problem, it won’t get better until you fix the culture. As a senior individual contributor (IC) tasked with fixing the technical problem, you have no choice but to wade into the organizational morass and correct the culture. This post describes how I tackle such problems.
The rest of these instructions assume that you are a senior IC. However, you can apply many of these same techniques if you are a service TL or manager.
Step 1: Admit you have a problem
You will never solve the problem until the team accepts that they have a problem. This is hard because the team has normalized the current state. The team will always have a justification for their behavior. We have so many half-finished migrations! We need three oncalls because our P0 bug volume is so high! Leadership just won’t listen!
You need to spend time with the team to identify what abnormal behaviors can be measured. Wherever possible, these should be aligned with business outcomes. A few recommendations include:
- All published service level objectives (SLOs). A service should be able to meet its existing promises. This includes bug closure SLOs.
- Alert closure latency. A service should be able to keep up with its alert volume.
- Data accuracy. Any service publishing data should have a measure for its quality. There is little value (if not negative value) in publishing inaccurate data.
- Successful releases. A release is successful if it meets a quantifiable canary criteria and executes in production for at least a week or the next release, whichever is sooner.
Chances are, the team is struggling in many, if not all, of these metrics. In fact, you will probably find that the team will claim an inability to even measure some of them. You should sit down with the TL and manager to review what they have and identify the gaps.
This next moment is critical. You need to make it clear that you are going to help them but they need to trust you because the next steps will be uncomfortable for them. You must be willing to advocate for them with leadership to provide the support they need, and you must follow through on your promise until the team reaches steady state. If you can’t commit to that, find a new project because you’re sacrificing your credibility on future projects if you ask them to step out onto a limb and leave them there.
Step 2: Figure out what the team really needs
The team likely has an idea what to do. Thing is, it probably won’t work.
Teams in this situation are accustomed to not getting what they need to be successful. They address superficial problems because the hard problems are too difficult to be funded. In essence, they normalize the deviance so much that they can’t even imagine solving the real problem causing their misery. You must work with the team to identify the root cause that will actually bring the metrics back to a healthy state.
Since the team has normalized deviance, they’re likely to push back against your plan as overly broad and unnecessary. Keep pressing on the issue by asking tough (but constructive) questions focused on whether alternatives will actually address the metrics. Compare the efficacy of their approach to your approach and selectively decide to defend or modify your plan. Repeat this process until you have a concrete, actionable plan to bring the metrics to a healthy state.
This part is where the engineering comes in. It can’t be made formulaic because happy teams are all alike, but unhappy teams are all unhappy in their own way. I’ve provided a case study below to illustrate how I did this for capacity forecasting.
Step 3: Create accountability
The metrics in Step 1 are great because leadership simply can’t argue against them without looking clownish. We must meet SLOs because we will be lying to our users if we don’t. We should respond to all alerts because we could be ignoring a user-facing outage if we don’t. We can’t publish inaccurate data because the business won’t work. If releases aren’t successful, we don’t have a service.
As the senior IC, you need to have a private conversation with leadership about the problem and everyone’s role in addressing it. Your role is to ensure that the team has a credible technical plan to address the problem. The team’s role is to execute that plan. Leadership’s role is to ensure the technical plan is prioritized and the team’s manager held accountable to the results. This conversation needs to happen before meeting with the team because a smooth initial meeting will demonstrate your ability to deliver leadership support, which enables the team to trust you.
Arrange a meeting between the team and leadership to review the metrics. The team should explain that they cannot meet the targets unless something changes. Since you’ve prepped them in advance, leadership should state that we must meet these metrics and ask how they can help. The team should present the plan that you have reviewed for addressing the plan and ask for it to be prioritized over other commitments. If you’ve prepared them in advance, leadership should agree, freeing up time to execute the plan.
In the end, the meeting itself is mostly performative. It brings formality to the plan, but you did 95% of the work before anybody stepped foot into the room. The team is taking accountability for fixing the metrics. Leadership is taking accountability to back the team’s plan. Doing this in a formal manner creates a commitment device for all parties that change the default option in future situations.
After this first meeting, maintain accountability through periodic metrics reviews. I prefer bi-weekly reviews in the beginning to keep the momentum going. Teams hate these reviews, but they’re necessary to reinforce accountability for both the team and leadership. Once the team has demonstrated an ability to make progress without being prodded, the review can switch to email updates with monthly or quarterly reviews.
Step 4: Set proximal objectives
Deviant teams are often a long way from meeting these metrics’ targets. You can drive progress and accountability by creating a proximal objective. A proximal objective is an objective that can be met in the short term. These are useful to create momentum and learn critical skills that are often missing in deviant teams: treating every alert seriously, writing good postmortems for every outage or rollback to understand how to prevent them, and so on.
In the beginning, you will probably choose a proximal objective arbitrarily because the team doesn’t have enough information until doing some initial steps. In the absence of evidence, I like to choose targets at 50%, 75%, 95%, and 99%. The numbers are round enough that people will nod their heads and agree, and they require progressively more rigor to achieve. Using alert response latency as an example:
- We can usually respond to 50% of alerts within SLO by deleting low precision alerts. This requires understanding why each alert exists and deleting the ones that are not specific enough to be useful. These alerts are often replaced by alerts measuring SLO burn directly.
- We can usually reach 75% of alerts by building more oncall rigor. This means clear rotations, acknowledgement and escalation procedures, and simple playbooks.
- 95% often requires eliminating the problems causing the alerts. The team has likely been working on this all along, but now it’s gating progress because some alerts are simply hard to debug and address. The 5% out of SLO likely result in blameless postmortems to understand how to prevent recurrence.
- 99% requires an operational rigor where action is taken on every alert, every alert has a playbook, and the team has close relationships with adjacent services’ oncalls to address complex interactions between services quickly.
As you progress through the proximal targets, the user experience and the team’s morale will visibly improve. This will build credibility with both the team and leadership, giving the team the leadership support necessary to proceed to the next proximal objective.
Step 5: Do the little things right every day
Now for the hardest part. The team must do the little things right every day to de-normalize deviance and create a culture that normalizes best practices. Failure to reset the culture will result in the team falling back into their old patterns and losing their progress.
The single most important thing to do is communicate the consequences of leadership’s choices. There’s always more work to do and more customers to please. Someday, leadership will tell the team to change priorities. The team must politely but firmly state, “If we do X, then we will not be able to do Y, meaning we will miss goal Z. Do you still want to do X?” Since technical excellence is the easiest thing to deprioritize when deviance is normalized, a firm reminder that “we will not be able to complete our canary process, meaning that we will have more customer-facing outages” tends to turn conversation back towards customer impacts and existential risks.
Here’s a few tips for reinforcing best practices:
- Always act like keeping the lights on is the top priority. This means responding to every alert, writing every postmortem, supervising every release, and inspecting every failure. The team is daring leadership to maintain their commitment from that formal meeting you had in Step 2. If leadership asks the team to change behavior, you remind leadership of the consequences of their choices.
- Be consistent. Inconsistency creates a variable reward schedule for external partners where they will keep asking until the team changes its mind. They may start at the manager or ping the oncall to work around the manager. The entire team must redirect new commitments back to the manager or TL. If the team is consistent long enough, partners will learn to stop asking.
- There are two answers to an external request: yes with consequences or no. Deviant teams tend to self-sabotage because they have learned helplessness. It emotionally easier to say yes and have them go away than firmly say no, fight it, and be told to do it anyway. This leads to oversubscribed teams and shuffling priorities whenever a stakeholder shows up insisting on the deliverable the team promised. If the request comes from inside the team’s reporting chain, the team must tell leadership the consequences of their choices. If the request comes from outside the reporting chain, they should escalate the priority and inform leadership of the consequences of their choices.
- Write a blameless postmortem for every service outage, no matter how small. Some people may consider this overkill. Writing a postmortem forces people to think deeply about why outages are happening in the first place. The team should add postmortem action items that detect and prevent recurrence to the plan and communicate these new tasks to leadership.
Over time, the team will come to understand that the organization values doing things right and start believing in it themselves. Then, all those decisions you don’t have time to observe start getting made the right way.
Failure modes
This isn’t necessarily a full-proof strategy. I have seen it fail a few different ways:
- Leadership changes their mind. If leadership loses their nerve, you should remind them about the long-term business consequences. A minor staffing correction when the plan is well underway is one thing, but if they systemically starve the plan, you might want to start looking for a new team because they’re knowingly incurring an existential business risk.
- The problem is bigger than the initial plan. Lots of teams will lose faith because they believe leadership will never support a change to the plan that dramatically increases the scope. Proximal objectives can mitigate this failure mode. If the jump from 75% to 95% requires a full refactor or something, the team can insert a new proximal objective and communicate why to leadership at the next review. The problem’s importance hasn’t changed, only the difficulty to fix it. You should support the team by explaining why they need that huge refactor.
- Manager self-sabotages the process. Old habits die hard. Once the commitment has been made in Step 2, leadership must measure the manager’s performance based on meeting those goals and escalating newly found tasks to meet the goals to ensure the manager is doing their part. You should point out the problem to the manager but escalate if it continues. In my experience, this is the most common failure mode.
- Selfish individuals sabotage the process for personal gain. Somebody on the team may need to pivot to working on the plan even though their promotion project is almost done. These people are most susceptible to side-channel requests because they want to say “yes” but somebody else is telling them to denormalize deviance. Ideally, managers should assign people in a way that is minimally disruptive to the team, but if that’s not possible, leaders and managers should try to find a way to recognize the engineer’s selflessness and impact on the team. You should contribute to the promotion case, if possible. If that still doesn’t work, then it comes down to accountability — the team needs X, and the engineer is accountable for delivering X even if they don’t like it.
Case study: Capacity forecasting
This whole process is probably best illustrated with a real-life example.
Back at Google, I parachuted into a capacity forecasting team that was widely considered to be struggling. Their manager had a plan to fix it, if only leadership would fund it.
I asked how the team measured success and was pointed towards a dashboard showing forecast accuracy with a nice horizontal line indicating their SLO. The existing metrics were within SLO but nobody was happy. Either the metric was wrong or the business was wrong. This led me to doubt the team’s proposed plan.
No matter who or how I asked, the team couldn’t provide me with a convincing accuracy metric that illustrated consumers’ problems. Forecasts are always wrong, so accuracy is really a question of understanding how much error is acceptable. This led to a deep exploration of the accuracy metric and developing operational processes to manage outlier forecasts. We would use the outlier investigations to figure out what steps were necessary to fix the problem.
The TL named this program “quantitative excellence.” We created a bi-weekly review to maintain accountability for the outlier root cause investigations. This review demonstrated several problems with the forecast and how it was used by the business. The team figured out how to qualify new models to avoid regressions. Notably, many of the team’s original ideas were irrelevant to fixing the problems.
Along the way, we needed to correct several cultural issues:
- I advocated for quantitative excellence among senior leadership: our senior director, distinguished and principal engineers, planners, etc. They enthusiastically supported the program when they understood that this program would bridge the gap between the team’s metrics and their experiences. I worked with our organization’s leadership to ring fence headcount to work on quantitative excellence, free from outside interference.
- We couldn’t get TL bandwidth from one sub-team because the sub-team was significantly oversubscribed, leading to lots of priority shuffling. I addressed this by escalating the concern to leadership, which put pressure on the manager to correct the problem by reinforcing accountability towards the goals.
- One sub-team needed to decommit from a significant project and kept putting up bureaucratic roadblocks to our progress, which we dealt with by reinforcing accountability to meet the goals. I met with leadership to emphasize how they needed to reward the work prior to the shift to eliminate these behaviors in the future.
- Nobody wanted to do the investigations, so we created a rotation where people were held accountable for reviewing outliers. Reviews took longer than expected, so we communicated how the plan would change. Leadership was happy to see the investigations would be happening and accepted the changes without drama.
- Sub-teams treated each other as external partners, requiring business requirements, design documents, and so on. We reinforced that everyone on the program was one team with sufficient representation from sub-teams to translate program needs to their own domains. I brought the problem to the manager’s attention to take care of it and leadership’s attention to create accountability for the manager.
In the end, capacity forecasting’s problems were relatively simple from a technical perspective and rooted in normalizing bad behaviors. TLs were able to execute once we addressed the culture.