You only get one chance to ask everyone to do something
Most horizontal software programs die because they are organizationally difficult to prioritize and technically hard to execute.
This guide describes how I have learned to run horizontals. So far, I have an 80% success rate using this technique. To briefly summarize:
- Know exactly what you will do with it.
- Generate a sample output to verify it is useful.
- Convince leadership that it is really important.
- Create user-centric instructions.
- Pilot the instructions with a trusted group
- Support everyone as they do it.
Context: The Reliability Reset disruption
Google Cloud leadership sent out an email in July 2019 requiring all services to immediately stop feature development and resolve multi-zonal reliability issues immediately. In the end, I think this was one of the best things to ever happen to Cloud, but describing it as “bumpy” would be an understatement.
The Reliability Reset was announced days after I started my role as Uber Tech Lead for Data Center Software (DCSW), so I volunteered to lead DCSW’s response. I immediately faced a mountain of problems including:
- What exactly is a reliability risk?
- Should every small concern be filed as a risk or should we bundle them into larger risks?
- “Zone” is a Cloud-specific term. What does a “zone” mean for non-Cloud?
- What about global services that aren’t on the serving path?
Every team in Cloud was left to make these decisions themselves. It was abject chaos. Teams who diligently identified risks were vilified by leadership dashboards, whereas teams that phoned it in had fewer hassles because leadership didn’t account for differently sized risks. The horizontal leadership thrashed service owners by changing their guidance as they learned more about the edge cases.
I made quite a few unilateral decisions inside DCSW to help people make progress, but none of them were corroborated by the people running Reliability Reset. It took months to answer these questions, leading to confusion and wasted time. Once things settled down, my technical program manager and I spent some time writing a postmortem to determine what we should have done instead. Since then, I have run multiple horizontal programs using those lessons. This post describes those lessons.
You only get one chance to ask everyone to do something
If you take nothing else away from this post, remember this principle: you only get one chance to ask everyone to do something.
If you don’t get it right the first time, teams will be universally angry and skeptical that it will work the second time. It is highly unlikely you will get a second chance. In the unlikely event you do get a second chance, it will be excruciatingly painful to execute.
Even if you get it right the first time, you will still have challenges. It may be your most important task, but every team out there has their own goals and priorities. You are asking them to preempt their normal priorities, which has a high bar. You need to convince their leadership to prioritize your request over their own and provide them with clear enough instructions that they don’t regret it.
You need to do a lot of homework before even making the request to ensure the request is actionable. A horizontal request will hit lots of edge cases. You need to understand most of those edge cases before reaching out to keep teams from weaseling out of your project because it doesn’t make sense for them.
Running example: Capacity reassignment latency
I will use an example throughout all these steps to illustrate the process.
I once worked on a capacity management project that needed to know how long it took various capacity operations to complete so we could build a roadmap to improve them. The operations were dependent upon the amount of capacity being reassigned. Each step took a highly variable amount of time.
Step 1: Know exactly what you will do with it
Imagine what the desired outcome looks like. What specific thing can you do with it? How does that fit into the broader ecosystem? This outcome should be impactful enough to justify asking everyone to do the work required to achieve it. If you can’t clearly articulate the outcome, do not proceed because you will never get everyone to do it anyway.
Example
We wanted the capacity management roadmap to be data-driven. Since capacity management was really important, it was pretty easy to get people on board with it.
Step 2: Generate a sample output to verify it is useful
Fake the outcome you expect from each component to determine if it is sufficient to generate the desired overall outcome. You should write a paragraph or two explaining how to interpret the output to ensure it makes sense.
Example
Capacity operation latency is complicated to model and visualize because the latency is dependent on the size of the operation and the time required in each step varies randomly. Each step can generate parallel sub-steps, each of which also varies randomly and can generate sub-steps.
The output needs to visualize percentiles to account for each step’s variability. This is complicated because the percentiles of individual stages are greater than the percentiles of the whole. We came up with a visualization that looked like the following picture.
I’ll spare you the (many) details. This visualization enables us to identify what steps contribute the most latency at each percentile, empowering us to create the improvement roadmap for the program. Each figure illustrates a different operation size.
The sample data required to generate this image represents each service’s sample output. Each step needs to provide its total execution time and parent-child relationships to its subtasks so we can separate how much time is spent by the parent versus waiting for the children to complete.
Step 3: Convince leadership that this is really important
You don’t need to convince every team to do something. You need to convince their leadership to have their teams do something. Then, you need to ensure there’s no credible excuse for their teams weaseling out of it (see steps 4–6).
You accomplish this by converting the output into an impact metric and gaining leadership support for actualizing that impact by disrupting every team’s roadmap. Ideally, you can convert this into financial impact. This step may require some quick and dirty hacks to measure a proxy metric.
Example
We wanted to build a roadmap, but we couldn’t estimate the opportunity without the result. Instead, we measured the end-to-end latency without the detailed breakdown using the low-fidelity data available, planted a flag in the ground with a credible target, and measured the financial impact of moving from the current state to the target state. We would then revise the target once we had the data and the roadmap.
Step 4: Create user-centric instructions
You need to clearly explain what you want users to do in a way that minimizes errors because you want to minimize friction. You also want to make it so easy that the teams can’t convince their leadership to de-commit from the request.
This process is somewhat unintuitive. The key is to remember that users just want this request to go away so they get back to their work. Some tips include:
- Design documents are not instructions. Users don’t need the background, alternative implementations, or system designs. They just want this request to go away, so create instructions that are exclusively instructions.
- Push all extraneous information to the appendix or links. This is not the time to show off. Your readers do not need to understand why. They don’t need to know what you’re going to do with it. They just want this request to go away.
- Put the instructions first. People don’t want to read eight pages of flavor text before the instructions. They just want this request to go away.
- Link to sample implementations. Ideally, you should provide examples that are easily copied and pasted. The faster the user can make this request go away, the more likely it is to be completed.
- Provide an escalation process if users encounter problems. Email and Slack channels work moderately well. Office hours are amazing. You can use the escalations as an indicator that your instructions may need to be improved.
Example
We defined a structured schema for logging operations, wrote instructions to apply them, and provided examples of existing implementations. Questions were directed to a mailing list. The questions were usually root caused to not reading the instructions.
Step 5: Pilot the instructions with a trusted group
You only get one chance to ask everyone to do something. However, you may have some allies that enable you to iterate. Use these trusted allies to test the instructions, highlight edge cases, and iterate until the instructions are crisp.
Example
We applied the logging to our machine disruption manager first. This gave us the ability to measure some important operations without other services’ help and refine our roadmap. We identified interesting edge cases that we incorporated into the schema. This pilot also revealed that our visualization needed to account for cases where capacity assignments were blocked by two parallel services.
Step 6: Support everyone as they do it
Finally, you ask everyone to do it.
In this step, you vigorously address every escalation your users identify to create a positive experience. Over time, you may discover that the instructions can be simplified further based on user feedback. You may elect to broadcast improvements to an announcement email alias as they are discovered.
You also need to track completion using some form of coverage metric to identify stragglers. These stragglers should be escalated to their leadership.
Example
Thanks to our pilot, we knew how long the coarse steps required, allowing us to determine what fraction of the time was unaccounted for. Then, we escalated stragglers to leadership.
Why hasn’t this been 100% effective?
I mentioned earlier that I have an 80% success rate with this process. We failed to successfully execute a technical debt horizontal using this process because the difficulty increased to the point where leadership support dropped.
DCSW developed a process for measuring and managing new technical debt that was working quite well. Since we were having success with it, we decided to give an engineering review describing our approach. Our second-level VP loved the success story and wanted to see it across his entire organization. He nominated some collaborators, and we were off to the races.
Then, his nominees wanted to go even further by cataloguing existing technical debt.
DCSW only cared about new technical debt because it was easy to track as we created it. Existing technical debt requires surveying all the existing code, which is extremely time consuming and useful solely for building a baseline metric. I brought this up very early. However, the nominees insisted, which increased the horizontal’s difficulty and enabled their teams to weasel out of the request after 1.5 calendar years of effort.