How to build a large system diagram
You have a problematic ecosystem and decide to diagram the entire thing. Now you have two problems. These suggestions can help.
Stop me if you’ve heard this story before. A new principal engineer shows up and asks how everything works. Since nobody can explain it, the principal engineer recommends that we all draw a system diagram so future hires can reference it.
Odds are, this effort will silently fade into the background after about 2–3 months because it is way harder than everyone thinks. Believe it or not, it is actually possible to draw a system diagram successfully, but you need to set some expectations up front. This post describes how to avoid common pitfalls of naïve system diagramming based on my involvement in several failed attempts.
Focus on a single important user journey
Most people want to draw the entire system diagram. This will fail because the scope is too large. There are just too many edges coming out of each service for every possible use case, leading to too many nodes and infinite scope. The sheer number of people involved will become too expensive, and the effort will die.
You can justify the effort by focusing on a single important user journey. This diagram helps optimize latency, debugging, complexity, etc. for something valuable. It’s even more effective if you can approximate that value. For example, we drew a system diagram for cluster deployments because each day of delay was worth a ton of money. Understanding that process was critical to optimizing latency.
Decide if you want a system or sequence diagram
Everyone defaults to a system diagram. I think this is a mistake.
If you want to know how something works, you almost certainly want a sequence diagram because it shows how data propagates through the ecosystem to accomplish the task. You start at the top and read to the bottom, illustrating how information flows from start to finish.
If you want to understand dependencies, you almost certainly want a system diagram because it shows the relationships between services. You can point at a single service and know exactly what can break it.
I point out this distinction because a system diagram rarely explains how things work. System diagrams often have back edges. They are often unclear about where the initial request begins or ends. A system diagram requires additional description that is self-explanatory in a sequence diagram.
Furthermore, sequence diagrams can often be delegated by asking a service tech lead, “What happens after they call your RPC?” It was easy for me to create a detailed end-to-end sequence diagram for capacity assignment operations by starting at the beginning and talking to the tech lead associated with each RPC called by the previous person I spoke with. The problems became clear, demonstrating what was needed to accelerate the process.
Prefer high-level nodes over low-level nodes
I used to run a service called the Eye of Sauron that collected and archived power and cooling data [1]. As a software team, we distinguished the Eye from our visualizations, health pipeline, and control systems. The operations teams referred to everything our team produced as “The Eye”.
Now imagine that we’re both drawing a system diagram. The Eye of Sauron team produces a detailed diagram that the operations team considers a single node. Each team has a valid view of the world, but they can’t be merged easily because someone needs to understand the detailed view well enough to understand what the edge on the high-level view means. This gets worse when the nodes do not neatly overlap.
Another problem occurs when one diagram defines a node by service boundaries and another diagram uses team boundaries. We need to reconcile the team boundaries to the service boundaries. This gets worse when the team is a virtual team managing a workflow involving several services.
You can only draw a system diagram if everyone agrees on a node’s granularity. My personal recommendation is to prefer higher-level nodes to lower-level nodes to hide complexity. When in doubt, group the nodes as external users view them, rather than service owners.
This may lead to conflict if external users consider two different services from two different teams to be part of the same node. If you’re senior enough to engage with it, this should lead to a conversation about why they’re two teams and services in the first place.
Define an edge’s direction and label very carefully
Suppose service A initiates an RPC to get data from service B. Which direction does the arrow point? What label do you place on the edge?
You have at least two choices for each. If you define the direction based on the RPC’s initiation, it points from A to B. If you define the direction based on the data flow, it points from B to A. The edge may be labeled by the RPC being called or the data being returned.
I argue that the selections are dependent. If you define the direction based on RPC initiation, the edge should be the RPC name. If you define the direction based on the data flow, the edge should be the data being returned. If you swap labels, it gets confusing because it looks like service A provides the data to service B or that service B calls service A’s RPC.
My recommendation differs based on whether you’re doing it by hand or through automation. If you’re drawing the diagram by hand, define the edges based on data flow because it is more natural. If you’re using automation, it’s too hard to determine the data flow’s direction, so use RPCs because it’s always correct.
Create norms for joining subgraphs
I have alluded to this a few times already. Individual tech leads will provide their view of how their service works. These drawings will differ from adjacent drawings. They will include more or fewer edges. The edges will have different labels. The edges will connect different nodes. Someone needs to reconcile the merges.
You can generally reduce the merge conflicts if you are using higher-level nodes. I generally ask the tech leads to clarify the relationship for each others’ clarification. This process requires you to set rules for whoever does the merge.
Use Graphviz until late in the process
You will drive yourself insane trying to draw the graph by hand. Please trust me on this.
Use the Dot programming language to draw the graph. It will redraw the layout every time you add a node or an edge. Representing the drawing in this language will reveal the merge conflicts at data entry. Sadly, no open source tool exists to merge two Dot diagraphs automatically, so you’ll need to do some copy-paste or write the tool.
Once you have a really good idea of what the drawing looks like, you can either adjust the Graphviz layout until you’re happy with it or create a final hand-optimized drawing using the Graphviz layout as a starting point.
Automate discovery or accept obsolescence
If you build a diagram by hand, it will be obsolete before you complete it because services are constantly evolving. That may be okay if your drawing is extremely high-level, but it will never work at lower levels. The only way to maintain a lower-level diagram is to automate its creation.
When you try this, you will immediately discover that all automated diagrams are terrible because the granularity is too fine to be useful. For example, a service replicated in each location should be one node, not hundreds of nodes. Similarly, everything behind the externally-facing API should be hidden inside the node. Also, it’s not terribly useful to know that a centralized monitoring platform is scraping your metrics or how your release automation calls your service.
At Google, another team wrote a service that grouped jobs and RPCs by their accounting users. I used the accounting user as the node granularity and wrote automation that aggregated all their data into hierarchical SVG drawings with hyperlinks between them [2]. The big drawing showed only the big nodes by accounting users. Clicking on the big nodes would show you a subgraph with all the internals for that service connected to the big nodes from adjacent services. This drawing could be generated automatically to keep everything up to date.
It was still too detailed out of the box, but you could make it useful by spending some time reviewing the graph. I limited the initial scope to a small set of services known to be part of the user journey. Then, I reviewed the output and expanded or pruned the scope based on chats with tech leads. It was pretty good for understanding how machine management operations worked until I switched to a sequence diagram.
You can use structured logging as another approach. If your logs include consistent entity ids, you can produce a topological sort that approximates a sequence diagram and go from there. Retries are annoying to deal with, but you can use the transition frequency as a heuristic. I never verified this approach, but it seemed reasonable.
Accept usefulness over perfection
Your diagram will not be perfect. The sooner you accept this, the sooner you can admit that usefulness is the goal. Define your use case early and solve only that problem.
The most useful system diagram we ever built had about 10 blocks in it for a team of 2,000. It was cartoonish and manually generated, but it explained to new hires the rough process for deploying clusters and who to ask about the steps. The drawing was useful three years later because the big blocks didn’t change enough to justify updates.
Another useful diagram was a cartoonish sequence diagram I used as a crash course for how deployments, repairs, upgrades, and decommissioning worked. It had only 15 blocks but was useful enough to train new team members about how our entire organization worked.
My capacity assignment sequence diagram was useful because it answered questions on the spot. I was the only one who used it directly, but I could say with certainty how things worked because each service’s tech lead had code reviewed my sequence diagram for accuracy. There was no need for another meeting to discuss questions because I already had the answer.
Each of these diagrams were successful because they were useful. I hope these lessons will help you find similar success. Good luck.
Footnotes
[1] The Eye of Sauron saw everything in the data center, and its gaze produced a focus ray of hatred on bigtable tablets during site turn ups. The name was funnier at the time, and I really wish we had randomized the row keys.
[2] Fun fact: I started writing this tool to merge Dot diagraphs because no open source tool existed. Unfortunately, it used a ton of internal Google libraries, so I couldn’t open source my solution. Sorry.