Advancing the State of the Art for Engineering Leadership

Data literacy is one of the more underrated parts of the software engineering skillset. When you’re dealing with a complex, dynamic, evolving system, being able to reason about data is at times more important than institutional knowledge, which tends to become outdated. Understanding a single library or subsystem really well often isn’t good enough. And when you transition to engineering leadership, grow your team, and focus more on the big picture, keeping up with every technology change isn’t feasible.

In this post, I’ll share several patterns I look for and what they tell you about how a feature or subsystem is performing.

Foo Service Let’s use FooService as an example. When you look at the source code for FooService for the first time, you’ll probably be very confused. There’s a config object being passed in with a mysterious flag. It’s difficult to reason about exactly what each code path is doing and how often it’s followed. How can we even begin to reason about performance characteristics in production? We have some profiling information available so let’s find the appropriate graph and see what it tells us.

The Blip You see a perf regression that looks like noise at first. The week over week graph shows that it’s actually a periodic regression. It doesn’t correlate with a periodic increase in app usage. What’s going on? This is often an indication that you have a warm/cold dynamic somewhere in your codebase. When some part of your application is updated, the initial session for every client experiences degraded performance followed by completely normal performance. When you see a blip, try to find a pattern rather than treating it as random noise.

Examples

Seeing a blip is not necessarily a bad thing but it’s important to treat those as random noise since they represent bottlenecks and optimization opportunities

Bimodal Distribution The histogram reveals multiple modes. In other words, there isn’t a single most common value but multiple values that are far apart. There’s no discernible pattern when you slice by demographic. What’s going on? This is often a sign of one or more really expensive code path being executed part of the time. It doesn’t necessarily mean that something is wrong but, unless the user experience is significantly different for each mode, there’s probably an optimization opportunity here.

CDF Outliers The CDF levels off sharply at p99 indicating a big increase in page load times for that percentile. It increases even more sharply when you zoom in at p99.9. Is your instrumentation broken somehow? Is this the result of a runaway query or zombie process somewhere in this system? What’s going on? This is often an indication that you have big fish or celebrities in your system with massively degraded performance.

Apps are often designed for “normal” people or uses cases. If you’re writing a feature or service that assumes normal usage you’re going to have a bad time. Or at the very least your p99.9 use case is going to have a bad time. So what’s the big deal? That’s not that many users right? Well p99.9 problems often affects your most important customers since they have the resources or influence to stretch your infra to the max in the first place. In other words, sometimes these outliers are actually your most important customers and deserve more attention.