Just-in-time optimization | Marshall Shen

Just-in-time optimization

Just-in-time optimization

In any startups we have too many things that need to get done. Whenever we take on a task, we do the best we can to accomplish that task, and we embed repeatable, best practices while doing so. These best practices accumulate and gradually we experience significant improvements in our workflow.

When we face too many urgent tasks, our instincts are to solve them as fast as possible. And when we finish solving them, we feel tired and frustrated, and we want to move on. But if we slow down and resist that fight-then-flight response, we can analyze the issue deeper and come up with repeatable and scalable solutions. For example, let’s say we receive a call from an important customer complaining about a problem they encounter in our mobile application. We discovered that their data sets didn’t match what they should receive due to a software bug. One easy path is to directly update the data records in SQL to what we think is correct so that the customer can quickly get what they want, and we can quickly move on from this issue. However, a better practice is to write a script based on API calls to the backend system, and also write tests to verify that the script does exactly what we intend it to do. In this case, writing API-based scripts might be a better practice than writing SQL scripts, because API-based scripts will trigger code path in the backend system and make data changes that are consistent with what the backend system does. In contrast, SQL scripts make one-off data changes that may or may not be consistent with the backend system’s code path. We also write automated tests against API-based scripts so that we know exactly which API calls a script triggers and what response that script is expecting.

In the example above, taking more time to write testable, API-based scripts is a just-in-time optimization. We optimize the solution and design best practices when we are solving a new problem, and later we repeat that best practice on other similar, urgent tasks.

Another practice on just-in-time optimization is good prioritization, and one way to prioritize is to group urgent tasks by their root causes. A fast-growing startup has a growing customer base yet an immature system and infrastructure. When a major defect happens in the system, it tends to magnify and result in an influx of complaints. When we face a significant influx of customer complaints, it’s important to step back and identify patterns of those complaints: are these complaints sharing a similar root cause? An immature system generally has an 80/20 challenge, which states that 20% of system defects cause 80% of customer complaints. A good thought experiment to identify a 20% defect is to ask: “if this defect is not in the system, how will our customer complaints look?”

For example, suppose many customers complain that they have trouble connecting to our website around the same time. In that case, it’s likely that it all stems from a similar scalability issue, such as available database connections or traffic load balancing. Although at that moment most of us will focus fixing the urgent problem, like manually increasing database throughput, after the urgent issue fix it’s also important to track these symptoms and solutions in a larger context: why do we have database scaling issues, does this issue reflect a general lack of engineering design in our system? After we identify the root cause, we should document it in detail, and prioritize engineering capacity to fix them.

As a group of engineers optimize for urgent tasks together, we need a common understanding of how we want to divide and conquer. When the team is super small, we can just pop in a group chat and ask for help. As the business and organization grow, solving together becomes more complicated. This is because we have different ways of thinking and more systems to support. When a startup is small and has monolithic codebase, most times, an engineer can track down problems in that single codebase and fix it. As more engineers join and more services are built, individuals start to specialize in parts of a big ecosystem, and a production issue might touch multiple services.

A great engineering culture promotes collaboration, and that holds true regardless of a company’s size. What becomes different is how we collaborate as the system size grows, and the number of engineers increases. Existing practices that work for a small number of engineers might not work for a larger crowd. To keep promoting collaboration, we need to keep reinventing ourselves: experiment new ways to work together and not afraid to remove existing ways if they don’t work for us.