We have all been there. Your phone starts buzzing before you have even made it into the office in the morning. Slack is pinging like crazy with your on-call engineer giving updates or asking for help. It is going to be all hands on deck today, and you brace yourself, knowing that you’re about to walk into a crisis.

The first moments of any production fire are vital – they set the tone for the whole process. The team begins by coming together to identify the issue, starts assessing alternative fixes and gathers more information. It’s here that you have the opportunity to really change the way the situation is addressed. Returning your system to stability is the goal, of course, and sometimes a simple and clean solution is clear. When that is not the case, though, it is better to recognize the opportunity at hand, pushing you to improve your system. Often times it is possible for the technology to be in a better place, when stability is restored, than it was before the fire. Instead of hacking a solution and waiting until tomorrow (or next month) to fix it properly, break out that list of proposals and system improvements and identify what can be of use. This all must be done, of course, ensuring patient safety as the first priority.

Our data platform had just such a production fire situation recently, when a change resulted in our ingestion process breaking.  This process, called Change Data Capture, is a component of MsSQL which allows for exact replication of a database. When it works, it is an excellent process for keeping systems in sync and getting changes quickly. Unfortunately, it can have a few negative repercussions. It is very CPU intensive, must process all changes to tables and requires keeping log files of any changes on the originating server until a change is fully propagated. Together, for a large system, these can cause extensive stability issues and even crash primary databases. This is what happened in this situation; while patient care was not impacted, we were unable to ingest most of the data we required.

We already had a new method for extracting data which was being rolled out slowly; one which gave us greater control of the process and was less prone to stability issues. This extract process essentially just queries all rows in a table within a timeframe and streams it back into our database. We can chunk request timeframes and tables however we see fit, carefully monitor to assure we are not causing any undue pressure on client systems and effectively rebuild a database from scratch over a few days, should it be necessary. Now, that’s not to say this approach was perfect: several use cases still had reliance on some of the older components and as it stood the new method was not able to handle the volume of data being transferred.

Patching the existing system would take time and continue to be unstable, setting us up for this to eventually reoccur. Installing the new process would require a few long days to determine how to handle the greater volume and to hack together a way to temporarily keep the legacy system up to date.

One major wrinkle was that our new method did not require proving from scratch. Although we had not yet moved completely forward with it, it was well vetted and had been tested in other scenarios. This was not an unproven idea on a lark. Jumping head first into something truly unknown could likely have left us in a worse position as new and unexpected issues arose. The team never took its eyes off of how to improve the system; continuing to slowly push forward, plan and develop better approaches, even if they could not be deployed wholesale. That unlocked the opportunity to take a significant leap forward in the wake of this significant issue.

In the weeks that followed, we monitored the new extract process closely. Due to the nature of the old system vs. the new (the old system being a continuous pull and the new being a batch), it’s difficult to compare speed directly, but the full extract process levelled out at just one hour. It has had fewer bugs and hiccups than the previous deployment and on the occasions we have had issues (on one such occasion, the extract did not start due to an issue in our orchestration framework), we only need that one hour to catch up. Previously, an issue impacting our ability to pull data could have taken days to completely recover from as changes were streamed into our data processing and storage system.

So how can “Crisis Driven Development” (CDD) be deployed on your team? It’s less a difficult process and more of a way of thinking, a consistent approach.

First, never stop examining your system, designing and brainstorming ways to make it better.  When ideas come up during conversation, write them down and keep lists so that they don’t get lost. Make time for the team to brainstorm together. It’s fun, a great chance to develop rapport within your team and can be an excellent learning experience.

Second, make time to explore, prototype and test. There are always more features to be developed and deployed, but amidst that push it’s vitally important not to let the team’s great ideas wither, but rather to test, develop and prove the keepers. Some approaches will not pan out. Others sound great on paper, but have significant technical hurdles for implementation. Still others wind up scaling less easily than had been expected. Even if a full deployment is not in the near-term plan, having proven concepts will make it possible to make an educated decision down the road. If the extract process we adopted had not been proven previously on a smaller scale and in less risky situations, we would not have been able to use it here.

Third, always take a deep breath during a production crisis. After the initial issue is identified, it is tempting to leap immediately to start trying various fixes. If the issue is deep enough and the fixes expensive enough (if restarting a server fixes everything, then it would NOT be a good candidate for CDD!), consider what the best approach is to rectify the problem long-term.

Fourth, assess the risks and rewards to your options. This cannot be a long and complex process, but it must be considered. This is greatly aided by extensive monitoring and system testing. Had we not been able to quickly establish the speed of the new extract system, we might have deemed it too risky to implement.

And last, don’t be afraid to jump into that long term solution once you believe it’s the best approach. System evolution sometimes moves in fits and starts, but by treating every crisis as also an opportunity, you can help your team always be at its best.

Author
Engineering Manager
Zach Drillings is an engineering manager at Flatiron Health, where he leads a data platform team in the development of data pipeline systems and data warehousing to support the company mission. He is also a passionate and active member of…
Back