Jun 12

Azure and a Hungry Hungry Hippo Problem

A recent project I worked on had an Azure Worker Role (background processing) and a Web Role (web sites).  The worker processed custom job messages off of a queue on an interval basis and did the expected work for the message type - pretty standard.  One day we were pushing out a change to the worker logic and started seeing extremely odd behavior in our test environment.  It was as if sometimes the new code was running as expected, but then other times the old code was running.  Four of us reviewed the code, dumbfounded, as we all felt pretty confident we could interpret what a few “if” statements should do.  So how was this happening and why was it intermittent?

After racking our brains, we figured it out.  Our problem was something I compared to the classic board game Hungry Hungry Hippo.  As part of our deployment procedures, we take advantage Azure’s deployment features (Production and Staging slots and the VIP Swap capability).  This allows you to stage a new release, let it get warmed up, and do a little sanity testing against the staging urls for web sites and then cutover with no downtime.  Our problem was that we never Stopped the old (now Staging) release.  So both the new code AND the old code was running, and it was simply a matter of timing which instance of the worker would read off of the queue.  It makes sense to keep the old instance available for a short period of time in case you need to rollback for some reason.  But if you have a similar worker role/queuing pattern and wish to avoid a few moments doubting your sanity, remember to Stop the deployment.  You’ll want to delete it eventually because you get charged for it, even if it is stopped. 

One other tip related to this type of problem is to check your logs.  You should of course have some sort of tracing/logging in your system, and if using Azure’s framework you can see that the Deployment ID is logged out with your messages.  This ties to the Deployment ID visible in the management portal and can help you definitively determine which instance is executing.

36 comments , permalink