Development Horror Story – Release Nightmare

posted by Roberto Cortez on

Everyone has good stories about releases that went wrong, right? I’m no exception and I have a few good ones under my development career. These are usually very stressful at the time, but now me and my teammates can’t talk about these stories without laughing.

Never Happened QA

History

I think this happened around 2009. Me and my team had to maintain a medium to large legacy web application with around 500 k lines of code. This application was developed by another company, so we didn’t have the code. Since we were in charge now and needed the code to maintain it, they handed us the code in a zip file (first pointer that something was wrong)!

Their release process was peculiar to say the least. I’m pretty sure there are worst release procedures out there. This one consisted in copying the changed files (*.class, *.jsp, *.html, etc) to an exploded war folder on a Tomcat server. We also had three environments (QA, PRE, PROD) with different application versions and no idea which files were deployed on each. They also had a ticket management application with attached compiled files, ready to be deployed and no idea of the original sources. What could possibly go wrong here?

The Problem

Our team was able to make changes required by the customer and push them to PROD servers. We have done it a few times successfully, even with all the handicaps. Everything was looking good until we got another request for additional changes. These changes were only a few improvements in the log messages of a batch process. The batch purpose was to copy files sent to the application with financial data input to insert into a database. I guess that I don’t have to state the obvious: this data was critical to calculate financial movements with direct impact on the amounts paid by the application users.

After our team made the changes and perform the release, all hell went loose. Files were not being copied to the correct locations. Several data duplicated in the database and the file system. Financial transactions with incorrect amounts. You name it. A complete nightmare. But why? The only change was a few improvements in the log messages.

The Cause

The problem was not exactly related with the changed code. Look at the following files:

And:

This is not the real code, but for the problem purposes it was laid out like this. Don’t ask me about the why it was like this. We got it in the zip file, remember?

So we have here a variable which sets the expected Operating System and then the logic to copy the file is dependant on this. The server was running on a Unix box so the variable value was Unix. Unfortunately, all the developers were working on Windows boxes. I said unfortunately, because if the developer that implemented the changes was using Unix, everything would be fine.

Anyway, the developer changed the variable to Windows so he could proceed with some tests. Everything was fine, so he performs the release. He copied the resulting BatchProcess.class into the server. He didn’t bother about the BatchConfiguration, since the one on the server was configured to Unix right?

Maybe you already spotted the problem. If you haven’t, try the following:

  • Copy and build the code.
  • Execute it. Check the output, you should get Windows.
  • Copy the resulting BatchProcess.class to an empty directory.
  • Execute this one again. Use command line java BatchProcess

What happened? You got the output Windows, right?. Wait! We didn’t have the BatchConfiguration.class file in the executing directory. How is that possible? Shouldn’t we need this file there? Shouldn’t we get an error?

When you build the code, the java compiler will inline the BatchConfiguration.OS variable. This means that the compiler will replace the variable expression in the if statement with the actual variable value. It’s like having if ("Windows".equals("Windows"))

Try executing javap -c BatchProcess. This will show you a bytecode representation of the class file:

You can confirm that all the variables are replaced with their constant values.

Now, returning to our problem. The .class file that was copied to the PROD servers had the Windows value set in. This messed everything in the execution runtime that handled the input files with the financial data. This was the cause of the problems I’ve described earlier.

Aftermath

Fixing the original problem was easy. Fixing the problems caused by the release was painful. It involved many people, many hours, pizza, loads of SQL queries, shell scripts and so on. Even our CEO came to help us. We called this the mUtils problem, since it was the original java class name with the code.

Yes, we migrated the code to something manageable. It’s now on a VCS with a tag for every release and version.

Comments ( 4 )

  1. Replydolzhenko

    Can’t recall the name of one of library under Apache wing – it has String literal inlining protection like

    public static final String VER = "2.0".intern();

    • ReplyRoberto Cortez

      Hi dolzhenko,

      Thank you for your comment. Yes, that’s a trick that you can use to prevent inlining. Unfortunately, we didn’t remember about it.

  2. ReplyPatrick Phelan

    I actually had the same issue lately! Atlassian in their bizarre wisdom decided to introduce a filter on svn checkins that would be ignored in the build change detection code on their bamboo build server. Their default string was a regular expression containing something like “[maven-release-plugin]”. So every time we did a release the next maven snapshot would not get deployed.

    The only solution was to download their source, change the string to something obscure and drop the class file into the runtime jar.

    This string was the similar to what you had above, a static string in an interface which was used in their service. The fix for the issue in the previous version just involved changing the interface string and dropping that class in. In the later versions though (perhaps I was using a later compiler too) I was stumped over why this wasn’t working. Until eventually I realised the issue was the same one as you showed above – I needed to change the service code and replace the static string. 🙂

    • ReplyRoberto Cortez

      Hi Patrick!

      Thank you for reading and for your comment. In fact, I thought that this kind of problem might happen to other people and that was the motivation behind the post 🙂

Leave a reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>