Everyone has good stories about releases that went wrong, right? I’m no exception and I have a few good ones under my development career. These are usually very stressful at the time, but now me and my teammates can’t talk about these stories without laughing.
History
I think this happened around 2009. Me and my team had to maintain a medium to large legacy web application with around 500 k lines of code. This application was developed by another company, so we didn’t have the code. Since we were in charge now and needed the code to maintain it, they handed us the code in a zip file (first pointer that something was wrong)!
Their release process was peculiar to say the least. I’m pretty sure there are worst release procedures out there. This one consisted in copying the changed files (*.class, *.jsp, *.html, etc) to an exploded war folder on a Tomcat server. We also had three environments (QA, PRE, PROD) with different application versions and no idea which files were deployed on each. They also had a ticket management application with attached compiled files, ready to be deployed and no idea of the original sources. What could possibly go wrong here?
The Problem
Our team was able to make changes required by the customer and push them to PROD servers. We have done it a few times successfully, even with all the handicaps. Everything was looking good until we got another request for additional changes. These changes were only a few improvements in the log messages of a batch process. The batch purpose was to copy files sent to the application with financial data input to insert into a database. I guess that I don’t have to state the obvious: this data was critical to calculate financial movements with direct impact on the amounts paid by the application users.
After our team made the changes and perform the release, all hell went loose. Files were not being copied to the correct locations. Several data duplicated in the database and the file system. Financial transactions with incorrect amounts. You name it. A complete nightmare. But why? The only change was a few improvements in the log messages.
The Cause
The problem was not exactly related with the changed code. Look at the following files:
BatchConfiguration
Java
1
2
3
publicclassBatchConfiguration{
publicstaticfinalStringOS="Windows";
}
And:
BatchProcess
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
publicclassBatchProcess{
publicvoidcopyFile(){
if(BatchConfiguration.OS.equals("Windows")){
System.out.println("Windows");
}elseif(BatchConfiguration.OS.equals("Unix")){
System.out.println("Unix");
}
}
publicstaticvoidmain(String[]args){
newBatchProcess().copyFile();
}
}
This is not the real code, but for the problem purposes it was laid out like this. Don’t ask me about the why it was like this. We got it in the zip file, remember?
So we have here a variable which sets the expected Operating System and then the logic to copy the file is dependant on this. The server was running on a Unix box so the variable value was Unix. Unfortunately, all the developers were working on Windows boxes. I said unfortunately, because if the developer that implemented the changes was using Unix, everything would be fine.
Anyway, the developer changed the variable to Windows so he could proceed with some tests. Everything was fine, so he performs the release. He copied the resulting BatchProcess.class into the server. He didn’t bother about the BatchConfiguration, since the one on the server was configured to Unix right?
Maybe you already spotted the problem. If you haven’t, try the following:
Copy and build the code.
Execute it. Check the output, you should get Windows.
Copy the resulting BatchProcess.class to an empty directory.
Execute this one again. Use command line java BatchProcess
What happened? You got the output Windows, right?. Wait! We didn’t have the BatchConfiguration.class file in the executing directory. How is that possible? Shouldn’t we need this file there? Shouldn’t we get an error?
When you build the code, the java compiler will inline the BatchConfiguration.OS variable. This means that the compiler will replace the variable expression in the if statement with the actual variable value. It’s like having if ("Windows".equals("Windows"))
Try executing javap -c BatchProcess. This will show you a bytecode representation of the class file:
You can confirm that all the variables are replaced with their constant values.
Now, returning to our problem. The .class file that was copied to the PROD servers had the Windows value set in. This messed everything in the execution runtime that handled the input files with the financial data. This was the cause of the problems I’ve described earlier.
Aftermath
Fixing the original problem was easy. Fixing the problems caused by the release was painful. It involved many people, many hours, pizza, loads of SQL queries, shell scripts and so on. Even our CEO came to help us. We called this the mUtils problem, since it was the original java class name with the code.
Yes, we migrated the code to something manageable. It’s now on a VCS with a tag for every release and version.
Based on my session idea at JavaOne about things that went terrible wrong in our development careers, I thought about writing a few of these stories. I’ll start with one of my favourites ones: Crashing a customer’s mail server after generating more than 23 Million emails! Yes, that’s right, 23 Millions!
History
A few years ago, I’ve joined a project that was being developed for several months, but had no production release yet. Actually, the project was scheduled to replace an existing application in the upcoming weeks. My first task in the project was to figure out what was needed to deploy the application in a production environment and replace the old application.
This application had a considerable amount of users (around 50 k), but not all of them were active. The new application had a new feature to exclude the users that didn’t log into the application for the last few months. This was implemented as a timer (executed daily) and a email notification was sent to that user warning him that he was excluded from the application.
The Problem
The release was installed on a Friday (yes, Friday!), and everyone went for a rest. Monday morning, all hell broke loose! The customer mail server was down, and nobody had any idea why.
The first reports indicated that the mail server was out of disk space, because it had around 2 Million emails pending delivery and a lot more incoming. What the hell happened?
The Cause
Even with the server down, support was able to show us a copy of an email stuck in the server. It was consistent with the email sent when a user was excluded. It didn’t make any sense, because we counted the number of users to be excluded and they were around 28 k, so only 28 k emails should have been sent. Even if all users were excluded the number could not be higher than 50 k (the total number of users).
Invalid Email
Looking into the code, we found out a bug that would cause the user to not be excluded if he had an invalid email. As a consequence these users were caught every time that the timer executed. From the total 28 k users to be excluded, around 26 k had invalid emails. From Friday to Monday, we count 3 executions * 26 k users, so 78k k emails. Ok, so now we have an email increase, but not close enough to the reported numbers.
Timer Bug
Actually the timer also had a bug. It was not scheduled to be executed daily, but every 8 hours. Let’s adjust the numbers: 3 days * 3 executions a day * 26 k users, brings the total to 234 k emails. A considerable increase but still far from a big number.
Additional Node
The operations installed the application in a second node, and the timer was executed in both. So a double increase. Let’s update: 2 * 234 k emails, brings the total to 468 k emails.
No-reply Address
Since the emails were automated, you usually set up a no-reply email as the email sender. Now the problem was that the domain for the no-reply address was invalid. Combining this with the users invalid emails, the mail server entered in a loop state. Each invalid user email generated an error email sent to the no-reply address, which was invalid as well and this caused a returned email again to the server. The loop end when the Maximum hop count is exceeded. In this case it was 50. Now everything starts to make sense! Let’s update the numbers:
26 k users * 3 days * 3 executions * 2 servers * 50 hops for a grand total of 23.4 Million emails!
Aftermath
The customer lost all their email from Friday to Monday, but it was possible to recover the mail server. The problems were fixed and it never happened again. I remember those days, to be very stressful, but today all of us involved, laugh about it!
I spent the last week in San Francisco to attend JavaOne 2014. This was my third time attending JavaOne, so I was already familiarized with the conference. Anyway, this year was different since I was going as a speaker for the first time.
Create the Future
“Create the Future” was the theme of JavaOne this year. The last few years have been very exciting for the Java community. After many years without evolution, we see now Java 8 with lambdas and streams, Java EE 7 with new specifications and simplifications)and a huge effort to unify and support Java for embeddable devices. Java 9 is already in the pipeline which promises modular Java (project Jigsaw). Java EE 8 is going to improve a lot of specifications and bring new ones like MVC, JSON-B and the much awaited JCache. Now it’s the time to contribute by Adopting a JSR.
During the last few years we heard a lot of voices claiming that Java is dead. Looking at what’s happening now, it doesn’t seem that way. The platform is evolving, a lot of new developers are joining the JVM ecosystem, and the conference was vibrating with energy. By the way, Java is turning 20 years in 2015. Let’s see what is going to happen in 20 years from now. Let’s hope that this blog is still around!
Keynote
The opening Keynote was a recap on what’s happening in the last few years. You can find all the videos here. Just a few notes:
The technical Keynote was interrupted because of lack of time. This also happened to me in one of my sessions. I understand that there is a time frame, but this was not the best way to kick out the conference. I’m pretty sure that most attendees would prefer to shorten up the Strategy Keynote for the Technical one.
I was referenced in the Community Keynote, because of my work at the Java EE 7 Hackergarten. Thank you Heather VanCura. Count me in with future contributions!
Venue
The event was split between the Moscone Center, The Hilton Hotel and the Parc 55 Hotel. I’m not from the time where JavaOne was completely held in the Moscone Center, so I can’t compare. Because of the layout of the hotels, you need to run sometimes from session to session and the corridors are not the best place to have groups of people chatting. A few of the rooms also have columns in the middle which makes difficult for the attendees and the speaker to be aware of everything.
In my session Development Horror Stories [BOF4223] I had to run with Simon Maple, to get there on time. The problem was that the previous slot sessions were held at the Hilton and then moved to the Moscone, which is a 15 minutes walk. By the way, no taxi wanted to take us because it was too close.
Food
Not even going to comment about it. Yeah the lunch sucked, and yeah I’m weird with the food.
Sessions
There is so much stuff going on, that it’s impossible to attend every session that you want to go. I probably only attended half of the sessions that I’ve signed up for. I had to split some of my time between the sessions, the Demogrounds, the Hackergarten and also a bit of personal time for the last details of my sessions. Not all sessions had video recording, but all of them should have audio and be available via Parleys.
These are my top 3 sessions (from the ones I have attended):
I’m relatively happy with my performance delivering the sessions, but I can improve much more. I do have to say, that I didn’t feel any nervousness. I guess that I’m feeling more comfortable on public speaking, plus preparing everything with a few weeks in advance also helped. Moving forward!
Development Horror Stories [BOF4223]
with Simon Maple
We had around 150+ people signed up, but only 50 or so showed up. I think this was related to the switch venues problem I described earlier. At the same time there was also an Oracle Tech Party with food, drinks and music. I guess that didn’t help either.
Anyway, me and Simon kicked out the BOF with a few of our own stories where things went terribly wrong. The crowd was really into it, so our plan to ask people for the audience to share their own stories worked perfectly. We probably had around 10+ people stepping up the stage. In the end we had a Java 8 In Action book give away signed by the author, for the best story voted by the audience. The winning story belong to Jan when he wrote a few scripts to clear and insert data into a database for tests. Unfortunately he executed it in a production environment by accident!
I think people enjoyed the BOF and this can work in pretty much everywhere. I’ll submit it in the future to other conferences. BOF’s don’t really need slides, but we did some anyway:
Java EE 7 Batch Processing in the Real World [CON2818]
with Ivan Ivanov This session was the first one of the day at 8.30 in the morning and was packed with people. It was surprising to see so many so early. Me and Ivan started the session with an introduction on Batch, origins, applications and so on. Next we went through the JSR-352 API to prepare for our demo at the end. The demo is based around World of Warcraft and we used the Batch API to download, process and extract metrics from the game Auction House’s (they are like eBay in the game). Stay tuned for a future post describing the entire sample.
Unfortunately we run out of time and we couldn’t show everything that we wanted, or at least go into more details about the demo. We allowed people to ask questions anytime, and we had a lot o them. I’m not complaining about it. I prefer doing it this way, since it makes the session more interactive. On the other hand, you end up using more time and is not very predictable. We will reorganize the session to perform the demo in the middle and everything should be fine like that.
CON4255 – The 5 people in your organization that grow legacy code
I’m pretty happy with how this session go. Considering that it was the last day of the conference and also one of the last sessions of the day, I had probably around 80+ people. I’m also happy because it was video recorded, so I can check it properly later.
I’m not going to spoil the content, but I think the attendees really enjoyed the session and had many moments to laugh about the content. I’ll just leave you with the slides:
Final Words
The event was huge, so I’m probably writing another post about it, since I don’t want to write a very long boring post. Next one is going to focus a little more on other sessions, activities and community!
I would like to thank everyone that attended my sessions and send a few specials ones: to Reza Rahman for helping me in the submission process, to Heather VanCura for the Hackergarten invite and for my co-speakers Ivan Ivanov and Simon Maple. Thanks everyone!
Last week, JavaOne 2014 published the sessions schedules plus the Schedule Builder for attendees to enrol in the sessions. I’m going to be speaking in the following sessions:
If you’re going, please sign-up for these sessions. I’m going to do my best to make sure that your time is well spent there. Check my previous post with some additional information about the sessions: Speaking at JavaOne 2014.
Yesteday I got really great news. I was selected to present 3 out of 4 sessions that I have submitted to JavaOne 2014! After attending the first JavaOne in 2012, and going again in 2013, this was my first time submitting something for JavaOne.
I have to confess that I had high hopes of being selected, but I was not expecting to have 3 sessions right in the first year of submissions, since there are a lot of submissions and it’s really hard to get selected. A special thanks to Reza Rahman for helping me out during the submission process and for providing valuable tips. Thanks Reza! I would also like to thank you Ivan Ivanov and Simon Maple my co-speakers in two of the sessions.
Have a look below into the sessions abstracts and videos. I don’t have the schedules yet, but look for them in the JavaOne Schedule Builder (when available) and signup 🙂
What am I going to speak about?
CON2818 – Java EE 7 Batch Processing in the Real World
Abstract
This talk will explore one of the newest API for Java EE 7, the JSR 352, Batch Applications for the Java Platform. Batch processing is found in nearly every industry when you need to execute a non-interactive, bulk-oriented and long running operation task. A few examples are: financial transactions, billing, inventory management, report generation and so on. The JSR 352 specifies a common set of requirements that every batch application usually needs like: checkpointing, parallelization, splitting and logging. It also provides you with a job specification language and several interfaces that allow you to implement your business logic and interact with the batch container. We are going to live code a real life example batch application, starting with a simple task and then evolve it using the advanced API’s until we have a full parallel and checkpointing reader-processor-writer batch. By the end of the session, attendees should be able to understand the use cases of the JSR 352, when to apply it and how to develop a full Java EE Batch Application.
Abstract
We all enjoy to hear a good success story, but in the software development industry the life of a developer is also made up of disasters, disappointments and frustrations. Have you ever deleted all the data in production? Or maybe you just run out of disk space and your software failed miserably! How about crashing your server with a bug that you introduced in the latest release? We can learn with each others with the mistakes we made. Come to this BOF and share with us your most horrific development story and what did you do to fix it.
CON4255 – The 5 people in your organization that grow legacy code
Abstract
Have you ever looked at a random piece of code and wanted to rewrite it so badly? It’s natural to have legacy code in your application at some point. It’s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I’m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!