Development Horror Story – Mail Bomb
Based on my session idea at JavaOne about things that went terrible wrong in our development careers, I thought about writing a few of these stories. I’ll start with one of my favourites ones: Crashing a customer’s mail server after generating more than 23 Million emails! Yes, that’s right, 23 Millions!
A few years ago, I’ve joined a project that was being developed for several months, but had no production release yet. Actually, the project was scheduled to replace an existing application in the upcoming weeks. My first task in the project was to figure out what was needed to deploy the application in a production environment and replace the old application.
This application had a considerable amount of users (around 50 k), but not all of them were active. The new application had a new feature to exclude the users that didn’t log into the application for the last few months. This was implemented as a timer (executed daily) and a email notification was sent to that user warning him that he was excluded from the application.
The release was installed on a Friday (yes, Friday!), and everyone went for a rest. Monday morning, all hell broke loose! The customer mail server was down, and nobody had any idea why.
The first reports indicated that the mail server was out of disk space, because it had around 2 Million emails pending delivery and a lot more incoming. What the hell happened?
Even with the server down, support was able to show us a copy of an email stuck in the server. It was consistent with the email sent when a user was excluded. It didn’t make any sense, because we counted the number of users to be excluded and they were around 28 k, so only 28 k emails should have been sent. Even if all users were excluded the number could not be higher than 50 k (the total number of users).
Looking into the code, we found out a bug that would cause the user to not be excluded if he had an invalid email. As a consequence these users were caught every time that the timer executed. From the total 28 k users to be excluded, around 26 k had invalid emails. From Friday to Monday, we count 3 executions * 26 k users, so 78k k emails. Ok, so now we have an email increase, but not close enough to the reported numbers.
Actually the timer also had a bug. It was not scheduled to be executed daily, but every 8 hours. Let’s adjust the numbers: 3 days * 3 executions a day * 26 k users, brings the total to 234 k emails. A considerable increase but still far from a big number.
The operations installed the application in a second node, and the timer was executed in both. So a double increase. Let’s update: 2 * 234 k emails, brings the total to 468 k emails.
Since the emails were automated, you usually set up a no-reply email as the email sender. Now the problem was that the domain for the no-reply address was invalid. Combining this with the users invalid emails, the mail server entered in a loop state. Each invalid user email generated an error email sent to the no-reply address, which was invalid as well and this caused a returned email again to the server. The loop end when the Maximum hop count is exceeded. In this case it was 50. Now everything starts to make sense! Let’s update the numbers:
26 k users * 3 days * 3 executions * 2 servers * 50 hops for a grand total of 23.4 Million emails!
The customer lost all their email from Friday to Monday, but it was possible to recover the mail server. The problems were fixed and it never happened again. I remember those days, to be very stressful, but today all of us involved, laugh about it!
Remember: always check the no-reply address!