There is nothing more tricky and fraught with potential problems than DNS upgrades.
This week we migrated from BIND to PowerDNS. Prior to the migration we dutifully tested PowerDNS on different servers, in different configurations, consulted other sysadmins who were running PowerDNS, and found all tests to be working flawlessly.
So we went ahead and upgraded all three of our DNS servers from BIND to PowerDNS, and watched…
After the first day we saw no issues. None.
On day two I answered a support call from a hosting client who said that when he entered a few email addresses into the bcc field in webmail that none of the recipients received his email. He also said he didn’t receive any bounce messages and that bcc in webmail must be broken. I suspected – since nothing had changed with our webmail in a few weeks – a PBKAC and told him I’d look into it, expecting to call him back and say something like “you can’t put spaces in your email addresses.”
When I investigated I noticed that nearly all of the bcc addresses in the client’s email were to yahoo.com email addresses and that nothing he appeared to have done was wrong. I viewed the queues on our clustered mail servers and noticed that the yahoo.com addresses that he tried to send to had not left the servers. And there were other outgoing yahoo.com emails besides his that hadn’t left the servers either.
No PBKAC here. This smelled awfully like a DNS issue. Sysadmin powers activate!
After combing through logs, running many digs, and googling like a madman (Which reminds me, how the hell did I fix servers before Google? Did I go to the freakin’ the library?) all signs pointed to the new PowerDNS install triggering a qmail bug that has to do with large DNS responses. For whatever reason, PowerDNS must have returned a bit more information than BIND did, so this qmail bug popped up after the migration.
Like all other things with qmail, there are about 10 different ways to attack this bug. Because of our many years of hosting email and battling spammers our qmail configuration is quite customized, which means that I had to navigate the most difficult route and patch all of the qmail binaries and recompile. Fortunately it all went fine, problem solved – all yahoo.com emails went blazing out of our servers on to their destination.
On day three and four our monitoring servers alerted us to our mail queue’s having unusual bounced messages in them. Not many bounces, but enough emails to trigger an alert. I investigated and didn’t see anything terribly out of the ordinary and told myself and the other sysadmins to keep a close eye on it.
On day five we were alerted to the fact that our mail servers had been placed on two spam blacklists. There are hundreds of blacklists on the Internet, and these two aren’t widely used, but a few of our email users had already had a couple emails blocked as a result. Ugh.
One of these blacklists was a backscatter blacklist. Backscatter is a wickedly perverse form of spam attack. It works like this: Spammers will fake the return address of the sender and will send a spam email to another address that the spammer knows will bounce, in the hopes that the bounced message will be read by the faked sender address. You follow that? Spammers are hoping that if they bounce a message to you that you’ll read it and buy their Viagra.
Qmail normally is known to be vulnerable to backscatter, but I’d patched all of our servers to fix this issue in 2008. However, after many hours of discussion and research I realized that when I patched qmail with the new DNS size patch – five days ago – I forgot to re-apply the backscatter qmail patch. We document all of these procedures but I somehow didn’t read my own patch documentation.
Nice work window-licker.
So I re-read our documentation and re-patched everything and it all worked. And on day 6 we were removed from the the blacklists.
Ripples are totally stressful.
Nice post. Hadn’t seen the term PBKAC before — great! Author is admin — which is whom, exactly?
You have all my understanding. Good sleuthing!
This one is my favorite:
PICNIC (“Problem In Chair Not In Computer”),
The mystery poster was me, the Worse Lambie.
We used to have an Error Code 100 — which along with Error 12: File not found, and Error 20 Btrieve Not Loaded — 100 stood for User Confused. I like PICNIC too! Thanks for the post, Oban.