Last Friday, one of the workers here in the office came over to me and said that he got an error in his inbox about a message that had been delayed. Not permanently, just delayed. I said OK, leave it, it’ll retry again for the next 48 hours and looked into it.
I connected to the Exchange 2010 server and opened Exchange Management Console and went straight to the Toolbox and clicked on Queue Viewer. There they were, pretty ducks all in a row all with DNS FAILURE errors. Huh. Interesting. I saw this happen once before when we were setting the server up. The DNS server it was set to use was offline, so no DNS resolution meant it didn’t know where to send the mail. Thinking this was the case this time, I checked the Network Adapter settings and saw that the preferred DNS server was the other VM “next to” the Exchange 2010 VM and the secondary was set to “my” DNS server here in my office.
I checked my DNS server first, just to make sure the service was running, and it was. I then checked the DNS server that was it’s primary and it, too, was running. Mystery. Nslookup queries failed and timed out even for common domain names. Not good. This was happening on both DNS servers.
I called in a support ticket (this was Friday at 4:00) and found out that the Exchange SysAdmin was on vacation and not back until Monday, and he was being covered by another Exchange SysAdmin on East Coast time. She called me back about 20 minutes later and we worked on it for a good 40 minutes with no resolution. She figured that since the DNS server was rebooted, it had been unable to contact the
PDC role holder and authorize/activate itself and that there must be a problem with the VPN between my network and hers.
This seemed like a valid diagnosis, as the other Administrator here at work told me that our router had been failing every 30-40 minutes, but recovering after a minute or two and was obviously dying. Yikes. This caused a little panic as ALL my sites use the same router/firewall and they’re discontinued and I hadn’t yet created a contingency plan to replace them.
She escalated the ticket up to tier 3 networking support, who tested the VPN and said that everything was up on their end, but they couldn’t ping my side of the VPN, therefore there was a problem with the VPN and it was on my end. (naturally). I don’t know too much about the router/firewalls we use here, I’ve been slowly learning as I went, but diagnostics and troubleshooting was beyond the scope of my knowledge beyond “well the blinky light is green, not red, so it’s up”.
Further compounding the matter was that this VPN was temporary, because we were switching it on Monday from an Internet VPN to a private, routed DSL connection into their MPLS network. That ADSL modem was plugged in to power and phone, but not into the LAN as it was just for testing.
At some point over the weekend, one of the emails from their networking people said that they could ping as far as 192.168.0.252 but no further. This was when the light bulb went off in my head. .252 is the address of the new ADSL router, NOT the VPN endpoint! Their network techs were trying to reach my network via a device that was physically unplugged! I thought it was odd, since I was connecting from home via VPN through the same device and it was up.
Monday came and I plugged the DSL modem into the LAN and disabled the Internet VPN connection from my network to theirs, created a new route for all traffic destined for their network to use this new gateway and everything seemed to be working. Outlook clients in my LAN segment were connecting via the MPLS network, verified by the IP addresses on a traceroute… I could Remote Desktop the virtual servers in their network… everything seemed to be working, but their network guys could still not ping my LAN from the MPLS gateway, even though I could ping back to my network from the Virtual servers (which was the important part anyway) so that left me with the DNS problem, which was still ongoing and some people were now starting to get NDRs because the 48 hours had timed out.
I started with my own laptop, and did an nslookup query. request timed out. Damnit! I checked the DNS server, the service was running, I restarted it, it still failed. I looked at the event log and there were a bunch of “DNS server encountered an invalid domain name” errors, but the errors were coming from all these weird IP addresses that were not in my network. I then thought that perhaps it was the forwarding that wasn’t working, based upon a few results that came up when I searched that error message online. I checked the forwarders on my DNS server and found that they were set to use two Shawcable.net servers, one of which resolved to a hostname and both of which did not respond to an nslookup query. How on earth did I end up with two (seemingly) random Shaw Cable DNS servers for my forwarders when I have a Telus ADSL connection in this office? that could explain why they didn’t respond; my IP address wasn’t in the Shaw Cable network!
I changed the two forwarders to 188.8.131.52 and 184.108.40.206 which is OpenDNS. I then restarted the DNS Server service and BAM! nslookups all worked. I then went back to the Exchange server and tried again. Still failed. OK, I have an idea of what’s going on now, so I connected to the DNS server there and checked it’s event logs. Similar messages, different addresses. I opened the DNS snap-in and went right to the forwarders. The two forwarders on this server were two Telus servers! This was a co-located (sort of) Virtual Server within an ISP, so how did I end up with Telus servers there?! I changed those two forwarders to OpenDNS and restarted the DNS Server service and as I was opening a command prompt window on the Exchange 2010 server to try an nslookup again, I could see the emails in the retry queue (which was still open) begin to flow out. I tried nslookup queries on a couple domain names that I knew were in the retry queue and they all answered lightning fast as non-authoritative responses.
SO in the end, I figured it out myself, but the million-dollar question that I can’t answer is HOW did my local DNS server get a Shaw DNS server as a forwarder, and how did the VM DNS server in the datacenter get a Telus one??