Curious tale of broken VPNs, the Year 2038, and certs that expired 100 years ago
It’s not NTP. There’s no way it’s NTP. It was NTP
Interview Back in late 2010, "Zimmie" was working in IT support for a vendor that made VPN devices and an associated operating system. He got a call on a Monday from a customer – a large specialty retailer in the US – about its VPN hardware that had stopped working over the weekend.
After looking into the report, the problem appeared to be the result of a certificate validation failure, as he described in a recent post on Mastodon.
"My then-employer's VPN devices are managed by a central server," explained Zimmie to The Register. "This server runs its own certificate authority (CA) which it uses to sign certificates for all the devices. The VPN endpoints then use these to authenticate to the management server (eg, when sending logs) and to each other (mostly for VPNs). The CA is a core part of the management software."
Zimmie told us he preferred not to publicly identify himself, the vendor, or its customer so as not to embarrass anyone. But he was okay with The Register recounting his tale.
Zimmie noted that the validation check had failed because the system couldn't check the certificate revocation list (CRL) – a list of digital certificates that have been revoked by the issuing authority.
The software which handles the validation allocates 512 KiB of space to store the CRL, and it's too big? This CA is very lightly used, so it shouldn't be possible to hit the limit within a human lifespan
"These devices authenticated to each other with certificates much like the ones used for HTTPS, but signed by a private certificate authority (CA)," he explained.
"Each customer had their own CA for this. Part of the process of validating the certificates is asking the CA if the certificate has been revoked. The validation was failing because the VPN box couldn't download the certificate revocation list (CRL) to confirm the peer's certificate wasn't on it. Why couldn't it download the CRL?"
Checking the VPN device's certificate operation log showed that the CRL was too big to be downloaded, he found.
"The software which handles the validation allocates 512 KiB of space to store the CRL, and it's too big? This CA is very lightly used, so it shouldn't be possible to hit the limit within a human lifespan. What's going on with the CRL that it hit the limit so early?"
A 2015 paper [PDF] on certificate revocation published by University of Maryland researchers notes that the CRL size for the median certificate is 51KB and that half of all CRLs are under 900B.
This one was nearly 1MB, and Zimmie discovered the size was due to the CA repeatedly revoking and reissuing every certificate it signed once per second. The CRL had grown more than 250 KiB per day over the weekend which, he estimated, amounted to a century of growth in just a day.
"Fortunately, the certificate authority had a certificate operation log," recounted Zimmie. "This log records every time the CA revokes a certificate, signs a new one, or performs a few other operations. Looking at the most recent entries, I see the CA process woke up, decided the CA's own certificate is expired, then revoked and reissued every certificate it had signed. I look further back, and I see it did the same process one second earlier. And one second before that."
It's not valid before 2037, and it's not valid after 1910? That means it's never valid. How did that happen?
These digital certificates, according to Zimmie, have two dates: notBefore, a date before which they're not valid; and notAfter, a date after which they're expired.
"I look at the CA's certificate, and it says it's not valid after 1910," he wrote. "Weird, but that at least explains why the CA is telling us its certificate is expired. But when it issues a replacement certificate for itself, the replacement expires in 1910, too. That's extremely weird. It should reissue the certificate with new dates."
Zimmie took note of the notBefore date. "People rarely care about this date, because it's extremely rare to get certificates which aren't valid yet," he explained.
"But this one isn't valid before some time in 2037. Wait, what? It's not valid before 2037, and it's not valid after 1910? That means it's never valid. How did that happen?"
Readers who deal with networking may already have some idea: Something is amiss with the time calculation.
"Quite a while ago, the developers of UNIX decided to track time as a signed 32-bit counter of the number of seconds since midnight, January 1, 1970," explained Zimmie.
"Linux kept this decision for compatibility. The 'signed' part means the number can be negative, to allow time values before 1970 to be represented. Thirty-two bits is enough for a little under 4.3 billion seconds. The sign means you get about 2.1 billion seconds before 1970 and about 2.1 billion after. Two point one billion seconds is a little under 25,000 days, or a little over 68 years."
Sixty-eight years after 1970, Zimmie said, is 2038. Anyone who deals with UNIX time should recognize the year 2038, which has its own website warning of troubles ahead.
"The latest time which can be represented like this is 03:14:07 UTC on January 19, 2038," said Zimmie. "Once the timer is incremented from this second, the value 'overflows' and goes from being a large positive number to being a large negative number. The next second this counter can represent is 20:45:52 UTC on December 13, 1901. This is called the Year 2038 Problem."
Noting that the certificate authority signs its own certificate to be valid for a ten-year period, Zimmie concluded it ran into the 2038 problem when calculating expiration dates. That suggested a bug in the date math library, or the code implementing that library. But that still left the question of why it was renewing certificates with the same dates.
Zimmie checked the CA's automatic renewal code and confirmed that it won't reissue the certificate with an earlier start date when it expires.
"That makes sense under normal operation, but makes it impossible to recover from odd situations like the one my customer found himself in," he explained. "Another bug. But why did it sign a certificate not valid before 2037 in the first place?"
- Superuser mostly helped IT, until a BSOD saw him invent a farcical fix
- Microsoft embraces its inner penguin as sudo sneaks into Windows 11
- Remember when enterprise administration was more than just a browser dashboard?
- Windows 3.11 trundles on as job site pleads for 'driver updates' on German trains
After checking the lengthy CA operation log, he found a system date stamp in 2037 about the CA renewing itself.
"Now the certificate starting in 2037 makes sense, except for the not-so-minor fact it's 2010, not 2037," he said. "So why does the CA operation log have entries from 2037?"
The customer, Zimmie recalled, reported that he didn't set the time to 2037. "He's using NTP, though, and it's synchronizing with the time server every 120 seconds," he elaborated, referring to the Network Time Protocol, which is used to sync computer clocks to a network time signal.
"I started looking through the NTP client logs. Most of the entries are from the client adjusting the clock a small fraction of a second forward or back. I filter out all the entries with adjustments starting with '-0.' or '0.', and I find some much larger adjustments. Tens of thousands of seconds. Hundreds of thousands. Millions. I find one adjustment of around -4 billion seconds, then it's back to millions or hundreds of thousands of seconds at a time."
According to Zimmie, the system believed that the NTP server had set the time to 2010, 2019, 2028, 2037, 1907, 1918, and so on until it cycled back to the present.
"The NTP client in the OS is really bad, and it's clearly allowing time adjustments much larger than are reasonable," he lamented. "The CA was running on an appliance OS provided by my employer, so we own the NTP client. Third bug of the day. At least now we know why the certificate authority thought it was 2037!"
The NTP client in the OS is really bad, and it's clearly allowing time adjustments much larger than are reasonable
That still left the question of why the NTP client raced forward through the available timestamp space.
"It happened again a few days later, and we got a packet capture proving it actually handed out the bogus timestamps," he explained. "Unfortunately, the NTP server was an appliance from another vendor, so I can only speculate. I think it was a value type mistake in C."
Zimmie told us "C has the concept of signed integers and unsigned integers, but the separation between them is extremely weak. You can tell the system to take this signed integer value and store it somewhere, then you can accidentally interpret it as an unsigned value later.
"For positive integers, this doesn't matter. It becomes a problem with negative integers, though. For several reasons, they are stored in memory in what is called 'two's complement' form. When you store the value -1 as a 32-bit signed integer, the representation in memory is 0xFFFFFFFF. If you then interpret that as an unsigned integer, you get the value 4,294,967,295."
According to Zimmie, the NTP server used a radio receiver to listen to an upstream time source.
"These radios are commonly GPS receivers, but CDMA cell networks at the time also provided precise enough time information. I suspect the NTP server had a badly faulty internal clock which ran very fast. I also suspect it lost the radio signal for long enough for the internal clock to get at least one whole second ahead of the time acquired from the radio.
"When it reacquired the radio signal, I think it subtracted the internal clock time from the radio time to find out how many seconds off it was. If I'm right, it got a small negative number. I think it then treated this as an unsigned integer when figuring out how many seconds behind it is and the small negative number turned into a HUGE positive number. With that, it raced ahead 4.29 billion seconds as fast as it could."
"In my experience, the most baffling behavior is almost always caused by very small mistakes," quipped Zimmie. "This small mistake would explain the behavior."
Zimmie wrote in his post that recovery was difficult. "Without the VPN connections working, they needed someone to physically go to around a hundred sites and get each one to trust the new CA," he explained. "It also ended up happening a few times before everything could be fixed."
The Register asked Zimmie why he believes this story has resonated with people who have seen it on social media.
"People enjoy mysteries," he replied. "We want to understand. I like telling that story because it illustrates how a simple problem can actually be bigger than you think, but big, weird problems can still be broken down and understood. I think that resonates with a lot of people."
It's also a bit of fun when you're actually solving the mystery.
"I generally start troubleshooting an issue by asking the system what it is doing," explained Zimmie. "Packet captures, poking through logs, and so on. After a few rounds of this, I start hypothesizing a reason, and testing my hypothesis. Basic scientific method stuff. Most are simple to check. When they're wrong, I just move on. As I start narrowing down the possibilities, and the hypotheses are proven, it's electric. Making an educated guess and proving it's right is incredibly satisfying."
Zimmie told The Register that NTP implementation problems like this are uncommon.
"I think the closest thing I've seen in the last ten years was that gpsd bug you wrote about in late 2021," he recalled. "Of course, the nastiest bugs and implementation issues seem to wait to strike until we least expect it. I'm definitely going to be on the lookout for NTP issues as we close in on 2038." ®