This article is more than 1 year old

Poor Meta. Technical debt and user training made its exabyte-scale data migration tricky

Welcome to the real world, kids. And for the rest of us, a future at which Meta is gulp! – better at large-scale analytics

Here’s one from the “welcome to the real world, kids, we have no sympathy for your plight” files: social media giant Meta’s engineering team has bemoaned the complexity of migrating from legacy technology.

In a Thursday post detailing migration of exabyte-scale data stores to new schemas, a quartet of Meta software engineers offered the following insight into their work.

Migrations are hard. Moreover, they become much harder at Meta because of:
  • Technical debt: Systems have been built over years and have various levels of dependencies and deep integrations with other systems.
  • Nontechnical (soft) aspects: Walking users through the migration process with minimum friction is a fine art that needs to be honed over time and is unique to every migration.

Fellas, we’re going to let you in on a secret: everyone gets technical debt, and everyone has trouble educating users about new systems.

Meta is not special. It is not a beautiful and unique snowflake. It is the same sort of decaying collection of cobbled-together tech that every other organisation accrues over time.

In this case, the decrepit tech was “numerous heterogeneous services, such as warehouse data storage and various real-time systems, that make up Meta’s data platform — all exchanging large amounts of data among themselves as they communicate via service APIs.”

As Meta detailed in December 2022, those systems struggled to scale as the data-harvesting giant built more AI workloads that needed to access data from diverse sources.

Improved data logging and serialization was the answer, so that data could describe itself more effectively and therefore be more easily ingested by diverse applications.

Meta built a system called “Tulip” to sort that out. And was chuffed that the formats it used required 40 percent to 85 percent fewer bytes and uses 50 percent to 90 percent fewer CPU cycles.

As Meta’s Thursday post explains, Tulip may have been top tech but making it work was hard, not least because the social media giant employed over 30,000 logging schemas.

Across the four-year effort to adopt Tulip, Meta engineers found some data wasn’t able to be easily ingested or converted, or that doing so was computationally expensive. Some tools designed to ease migration created problems as they ran, so engineers created rate limiters so that issues didn’t snowball.

And then there were those pesky users, whose role planting Tulip in Meta’s tech garden necessitated the creation of a migration guide, an instructional video, plus a support team.

“Making huge bets such as the transformation of serialization formats across the entire data platform is challenging in the short term, but it offers long-term benefits and leads to evolution over time,” the post winds up.

“Designing and architecting solutions that are cognizant of both the technical as well as nontechnical aspects of performing a migration at this scale are important for success,” the post adds. “We hope that we have been able to provide a glimpse of the challenges we faced and solutions we used during this process.”

Meta’s four engineers probably have offered useful insights for those who face similar data-wrangling challenges. The rest of you who have lived through legacy migrations? Maybe less so.

And for everyone else, the insight here is that Met has become more efficient at wielding exabytes of data. Much of it gathered from, and about, you. ®

More about

TIP US OFF

Send us news


Other stories you might like