Walk with me... through a billion files. Slow down – admire the subset

Qumulo and the tree-walking problem


Analysis If you ask your notebook's filesystem how many MP3 files it is storing that haven’t been opened in 30 days, you can find the answer reasonably quickly. But ask an enterprise’s file system when it holds a million files and you have a big problem.

Ask this question of a file system that holds a billion files and your day just got a whole lot worse.

Here's a filesystem 101 to say why this happens.

A file system is like an upside-down tree structure of files in folders, with the folders forming a directory tree descending from a single root. Each node in this structure lists the files it contains, plus data about the files, and the sub-folders in contains. There is no central directory in one place listing all this stuff for the entire filesystem.

Filesystem_tree

Upside-down file system tree structure

So, to answer the initial question: the system has to traverse or walk the file system tree and at each node (F1-1, F1-2) look for files with the .MP3 extension and their last opened date, adding them to a list if they match the filter criteria. If there is a nested sub-folder or sub-folders (F2-1, F2-2) it has to walk the tree down to the first one ( F2-1) node and repeat the process, and then repeat again for any sub-folder, until it gets to the bottom of that series of nodes, then walk back up until it gets to a node where there is another sub-folder (F2-2) listed and go to that, and so on ad infinitum, meaning the end of the file system.

Assume each node access requires a disk access and this takes 10 milliseconds; then a 10-node file system would take 100ms roughly plus the access needed to walk back up the tree; say 150ms being simplistic.

So, again, a 100-node file system would take 1,500ms, a thousand node one 15,000ms, a million node one 15,000,000ms and a billion node one 15,000,000,000ms - like we said, your day just got a whole lot worse because the tree walk is going to take days, 173.6 if our often suspect math is correct.

Qumulo CTO and co-founder Peter Godman, presenting to a press briefing, says these kind of numbers aren’t imaginary. A major DreamWorks picture needs 500 million files and legacy kit – meaning pre-2010 – can’t cope with this kind of filesystem request, going into a kind of tree-walk paralysis, which makes them hard to manage and optimise.

Qumulo says tree walks make data management tasks days to weeks long, leading to data blindness.

Peter_Goodman

Qumulo co-founder and CEO Peter Godman

Seemingly simple requests – such as how many MP3 files there are that haven’t been opened in 30 days – are practically impossible to accomplish, let alone telling the filesystem to move them off to cheap back-end cloud storage.

Qumulo’s marketing VP, Jay Wampold, says: “When you have a billion anything, humans can’t manage it.”

As Godman says, the metadata processing involved becomes a problem in its own right: ”The metadata itself is a big data problem at scale” and with QF2 (Qumulo File Fabric) we have real-time control of files at scale.

You can’t retrofit the necessary metadata generation, storage and access to an existing file system. It has to be designed in, which it has been when it comes to Qumulo’s scale-out filesystem (QFS, with its underlying scalable block store (SBS).

There is a Qumulo database component, an extension of traditional file system metadata, which puts virtual fields in file metadata, and has an analytics capability.

Qumulo invented it and built it and it is distributed across nodes. It is the firm's own metadata database, and a property of its file system tree, not a separate "box" containing metadata.

A Qumulo QF2 technical overview (PDF) declares:

When you have a large numbers of files, the directory structure and file attributes themselves become big data. As a result, sequential processes such as tree walks, which are fundamental to legacy storage, are no longer computationally feasible. Instead, querying a large file system and managing it requires a new approach that uses parallel and distributed algorithms.

The technical paper says: “There is an inode B-tree, which acts as an index of all the files. The inode list is a standard file-system implementation technique that makes checking the consistency of the file system independent of the directory hierarchy. Inodes also help to make update operations such as directory moves efficient.

Files and directories are represented as B-trees with their own key/value pairs, such as the file name, its size and its access control list (ACL) or POSIX permissions.

This reliance on B-trees that point to virtualized protected block storage in SBS is one of the reasons that in QF2, a file system with a trillion files is feasible.

QumuloDB analytics are built in and integrated with the file system itself. Because the QF2 file system relies on B-trees, the analytics can use a system of real-time aggregates and information is available for timely processing without costly file system tree walks.

Read the technical overview to find out more. ®

Similar topics


Other stories you might like

  • India extends deadline for compliance with infosec logging rules by 90 days
    Helpfully announced extension on deadline day

    India's Ministry of Electronics and Information Technology (MeitY) and the local Computer Emergency Response Team (CERT-In) have extended the deadline for compliance with the Cyber Security Directions introduced on April 28, which were due to take effect yesterday.

    The Directions require verbose logging of users' activities on VPNs and clouds, reporting of infosec incidents within six hours of detection - even for trivial things like unusual port scanning - exclusive use of Indian network time protocol servers, and many other burdensome requirements. The Directions were purported to improve the security of local organisations, and to give CERT-In information it could use to assess threats to India. Yet the Directions allowed incident reports to be sent by fax – good ol' fax – to CERT-In, which offered no evidence it operates or would build infrastructure capable of ingesting or analyzing the millions of incident reports it would be sent by compliant organizations.

    The Directions were roundly criticized by tech lobby groups that pointed out requirements such as compelling clouds to store logs of customers' activities was futile, since clouds don't log what goes on inside resources rented by their customers. VPN providers quit India and moved their servers offshore, citing the impossibility of storing user logs when their entire business model rests on not logging user activities. VPN operators going offshore means India's government is therefore less able to influence such outfits.

    Continue reading
  • Hangouts hangs up: Google chat app shuts this year
    How many messaging services does this web giant need? It's gotta be over 9,000

    Google is winding down its messaging app Hangouts before it officially shuts in November, the web giant announced on Monday.

    Users of the mobile app will see a pop-up asking them to move their conversations onto Google Chat, which is yet another one of its online services. It can be accessed via Gmail as well as its own standalone application. Next month, conversations in the web version of Hangouts will be ported over to Chat in Gmail. 

    Continue reading
  • OpenSSL 3.0.5 awaits release to fix potential worse-than-Heartbleed flaw
    Though severity up for debate, and limited chips affected, broken tests hold back previous patch from distribution

    The latest version of OpenSSL v3, a widely used open-source library for secure networking using the Transport Layer Security (TLS) protocol, contains a memory corruption vulnerability that imperils x64 systems with Intel's Advanced Vector Extensions 512 (AVX512).

    OpenSSL 3.0.4 was released on June 21 to address a command-injection vulnerability (CVE-2022-2068) that was not fully addressed with a previous patch (CVE-2022-1292).

    But this release itself needs further fixing. OpenSSL 3.0.4 "is susceptible to remote memory corruption which can be triggered trivially by an attacker," according to security researcher Guido Vranken. We're imagining two devices establishing a secure connection between themselves using OpenSSL and this flaw being exploited to run arbitrary malicious code on one of them.

    Continue reading
  • Not enough desks and parking spots, wobbly Wi-Fi: Welcome back to the office, Tesla staff
    Don't worry, the tweetings will continue until morale improves

    Employees at Tesla suffered spotty Wi-Fi and struggled to find desks and parking spots when they were returned to work at the office following orders from CEO Elon Musk.

    Most tech companies are either following a hybrid work model or are still operating fully remotely. Musk, however, wants his automaker's staff back at the office working for at least 40 hours a week. Those who fail to return risk losing their jobs, he warned in an internal email earlier this month.

    "Everyone at Tesla is required to spend a minimum of 40 hours in the office per week. Moreover, the office must be where your actual colleagues are located, not some remote pseudo office. If you don't show up, we will assume you have resigned," he wrote.

    Continue reading
  • LGBTQ+ folks warned of dating app extortion scams
    Uncle Sam tells of crooks exploiting Pride Month

    The FTC is warning members of the LGBTQ+ community about online extortion via dating apps such as Grindr and Feeld.

    According to the American watchdog, a common scam involves a fraudster posing as a potential romantic partner on one of the apps. The cybercriminal sends explicit of a stranger photos while posing as them, and asks for similar ones in return from the mark. If the victim sends photos, the extortionist demands a payment – usually in the form of gift cards – or threatens to share the photos on the chat to the victim's family members, friends, or employer.

    Such sextortion scams have been going on for years in one form or another, even attempting to hit Reg hacks, and has led to suicides.

    Continue reading
  • 5G C-band rollout at US airports slowed over radio altimeter safety fears
    Well, they did say from July, now they really mean from July 2023

    America's aviation watchdog has said the rollout of 5G C-band coverage near US airports won't fully start until next year, delaying some travelers' access to better cellular broadband at crowded terminals.

    Acting FAA Administrator Billy Nolen said in a statement this month that its discussions with wireless carriers "have identified a path that will continue to enable aviation and 5G C-band wireless to safely co-exist."

    5G C-band operates between 3.7-3.98GHz, near the 4.2-4.4GHz band used by radio altimeters that are jolly useful for landing planes in limited visibility. There is or was a fear that these cellular signals, such as from cell towers close to airports, could bleed into the frequencies used by aircraft and cause radio altimeters to display an incorrect reading. C-band technology, which promises faster mobile broadband, was supposed to roll out nationwide on Verizon, AT&T and T-Mobile US's networks, but some deployments have been paused near airports due to these concerns. 

    Continue reading
  • IBM settles age discrimination case that sought top execs' emails
    Just days after being ordered to provide messages, Big Blue opts out of public trial

    Less than a week after IBM was ordered in an age discrimination lawsuit to produce internal emails in which its former CEO and former SVP of human resources discuss reducing the number of older workers, the IT giant chose to settle the case for an undisclosed sum rather than proceed to trial next month.

    The order, issued on June 9, in Schenfeld v. IBM, describes Exhibit 10, which "contains emails that discuss the effort taken by IBM to increase the number of 'millennial' employees."

    Plaintiff Eugene Schenfeld, who worked as an IBM research scientist when current CEO Arvind Krishna ran IBM's research group, sued IBM for age discrimination in November, 2018. His claim is one of many that followed a March 2018 report by ProPublica and Mother Jones about a concerted effort to de-age IBM and a 2020 finding by the US Equal Employment Opportunity Commission (EEOC) that IBM executives had directed managers to get rid of older workers to make room for younger ones.

    Continue reading
  • FTC urged to probe Apple, Google for enabling ‘intense system of surveillance’
    Ad tracking poses a privacy and security risk in post-Roe America, lawmakers warn

    Democrat lawmakers want the FTC to investigate Apple and Google's online ad trackers, which they say amount to unfair and deceptive business practices and pose a privacy and security risk to people using the tech giants' mobile devices.

    US Senators Ron Wyden (D-OR), Elizabeth Warren (D-MA), and Cory Booker (D-NJ) and House Representative Sara Jacobs (D-CA) requested on Friday that the watchdog launch a probe into Apple and Google, hours before the US Supreme Court overturned Roe v. Wade, clearing the way for individual states to ban access to abortions. 

    In the days leading up to the court's action, some of these same lawmakers had also introduced data privacy bills, including a proposal that would make it illegal for data brokers to sell sensitive location and health information of individuals' medical treatment.

    Continue reading

Biting the hand that feeds IT © 1998–2022