The 4KB Lie: Why Stubbing Files Doesn't Fix Your Metadata Problem

In this article

Payload vs. Metadata: The Real-World Terminal Test
The File System Tax (Why a 12-byte file takes 4KB)
The Directory Traversal Death Trap
The Catalog Solution: Solving the ls -l Problem
The Two-Tier Archiving Strategy

Payload vs. Metadata: The Real-World Terminal Test

The marketing guy loves to talk about "0-byte files." The pitch is simple: software intercepts the file, moves it to an archive, and leaves behind a "ghost" or a "stub." The user still sees the file, but it takes up zero space on the SSD.

As engineers, we know that is a lie. A file cannot exist in a POSIX file system without taking up some space. To understand exactly what happens when HuskHoard stubs a file, look at this real-world output from our XFS test lab:

jo@jo-N-A:~/huskhoard/hot_tier$ ls -l
total 40
drwxr-xr-x 2 jo jo     4096 May 26 23:43 deep_project_folder
-rw-rw-r-- 1 jo jo 12582912 Jun  6 23:08 dummy_data5.bin
-rw-r--r-- 1 jo jo 36700190 May 26 23:43 streamgate_test.bin
-rw-r--r-- 1 jo jo       12 May 26 23:43 tiny_file.txt

jo@jo-N-A:~/huskhoard/hot_tier$ du -h
8.0K    ./deep_project_folder
52K     .
    

This output is the "smoking gun" of how archiving works at the kernel level.

Look at the Logical View (ls -l). The OS reports that streamgate_test.bin is 36.7 MB, and dummy_data5.bin is 12.5 MB. To Plex, Premiere Pro, or Windows Explorer, those files exist and are fully sized.

Now look at the Physical View (du -h). The entire folder is only occupying 52 KB of disk space. Without HuskHoard, du would report roughly 50 MB. We have successfully reclaimed 99.9% of the SSD capacity. The physical data blocks were freed via a Linux FALLOC_FL_PUNCH_HOLE system call.

But wait... if the big files were punched down to 0 physical blocks, where is that 52 KB coming from?

The File System Tax (Why a 12-byte file takes 4KB)

Let's look deeper into a sub-folder to see exactly where our SSD space is leaking.

jo@jo-N-A:~/huskhoard/hot_tier$ cd node_modules
jo@jo-N-A:~/huskhoard/hot_tier/node_modules$ ls -l
total 4
-rw-r--r-- 1 jo jo 12 May 26 23:43 ignore_me.txt

jo@jo-N-A:~/huskhoard/hot_tier/node_modules$ du -h
8.0K    .
    

Here lies the fundamental limitation of standard filesystems. While the exact math varies—ext4 uses rigid 4KB blocks, XFS embeds tiny files into 512-byte inline inodes, and ZFS packs data into dnodes—every filesystem imposes a Metadata Tax. A 12-byte file isn't just 12 bytes; it brings an entourage of structural overhead.

The File Tax: Even if you punch the payload to zero bytes, the file still requires an Inode (or dnode in ZFS) to store permissions, timestamps, and block pointers. This permanently reserves anywhere from 256 bytes to several kilobytes of disk space per file.
The Directory Tax: A folder is not just a visual container; it is a hidden database structure (using B-Trees, HTrees, or ZAPs) that maps filenames to inodes. The more files you have, the more these directory structures consume physical blocks.

This is why millions of tiny files—or millions of 0-byte stubs—suffocate an NVMe drive. If you leave 1,000,000 stubs on a drive, you aren't just wasting Gigabytes of metadata capacity; you are actively degrading the kernel's ability to cache and traverse its own directory trees.

The Directory Traversal Death Trap

So, we run HuskHoard. We punch holes in the big files, saving Terabytes of capacity. We are safe, right?

Wrong. Stubbing files solves the Capacity problem. It does not solve the Metadata problem.

Even if a file is hole-punched to zero payload blocks, the Inode remains in the filesystem. The filename remains in the Directory tree. If you run ls -l in a directory with 1,000,000 hole-punched stubs, your terminal will still hang. Why? Because the OS has to execute one million tiny, sequential reads (chatter) against the SSD to pull the permissions, UID, and timestamps for every single stub.

The Inode Reality

If you leave 1,000,000 stubs on your SSD, you are still forcing the filesystem to manage 1,000,000 inodes. Over time, the internal structures that manage these entries (XFS B-trees, ZFS ZAPs, or ext4 HTrees) become large, fragmented, and excruciatingly slow to traverse. "0 bytes of payload" still equals "A massive metadata headache for your kernel."

The Catalog Solution: Solving the `ls -l` Problem

This is where the marketing pitch ends and the engineering reality begins. If you want to build a Petabyte-scale archive with millions of files, you cannot leave the stubs on the SSD forever. You must transition from a Linear Search (Filesystem) to an Indexed Search (Database).

In HuskHoard, every archived file is rigorously indexed in a local husk_catalog.db (SQLite). SQLite uses an incredibly efficient B-Tree structure. Searching a million records in SQLite takes milliseconds, whereas readdir() crawling a million inodes takes seconds or even minutes.

Filesystem Traversal (ls -l)

Linear & Chattery

✗Must load the directory block to find pointers

✗Must seek to 1,000,000 distinct inodes on disk

✗No caching for complex metadata queries

Catalog Traversal (SQLite)

Indexed & Grouped

✓Data is stored sequentially in contiguous pages

✓Primary Key & Index lookups are O(log N)

✓Entire index usually fits in system RAM

The Two-Tier Archiving Strategy

Because of this physical limitation, admins using HuskHoard don't treat all data the same. They employ a two-tiered Janitor strategy to keep their NVMe SSDs completely clean of metadata excess.

Tier 1: The Transparent Stub (Warm Data)

When a project finishes, HuskHoard archives the files and punches holes in the payload. The stubs remain on the SSD. Your editors and software can still see them in their standard folders. You save Terabytes of capacity, but you accept that the Inode metadata is still taking up space. This is perfect for files aged 30-90 days that you might need to "thaw" quickly.

Tier 2: The Deep Archive (Cold Data)

Once data ages past a year, you don't need it cluttering your SMB shares. At this point, the true power of HuskHoard is utilized: Metadata Offloading.

You delete the stub from the SSD entirely (e.g., rm -rf /hot_tier/old_project/). You drop the file from the filesystem. The payload is safe on your raw LTO Tape or SMR drive. The metadata is safe inside the SQLite Catalog. Your SSD reclaims its inodes, its directory trees shrink back down to lightning-fast sizes, and the data remains just one SQLite query away.

The Database is the Source of Truth

A common fear is that deleting a file from the Hot Tier means it will be considered "dead" and lost forever during the next Tape Repack (Garbage Collection). This is not how HuskHoard works.

By design, the HuskHoard background daemon ignores filesystem deletion events. If you delete a stub from the SSD, the SQLite Database doesn't blink. The file remains fully indexed. When you run a repack operation to reclaim tape space, the engine never looks at the SSD—it queries the database for the latest active versions, ensuring your deep-archived data is safely carried over to the new tape. To get the file back, you simply use the husk restore CLI command to query the catalog and seamlessly re-materialize the file onto the Hot Tier.

End Game

Hole-punching is a fantastic tool for keeping SSD capacity in check, but do not let it fool you into thinking it solves filesystem scale. Filesystems are designed to manage active, live data. Relational databases are designed to index millions of records.

By using HuskHoard's MultiTapeWriter to pack tiny files sequentially on the backend, and its SQLite catalog to index them on the frontend, you finally fix the ls -l problem. You let the filesystem do what it does best, and you let the database handle the rest.