Why the File System Can't Be the Source of Truth
In the previous post we established that HuskHoard's stubbing system deliberately severs the normal relationship between a file path and the physical bytes it represents. A stub looks like a 47 GB file but occupies zero disk blocks. The file system has been turned into a presentation layer.
That raises an uncomfortable question: if the file system no longer holds the data, and the data is written sequentially onto a tape cartridge or a flat image file on disk, or an rclone cloud backend, how does HuskHoard know where on that volume any given file lives? Tape is not a random-access device. You can't ask it "give me the file at path /projects/alpha/final.mov" — you can only seek to a byte offset and read forward from there.
The answer is husk_catalog.db: a SQLite database that functions as HuskHoard's permanent index of everything it has ever archived. Every operation in the system — stub creation, transparent restore, point-in-time version rollback, integrity scrubbing, and garbage-collection repacking — ultimately begins by querying this file.
The catalog is not a backup of metadata. It is the authoritative map. Without it, you have a tape full of data with no way to look anything up. With it, you can reconstruct an entire presentation layer from zero in seconds.
The Catalog Schema: What Gets Recorded and Why
The heart of husk_catalog.db is the catalog table. Here is the exact schema as defined in database.rs:
Every column earns its place. Let's walk through the important ones.
tape_uuid + tape_offset: The Physical Address
These two columns together form a complete physical address for the archived payload. The tape_uuid identifies which volume the data lives on — not by a human-readable name or a device path like /dev/nst0 (which can change when you plug a drive into a different port), but by the 16-byte UUID that was burned into the volume's header block when it was first formatted. The tape_offset is the exact byte position on that volume where the object's 4 KB header begins.
This design means HuskHoard doesn't care what port your LTO drive is plugged into, or whether an image file has been renamed, or whether an rclone remote has been reconfigured. As long as the UUID matches, the restore succeeds.
version: Automatic Point-in-Time History
Every time a file is re-archived, HuskHoard doesn't overwrite the previous catalog entry. Instead, it increments the version counter and writes a new row. Because tape is an append-only medium — you can't punch a hole in the middle of a cartridge — old versions naturally accumulate. The catalog tracks all of them.
This gives you automatic point-in-time version history at essentially zero overhead. If you accidentally overwrite a critical video project and the Janitor archives the damaged version, you can roll back to any prior version with a single husk restore command by passing an explicit --version flag. The default behavior always selects the highest (most recent) version. Also it is important to point out we write sequentially to all media and for huskhoard all volumes are treated as though they were a tape. Sequentially writing rather than random writing to a volume starting at a fixed LBA.
Version Management
Version Management
--version N on any restore commandmax_versions configblake3_hash: The Integrity Anchor
The BLAKE3 hash of the uncompressed payload is recorded at archive time and stored permanently in the catalog. This is what the husk scrub command uses when it performs a deep integrity check: it reads every byte off the volume, decompresses on the fly into a hash sink (using std::io::sink() so no disk space is needed), and compares the resulting hash against the catalog's stored value. Any mismatch is flagged as corruption.
Critically, the hash is of the original content, not the compressed bytes. This means you can upgrade the compression algorithm in a future HuskHoard version without invalidating any existing integrity records.
The Tapes Table: Mapping UUIDs to Hardware
The catalog table knows which UUID holds a given file, but it doesn't know where that UUID is physically located right now. That's the job of the tapes table.
The drive_serial column is what allows HuskHoard to survive hardware moves. On startup, the daemon calls rescan_tape_drives(), which walks /sys/block and /sys/class/scsi_tape looking for block and character devices whose hardware serial numbers match entries in this table. If your LTO drive was on /dev/nst0 yesterday and is on /dev/nst1 today because you rearranged your HBA, the catalog automatically heals the mapping before the first restore attempt.
The location_hint field is for human operators working with large offline libraries. When HuskHoard needs a volume that isn't inserted, it passes this hint to the Sidecar IPC bridge's WAKE_VOLUME event, which your automation layer can use to tell a robotic loader — or a human — exactly which shelf to fetch the cartridge from.
The object_frames Table: Range Queries and HTTP Streaming
Most people's mental model of an archive is: one file, one contiguous blob. HuskHoard has that, but it also stores a second index specifically to enable range-based access through the HTTP Streaming Gateway.
When HuskHoard archives a large file, it writes it through the Zstd compressor in chunks, recording the mapping between uncompressed byte positions and their corresponding positions in the compressed stream on the storage volume. This produces a jump table stored in object_frames.
Why does this matter? Consider what happens when Plex or VLC sends an HTTP Range: bytes=2147483648- request — asking to start playback from the two-gigabyte mark of a 12 GB video file. Without the frame index, HuskHoard would have to decompress from byte zero just to find that position. With the frame index, it can look up the nearest compressed frame boundary, seek directly to that offset on the volume, and begin streaming from there.
This is what allows HuskHoard to function as a transparent media server bridge. Your video player sends a standard HTTP Range request to localhost:8080, and HuskHoard handles the seek into the compressed stream entirely invisibly.
Why WAL Mode Is Non-Negotiable
HuskHoard runs two threads that both need to write to the catalog simultaneously: the foreground Interceptor (which handles live fanotify restore events from the OS) and the background Janitor Worker (which processes the archive queue). In standard SQLite journal mode, these two writers would collide constantly, producing "database is locked" errors at the worst possible moments — like when a user's video editor is waiting for a restore to complete.
The fix is the very first thing init_catalog() does:
In WAL mode, readers never block writers and writers never block readers. The Interceptor can be in the middle of updating a restored file's catalog entry while the Janitor is simultaneously inserting new archive records, and neither will ever see a lock error. A busy_timeout of 30 seconds is also set as a backstop, but in practice WAL mode makes write contention almost nonexistent.
The Nuclear Option: Rebuilding a Lost Catalog From Raw Storage Volume
Inevitably, the question arises: what happens if I lose the catalog database? Maybe the SSD it lived on failed. Maybe someone ran rm in the wrong directory. The data is still physically on your storage volumes — but without the index, how do you find any of it?
This is what husk rebuild is for.
The rebuild command works because every archived object on a volume begins with a 4 KB ObjectHeader containing a fixed magic string (USTDHUSK), the original file path, the BLAKE3 hash, POSIX permissions, timestamps, and a CRC32 checksum of the header itself. The rebuilder simply walks the storage volume 4 KB at a time, checks for the magic bytes, validates the header CRC, and if both pass, inserts a new catalog row. It doesn't require any prior knowledge of what was on the volume.
This means your storage volumes, weather disk, tape or S#, are entirely self-describing. The catalog is an optimization — it makes lookups instant instead of requiring a full scan. But the ground truth is redundantly encoded directly into the volume format itself. Lose the catalog and you can rebuild it. Lose the volumes and no software in the world will save you.
The (file_path, version) index makes finding any object a microsecond operation, regardless of how many millions of files are archived.
Every object header on a volume contains a full copy of its own metadata. The catalog can be completely rebuilt from scratch via a single raw scan.
Volume identity is based on cryptographic UUIDs, not device paths. Drives can be moved between machines and ports without breaking a single restore.
The Husk Catalog is deliberately boring infrastructure. It is a SQLite file you can open with any SQLite browser, inspect with standard SQL, and back up with a simple cp. That simplicity is the point. A critical piece of your archive pipeline should be something you can fully understand, audit, and recover from — not a proprietary black box. The catalog knows where everything is. The volumes have everything. As long as you have both, you have your archive.