The Catalog: HuskHoard's Ground Truth Engine

In this article

Why the file system can't be the source of truth
The catalog schema: what gets recorded and why
The tapes table: mapping UUIDs to hardware
The object_frames table: range queries and streaming
Why WAL mode is non-negotiable
The nuclear option: rebuilding a lost catalog from raw disk or tape

Why the File System Can't Be the Source of Truth

In the previous post we established that HuskHoard's stubbing system deliberately severs the normal relationship between a file path and the physical bytes it represents. A stub looks like a 47 GB file but occupies zero disk blocks. The file system has been turned into a presentation layer.

That raises an uncomfortable question: if the file system no longer holds the data, and the data is written sequentially onto a tape cartridge or a flat image file on disk, or an rclone cloud backend, how does HuskHoard know where on that volume any given file lives? Tape is not a random-access device. You can't ask it "give me the file at path /projects/alpha/final.mov" — you can only seek to a byte offset and read forward from there.

The answer is husk_catalog.db: a SQLite database that functions as HuskHoard's permanent index of everything it has ever archived. Every operation in the system — stub creation, transparent restore, point-in-time version rollback, integrity scrubbing, and garbage-collection repacking — ultimately begins by querying this file.

The key insight

The catalog is not a backup of metadata. It is the authoritative map. Without it, you have a tape full of data with no way to look anything up. With it, you can reconstruct an entire presentation layer from zero in seconds.

The Catalog Schema: What Gets Recorded and Why

The heart of husk_catalog.db is the catalog table. Here is the exact schema as defined in database.rs:

-- The main per-object archive index
CREATE TABLE IF NOT EXISTS catalog (
    id               INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path        TEXT    NOT NULL,
    version          INTEGER NOT NULL,
    tape_uuid        TEXT    NOT NULL,
    tape_offset      INTEGER NOT NULL,
    payload_size     INTEGER NOT NULL,   -- uncompressed byte count
    compressed_size  INTEGER DEFAULT 0,
    compression_type INTEGER DEFAULT 0,  -- 0=None, 1=Zstd
    uid              INTEGER NOT NULL,
    gid              INTEGER NOT NULL,
    posix_mode       INTEGER NOT NULL,
    archived_at      DATETIME DEFAULT CURRENT_TIMESTAMP,
    original_mtime   INTEGER NOT NULL,
    blake3_hash      TEXT    NOT NULL,
    custom_metadata  TEXT
);

CREATE INDEX IF NOT EXISTS idx_path_version ON catalog (file_path, version);
    

Every column earns its place. Let's walk through the important ones.

tape_uuid + tape_offset: The Physical Address

These two columns together form a complete physical address for the archived payload. The tape_uuid identifies which volume the data lives on — not by a human-readable name or a device path like /dev/nst0 (which can change when you plug a drive into a different port), but by the 16-byte UUID that was burned into the volume's header block when it was first formatted. The tape_offset is the exact byte position on that volume where the object's 4 KB header begins.

This design means HuskHoard doesn't care what port your LTO drive is plugged into, or whether an image file has been renamed, or whether an rclone remote has been reconfigured. As long as the UUID matches, the restore succeeds.

version: Automatic Point-in-Time History

Every time a file is re-archived, HuskHoard doesn't overwrite the previous catalog entry. Instead, it increments the version counter and writes a new row. Because tape is an append-only medium — you can't punch a hole in the middle of a cartridge — old versions naturally accumulate. The catalog tracks all of them.

This gives you automatic point-in-time version history at essentially zero overhead. If you accidentally overwrite a critical video project and the Janitor archives the damaged version, you can roll back to any prior version with a single husk restore command by passing an explicit --version flag. The default behavior always selects the highest (most recent) version. Also it is important to point out we write sequentially to all media and for huskhoard all volumes are treated as though they were a tape. Sequentially writing rather than random writing to a volume starting at a fixed LBA.

Typical Backup Software

Version Management

✗Versions stored in separate backup sets or snapshots

✗Requires dedicated "restore from date X" UI

✗Old versions purged on a separate, manual schedule

✗Restoring a version moves data into a staging area

HuskHoard Catalog

Version Management

✓Every archive event is a new version row in the same table

✓Rollback via --version N on any restore command

✓Old versions auto-pruned based on max_versions config

✓Restored version is placed directly at the target path

blake3_hash: The Integrity Anchor

The BLAKE3 hash of the uncompressed payload is recorded at archive time and stored permanently in the catalog. This is what the husk scrub command uses when it performs a deep integrity check: it reads every byte off the volume, decompresses on the fly into a hash sink (using std::io::sink() so no disk space is needed), and compares the resulting hash against the catalog's stored value. Any mismatch is flagged as corruption.

Critically, the hash is of the original content, not the compressed bytes. This means you can upgrade the compression algorithm in a future HuskHoard version without invalidating any existing integrity records.

The Tapes Table: Mapping UUIDs to Hardware

The catalog table knows which UUID holds a given file, but it doesn't know where that UUID is physically located right now. That's the job of the tapes table.

-- Maps volume UUIDs to physical device paths and hardware identifiers
CREATE TABLE IF NOT EXISTS tapes (
    tape_uuid     TEXT PRIMARY KEY,
    device_path   TEXT NOT NULL,       -- e.g. /dev/nst0, my_archive.img, rclone:s3:bucket
    drive_serial  TEXT DEFAULT 'VIRTUAL_IMAGE',
    backend_type  TEXT DEFAULT 'local',
    location_hint TEXT DEFAULT NULL    -- e.g. "Shelf B, Row 3" for offline tapes
);
    

The drive_serial column is what allows HuskHoard to survive hardware moves. On startup, the daemon calls rescan_tape_drives(), which walks /sys/block and /sys/class/scsi_tape looking for block and character devices whose hardware serial numbers match entries in this table. If your LTO drive was on /dev/nst0 yesterday and is on /dev/nst1 today because you rearranged your HBA, the catalog automatically heals the mapping before the first restore attempt.

The location_hint field is for human operators working with large offline libraries. When HuskHoard needs a volume that isn't inserted, it passes this hint to the Sidecar IPC bridge's WAKE_VOLUME event, which your automation layer can use to tell a robotic loader — or a human — exactly which shelf to fetch the cartridge from.

The object_frames Table: Range Queries and HTTP Streaming

Most people's mental model of an archive is: one file, one contiguous blob. HuskHoard has that, but it also stores a second index specifically to enable range-based access through the HTTP Streaming Gateway.

-- Frame index for byte-range streaming (powers the HTTP gateway)
CREATE TABLE IF NOT EXISTS object_frames (
    file_path            TEXT    NOT NULL,
    version              INTEGER NOT NULL,
    uncompressed_offset  INTEGER NOT NULL,
    compressed_offset    INTEGER NOT NULL,
    compressed_size      INTEGER NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_frames ON object_frames (file_path, version);
    

When HuskHoard archives a large file, it writes it through the Zstd compressor in chunks, recording the mapping between uncompressed byte positions and their corresponding positions in the compressed stream on the storage volume. This produces a jump table stored in object_frames.

Why does this matter? Consider what happens when Plex or VLC sends an HTTP Range: bytes=2147483648- request — asking to start playback from the two-gigabyte mark of a 12 GB video file. Without the frame index, HuskHoard would have to decompress from byte zero just to find that position. With the frame index, it can look up the nearest compressed frame boundary, seek directly to that offset on the volume, and begin streaming from there.

HTTP Range Request arrives

→

Gateway queries object_frames

→

Nearest frame boundary located

Seek to compressed offset on storage volume

→

Decompress forward to exact byte

→

Stream to VLC / Plex / browser

This is what allows HuskHoard to function as a transparent media server bridge. Your video player sends a standard HTTP Range request to localhost:8080, and HuskHoard handles the seek into the compressed stream entirely invisibly.

Why WAL Mode Is Non-Negotiable

HuskHoard runs two threads that both need to write to the catalog simultaneously: the foreground Interceptor (which handles live fanotify restore events from the OS) and the background Janitor Worker (which processes the archive queue). In standard SQLite journal mode, these two writers would collide constantly, producing "database is locked" errors at the worst possible moments — like when a user's video editor is waiting for a restore to complete.

The fix is the very first thing init_catalog() does:

// EXTREMELY IMPORTANT: Enable Write-Ahead Logging.
// This allows the Sweeper (Thread) and Interceptor (Main) to write simultaneously
// without throwing "database is locked" errors.
conn.execute_batch("PRAGMA journal_mode = WAL;")?;
    

In WAL mode, readers never block writers and writers never block readers. The Interceptor can be in the middle of updating a restored file's catalog entry while the Janitor is simultaneously inserting new archive records, and neither will ever see a lock error. A busy_timeout of 30 seconds is also set as a backstop, but in practice WAL mode makes write contention almost nonexistent.

The Nuclear Option: Rebuilding a Lost Catalog From Raw Storage Volume

Inevitably, the question arises: what happens if I lose the catalog database? Maybe the SSD it lived on failed. Maybe someone ran rm in the wrong directory. The data is still physically on your storage volumes — but without the index, how do you find any of it?

This is what husk rebuild is for.

$ husk rebuild --tape-dev /dev/nst0 --output-db husk_recovered.db
 Starting deep scan recovery on /dev/nst0...
   Scanning at offset 4096...
   Found object: /projects/alpha/rushes/day01.mov (v1)
   Found object: /projects/alpha/rushes/day02.mov (v1)
   Found object: /projects/alpha/final.mov (v1)
   Found object: /projects/alpha/final.mov (v2)
   ... 14,847 objects found ...
 Recovery Complete! Successfully rebuilt 14,848 entries into husk_recovered.db.
    

The rebuild command works because every archived object on a volume begins with a 4 KB ObjectHeader containing a fixed magic string (USTDHUSK), the original file path, the BLAKE3 hash, POSIX permissions, timestamps, and a CRC32 checksum of the header itself. The rebuilder simply walks the storage volume 4 KB at a time, checks for the magic bytes, validates the header CRC, and if both pass, inserts a new catalog row. It doesn't require any prior knowledge of what was on the volume.

This means your storage volumes, weather disk, tape or S#, are entirely self-describing. The catalog is an optimization — it makes lookups instant instead of requiring a full scan. But the ground truth is redundantly encoded directly into the volume format itself. Lose the catalog and you can rebuild it. Lose the volumes and no software in the world will save you.

Instant Lookups

The (file_path, version) index makes finding any object a microsecond operation, regardless of how many millions of files are archived.

Self-Describing Storage Volumes

Every object header on a volume contains a full copy of its own metadata. The catalog can be completely rebuilt from scratch via a single raw scan.

Hardware-Portable UUIDs

Volume identity is based on cryptographic UUIDs, not device paths. Drives can be moved between machines and ports without breaking a single restore.

The Husk Catalog is deliberately boring infrastructure. It is a SQLite file you can open with any SQLite browser, inspect with standard SQL, and back up with a simple cp. That simplicity is the point. A critical piece of your archive pipeline should be something you can fully understand, audit, and recover from — not a proprietary black box. The catalog knows where everything is. The volumes have everything. As long as you have both, you have your archive.

The Catalog Is the Archive: Inside HuskHoard's Ground Truth Engine

Why the File System Can't Be the Source of Truth

The Catalog Schema: What Gets Recorded and Why

tape_uuid + tape_offset: The Physical Address

version: Automatic Point-in-Time History

Version Management

Version Management

blake3_hash: The Integrity Anchor

The Tapes Table: Mapping UUIDs to Hardware

The object_frames Table: Range Queries and HTTP Streaming

Why WAL Mode Is Non-Negotiable

The Nuclear Option: Rebuilding a Lost Catalog From Raw Storage Volume

Get some Data Context

Tuning the Config for deployment