The Danger of Querying a Live Filesystem
Every time a file is archived, moved, or deleted, a database record is updated. In HuskHoard, this production database is an ultra-fast SQLite catalog that tracks physical tape blocks, virtual offsets, and file state. It is highly optimized for OLTP (Online Transaction Processing).
The problem arises when the IT billing department or a data scientist wants to analyze storage trends. They want to run an aggregation query—scanning 50 million rows to sum up the payload_size grouping by action and filtering by date. This is an OLAP (Online Analytical Processing) workload.
Running a heavy OLAP query against a live OLTP database is dangerous. It locks tables, evicts cache, and starves CPU cycles. If your core database is busy running a 45-second GROUP BY query, it can't respond to the incoming video stream being written to tape. The write buffer fills up, and your archiving pipeline stalls.
The Bottlenecked DB
Production reads/writes share the exact same database engine as reporting/analytics. A heavy analytical query can lock the database and crash an active data ingest.
Compute Decoupling
Production transactions stay in SQLite. Analytics are pushed to an immutable Cloud Data Lake. You can run petabyte-scale queries on the cloud without the storage server ever noticing.
To solve this, we needed a way to bridge the gap between OLTP and OLAP efficiently. We built a native Parquet export engine directly into the Husk core using Apache Arrow.
The Native Arrow Export Command
Instead of relying on an external sidecar to buffer events or setting up complex ETL pipelines, Husk now ships with a built-in data engineering command. By running husk export --format parquet, the core daemon safely maps the SQLite catalog directly into Apache Arrow memory structures.
Because it operates within the main Rust engine, it uses zero-copy memory operations to build the columnar data builders. It reads the catalog, vectorizes the rows into columns, and writes out a highly compressed analytics file instantly.
This operation is incredibly fast and avoids keeping a persistent secondary daemon running. You just export your catalog exactly when your reporting tools need it.
Apache Parquet and the Columnar Advantage
When running a catalog export, Husk doesn't write out a CSV or JSON file. It serializes the database directly into Apache Parquet.
Parquet is a columnar storage format. In a row-based format like CSV, a single file's data (path, size, hash, date) is stored contiguously. In Parquet, all the file paths are stored together, all the sizes are stored together, and all the hashes are stored together.
This allows for incredible compression (often 80%+ smaller than JSON) and dramatically faster analytical queries. If you ask a Parquet file "What is the sum of all payload_size?", the query engine only reads the single column of integers, completely ignoring the heavy string columns like file paths and hashes.
.parquet file.Cloud Syncing with Rclone
Exporting a Parquet file to the local disk is only half the battle. To truly decouple analytics from production, the data needs to leave the server.
Because Husk acts as a standard CLI, you can easily orchestrate cloud uploads via a cron job using the industry-standard `rclone` utility. Once the export completes, simply push the file to an AWS S3 bucket, Google Cloud Storage, or a central NAS.
There is no complex database replication or active-active consistency checking required. Your cloud bucket simply serves the latest .parquet snapshot, building a massive, easily queryable Data Lake of your storage infrastructure.
Querying the Lake with DuckDB
Now that your metadata is safely living in a cloud bucket as Parquet files, how do you analyze it? You don't need to spin up a Hadoop cluster or pay for Snowflake. You can use DuckDB, an incredibly fast in-process analytical SQL engine.
A data scientist can open a terminal on their MacBook in a coffee shop, connect DuckDB to the S3 bucket, and query billions of Husk storage events in milliseconds.
Notice the run time: 0.4 seconds to aggregate gigabytes of logs across an entire cluster. And most importantly, during those 0.4 seconds, the production Husk server experienced exactly 0% CPU utilization. The analytical compute was entirely localized to the data scientist's laptop, pulling optimized column data from standard S3 storage.
StreamGate: Zero-Download Peeking for ML
This analytics-first mindset extends to how data engineers and ML pipelines actually interact with the archived files. If you want to query a specific row group in a 500GB Parquet dataset, or an ML training script needs to fetch a localized batch of video frames, traditional cold storage forces you to download the entire file first. This creates massive I/O bottlenecks and brutal cloud egress fees.
Husk solves this with a built-in HTTP Streaming Gateway. Modern data tools like DuckDB, PyArrow, and Python's fsspec are incredibly smart—they read the footer of a file first, calculate exactly which byte offsets contain the data they need, and send an HTTP Range request (e.g., bytes=50000000-60000000).
When the Husk Gateway receives this request, it doesn't restore the file to disk. It acts as a zero-copy bridge, streaming only the requested bytes directly from the archive straight into your data pipeline's memory.
Zstd Jump Tables & Partial Extraction
There is a major technical hurdle to executing these partial reads: Compression. If a massive dataset is compressed as a single continuous Zstd stream to save space, you cannot mathematically seek to the middle of it. You have to decompress the file from byte zero just to find out what byte 50,000,000 looks like.
Husk circumvents this using Jump Tables. During the archiving process, Husk chunks the file into independent 16MB frames. As each frame is compressed via Zstd, Husk records its exact uncompressed and compressed sizes into an index. This table is saved in the SQLite catalog (the object_frames table) and packed directly into the binary file's metadata header.
The Continuous Scroll
The entire dataset is compressed as one solid block. To read a single column or video frame in the middle, the CPU must download and decompress all the preceding gigabytes of data first.
The Indexed Jump Table
The file is a sequence of independent 16MB chapters. Husk looks at the database, jumps instantly to the correct chapter, and only downloads and decompresses the bytes for that specific block.
When DuckDB or a PyTorch dataloader requests a partial read, the StreamGate engine queries the SQLite database to find the exact compressed offset needed. It then uses rclone cat --offset X --count Y to pull only that chunk from your cloud bucket (or seeks physically on a tape drive using Linux MTIO commands), and decodes just the requested bytes.
Zero wasted bandwidth. Zero unnecessary decompression. You get O(1) seek times on cloud and cold storage, allowing your ML models and BI tools to query petabytes of compressed archives directly.
Why Native Arrow Integrations Win
Embedding Arrow and Parquet exports directly into the core engine changes the paradigm of enterprise storage management.
Husk proves that you don't need complex ETL pipelines, Kafka queues, or secondary database replicas to gain deep insights into your storage infrastructure. By leveraging Rust and Apache Arrow to write the world's most efficient columnar format, your archive effortlessly becomes a world-class Data Lake.