The Archive is a Database: Building a Zero-Impact Data Lake

In this article

The danger of querying a live filesystem
The Native Arrow Export Command
Apache Parquet and the Columnar Advantage
Cloud Syncing with Rclone
Querying the Lake with DuckDB
StreamGate: Zero-Download Peeking for ML
Zstd Jump Tables & Partial Extraction
Why Native Arrow Integrations Win

The Danger of Querying a Live Filesystem

Every time a file is archived, moved, or deleted, a database record is updated. In HuskHoard, this production database is an ultra-fast SQLite catalog that tracks physical tape blocks, virtual offsets, and file state. It is highly optimized for OLTP (Online Transaction Processing).

The problem arises when the IT billing department or a data scientist wants to analyze storage trends. They want to run an aggregation query—scanning 50 million rows to sum up the payload_size grouping by action and filtering by date. This is an OLAP (Online Analytical Processing) workload.

Running a heavy OLAP query against a live OLTP database is dangerous. It locks tables, evicts cache, and starves CPU cycles. If your core database is busy running a 45-second GROUP BY query, it can't respond to the incoming video stream being written to tape. The write buffer fills up, and your archiving pipeline stalls.

Traditional Storage Archiving

The Bottlenecked DB

Production reads/writes share the exact same database engine as reporting/analytics. A heavy analytical query can lock the database and crash an active data ingest.

Husk Data Lake

Compute Decoupling

Production transactions stay in SQLite. Analytics are pushed to an immutable Cloud Data Lake. You can run petabyte-scale queries on the cloud without the storage server ever noticing.

To solve this, we needed a way to bridge the gap between OLTP and OLAP efficiently. We built a native Parquet export engine directly into the Husk core using Apache Arrow.

The Native Arrow Export Command

Instead of relying on an external sidecar to buffer events or setting up complex ETL pipelines, Husk now ships with a built-in data engineering command. By running husk export --format parquet, the core daemon safely maps the SQLite catalog directly into Apache Arrow memory structures.

Because it operates within the main Rust engine, it uses zero-copy memory operations to build the columnar data builders. It reads the catalog, vectorizes the rows into columns, and writes out a highly compressed analytics file instantly.

// Inside Husk Core: Exporting the catalog natively
let mut stmt = conn.prepare("SELECT id, file_path, hash, archived_at, action, payload_size, ... FROM catalog")?;

// Initialize Arrow Column Builders
let mut path_b = StringBuilder::new();
let mut payload_b = Int64Builder::new();

// Vectorize rows into contiguous memory columns
for row in rows {
    path_b.append_value(row.get::<_, String>(1)?);
    payload_b.append_value(row.get::<_, i64>(5)?);
}

// Write to Disk with Snappy Compression
let batch = RecordBatch::try_new(Arc::new(schema), vec![...])?;
let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props))?;
writer.write(&batch)?;

This operation is incredibly fast and avoids keeping a persistent secondary daemon running. You just export your catalog exactly when your reporting tools need it.

Apache Parquet and the Columnar Advantage

When running a catalog export, Husk doesn't write out a CSV or JSON file. It serializes the database directly into Apache Parquet.

Parquet is a columnar storage format. In a row-based format like CSV, a single file's data (path, size, hash, date) is stored contiguously. In Parquet, all the file paths are stored together, all the sizes are stored together, and all the hashes are stored together.

This allows for incredible compression (often 80%+ smaller than JSON) and dramatically faster analytical queries. If you ask a Parquet file "What is the sum of all payload_size?", the query engine only reads the single column of integers, completely ignoring the heavy string columns like file paths and hashes.

Arrow Vectorization

The Rust structs are converted into Arrow arrays in RAM. This provides a zero-copy memory format optimized for modern CPU architectures.

Parquet Serialization

The Arrow vectors are encoded, compressed using standard Big Data algorithms (like Snappy), and written out as a single .parquet file.

Zero-Impact Analytics

The exported Parquet file is instantly ready for query engines, leaving your production SQLite database completely untouched by heavy BI dashboards.

Cloud Syncing with Rclone

Exporting a Parquet file to the local disk is only half the battle. To truly decouple analytics from production, the data needs to leave the server.

Because Husk acts as a standard CLI, you can easily orchestrate cloud uploads via a cron job using the industry-standard `rclone` utility. Once the export completes, simply push the file to an AWS S3 bucket, Google Cloud Storage, or a central NAS.

# A simple cron script to update the data lake nightly
#!/bin/bash
export_path="/tmp/husk_catalog_$(date +%F).parquet"

# 1. Export the catalog natively
husk export --format parquet --output "$export_path"

# 2. Sync to the cloud lake
rclone copy "$export_path" aws:enterprise-bucket-name/husk_lake/metadata/

# 3. Clean up
rm "$export_path"
echo "☁️ Successfully updated Global Lake!"

There is no complex database replication or active-active consistency checking required. Your cloud bucket simply serves the latest .parquet snapshot, building a massive, easily queryable Data Lake of your storage infrastructure.

Querying the Lake with DuckDB

Now that your metadata is safely living in a cloud bucket as Parquet files, how do you analyze it? You don't need to spin up a Hadoop cluster or pay for Snowflake. You can use DuckDB, an incredibly fast in-process analytical SQL engine.

A data scientist can open a terminal on their MacBook in a coffee shop, connect DuckDB to the S3 bucket, and query billions of Husk storage events in milliseconds.

-- Analyzing 5 years of storage growth directly from S3 using DuckDB
v0.10.1 > SELECT 
    substring(archived_at, 1, 7) AS month,
    COUNT(file_path) AS total_files_archived,
    SUM(payload_size) / 1024 / 1024 / 1024 / 1024 AS tb_written
FROM 's3://enterprise-bucket-name/husk_lake/metadata/*.parquet'
GROUP BY month
ORDER BY month DESC
LIMIT 5;

┌─────────┬──────────────────────┬─────────────┐
│  month  │ total_files_archived │ tb_written  │
│ varchar │        int64         │   double    │
├─────────┼──────────────────────┼─────────────┤
│ 2026-08 │               142091 │  184.221    │
│ 2026-07 │               105330 │  156.904    │
│ 2026-06 │                98112 │  112.551    │
│ 2026-05 │               210443 │  301.889    │
│ 2026-04 │                85002 │   94.112    │
└─────────┴──────────────────────┴─────────────┘
Run Time (s): real 0.418 user 1.290204 sys 0.082302

Notice the run time: 0.4 seconds to aggregate gigabytes of logs across an entire cluster. And most importantly, during those 0.4 seconds, the production Husk server experienced exactly 0% CPU utilization. The analytical compute was entirely localized to the data scientist's laptop, pulling optimized column data from standard S3 storage.

StreamGate: Zero-Download Peeking for ML

This analytics-first mindset extends to how data engineers and ML pipelines actually interact with the archived files. If you want to query a specific row group in a 500GB Parquet dataset, or an ML training script needs to fetch a localized batch of video frames, traditional cold storage forces you to download the entire file first. This creates massive I/O bottlenecks and brutal cloud egress fees.

Husk solves this with a built-in HTTP Streaming Gateway. Modern data tools like DuckDB, PyArrow, and Python's fsspec are incredibly smart—they read the footer of a file first, calculate exactly which byte offsets contain the data they need, and send an HTTP Range request (e.g., bytes=50000000-60000000).

When the Husk Gateway receives this request, it doesn't restore the file to disk. It acts as a zero-copy bridge, streaming only the requested bytes directly from the archive straight into your data pipeline's memory.

Zstd Jump Tables & Partial Extraction

There is a major technical hurdle to executing these partial reads: Compression. If a massive dataset is compressed as a single continuous Zstd stream to save space, you cannot mathematically seek to the middle of it. You have to decompress the file from byte zero just to find out what byte 50,000,000 looks like.

Husk circumvents this using Jump Tables. During the archiving process, Husk chunks the file into independent 16MB frames. As each frame is compressed via Zstd, Husk records its exact uncompressed and compressed sizes into an index. This table is saved in the SQLite catalog (the object_frames table) and packed directly into the binary file's metadata header.

Standard Tar/Zip Archive

The Continuous Scroll

The entire dataset is compressed as one solid block. To read a single column or video frame in the middle, the CPU must download and decompress all the preceding gigabytes of data first.

Husk StreamGate

The Indexed Jump Table

The file is a sequence of independent 16MB chapters. Husk looks at the database, jumps instantly to the correct chapter, and only downloads and decompresses the bytes for that specific block.

When DuckDB or a PyTorch dataloader requests a partial read, the StreamGate engine queries the SQLite database to find the exact compressed offset needed. It then uses rclone cat --offset X --count Y to pull only that chunk from your cloud bucket (or seeks physically on a tape drive using Linux MTIO commands), and decodes just the requested bytes.

// StreamGate intercepting a DuckDB Parquet footer read
LOG: [StreamGate] Requested file: training_data.parquet, offset: 104857600
LOG: [StreamGate] Frames found: 64. Selected range 6 to 6
LOG: [StreamGate] Compressed Target Offset: 88200451, Target Length: 15400230
LOG: [StreamGate] Spawning Rclone -> cat remote:bucket/husk_0.bin --offset 88200451 --count 15400230
LOG: [StreamGate] Launching Zstd Read-Decoder on Target Frames... Complete!

Zero wasted bandwidth. Zero unnecessary decompression. You get O(1) seek times on cloud and cold storage, allowing your ML models and BI tools to query petabytes of compressed archives directly.

Why Native Arrow Integrations Win

Embedding Arrow and Parquet exports directly into the core engine changes the paradigm of enterprise storage management.

01 — Zero Impact

Compute Isolation

Your storage arrays and tape robots exist to move data quickly. By pushing analytics to a cloud bucket and querying with DuckDB or Pandas, you guarantee that billing and auditing will never slow down a video render or backup job.

02 — Infinite History

The Immutable Ledger

Because Parquet fragments are append-only, you have a perfect, unalterable timeline of every file modification, hash calculation, and movement. This acts as a cryptographic audit log for compliance without expanding an active database.

03 — Standard Tooling

No Vendor Lock-In

We didn't invent a proprietary reporting UI. We emit open-source Apache Parquet files. You can plug them into Tableau, PowerBI, Grafana, Jupyter Notebooks, or write custom Python scripts. Your data belongs to you.

04 — Simplicity

Less Infrastructure

By removing the sidecar daemon, you have one less moving part to monitor. The Husk binary is fully self-contained. Run one command, get a cloud-ready dataset.

Husk proves that you don't need complex ETL pipelines, Kafka queues, or secondary database replicas to gain deep insights into your storage infrastructure. By leveraging Rust and Apache Arrow to write the world's most efficient columnar format, your archive effortlessly becomes a world-class Data Lake.