More

terrelln · 2025-10-30T20:11:56 1761855116

Its slang that has made its way from queer culture into mainstream. It is not meant to satisfy anyones ego, other than it just has vague positive connotations. It is used in a pretty equivalent way as "take that money, dude".

terrelln · 2025-10-07T16:17:19 1759853839

Out of curiosity, what was the input file format?

We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.

But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.

p1mrx · 2025-10-07T20:30:33 1759869033

The input was just CD audio, "One More Time" by Daft Punk.

test.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

terrelln · 2025-10-06T22:34:30 1759790070

You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.

It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.

[0] https://openzl.org/api/c/graphs/sddl/

terrelln · 2025-10-06T22:31:15 1759789875

The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.

On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.

terrelln · 2025-10-06T22:28:55 1759789735

We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.

Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.

[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...

yubblegum · 2025-10-06T22:33:46 1759790026

Thanks! (Super cool idea btw.)

terrelln · 2025-10-06T22:19:22 1759789162

Do you happen to have a pointer to a good open source dataset to look at?

Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.

We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.

bede · 2025-10-06T22:55:36 1759791336

For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further

terrelln · 2025-10-06T23:04:08 1759791848

Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.

fwip · 2025-10-07T03:45:55 1759808755

Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.

felixhandte · 2025-10-07T18:06:57 1759860417

Wanna hop over to https://github.com/facebook/openzl/issues/76?

terrelln · 2025-10-06T21:52:23 1759787543

You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.

Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.

However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.

[0] https://openzl.org/api/c/graphs/sddl/

kstenerud · 2025-10-07T01:08:47 1759799327

I've done a binary representation of JSON-structured data that uses unary coding for variable length length fields: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

This tends to confuse generic compressors, even though the sub-byte data itself usually clusters around the smaller lengths for most data and thus can be quite repetitive (plus it's super efficient to encode/decode). Could this be described such that OpenZL can capitalize on it?

terrelln · 2025-10-06T21:40:30 1759786830

> Is openzl indexable

Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.

terrelln · 2025-10-06T21:05:47 1759784747

Exactly! SDDL [0] provides a toolkit to do this all with no-code, but today is pretty limited. We will be expanding its feature set, but in the meantime you can also write code in C++ or Python to parse your format. And this code is compression side only, so the decompressor is agnostic to your format.

[0] https://openzl.org/api/c/graphs/sddl/

maeln · 2025-10-06T21:10:09 1759785009

Now I cannot stop thinking about how I can fit this somewhere in my work hehe. ZStandard already blew me away when it was released, and this is just another crazy work. And being able to access this kind of state-of-the-art algo' for free and open-source is the oh so sweet cherry on top

touisteur · 2025-10-07T15:59:29 1759852769

How happy I am to have all written/read data going through a DSL. On to generating the code to make OpenZL happy...

terrelln · 2025-10-06T20:07:31 1759781251

There's a Quick Start guide here:

https://openzl.org/getting-started/quick-start/

However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.

You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.

If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.