More

bassp · 2025-12-13T06:53:49 1765608829

> Some problems are straightforward to specify. A file system is a good example.

I’ve got to disagree with this - if only specifying a file system were easy!

From the horse’s mouth, the authors of the first “properly” verified FS (that I’m aware of), FSCQ, note that:

> we wrote specifications for a subset of the POSIX system calls using CHL, implemented those calls inside of Coq, and proved that the implementation of each call meets its specification. We devoted substantial effort to building reusable proof automation for CHL. However, writing specifications and proofs still took a significant amount of time, compared to the time spent writing the implementation

(Reference: https://dspace.mit.edu/bitstream/handle/1721.1/122622/cacm%2...)

And that’s for a file system that only implements a subset of posix system calls!

Animats · 2025-12-13T19:09:35 1765652975

Ah, they're trying to specify file system recovery, which exposes the internals.

I did that once, as part of a very old secure operating system project, KSOS.[1] The specs were in SPECIAL, a forgotten language from SRI International.[2] The file system had two key invariants. The weak invariant was true at the end of every write. The strong invariant was true when the file system was clean and unmounted. The job of file recovery was to take a file system for which the weak invariant was true and make the strong invariant true.

We had a spec, but nobody could prove anything about it at the time. The tools didn't exist in the early 1980s. And trying to cram KSOS into a PDP-11 didn't really fit. It ran, but was too slow. All of this was just too early. Today you can do it.

[1] https://seclab.cs.ucdavis.edu/projects/history/papers/ford78...

[2] https://www.sciencedirect.com/science/article/abs/pii/016412...

bassp · 2025-10-14T19:23:53 1760469833

Yes! There’s a canonical algorithm called the “Blelloch scan” for prefix sum (aka prefix scan, because you can generalize “sum” to “any binary associative function”) that’s very gpu friendly. I have… fond is the wrong word, but “strong” memories of implementing in a parallel programming class :)

Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

ashtonsix · 2025-10-14T20:02:43 1760472163

Mm, I used that exact writeup as a reference to implement this algorithm in WebGL 3 years ago: https://github.com/ashtonsix/webglc/blob/main/src/kernel/sca...

It even inspired the alternative "transpose" method I describe in the OP README.

bassp · 2025-03-08T18:23:43 1741458223

I agree with the author 100% (the TanTan anecdote is great, super clever work!), but.... sometimes you do need Redis, because Redis is the only production-ready "data structure server" I'm aware of

If you want to access a bloom filter, cuckoo filter, list, set, bitmap, etc... from multiple instances of the same service, Redis (slash valkey, memorydb, etc...) is really your only option

sgarland · 2025-03-08T23:20:12 1741476012

Postgres has Bloom filters: https://www.postgresql.org/docs/current/bloom.html

It also has arrays, sets, and bitstrings, though for the latter you can just as easily (and with less space consumed) map it in your app, and store an integer.

jasonthorsness · 2025-03-08T18:37:03 1741459023

Yes, while the default idea of Redis might be to consider it a key/value cache, the view of the project itself is definitely about being a "data structure server" - it's right at the top of the https://github.com/redis/redis/blob/unstable/README.md and antirez has focused on that (I can't find one quote I am looking for specifically but it's evident for example in discussion on streams https://antirez.com/news/114). Although I've definitely seen it be used just as a key/value store in the deployments I'm familiar with ¯\_(ツ)_/¯

e_hup · 2025-03-08T19:37:08 1741462628

All of those can be serialized and stored in an RDMS. You don't need Redis for that.

bassp · 2025-03-08T19:50:27 1741463427

They can (and that's probably the right choice for a lot of use cases, especially for small data structures and infrequently updated ones), but serializing and storing them in a database requires you to (in your application code) implement synchronization logic and pay the performance cost for said logic; for instance, if you want to `append` to a shared list, you need to deserialize the list, append to the end of it in your application code, and write it back to the DB. You'd need use some form of locking to prevent appends from overwriting each other, incurring a pretty hefty perf penalty for hot lists. Also, reading an entire list/tree/set/whatever back just to add/delete one element is very wasteful (bandwidth/[de]serialization cost-wise)

evil-olive · 2025-03-08T20:30:17 1741465817

> for instance, if you want to `append` to a shared list, you need to deserialize the list, append to the end of it in your application code, and write it back to the DB.

this seems like a classic case of impedance mismatch, trying to implement a Redis-ism using an RDBMS.

for a shared list in a relational database, you could implement it like you've said, using an array type or a jsonb column or whatever, and simulate how it works in Redis.

but to implement a "shared list" in a way that meshes well with the relational model...you could just have a table, and insert a row into the table. there's no need for a read-modify-write cycle like you've described.

or, if you really need it to be a column in an existing table for whatever reason, it's still possible to push the modification to the database without the heavy overhead. for example [0]:

> The concatenation operator allows a single element to be pushed onto the beginning or end of a one-dimensional array. It also accepts two N-dimensional arrays, or an N-dimensional and an N+1-dimensional array.

0: https://www.postgresql.org/docs/current/arrays.html#ARRAYS-M...

bassp · 2025-03-08T20:53:53 1741467233

Sure, but that’s not what the person responding to my original comment was suggesting :). They suggested that you serialize entire data structures (bloom filters, lists, sets, etc…) into a relational DB to get redis-like functionality out of it; I chose a list as an example to illustrate why that’s not a great option in many cases.

You’re right that managing lists in RDMSes is easy-ish, if you don’t have too many of them, and they’re not too large. But, like I mentioned in my original comment, redis really shines as a complex data structure server. I wouldn’t want to implement my own cuckoo filter in Postgres!

hot_gril · 2025-03-10T00:46:31 1741567591

Doesn't a Postgres table fulfill this? You don't shove the whole list into a single column, you make a separate row per entry. This also works as a map or set.

bassp · on Feb 19, 2025

I use Kafka for a low-message-volume use case because it lets my downstream consumers replay messages… but yeah in most cases, it’s over kill

ofrzeta · on Feb 19, 2025

That was also a use case for me. However at some point I replaced Kafka with Redpanda.

drinker · on Feb 20, 2025

Isn't redpanda built for the same scale requirements as Kafka?

rockwotj · on Feb 20, 2025

Redpanda is much more lean and scales much better for low latency use cases. It does a bunch of kernel bypass and zero copy mechanisms to deliver low latency. Being in C++ means it can fit into much smaller footprints than Apache Kafka for a similar workload

drinker · on Feb 20, 2025

Those are all good points and pros for redpanda vs Kafka but my question stills stands. Isn't redpanda designed for high-volume scale similar to the use cases for Kafka rather than the low volume workloads talked about in the article?

rockwotj · on Feb 20, 2025

When the founder started it was designed to be two things:

* easy to use * more efficient and lower latency than the big resources needed for Kafka

The efficiency really matters at scale and low latency yes but the simplicity of deployment and use is also a huge win.

munksbeer · on Feb 20, 2025

In kafka, if you require the highest durability for messages, you configure multiple nodes on different hosts, and probably data centres, and you require acks=all. I'd say this is the thing that pushes latency up, rather than the code execution of kafka itself.

How does redpanda compare under those constraints?

rockwotj · on Feb 21, 2025

Oh if you care about durability on Kafka vs Redpanda, see https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s..., acks=all does not fsync (by default before acknowledging the write), so it's still not safe. We use raft for the data path, a proven replication protocol (not the custom ISR protocol) and fsync by default for safety (although if you're good with relaxed durability like in Kafka you can enable that too: https://www.redpanda.com/blog/write-caching-performance-benc...).

As for Redpanda vs Kafka in multi AZ setups and latency, the big win in Redpanda is tail latencies are kept low (we have a variety of techniques to do this). Here's some numbers here: https://www.redpanda.com/blog/kafka-kraft-vs-redpanda-perfor...

Multi AZ latency is mostly single digit millisecond (ref: https://www.bitsand.cloud/posts/cross-az-latencies/) and the JVM can easily take just as long during GC, which can drive up those tail latencies.

enether · on Feb 22, 2025

It's pretty safe. Kafka replicates to 3 nodes (no fsync) before the request is completed. What are the odds of all 3 nodes (running in different data centers) failing at the same time?

rockwotj · on Feb 23, 2025

Generally if you care about safety then “pretty safe” doesn’t cut it.

enether · on Feb 23, 2025

It's just my polite way of saying it's safe enough for most use cases and that you're wrong.

The fsync thing is complete FUD by RedPanda. They later introduce write caching[1] and call it an innovation[2]. I notice you also work for them.

Nevertheless, those that are super concerned with safety usually run with an RF of 5 (e.g banks). And you can configure Kafka to fsync as often as you want[3]

1 - https://www.redpanda.com/blog/write-caching-performance-benc... 2 - https://www.linkedin.com/posts/timfox_oh-you-cant-make-this-... 3 - https://kafka.apache.org/documentation/#brokerconfigs_log.fl...

deniscoady · on Feb 23, 2025

Disclaimer: I currently work for Redpanda.

It's just my polite way of saying it's safe enough for most use cases and that you're wrong.

Low volume data can be some of the most valuable data on the planet. Think SEC reporting (EDGAR), law changes (Federal Register), court judgements (PACER), new cybersecurity vulnerabilities (CVEs), etc. Missing one record can be detrimental if its the one record that matters.

Does everyone need durability by default? Probably not, but Redpanda users get it for free because there is a product philosophy of default-safe behavior that aligns with user expectations - most folks don't even know how this stuff works, why not protect them when possible?

The fsync thing is complete FUD by RedPanda.

You want durability? Pay the `fsync()` cost. Otherwise recognize that acknowledgement and durability are decoupled and that the data is sitting in unsafe volatile memory for a bit.

They later introduce write caching[1] and call it an innovation[2].

There are legitimate cases where customers don't care about durability and want the fastest possible system. We heard from these folks and responded with a feature they can selectively opt-in for that behavior _knowing the risks_. Again the idea is to be safer by default, and allow folks to opt-in to more risky behaviors.

those that are super concerned with safety usually run with an RF of 5 (e.g banks)

Going above RF=3 does not guarantee "more nines" since you need more independent server racks, independent power supplies or UPSs, etc, otherwise you're just pigeonholing yourself. This greatly drives up costs. Disks and durability is just cheaper and simpler. Worst case you pull the drives and pull the data off them, not fun and not easy, but possible unlike in-memory copies.

And you can configure Kafka to fsync as often as you want[3]

Absolutely! But nobody changes the default which is the issue - expectations of new users are not aligned with actual behavior. Same thing happened during the early MongoDB days. Either there needs to be better documentation/education to have people understand what the durability guarantees actually are, or change the defaults.

enether · on Feb 23, 2025

I agree that data can be valuable and even one record loss can be catastrophic.

I agree that there needs to be better documentation.

I just don't agree that losing 3 replicas each living in a different DC at once is a realistic concern. The ones that would truly be concerned about this issue would do one of two things - run RF>3 (yes, it costs more) or set up some disaster recovery strategy (e.g run in multiple regions, yes that costs more.)

Because truth be told - losing 3 AZs at once is a disaster. And even if you durably persisted to disk - all 3 disks may have become corrupt anyway.

agallego · on Feb 23, 2025

It is not FUD. It is deterministic. Reproducible on your laptop. Out of all the banks I work with only a handful of use cases use rf=5. Defaults matter, because most people do not change them.

enether · on Feb 23, 2025

Defauls do matter, in principle. But I think this particular risk is overblown, see my other reply for my thoughts

bassp · on Feb 5, 2025

Your network programming guide really saved my bacon back when I was taking a networking class, I appreciate all your hard work!

bassp · on Jan 25, 2025

IME, yes.

Here a couple problems I've run into using GQL for backend to backend communication:

* Auth. Good GQL APIs think carefully about permission management on a per-field basis (bad GQL apis slap some auth on an entire query or type and call it a day). Back-end services, obviously, are not front end clients, and want auth that grants their service access to an entire query, object, or set of queries/mutations. This leads to tension, and (often) hacky workarounds, like back-end services pretending to be "admin users" to get the access they need to a GQL API.

* Nested federation. Federation is super powerful, and, to be fair, data loaders do a great job of solving the N+1 query problem when a query only has one "layer" of federation. But, IME, GQL routers are not smart enough to handle nested federation; ie querying for a list of object `A`s, then federating object `B` on to each `A`, then federating object `C` on to each `B`. The latency for these kinds of queries is, usually, absolutely terrible, and I'd rather make these kinds of queries over gRPC (eg hit one endpoint for all the As, then use the result to get all the Bs, then use all the Bs to get all the Cs)

jensneuse · on Jan 25, 2025

We wrote a breadth first algorithm to handle the second problem you're describing. I'm curious to hear your thought on it: https://wundergraph.com/blog/dataloader_3_0_breadth_first_da...

bassp · on Jan 25, 2025

That's really clever! Kudos. I'm gonna set aside some time this week to dive into the implementation

bassp · on Jan 14, 2025

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

einpoklum · on Jan 14, 2025

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

You were taught wrong...

First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

Amount of shared memory used per warp

* Makeup of the instructions to be executed

* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

bassp · on Jan 15, 2025

Sorry if I was sloppy with my wording, instruction issuance is what I meant :)

I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?

einpoklum · on Jan 15, 2025

> warps weren't issued instructions unless they were ready to execute

This is true, but after they've been issued, it still takes a while for the execution to conclude.

> it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once

Just replace "most" with "some". It really depends on what kind of kernel you're writing.

delifue · on Jan 15, 2025

The GPU Glossary mentions that a warp scheduler can context switch https://modal.com/gpu-glossary/device-hardware/warp-schedule... but you said there is no such thing as an SM "context switch between threads". Is there some ambiguity in context switch

saagarjha · on Jan 14, 2025

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.

bassp · on Jan 14, 2025

You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once

saagarjha · on Jan 15, 2025

Right, ok. So you mean a handful of warps and not like a plethora of them for no reason.

buildbot · on Jan 14, 2025

Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

saagarjha · on Jan 15, 2025

Yeah I’m talking about the limit per-block.

bassp · on Jan 10, 2025

Not the OP, but Hillel Wayne’s course/tutorial (https://www.learntla.com/) is fantastic. It’s focused on building practical skills, and helped me build enough competence to write a few (simple, but useful!) specs for some of the systems I work on.

bassp · on Jan 10, 2025

It’s not all or nothing!

I work on a very “product-y” back end that isn’t fully specified, but I have formally specified parts of it.

For instance, I property-based-tested a particularly nasty state machine I wrote to ensure that, no matter what kind of crazy input I called an endpoint with, the underlying state machine never made any invalid transitions. None of the code around the state machine has a formal spec, but because the state machine does, I was able to specify it.

In the process, I found some very subtle bugs I’d have never caught via traditional unit testing.

constantcrying · on Jan 10, 2025

Completely agree, but obviously you relied on the state machine being sufficiently separate and having a formal specification for it.

State machines are quite common, in aerospace software, where the requirements even specify the states and transitions. If you can formally verify them, I believe, you absolutely should, as often there can be a lot of subtlety going on there, especially if you have a distributed system of state machines sending messages to one another.

senderista · on Jan 10, 2025

Property-based testing is also effective in finding bugs even in the absence of any formal model. Basically you just need to identify informal invariants ("properties"), encode them as asserts, and then run tests with enough coverage to likely find any invariant violations.

bassp · on Dec 14, 2024

I’m surprised that this piece mentioned Microsoft, but didn’t touch on Microsoft’s solution to this problem: project silica (https://www.microsoft.com/en-us/research/project/project-sil...), which stores data on etched pieces of quartz glass that are supposed to be able reliably store data for thousands a of years. Of course, you still need to solve the dispersal problem, and need to make sure that the knowledge of how to read the glass tablets is passed down, but hey, nothing’s perfect!

simpaticoder · on Dec 15, 2024

It would be neat if they had a process for interactively "zooming in" on a given artifact with a suitably equipped human. For example, on the slab print human legible instructions for building a microscope. Then, when you look at the slab again with the microscope you built, there are more instructions on building a better microscope, a computer, and a decoding algorithm. This can continue, such that the N-1 level instructions allow the human read the artifact at Nth level density, unlocking further instructions, until you reach "bottom". I assume there are people expert in theory-crafting a message that self-teaches the language in which it is written. This also assumes there is no meaningful interaction of representations at different levels, or ones that can easily be accounted for.