Author of the blog here. I had a great time writing this. By far the most complex article I've ever put together, with literally thousands of lines of js to build out these interactive visuals. I hope everyone enjoys.
The visuals are awesome; the bouncing-box is probably the best illustration of relative latency I've seen.
Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right? I would think if your recovery is 10 minutes for example, even if each of three servers is guaranteed to fail once in the month, I think it's already like 1 in two million? and if it's a 1% chance of failure in the month failure of all three overlapping becomes extremely unlikely.
Thought I would note this because one-in-a-million is not great if you have a million customers ;)
> Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right?
Absolutely. Our actual durability is far, far, far higher than this. We believe that nobody should ever worry about losing their data, and thats the peace of mind we provide.
> Instead of relying on a single server to store all data, we can replicate it onto several computers. One common way of doing this is to have one server act as the primary, which will receive all write requests. Then 2 or more additional servers get all the data replicated to them. With the data in three places, the likelihood of losing data becomes very small.
Is my understanding correct, that this means you propagate writes asynchronously from the primary to the secondary servers (without waiting for an "ACK" from them for writes)?
Kudos to whoever patiently & passionately built these. On an off topic - This is a great perspective for building realistic course work for middle & high school students. I'm sure they learn faster & better with visuals like these.
1 in a million is the probability that all three servers die in one months, without swapping out the broken ones.
So at some point in the month all the data is gone.
If you replace the failed(or failing) node right away, the failure percentage goes down greatly.
You would likely need the probability of a node going done in 30 minutes time space.
Assuming the migration can be done in 30 min.
(i hope this calculation is correct)
If 1% probability per month then
1%/(43800/30) = (1/1460)% probability per 30 min.
For three instances:
(1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.
Calculated for one month
(1/3112136000)% * (43800/30) = (1/2131600)%
So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month.
After the 30 minutes another replica will already be available, making the data safe.
I'm happy to be corrected.
The probability course was some years back :)
One thing I will suggest: you’re assuming failures are non-correlated and have an equally weighted chance per in it of time.
Neither is a good assumption from my experience. Failures being correlated to any degree greatly increases the chances of what the aviation world refers to as “the holes in the Swiss cheese lining up”.
The animations are fantastic and awesome job with the interactivity. I find myself having to explain latency to folks often in my work and being able to see the extreme difference in latencies for something like a HDD vs SSD makes it much easier to understand for some people.
Edit: And for real, fantastic work, this is awesome.
The level of your effort really shows through. If you had to ballpark guess, how much time do you think you put in? and I realize keyboard time vs kicking around in your head time are quite different
Thank you! I started this back in October, but of course have worked on plenty of other things in the meantime. But this was easily 200+ hours of work spread out over that time.
If this helps as context, the git diff for merging this into our website was: +5,820 −1
Half on topic: what libs/etc did you use for the animations? Not immediately obvious from the source page.
(it's a topic I'm deeply familiar with so I don't have a comment on the content, it looks great on a skim!) - but I've been sketching animations for my own blog and not liked the last few libs I tried.
Interesting. Running any chrome extensions that might be messing with things? Alternatively, if you can share any errors you're getting in the console lmk.
This is beautiful and brilliant, and also is a great visual tool to explain how some of the fundamental algorithms and data structures originate from the physical characteristics of storage mediums.
I wonder if anyone remembers the old days where you programmed your own custom defrag util to place your boot libs and frequently used apps to the outer tracks of the hard drive, so they are loaded faster due to the higher linear velocity of the outermost track :)
Were you at all inspired by the work of Bartosz
Ciechanowski? My first thought was that you all might have hired him to do the visuals for this post :)
I was delighted to see your models of tape operations as I used it a lot in the COBOL days.
For reasons discussed in your article we would arrange tape processing as much as possible in sequential scans, something at which COBOL was quite excellent. One of the common performance problems was when there was a mismatch between a slower COBOL processing speed that could not keep up with the flow of blocks coming off the drive head.
In this case you would see the drive start to overshoot as it read more blocks than the COBOL program could handle. The drive would begin a painful jump forward/spool backward motion which made the performance issue quite visible. You would then eyeball the code to understand way the program was not keeping up, correct, and resubmit until the motion disappeared.
Amazing presentation. It really helps to understand the concepts.
The only add is that it understates the impact of SSD parallelism. 8 Channel controllers are typical for high end devices and 4K random IOPS continue to scale with queue depth, but for an introduction the example is probably complex enough.
It is great to see PlanetScale moving in this direction and sharing the knowledge.
Just going off specs sheets from manufacturers and reviews (mostly consumer products, so enterprise should be the same or better).
There are only a few major NAND manufacturers: Samsung, Micron, Kioxia / Western Digital, SK Hynix, and their branded products are usually the best.
There are also several 3rd party controller developers: Phison, Marvell, Silicon Motion, which I think are the largest, and then a bunch of others.
I hadn't looked at this in a couple years, so 16 channel controllers are more common now, but only on high end enterprise devices.
4KB random read/write specs are definitely not trustable without testing. They are usually at max queue depth and, at least for consumer devices, based on writing to a buffer in SLC mode, so they will be a lot lower once the buffer is exhausted. Enterprise specs might be more realistic but there isnt as much public testing data available.
Hi, what actually are _metal_ instances that are being used when you're on EC2 that have local NVME attached? Last time I looked, apart from the smallest/slowest Graviton, you have to spend circa 2.3k USD/mo to get a bare-metal instance from AWS - https://blog.alexellis.io/how-to-run-firecracker-without-kvm...
Hi there, PS employee here. In AWS, the instance types backing our Metal class are currently in the following families: r6id, i4i, i3en and i7ie. We're deploying across multiple clouds, and our "Metal" product designation has no direct link to Amazon's bare-metal offerings.
The visualizations are excellent, very fun to look at and play with, and they go along with the article extremely well. You should be proud of this, I really enjoyed it.
I don’t see any animations on Safari. Also, I’d much prefer a variable-width font, monospace prose is hard to read. While I can use Reader Mode, that removes the text coloring, and would likely also hide the visuals (if they were visible in the first place).
The visuals add a lot to this article. A big theme throughout is latency, and the visual help the reader see why tape is slower than an hdd, which is slower than an ssd, etc. Also, its just plain fun!
I'm curious, what do you do on the internet without js these days?
> I'm curious, what do you do on the internet without js these days?
Browse the web, send/receive email, read stories, play games, the usual. I primarily use native apps and selectively choose what sites are permitted to use javascript, instead of letting websites visited on a whim run javascript willy nilly.