Which company deployed a chaos monkey deamon on their systems? Seemed to improve...

theolivenbaum · on Nov 18, 2024

Netflix did that many years ago, interesting idea even if a bit disruptive in the beginning https://netflix.github.io/chaosmonkey/

e28eta · on Nov 19, 2024

See also, the rest of the simian army: https://netflixtechblog.com/the-netflix-simian-army-16e57fba...

philsnow · on Nov 19, 2024

At Google, the global Chubby cell had gone so long without any downtime that people were starting to assume that’s it was just always available, leading to some kind of outage or other when the global cell finally did have some organic downtime.

Chubby-SRE added quarterly synthetic downtime of the global cell (iff the downtime SLA had not already been exceeded).

kelnos · on Nov 19, 2024

For those of us who haven't worked at Google, what's "Chubby" and what's a "cell"?

philsnow · on Nov 19, 2024

Ah, chubby is a distributed lock service. Think “zookeeper” and you won’t be far off.

https://static.googleusercontent.com/media/research.google.c... [pdf]

Some random blog post: https://medium.com/coinmonks/chubby-a-centralized-lock-servi...

You can run multiple copies/instances of chubby at the same time (like you could run two separate zookeepers). You usually run an odd number of them, typically 5. A group of chubby processes all managing the same namespace is a “cell”.

A while ago, nearly everything at Google had at least an indirect dependency on chubby being available (for service discovery etc), so part of the standard bringup for a datacenter was setting up a dc-specific chubby cell. You could have multiple SRE-managed chubby cells per datacenter/cluster if there was some reason for it. Anybody could run their own, but chubby-sre wasn’t responsible for anybody else’s, I think.

Finally, there was a global cell. It was both distributed across multiple datacenters and also contained endpoint information for the per-dc chubby cells, so if a brand new process woke up somewhere and all it knew how to access was the global chubby cell, it could bootstrap from that to talking to chubby in any datacenter and thence to any other process anywhere, more or less.

^ there’s a lot in there that I’m fuzzy about, maybe processes wake up and only know how to access local chubby, but that cell has endpoint info for the global one? I don’t think any part of this process used dns; service discovery (including how to discover the service discovery service) was done through chubby.

starspangled · on Nov 19, 2024

Not trying to "challenge" your story, and its interesting anecdote in context. But if you have time to indulge me (and I'm not a real expert at distributed systems, which might be obvious) -

Why would you have a distributed lock service that (if I read right) has multiple redundant processes that can tolerate failures... and then require clients tolerate outages? Isn't the purpose of this kind of architecture so that each client doen't have to deal with outages?

saalweachter · on Nov 19, 2024

Because you want the failure modes to be graceful and recovery to be automatic.

When the foundation of a technology stack has a failure, there are two different axis of failure.

1. How well do things keep working without the root service? Does every service that can be provided without it still keep going?

2. How automatically does the system recover when the root service is restored? Do you need to bring down the entire system and restore it in a precise order of dependencies?

It's nice if your system can tolerate the missing service and keep chugging along, but it is essential that your system not deadlock on the root service disappearing and stay deadlocked after the service is restored. At best, that turns a downtime of minutes into a downtime of hours, as you carefully turn down every service and bring them back up in a carefully proscribed order. At worse, you discover that your system that hasn't gone down in three years has acquired circular dependencies among its services, and you need to devise new fixes and work-arounds to allow it to be brought back up at all.

praptak · on Nov 19, 2024

First, a global system with no outages (say the gold standard of 99.999% availability) is a promise which is basically impossible to keep.

Second, a global system being always available definitely doesn't mean it is always available everywhere. A single datacenter or even a larger region will experience both outages and network splits. It means that whatever you design on top of the super-available global system will have to deal with the global system being unavailable anyway.

TLDR is that the clients will have to tolerate outages (or at least frequent cut offs from the "global" state") anyway so it's better not to give them false promises.

chili6426 · on Nov 19, 2024

I don't work at Google but I do live in a country where I have access to google.com. Chubby is a lock service that is used internally at Google and a cell is referring to an instance of Chubby. You can read more here: https://static.googleusercontent.com/media/research.google.c...

nine_k · on Nov 19, 2024

Replace this with "API gateway cluster", or basically any simple enough, very widely used service.

RcouF1uZ4gsC · on Nov 19, 2024

The same company that was in the news recently for screwing up a livestream of a boxing match.

chrisweekly · on Nov 19, 2024

True, but it's the exception that proves the rule; it's also the same company responsible for delivering a staggeringly high percentage of internet video, typically without a hitch.

tovej · on Nov 19, 2024

That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.

When used like this it just confuses a reader with rethoric. In this case netflix is just bad at live streaming, they clearly haven't done the necessary engineering work on it.

nine_k · on Nov 19, 2024

The fact that Netflix surprised so many people by an exceptional technical issue implies that as a rule Netflix delivers video smoothly and at any scale necessary.

chrisweekly · on Nov 19, 2024

Yes! THIS is precisely what I meant in my comment.

jjk166 · on Nov 19, 2024

That's also not what "an exception proving the rule" is either. The term comes from a now mostly obsolete* meaning of prove meaning "to test or trial" something. So the idiom properly means "the exception puts the rule to the test." If there is an exception, it means the rule was broken. The idiom has taken on the opposite meaning due to its frequent misuse, which may have started out tongue in cheek but now is used unironically. It's much like using literally to describe something which is figurative.

* This is also where we get terms like bulletproof - in the early days of firearms people wanted armor that would stop bullets from the relatively weak weapons, so armor smiths would shoot their work to prove them against bullets, and those that passed the test were bullet proof. Likewise alcohol proof rating comes from a test used to prove alcohol in the 1500s.

lelanthran · on Nov 19, 2024

> That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.

So the rule is "Free parking on Sundays", and the exception that proves it is "Free parking on Sundays"? That's a post-hoc (circular) argument that does not convince me at all.

I read a different explanation of this phrase on HN recently: the "prove" in "exception proves the rule" has the same meaning as the "prove" (or "proof") in "50% proof alcohol".

AIUI, in this context "proof" means "tests". The exception that tests the rule simply shows where the limits of the rules actually are.

Well, that's how I understood it, anyway. Made sense to me at the time I read the explanation, but I'm open to being convinced otherwise with sufficiently persuasive logic :-)

taejo · on Nov 19, 2024

The rule is non-free parking. The exception is Sundays.

lucianbr · on Nov 19, 2024

The meaning of a word or expression is not a matter of persuasive logic. It just means what people think it means. (Otherwise using it would not work to communicate.) That is why a dictionary is not a collection of theorems. Can you provide a persuasive logic for the meaning of the word "yes"?

https://en.wikipedia.org/wiki/Exception_that_proves_the_rule

Seems like both interpretations are used widely.

tsimionescu · on Nov 19, 2024

The origin of the phrase is the aphorism that "all rules have an exception". So, when someone claims something is a rule and you find an exception, that's just the exception that proves it's a real rule. It's a joke, essentially, based on the common-sense meaning of the word "rule" (which is much less strict than the mathematical word "rule").

seaal · on Nov 19, 2024

50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.

lelanthran · on Nov 19, 2024

> 50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.

50% proof wouldn't be 25% ABV?

oofabz · on Nov 19, 2024

Since 50% = 0.5, and proof doesn't take a percentage, I believe "50% proof" would be 0.25% ABV.

lelanthran · on Nov 19, 2024

Good catch :-)

eru · on Nov 19, 2024

Yes, though serving static files is easier than streaming live.

bitwize · on Nov 18, 2024

The chaos monkey is there to remind you to always mount a scratch monkey.

amelius · on Nov 18, 2024

"Your flight has been delayed due to Chaos Monkey."

nine_k · on Nov 19, 2024

This means a major system problem. The point of the Chaos Monkey is that the system should function without interruptions or problems despite the activity of the Chaos Monkey. That is, it is to keep the system in such a shape that it could swallow and overcome some rate of failure, higher than the "naturally occurring" rate.

bigiain · on Nov 19, 2024

"My name is Susie and I'll be the purser on your flight today, and on behalf of the Captain Chaos Monkey and the First Officer Chaos Monkey... Oh. Shit..."