"The foundational tenet of the Zero Trust Model is that no actor, system, network, or service operating outside or within the security perimeter is trusted"
It's about time. That's how airliners are designed. The guiding principle is "no single failure will bring the airplane down."
What it is not is "guarantee critical components will not fail".
The airline design principle is applicable to all kinds of things, like electrical grid design, security design, nuclear plant design, oil drilling platform design, ship design, and on and on. But I see it rarely applied, which is frustrating.
It's a good sound bite, but it's not really true. Nobody designs their IT infrastructure so that nothing is trusted; in fact, universally, IT security teams talk about their "source of truth" (usually: their single "source of truth") for things like authentication.
What ZTN is mostly about is changing the SPOF we had before --- the network perimeter --- to something fuzzier and more flexible. It's a good plan; we've been talking for over 20 years, going back to the Jericho Forum, about "deperimeterization" and "segmentation", and segmentation hasn't really worked out.
But it'd be a mistake to look at all of this stuff and think that it's a shift away from single points of failure, or to fundamentally greater resilience. We're opting for better SPOFs, not the elimination of SPOFs.
I kind of want to challenge this "better SPOFs" idea. From what I understand of the ZTN concept, the SPOF is the factory line. If an adversary can compromise what's on the board, it's really hard to prevent that adversary from controlling what happens on that machine.
I have had people tell me in meetings at work that they have pictures of people swapping out parts on an assembly line for unknown parts. If that's happening, unless the machines are verified by XRay, from my point of view, the ZT concept isn't worth a whole lot.
A pitfall of all these ZTN discussions is that they tend to propose a definition of "zero trust" as an axiom, and then use deduction to construct a whole system from those first principles. This is especially useful for security product vendors, who will pick the most congenial definition of "zero trust" in order to derive a mandate for their products.
We don't so much have to get lost in abstractions here. There's a blueprint: all this ZTN stuff was inspired by Google's BeyondCorp paper (ZTN is the marketing-neutral name for BeyondCorp; the Kubernetes to BC's Borg). So: just go read the BeyondTrust paper; it's great:
The paper's definitely worth reading, although it's worth noting that:
a) This is implementation heavy
b) The implementation is tied to a specific company migrating to this approach incrementally
I would say it's a great case study for moving to a ZTN, but if I were to explain ZTN I'd probably just propose a definition. I'm technically a vendor but the definition serves me in no way.
1. Requests are mutually authenticated and always encrypted (ex: mtls)
2. Requests are authorized; policy oriented, granular, and contextual authorization
3. Requests are monitored and audited
The paper lists a bunch of ways to accomplish those goals. SSO, device inventories, an authorizing proxy, a monitor, etc.
Of course, you're right. Vendors are horribly self serving and have already significantly degraded the term and confused people.
This is the implementation that defined the movement. You can distill it down to principles, but if you're saying things that contradict it, you're talking about something other than ZTN.
Yeah, I'm totally with you on that. I was just trying to get across that distilling to principles is easy and effective, whereas the paper is going to describe a specific implementation.
If someone is interested in ZTN they would be best presented with both imo
People land in weird places when they try to carry around just the principles. Like, "network controls are evil and should never be used". That's not at all what BeyondCorp says.
Thanks - my company is pushing a lot of ZTN stuff, and I wasn't totally sure where the principles are coming from. Having a root that I can anchor my debates in (which, as you can probably tell, is currently lacking) should be really helpful.
Is there a "now what?" if the board is compromised? I have... difficulty... understanding what you could possibly do to prevent full distributed system compromise if one hardware node is compromised. You seem very certain that something can be done, and I trust you, so I can trust that something can be done, but perhaps the solutions in this space elude me. I'll keep watching this space, but my pessimism may leak through from time to time.
I can't answer it in detail because I don't know how those systems work. But if I did, I bet I could find a way. After all, it's clear to me how Fukushima and Deep Water Horizon both had an easily fixed single point of failure that doomed them, and I ain't no expert on nuclear reactors nor drilling rigs.
When I got a job at Boeing, I thought it was impossible to design an airplane with full redundancy. They showed me how it was done.
BTW, in airplane avionics, faulty boards are detected by having more than one board, and comparing the results. Ideally the different boards are different designs, different CPUs, different code, etc.
In a 'well designed system', any given node has limited access to the other nodes and user data.
If a node is compromised, that shouldn't turn into full access to everything. Specifics would vary depending on your application, but while most user facing servers might need to be able to validate a login token, few would need to be able to issue a login token or access billing details.
If you compromise the node that has billing details or access to them, then you'll have them, but if you compromise a different node, you won't. That's better than compromise any host and get access to the 'private' network where there's no auth or encryption, which is kind of traditional.
A company wants to design a service with three operations. Signup(username, password), SetData(username, password, new_data) and GetData(username).
The client app is assumed to be trusted, but any server (no more than one though) could be compromised. Design a protocol so that the compromise does neither allow an attacker to update someone elses data, return wrong data, nor cause a denial of service.
Just on the last "60 Minutes" there was an episode about how a handful of substation failures would bring down the entire grid.
Yet another example of the concepts of airliner design being completely unknown outside of the aviation industry.
The "grid" doesn't appear to be a grid at all, but a tree. An actual grid would be able to route around failures, much like the internet protocol can do.
You decide in advance how to do it. You do it quickly because you've designed the system to do it automatically or to enable quick human decisions.
And yeah, you let a section fail if that prevents the entire system from failing.
The power in my neighborhood fails all the time, even though it is well within the city. Fed up, I finally installed a generator last year. Several failures since, including going out for the night 2 days ago, continues to justify it. I was probably the last holdout, one by one each of the neighbors had acquired a generator.
The power grid is in fact a graph, not a tree, and not a grid. It is automatically able to route around failures, and it does route around the failures. These failures happen every day, but people almost never notice. Almost.
The trouble is that the power grid just operates on the electrical principle of electricity following the path of least resistance. So if there's low resistance but low capacity line between generation and usage, the electricity will flow through it, and there is no QoS on the line that can limit how much power goes through it, besides on/off.
The on/off nature of the interconnections is what makes the power grid fragile. If a interconnection has a rating of 100MW or whatever, and the grid attempts to pump 110MW through it, you either leave it on and hope it doesn't break or shut it off and pump 0MW. The grid will then automatically route that 110MW through other interconnects. If there is a nearby 500MW interconnect that is currently pumping 350MW, you're fine, but if it's pumping 475MW, you're probably not. This can lead to cascading failures.
The grid does have the ability to shut down usage if there's a danger of a cascading failure. For instance, if your 100MW rated interconnect has 101MW coursing through it, you can shut down a neighborhood that's using 10MW or whatever, and keep the 100MW interconnect online with just 91MW running through it. This is what brownouts and rolling blackouts are.
The grid can have major failures, but these major failures don't happen unless there are multiple overlapping minor failures. Similarly, if your 737 MAX has 2 MCAS sensors and 1 manual MCAS override button, and both your sensors and your manual override all fail, you're going to have a bad time. Look at the sequence of events of the 2003 Northeast blackout: https://en.wikipedia.org/wiki/Northeast_blackout_of_2003#Seq... There are at least half a dozen failures between the initial problem at 12:15pm and until the point of no return failure at 4:05pm that if handled properly would have averted the crisis and turned it into a minor inconvenience. At least half a dozen instances of human error plus multiple instances of equipment failures. If an aircraft had that many things go wrong all at once, the plane's going down.
Note that the Texas power outage last year wasn't like this at all. The grid actually kept working fine, the problem is that there were lots of generators that failed as a result of the cold, mostly due to impure natural gas freezing. The generation capacity of the grid was cut to a fraction of its capacity, and as a result, they shut down a bunch of cities to reduce load. For whatever reason, they just left power off in several cities for days instead of doing rolling blackouts. I don't know enough about the circumstances to speculate on what the reason for that is.
The MAX failure illustrates what happens when one relies on one "cannot fail" component.
But you should know that there was a backup. 3 crews experience MCAS failure. 2 crashed. The 3rd crew simply turned off the stab trim system and completed the flight safely. You almost never hear about crew #3.
I can't really offer any suggestions for the power grid, as I don't know enough about it, except that I'm sure there's a way :-)
The Ethiopian Airlines crew that crashed also turned off the stab trim[1]. The problem is that you couldn't turn off MCAS separately from the fly-by-wire trim control, so you could either have fly-by-wire with MCAS or manual trim control without it.
Unfortunately for the Ethiopian Airlines crew, they were too late. By the time they set the stabilizer trim to CUT OUT, MCAS already pushed the plane nose down too much and they couldn't manage them manual controls, which were pretty hard to use under extreme conditions[2].
I'm not an expert or even an amateur, but the engineering issue I see here is that Boeing really left no contingency plan for a catastrophically failing AoA sensor on take-off. You can't turn off MCAS directly, and the recommended indirect method for turning it off require skills that many (most?) commercial airline pilots are not trained for.
The worst thing, of course, is that the Lion Air crew that crashed wasn't even aware of MCAS. The 737 MAX manuals made no mention of it.
I know they turned off the trim. They turned it off when the airplane was too far nose down.
They had successfully brought it back to normal trim with the electric trim switches, which had overridden MCAS. Then, the trim gets turned off.
The 3rd crew that survived had no idea MCAS existed. All they knew was misbehaving trim => turn it off. They also had repeatedly used the electric trim switches to return trim to normal. Then they turned it off.
This procedure was documented in an Emergency Airworthiness Directive sent to all MAX pilots, including the EA crew.
The only skill required was to use the electric trim switches to restore normal trim, which all three crews did, and then turn it off.
The point is, it is not necessary to know why the trim is misbehaving, it was only necessary to know how to turn it off. Emergency procedures are like this - they're not debugging procedures, they're about regaining control. Save the debugging for the mechanics on the ground.
P.S. the LA crew restored normal trim 25 times, and never did turn it off.
I guess I don't understand what you're trying to say.
All it takes for a 737 MAX to crash is one piece of equipment failure and 2/3 an instance of human error. I submit that if three crews were presenting with the same situation and two of them crashed with everyone onboard killed, it's not really the sort of thing that you can blame on human error.
All it takes for a catastrophic failure of the grid is literally over a dozen cascading equipment and human error instances.
I guess I don't understand -- you're saying that the grid, where over a dozen overlapping failures is required for catastrophic results, should learn from the airline industry where it takes 2-5 failures to bring down an airliner.
I think it's possible you're maybe getting confused about what the point of the interview was-- were they talking about a targeted attack on the power grid? An adversarial opponent, knowledgeable, determined, well funded, conducting a dedicated attack on the power grid? I think that yes, if a nation state or terrorist organization conducted a targeted attack on the power grid, it would likely do an outsized amount of damage on a regional portion of the nation compared to the resources expended. But in that case the comparison to airliners does not apply -- all the redundant safety systems in the world aren't going to save you from a Buk or SM-2MR surface to air missile.
And a key there is “never fix software or hardware by depending on humans” because we’re always the weak link.
And in cases like zero trust it may mean things like “make sure it works so well you don’t have people resetting access over the phone” or requiring an in-person replacement of a security token, etc.
I agree that the MAX design should have been fully redundant.
But the only reason there are pilots at all on an airliner is to deal with the unexpected.
I was told about one case where the elevator actuators froze in the slightly nose up position. The pilot called the cabin, and asked everyone to pack themselves into the front. With that he managed to bring enough pitch control to land it safely, but it surely was a terrifying ordeal for everyone on board.
(The actuators froze because of water buildup turning to ice at altitude. Actuators are now heavily designed and tested with freezing water. The stab trim test for this was quite thorough - the motors were very powerful and the ice scrapers on the nut would just peel the ice off. I also enlarged the drain holes in my car doors to Boeing spec so water wouldn't build up in them and rust.)
With all due respect to the esteemed gentleman, I think sometimes an equally important question is "When this fails, should it be survived?" There may be some operations that _shouldn't_ be worked around. For example, if my auth service goes down, should accesses continue, or should we prevent all accesses?
I don't think I'm communicating myself clearly here, but I would like to thank the parent commenter for his insight into this area and more. Just to cover my bases.
EDIT: I want to make clear that there is no sarcasm here. I respect WalterBright probably only behind Knuth in terms of effective programmers/engineers in the field today, and I doubt I could add anything to his opinions.
If the auth service has failed, the correct response is to shut down all access to what that service was connected to.
In order to survive shutting down that access, the system must be compartmentalized, so the shutdown is limited in scope.
For example, the hydraulic system on an airplane has a redundant equivalent hydraulic system. But within a hydraulic system, it is designed to isolate failed sections of it from the rest, so the rest can still function. I.e. developing a leak in the elevator actuators won't disable the aileron actuators on the wings.
The Fukushima plant and Deepwater Horizon rig both had a vulnerability to a single point of failure that produced cascading failures that unzipped the whole system. Both were easily and inexpensively preventable.
Even an auth system can be handled in various other ways - for example, auth tokens can be handed out as valid for 48 hours but normally cycled every 24 - in case of auth failure those who already have a token can keep working for up to two days.
But that may not fit the risk profile, and you may need additional considerations - and perhaps even different levels of authentication.
A related principle is compartmentalization. For example, if your system is penetrated, it doesn't give access to all the information. There should never be "X was penetrated, 100 million records were stolen." It should be X only has access to 1 million records, and penetrating X should not compromise X2, X3, X4, etc.
And it can go further - things like “I can look at any record in the database” and “I can run any sql query against it but only get a max of C results per hour” or similar to prevent exfiltration without preventing work against the dataset.
How do you do shared secret authentication in this model? I know cert based auth is far better but today many apps rely on some kind of shared secret auth. Don't you by definition trust eg LDAP/IPA servers then?
EDIT: In case it's not clear: I'm talking about employee facing software inside an organization where you might have some kind of single sign on system or a distributed account system (like IPA or LDAP.)
The SP never sees the credentials. The SP only sees a token which includes the username (NameID) and other attributes passed from the IdP through the client.
But with SAML you're trusting the cert/key pair on the signing end of the connection. If you say "well, we can use the cert provided by the server by getting it over HTTPS every time we need to auth with SAML," then you're trusting the Root CA Cert/Key pair for the TLS connection that underlies the HTTPS protocol. (Source: I've written two SAML SPs.)
With ZT, you basically have to bootstrap trust from the factory. I don't think of ZT as "don't trust anything" - it's more like "trust our supply lines".
Think about the failure modes of ZT: if the NIC, the CPU, the OS, and the bootloader are deemed secure at boot time, there has to be something that starts the bootloader and loads its keys. If you compromise _that_ piece, then you can compromise anything further up the stack and not worry too much about security alerts. The only way to make sure that all machines are secure/uncompromised is to XRay all of the bootloader chips and verify them down to the ~100um level (got this figure by talking to a guy doing grad work @UofM when he was in SV around 2017-2018, I want to say).
It's about time. That's how airliners are designed. The guiding principle is "no single failure will bring the airplane down."
What it is not is "guarantee critical components will not fail".
The airline design principle is applicable to all kinds of things, like electrical grid design, security design, nuclear plant design, oil drilling platform design, ship design, and on and on. But I see it rarely applied, which is frustrating.