This is a high-effort deep-dive into the details of how modern Intel systems deal with "memory disambiguation". The exploration here goes well beyond what's in Intel's detailed optimization guides (https://software.intel.com/en-us/articles/intel-sdm), and deeper than what's in Agner Fog's excellent manuals (http://www.agner.org/optimize/). The information here is new, and likely doesn't exist anywhere outside of an Intel NDA.
While most programmers can ignore this level of operation, research like this is the only thing that can predict otherwise inexplicable observed behavior like this https://news.ycombinator.com/item?id=15935283, and hopefully can allow future low-level optimizations that would not otherwise be possible. If you are writing an optimizing compiler that targets these processors, or optimizing an inner loop where every cycle counts, this is gold.
Since it's a pretty dense piece, here's a higher level intro what he's talking about. Modern desktop and server processors execute instructions "out-of-order". Future instructions along the "speculative" execution path are thrown into a reorder buffer capable of holding a couple hundred instructions, and then executed as soon as possible, often several instructions per cycle (ie, they are "superscalar"). Since hundreds of instructions can be executed in the time it takes to access main memory, one of the main opportunities to make things faster is when loads from memory can be executed sooner.
But in the presence of both loads and stores, it can be difficult to determine whether it's safe to "hoist" a load above a store. Sometimes this is obviously unsafe, as when reading and storing from the same unchanged register, but what if two different registers happen to hold overlapping addresses? Then it can be hard to tell if a store happened to write to the same memory that a hoisted load was reading from, causing the load to retrieve the wrong data. This article is about how Skylake (a particular recent generation of Intel processors) actually handles these cases.
While most programmers can ignore this level of operation, research like this is the only thing that can predict otherwise inexplicable observed behavior like this https://news.ycombinator.com/item?id=15935283, and hopefully can allow future low-level optimizations that would not otherwise be possible. If you are writing an optimizing compiler that targets these processors, or optimizing an inner loop where every cycle counts, this is gold.
Since it's a pretty dense piece, here's a higher level intro what he's talking about. Modern desktop and server processors execute instructions "out-of-order". Future instructions along the "speculative" execution path are thrown into a reorder buffer capable of holding a couple hundred instructions, and then executed as soon as possible, often several instructions per cycle (ie, they are "superscalar"). Since hundreds of instructions can be executed in the time it takes to access main memory, one of the main opportunities to make things faster is when loads from memory can be executed sooner.
But in the presence of both loads and stores, it can be difficult to determine whether it's safe to "hoist" a load above a store. Sometimes this is obviously unsafe, as when reading and storing from the same unchanged register, but what if two different registers happen to hold overlapping addresses? Then it can be hard to tell if a store happened to write to the same memory that a hoisted load was reading from, causing the load to retrieve the wrong data. This article is about how Skylake (a particular recent generation of Intel processors) actually handles these cases.