All of those things are solved with modern extensions. It's like comparing pre-MMX x86 code with modern x86. Misaligned loads and stores are Zicclsm, bit manipulation is Zb[abcs], atomic memory operations are made mandatory in Ziccamoa.
All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons.
But RISC-V is a _new_ ISA. Why did we start out with the wrong design that now needs a bunch of extensions? RISC-V should have taken the learnings from x86 and ARM but instead they seem to be committing the same mistakes.
I was a bit shocked by headline, given how poorly ARM and x86 compares to RISC-V in speed, cost, and efficiency ... in the MCU space where I near-exclusively live and where RISC-V has near-exclusively lived up until quite recently. RISC-V has been great for RTOS systems and Espressif in particular has pushed MCUs up to a new level where it's become viable to run a designed-from-scratch web server (you better believe we're using vector graphics) on a $5 board that sits on your thumb, but using RISC-V in SBCs and beyond as the primary CPU is a very different ballgame.
It's not the wrong design; RISC-V is designed around extensions, and they left room in the instruction encoding for them. They don't have a 800-lb gorilla like Intel shoving the ISA down customers' throats (Canonical is the closet thing) so there is some debate on which combination of extensions are needed for desktop apps.
> They don't have a 800-lb gorilla like Intel shoving the ISA down customers' throats
Nobody really forces you to use x64 if you don't like it, just as nobody forced you to use Itanium — which Intel famously failed to "shove down the customers' throats" btw.
It is a reduced instruction set computing isa of course. It shouldn't really have instructions for every edge case.
I only use it for microcontrollers and it's really nice there. But yeah I can imagine it doesn't perform well on bigger stuff. The idea of risc was to put the intelligence in the compiler though, not the silicon.
> It shouldn't really have instructions for every edge case.
Depends on what the instruction does. If it goes through a four-loads-four-stores chain that VAXen could famously do (with pre- and post-increments), then sure, this makes it impossible to implements such ISA in a multiscalar, OOO manner (DEC tried really, really hard and couldn't do it). But anything that essentially bit-fiddles in funny ways with the 2 sets of 64 bits already available from the source registers, plus the immediate? Shove it in, why not? ARM has bit shifted immediates available for almost every instruction since ARMv1. And RISC-V also finally gets shNadd instructions which are essentially x86/x64's SIB byte, except available as a separate instruction. It got "andn" which, arguably, is more useful than pure NOT anyway (most uses of ~ in C are in expressions of "var &= ~expr..." variety) and costs almost nothing to implement. Bit rotations, too, including rev8 and brev8. Heck, we even got max/min instructions in RISC-V because again, why not? The usage is incredibly widespread, the implementation is trivial, and makes life easier both for HW implementers (no need to try to macrofuse common instruction sequences) and the SW writers (no need to neither invents those instruction sequences and hope they'll get accelerated nor read manufacturers datasheets for "officially" blessed instruction sequences).
As proven by x86/x64 and ARM evolution, being all in into pure RISC doesn't pay off, because there is only so much compilers can do in a AOT deployment scenario.
> The idea of risc was to put the intelligence in the compiler though, not the silicon.
Itanium did this mistake. Sure, compilers are much better now, but still dynamic scheduling beats static one for real-world tasks. You can (almost perfectly) statically schedule matrix multiplication but not UI or 3D game.
Even GPUs have some amount of dynamic scheduling now.
It was kind of an experiment from start. Some ideas turned out to be good, so we keep them. Some ideas turned out not to be good, so we fix them with extensions.
It's hard to imagine a student putting together a RVA23 core in a single semester. And you don't really want that in the embedded roles RISC-V has found a lot of success in either.
Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").
IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.
>atomic memory operations are made mandatory in Ziccamoa
In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.
And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.
I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.
By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.
RISC-V truly is the RyanAir of processors: Oh, you want FP maths? That's an optional extra, did you check that when you booked? And was that single or double-precision, all optional extras at an extra charge. Atomic instructions, that's an extra too, have your credit card details handy. Multiply and divide? Yeah, extras. Now, let me tell you about our high-end customer options, packed SIMD and user-level interrupts, only for business class users. And then there's our first-class benefits, hypervisor extensions for big spenders, and even more, all optional extras.
So it's modular. This is normally considered a good thing. It means you don't have to pay for features you don't need.
The ISA is open so there's no greedy corporation trying to upsell you. I mean there's an implementation and die area cost for each extension but it's not being set at an artificial level by a monopolist.
There's a good chance you're actually paying more for the features you don't need. Preparing an EUV mask set costs something like 30 million dollars (that figure may be out of date, i.e. it could be more now). So instead of a single mask set with everything on the device, whether you need it or not, you're paying $30 million for each special-snowflake variant. This is why vendors do a one-size-fits-all version of many of their products and then disable the extra functionality for the cheaper market segments, because it's much, much cheaper than making separate reduced-functionality devices.
It's a good thing in many cases but not if you're going to be running applications distributed as binaries. Maybe if we go the Gentoo route of everybody always recompiling everything for their own system?
RVA23 is, finally, the belated admission that maybe we shouldn't have everything as optional extras. Hopefully it'll take off, I can't imagine what sort of a headache it is for maintainers of repos who have to track a dozen different variants of binaries depending on which flavour of RISC-V the apt-get is coming from.
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
"Making everything optional" is for the embedded space.
As for general purpose processors, RISC-V has always had the idea of profiles (mandatory set of extensions). Just look at the G extension, which mandated floating point, multiply/division, atomics, ... things that you expect to see on user-facing general-purpose processors.
> the belated admission that maybe we shouldn't have everything as optional extras
That's why I disagree with the above claim.
(1) The optionality is a feature of RISC-V and it allows RISC-V to shine on different ecosystems. The desktop isn't everything.
(2) RISC-V has always addressed the fear of fragmentation on the desktop by using profiles.
RVA23 (and RVA20 before it) aren't an admission that Risc-V got it wrong. It's a necessary step to make Risc-V competetive in the desktop space as opposed to micro-controllers where the flexibility is hugely valuable.
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
But that means a port of Linux can’t be to RISC-V, it has to be to a specific implementation of RISC-V, or if sufficient (which seems still debatable) to a specific common RISC-V profile.
In what way are RISC-V profiles debatable? Canonical is spearheading the RVA23-as-a-default movement and so far, it seems that there are no heavy objections towards that effort (beyond the usual "Canonical sucks" shtick that you see in every discussion involving Canonical)
You can target the minimum instruction set and it'll run everywhere. Albeit very slowly. Perhaps you use a fat binary to get reasonable performance in most cases.
This isn't easy but it can be done (and it is being done on x86, despite constantly evolving variations of AVX).
Interestingly, RISC-V vector extensions are variable length.
So, you can compile your RISC-V software to require the equivalent of AVX and it will run on whatever size vectors the hardwre supports.
So, on x86-64, if I write AVX2 software and run it on AVX512 capable hardware, I am leaving performance on the table. But if I write software that uses AVX512, it will not run on hardware that does not support those extensions (flags).
On RISC-V, the same binary that uses 256 bit vectors on hardware that only supports that will use 512 bit vectors on hardware that supports it, or even 1024 bit vectors on hardware like the A100 cores of the SpacemiT K3.
So, I guess X86-64 is is the RyanAir of processors.
(Personal opinion)
I get the impression that RISC-V-related discussions often lack of awareness of prior work/alternatives. A large amount of (x86) software actually uses our Highway library to run on whatever size vectors and instructions the CPU offers.
This works quite well in practice. As to leaving performance on the table, it seems RVV has some egregious performance differences/cliffs. For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?
> For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?
Use Zvzip, in the mean time:
zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways
unzip: vsnrl
trn1/trn2: masked vslide1up/vslide1down with even/odd mask
The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.
Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable?
That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(
Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?
RyanAir is about exploiting consumers, with bait-and-switch and shitty terms and conditions.
RISC-V's modularity is about giving choice to hardware designers, so they can pick and choose just those features that their solution needs, and even allow for custom extensions.
RISC-V's modularity is for academia. 1) for education, where students learn/use/work on simple processors, 2) for research in new types of hardware and extensions, where ease of implementation or ease of creating a custom extension is important.
Extensiosn are not just for academia. If I am building a microcontroller to control the storage media I am selling (eg. hard drives), why do I need to implement a bunch of features I am not going to use? What about my flow rate monitor? Or my pacemaker?
In some of these, less silicon means less power means more better. Like that last example.
Then x86_64 is the cable television service of processors. "Oh, you want channel 5? Then you have to buy this bundle with 40 other channels you will never watch, including 7 channels in languages you do not speak."
And where it actually mattered they did not introduce a separate extension. Integer division is significantly more complex than multiplication, so it may make sense for low-end microcontrollers to implement in hardware only the latter.
RyanAir is the least expensive right? And it still gets you there?
I would be ok with that if it was a valid analogy.
It is valid in microcontroller land. There, the chip and the software are provided by the same party. So you can select for exactly the RISC-V features you need and save yourself some silicon. That sounds like a win to me.
At the application level, like a server or a desktop, that would be a disaster because I get my hardware and software from different people. How do the software guys know what hardware to target? Well, that is exacly why RVA23 exists.
What does RVA23 mean? It is the RISC-V "Application" profile. It allows you to build software to a single hardware target and trust that hardware makers will target the same proifle. RVA23 is like saying x86-64v4. Both are simple names for a long list of extensions (flags) and assumptions that you expect the hardware to honour. So, when Ubuntu 26.04 says it requires RVA23, it means that all the software built on it can assume those features. No a la carte.
The reason RVA23 is geting so much attention is that it has essentially the same feature set as modern ARM64 or x86-64. Software will be able to target this profile for a long time. There may be a new profile in a few years time, like RVA30, but hardware that implements that will still run RVA23 software (just as x86-64v4 hardware will run x86-64v1 software). Hardware built for profiles before RVA23 may be missing features modern applications expect.
I guess you could say that RVA23 is British Airways Business Class.
If you really want to support hardware designed before RVA23, almost everything you would want to run pre-built software on supports RVA20. And again, your RVA20 stuff will run fine on RVA23 hardware (but with fewer features--like no vectors). So maybe no in-flight meal, but it will get you there.
I think having separate unaligned load/store instructions would be a much worse design, not least because they use a lot of the opcode space. I don't understand why you don't just have an option to not generate misaligned loads for people that happen to be running on CPUs where it's really slow. You don't need to wait for a profile for that.
As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient. By the time you get to CPUs where portable code is important a CSPRNG is probably fine.
I agree about page size though. Svnapot seems overly complicated and gives only a fraction of the advantages of actually bigger pages.
>As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient.
It's a terrible attitude to have towards programmers, but looking at misaligned ops, I guess we can see a pattern from RISC-V authors here.
Most programmers do not target a concrete microcontroller and develop every line of code from scratch. They either develop portable libraries (e.g. https://docs.rs/getrandom) or build their projects using those libraries.
The whole raison d'être of an ISA is to provide a portable contract between hardware vendors and programmers . RISC-V authors shirk this responsibility with "just look at your micro specs, lol" attitude.
The option to generate or not generate misaligned loads/stores does exist (-mno-strict-align / -mstrict-align). But of course that's a compile-time option, and of course the preferred state would be to have use of them on by default, but RVA23 doesn't sufficiently guarantee/encourage them not being unreasonably-slow, leaving native misaligned loads/stores still effectively-unusable (and off by default on clang/gcc on -march=rva23u64).
aka, Zicclsm / RVA23 are entirely-useless as far as actually getting to make use of native misaligned loads/stores goes.
Yeah, that is quite funky; and indeed gcc does that. Relatedly, super-annoying is that `vle64.v` & co could then also make use of that same hardware, but that's not guaranteed. (I suppose there could be awful hardware that does vle8.v via single-byte loads, which wouldn't translate to vle64.v?)
> RVA23 doesn't guatantee them not being unreasonably-slow
Right but it doesn't guarantee that anything is unreasonably slow does it? I am free to make an RVA23 compliant CPU with a div instruction that takes 10k cycles. Does that mean LLVM won't output div? At some point you're left with either -mcpu=<specific cpu> and falling back to reasonable assumptions about the actual hardware landscape.
Do ARM or x86 make any guarantees about the performance of misaligned loads/stores? I couldn't find anything.
Exactly, I 100% agree, and IMO toolchains should default to assuming fast misaligned load/store for RISC-V.
However, the spec has the explicit note:
> Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only
for correctness, not for performance.
Which was a mistake. As you said any instruction could be arbitrarily slow, and in other aspects where performance recommendations could actually be useful RVI usually says "we can't mandate implementation".
I don't think x86/ARM particularly guarantee fastness, but at least they effectively encourage making use of them via their contributions to compilers that do. They also don't really need to given that they mostly control who can make hardware anyway. (at the very least, if general-purpose HW with horribly-slow misaligned loads/stores came out from them, people would laugh at it, and assume/hope that that's because of some silicon defect requiring chicken-bit-ing it off, instead of just not bothering to implement it)
Indeed one can make any instruction take basically-forever, but I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation (on at least some inputs), else having said instruction is strictly-redundant.
And if any significant general-purpose hardware actually did a 10k-cycle div around the time the respective compiler defaults were decided, I think there's a good chance that software would have defaulted to calling division through a function such that an implementation can be picked depending on the running hardware. (let's ignore whether 10k-cycle-division and general-purpose-hardware would ever go together... but misaligned-mem-ops+general-purpose-hardware definitely do)
> if general-purpose HW with horribly-slow misaligned loads/stores came out from them
How is that different for RISC-V?
> I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation
I agree! So just use misaligned loads if Zicclsm is supported. As you observed there's a feedback loop between what compilers output and what gets optimised in hardware. Since RVA23 hardware is basically non-existent at the moment you kind of have the opportunity to dictate to hardware "LLVM will use misaligned accesses on RVA23; if you make an RVA23 chip where this is horribly slow then people will laugh at you and assume it's some sort of silicon defect".
RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent, and it seems not enough people have laughed at them, and instead compilers did just surrender and default to not using them.
> As you observed there's a feedback loop between what compilers output and what gets optimised in hardware.
Well, that loop needs to start somewhere, and it has already started, and started wrong. I suppose we'll see what happens with real RVA23 hardware; at the very least, even if it takes a decade for most hardware to support misaligned well, software could retroactively change its defaults while still remaining technically-RVA23-compatible, so I suppose that's good.
> RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent
Only U74 and P550, old RV64GC CPUs.
SiFive's RVA23 cores have fast misaligned accesses, as do all THead and SpacemiT cores.
I can't imagine that all the Tenstorrent and Ventana and so forth people doing massively OoO 8-wide cores won't also have fast misaligned accesses.
As a previous poster said: if you're targeting RVA23 then just assume misaligned is fast and if someone one day makes one that isn't then sucks to be them.
P550 is, like, what, only a year old? I suppose there has been some laughing at it at least.
Also Kendryte K230 / C908, but only on vector mem ops, which adds a whole another mess onto this.
I'd hope all the massive OoO will have fast misaligned mem ops, anything else would immediately cause infinite pain for decades.
But of course there'll be plenty of RVA23 hardware that's much smaller eventually too, once it becomes a general expectation instead of "cool thing for the very-top-end to have".
I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.
It has take a while for this core to appear in an SoC suitable for SBCs, as Intel was originally announced as doing that and got as far as showing a working SoC/Board at the Intel Innovation 2022 event in September 2022.
Someone who attended that event was able to download the source code for my primes benchmark and compile and run it, at the show, and was kind enough to send me the results. They were fine.
For reasons known only to Intel, they subsequently cancelled mass production of the chip.
ESWIN stepped up and made the EIC7700X, as used in the Milk-V Megrez and SiFive HiFive Premier P550, which did indeed ship just over a year ago.
But technically we could have had boards with the Intel chip three years ago.
Heck we should have had the far better/faster Milk-V Oasis with the P670 core (and 16 of them!) two years ago. Again, that was business/politics that prevented it, not technology.
> No, it was released to customers in June 2021, almost five years ago.
Ah, okay. (still, like, at least a couple decades newer than the last x86-64 chip with slow unaligned mem ops, if such ever existed at all? Haven't heard of / can't find anything saying any aarch64 ever had problems with them either, so still much worse for the RISC-V side).
Well, I suppose we can hope that business/politics messes will all never happen again and won't affect anything RVA23.
> I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.
This very much has a "for now" on it. Once there is actually widespread hardware with the feature, I would be very surprised if the compilers don't update their heuristics (at least for RVA23 chips)
Indeed we shall hope heuristics update; but of course if no compilers emit it hardware has no reason to actually bother making fast misaligned ops, so it's primed for going wrong.
hardware devs traditionally have been pretty good at helping the compiler teams with things like this (because its a lot cheaper to improve the compiler than your chip).
>So just use misaligned loads if Zicclsm is supported.
LLVM and GCC developers clearly disagree with you. In other words, re-iterating the previously raised point: Zicclsm is effectively useless and we have to wait decades for hypothetical Oilsm.
Most programmers will not know that the misaligned issue even exists, even less about options like -mno-strict-align. They just will compile their project with default settings and blame RISC-V for being slow.
RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.
Well, we don't necessarily have to wait for Oilsm; software that wants to could just choose to be opinionated and run massively-worse on suboptimal hardware. And, of course, once Oilsm hardware becomes the standard, it'd be fine to recompile RVA23-targeting software to it too.
> RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.
Rather hard to mandate performance by an open ISA. Especially considering that there could actually be scenarios where it may be necessary to chicken-bit it off; and of course the fact that there's already some questionability on ops crossing pages, where even ARM/x86 are very slow.
I am not saying that RISC-V should mandate performance. If anything, we wouldn't had the problem with Zicclsm if they did not bother with the stupid performance note.
I would be fine with any of the following 3 approaches:
1) Mandate that store/loads do not support misaligned pointers and introduce separate misaligned instructions (good for correctness, so its my personal preference).
2) Mandate that store/loads always support misaligned pointers.
3) Mandate that store/loads do not support misaligned pointers unless Zicclsm/Oilsm/whatever is available.
If hardware wants to implement a slow handling of misaligned pointers for some reason, it's squarely responsibility of the hardware's vendor. And everyone would know whom to blame for poor performance on some workloads.
We are effectively going to end up with 3, but many years later and with a lot of additional unnecessary mess associated with it. Arguably, this issue should've been long sorted out in the age of ratification of the I extension.
2 is basically infeasible with RISC-V being intended for a wide range of use-cases. 1 might be ok but introduces a bunch of opcode space waste.
Indeed extremely sad that Zicclsm wasn't a thing in the spec, from the very start (never mind that even now it only lives in the profiles spec); going through the git history, seems that the text around misaligned handling optionality goes all the way back to the very start of the riscv/riscv-isa-manual repo, before `Z*` extensions existed at all.
More broadly, it's rather sad that there aren't similar extensions for other forms of optional behavior (thing that was recently brought up is RVV vsetvli with e.g. `e64,mf2`, useful for massive-VLEN>DLEN hardware).
>1 might be ok but introduces a bunch of opcode space waste.
I wouldn't call it "waste". Moreover, it's fine for misaligned instructions to use a wider encoding or be less rich than their aligned counterparts. For example, they may not have the immediate offset or have a shorter one. One fun potential possibility is to encode the misaligned variant into aligned instructions using the immediate offset with all bits set to one, as a side effect it also would make the offset fully symmetric.
Of course that'd result in entirely-avoidable slowdown for the potentially-misaligned ops. Perhaps fine for a program that doesn't use them frequently, but quite bad for ones that need misaligned ops everywhere.
In terms of correctness, there's also the possibility of partially-misaligned ops (e.g. an 8B load with 4B alignment, loading two adjacent int32_t fields) so you're not handling everything with correct faults anyways.
I don't think it's too bad. The compressed extension was arguably a mistake (and shouldn't be in RVA23 IMO), but apart from that there aren't any major blunders. You're probably thinking about how JAL(R) basically always uses x1/x5 (or whatever it is), but I don't think that's a huge deal.
About 1/3 of the opcode space is used currently so there's a decent amount of space left.
Yep, RISC-V also has these megapages. 4k is the last-level page size. You get larger pages (4M on 32-bit and 2M/1G on 64-bit) by terminating the walk at higher levels of the page table.
Yes, and Linux. at least historically, has not used them without explicit program opt-in. Often advice is to disable transparent huge pages for performance reasons. Not sure about other operating systems.
x86 has decades of knowhow and a zillion transistors to spend on making the memory pipeline, TLB caching & prefetching etc. etc. really really good. They work as well as they do despite the 4k base page size, not because of it.
If you'd start from a clean sheet today you'd probably end up with a somewhat bigger base page size. Not hugely larger though, as that wastes a lot of memory for most applications. Maybe 16k like some ARM chips use?
First, x86-64 also has “extensions” such as avx, avx2, and avx512. Not all “x86-64” CPUs support the same ones. And you get things like svm on AMD and avx on Intel. Remember 3DNow?
X86-64 also has “profiles” which tell you what extensions should be available. There is x86-64v1 and x86-64v4 with v2 and v3 in the middle.
RVA23 offers a very similar feature-set to x86-64v4.
You do not end up with a mess of extensions. You get RVA23. Yes, RVA23 represents a set of mandatory extensions. The important thing is that two RVA23 compliant chips will implement the same ones.
But the most important point is that you cannot “just use x86-64”. Only Intel and AMD can do that. Anybody can build a RISC-V chip. You do not need permission.
>Anybody can build a RISC-V chip. You do not need permission.
No, anybody can’t build a RISC-V chip. That’s the same mistake OSS proponents make. Just because something is open source doesn’t mean bugs will be found. And just because bugs are found doesn’t mean they will be fixed. The vast majority of people can’t do either.
The number of people who can design a chip implementation of the RISC-V ISA is much, much smaller, and the number who can get or own a FAB to manufacture the chips smaller still. You don’t need permission to use the ISA, but that is not the only gate.
Yes, they can. My point is that nobody needs to give you permission. You can pretend that does not matter but China is about to educate us about what this means rather dramatically in the next few years.
And India is building RISC-V chips. And Europe is building RISC-V chips. Tenstorrent started in Canada (building RISC-V chips).
> the number who can get or own a FAB to manufacture the chips
Really? Almost nobody owns fabs and yet there are a multitude of chip makers. Getting access to a fab requires only money. It has nothing to do with the ISA or your skills. TSMC can make RISC-V chips just fine and already do. In some places, like China, RISC-V chips may be at the front of the line.
> The number of people who can design a chip implementation of the RISC-V ISA
Every electrical engineer is going to know how to design a RISC-V chip. But you could also be an intelligent garbage man and design a RISC-V chip in your spare time using only open source materials. You can even tape it out.
"But that is only a 32 bit microcontroller!", you might say. Sure. But the skills to build RISC-V are going to propogate. Of course, that does not mean that everybody in the world is going to figure out how to build chips. That is clearly not my point. They will still be built primarily by a select few. But that is not unique to RISC-V by any stretch. In fact, less so.
The hard part about building a chip from scratch is not the ISA. You think that a world-class engineer working with ARM64 or amd64 today cannot design a RISC-V chip? That is like saying a carpenter building oak cabinets lacks the skills to make them with maple.
And since it is the same amount of work to start fresh regardless of ISA, why not start with RISC-V?
Except you do not have to start fresh with RISC-V because there are many, and will be many, many more, open designs to study and start with. Here is a 64 bit chip that implements the very latest RISC-V vector extensions:
Which, by the way, means that although most won't, anybody can build a RISC-V chip.
The RISC-V world will look like ARM. Most chip makers will license the core design off somebody else. But there will be more of those "somebody elses" to choose from. And there will be more people who choose to design their own silicon. Meta just bought Rivos. What for do you think? And they did not have to talk to ARM about it.
> lots of commercial software is now compiled for newer x86 64 extensions.
Almost all software I encountered - including Windows 10 and precompiled Debian 13 - needs only SSE4.2, essentially mid-2000s ISA. Intel produced until very recently (early 2020s) Celeron CPUs which did not even support AVX.
Yet I still have regular conversations explaining "there is no way our customers are running on hardware that doesn't support this, where would they even be getting the hardware from, 2008?". I have a set of requirements in front of me requiring software to run on not only all Intel 64-bit chips, but also all Intel 32-bit chips.
No, you really can’t. For some OSS, on hardware that has an OS supported by that software, with a compiler that supports that target and the options you want, and in some cases where the OSS has been written to support those options, you can compile it. Otherwise you are just out of luck.
I don't really understand your position here. Compiler availability isn't really that big of a deal, even on obscure or proprietary platforms. Why would there be "some cases where the OSS has been written to support those options"?
Because the ISA is not encumbered the way other ISAs are legally, and there are use cases where the minimal profile is fine for the sake of embedded whatever vs the cost to implement the extensions
I'm not a lawyer but I would assume its copyright. Kind of like API in software. In software somehow this does not apply most of the time. But it seems in hardware this is very real. But I would appreciate a lawyer jumping in.
I know for example that Berkley when thinking pre-RISC-V that they had a deal with Intel about using x86-64 for research. But they were not able to share the designs.
I don't know why there aren't independent X86-64 manufacturers. Patents on the extensions maybe? But as I understand copyright, APIs can't be copyrighted so it's not that.
I doubt there is any legal barrier, because there are a few existing projects with x86 cores on an FPGA, as well as some SoCs. Here's a 486: https://opencores.org/projects/ao486
Regarding misaligned reads, IIRC only x86 hides non-aligned memory access. It's still slower than aligned reads. Other processors just fault, so it would make sense to do the same on riscv.
The problem is decades of software being written on a chip that from the outside appears not to care.
ARM Cortex-A cores also allow unaligned access (MCU cores don't though, and older ARM is weird). There's perhaps a hint if the two most popular CPU architectures have ended up in the forgiving approach to unaligned access, rather than the penalising approach of raising an interrupt.
Yes, unaligned loads/stores are a niche feature that has huge implications in processor design - loads across cache-lines with different residency, pages that fault etc.
This is the classic conundrum of legacy system redesign - if customers keep demanding every feature of the old system be present, and work the exact same then the new system will take on the baggage it was designed to get rid of.
The new implementation will be slow and buggy by this standard and nobody will use it.
Unaligned load/store is crucial for zero-copy handling of mmaped data, network streams and all other kinds of space-optimized data structures.
If the CPU doesn't do it software must make many tiny conditional copies which is bad for branch prediction.
This sucks double when you have variable length vector operations... IMO fast unaligned memory accesses should have been mandatory without exceptions for all application-level profiles and everything with vector.
I think you can do this fairly efficiently with SSE for x86 - SSE/AVX has shift and shuffle. Encoding/Decoding packed data might even be faster this way.
I'm not familiar with RISC-V but from what I've seen here, they're also trying to solve this similarly with vector or bit extraction instructions.
Yes because unaligned load is no problem with SSE/AVX. On my RISC-V OrangePi unaligned vector loads beyond byte-granularity fault so you have to take extra care.
AVX shift and shuffle is mostly limited to 128 bits unfortunately for historical reasons (even for 256-bit instructions) and hardware support for AVX512/AVX10 where they fixed that is a complete mess so it's hard to rely on when you care about backwards compatibility for consumer devices, e.g. in game development.
For working with packed data structures where fields are irregular/non-predictable/dependent on previous fields etc. unaligned load/store is a godsend. Last time I worked on a custom DB engine that used these patterns the generated x86 code was so much nicer than the one for our embedded ARM cores.
PDP-11 was a major source of inspiration for m68k architecture designers. The influence can be seen in multiple places, starting from the orthogonal ISA design down to instruction mnemonics.
It is quite likely that not allowing the misaligned access was also influenced by PDP-11.
Also the bit manipulation extension wasn't part of the core. So things like bit rotation is slow for no good reason, if you want portable code. Why? Who knows.
> Also the bit manipulation extension wasn't part of the core.
This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast.
Bit manipulation instructions are part and parcel of any curriculum that teaches CPU architecture. They are the basic building blocks for many more complex instructions.
I can see quite a few items on that list that imnsho should have been included in the core and for the life of me I can't see the rationale behind leaving them out. Even the most basic 8 bit CPU had various shifts and rolls baked in.
This is the reason behind the profiles like RVA23 which include bitmanip, vector and a large number of other extensions. Real chips coming very soon will all be RVA23.
The Milk-V Titan (https://milkv.io/titan) can take up to 64GB which is fine considering the number of cores and the cost of RAM. If you needed and could afford more RAM you'd be better off distributing the work across more than one board.
Unfortunately they found a bug and had to redesign the boards. I've had one of these on pre-order since last year. Latest is I think they're intending to ship them next month (April).
Ok! I will keep an eye out. It is one of the most interesting developments for me hardware wise in the last decade, and I definitely want to show my support by buying one or more of the boards. Respin is always really annoying this late in, the post mortem on that must make for interesting reading.
32-bit barrel shifters consume significant area and RISC-V was developed to support resource constrained low cost embedded hardware in a minimal ISA implementation.
The 32-bit ARM architecture included a barrel shifter as part of its basic design, as in every instruction had a shift field.
If a CPU built in 1985 with a grand total of 26 000 transistors could afford it, I am pretty sure that anything built in this century could afford it too.
To the best of my knowledge (and Google-fu), 26K really isn't a lot of transistors for an embedded MCU - at least not a fully-featured 32-bit one comparable to a minimal RISC-V core. An ARM Cortex M0, which is pretty much the smallest thing out there, is around 10K gates => around 40K transistors. This is also around the same size as a minimal RISC-V core AFAICT.
There's reason RV32E and RV64E, with half the registers, are a thing. RV32I/RV64I isn't small enough.
There are many chips in the market that do embed 8051s for janitorial tasks, because it is small and not legally encumbered. Some chips have several non-exposed tiny embedded CPUs within.
RISC-V is replacing many of these, bringing modern tooling. There's even open source designs like SERV that fit in a corner of an already small FPGA, leaving room for other purposes.
Per https://en.wikipedia.org/wiki/Transistor_count, even an 8051 has 50K transistors, which reinforces my claim that 26K really doesn't seem like a big ask for an MCU core. Whether that means a barrel shifter is worth it or not is a totally orthogonal question, of course.
(Although I do have to eat my words here - I didn't check that Wikipedia page, and it does actually list a ~6K RISC-V core! It's an experimental academic prototype "made from a two-dimensional material [...] crafted from molybdenum disulfide"; I don't know if that construction might allow for a more efficient transistor count and it's totally impractical - 1KHz clock speed, 1-bit ALU, etc. - for almost any purpose, but it is technically a RISC-V implementation significantly smaller than 26K)
I don't know if that construction might allow for a more efficient transistor count and it's totally impractical - 1KHz clock speed, 1-bit ALU, etc. - for almost any purpose, but it is technically a RISC-V implementation significantly smaller than 26K
That sounds like a microcoded RISC-V implementation, which can really be done for any ISA at the extreme expense of speed.
If I'm not mistaken, microcode is a thing at least on Intel CPU's, and that is how they patched Spectre, Meltdown and other vulnerabilities – Intel released a microcode update that BIOS applies at the cold start and hot patches the CPU.
Maybe other CPU's have it as well, though I do not have enough information on that.
> There's reason RV32E and RV64E, with half the registers, are a thing. RV32I/RV64I isn't small enough.
This is actually kind of counter to your point. The really tiny micro-controllers from the 80s only had 224 bits of registers. RV32E is at least twice that (16 registers*32 bits), and modern mcus generally use 2-4kbs of sram, so the overhead of a 32 bit barrel shifter is pretty minimal.
IIUC this is a lot less true in the modern era. Even with 24nm transistors (the cheapest transistor last time I checked), modern microcontrollers have a fairly big transistor budget for the core (since 80+% of the transistors are going to sram anyway).
You can save a lot of silicon by doing 8 or 16 bit shifters and then doing the rest at the code generation level. Not having any seems really anemic to me.
It was the case even 15 years ago when Cortex M0/M3 really started to get traction, that the processor area of ARM cores was small enough to not make a difference in practice.
Yeah I don’t get it. Shifts and rolls are among the simplest of all instructions to implement because they can be done with just wires, zero gates. Hard to imagine a justification for leaving them out.
> One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used.
Same could be said of MIPS.
My understanding is the RISC-V raison d'etre is rather avoidance of patented/copywritten designs.
As you indicate, MIPS was widely used in computer architecture courses and textbooks, including pre-RISC-V editions of Patterson & Hennessy (Computer Organization & Design) and Harris & Harris (Digital Design and Computer Architecture.
In spite of the currently mediocre RISC-V implementations, RISC-V seems to have more of a future and isn't clouded by ISA IP issues, as you note.
the avoidance of patent/copyright is critical for (legally) having students design their own chips. MIPS was pretty good (and widely used) for teaching assembly, but pretty bad for teaching a class where students design chips
This is largely contradicted by the (pre RISC-V) MIPS editions of Patterson & Hennessy, Harris & Harris, etc., which teach you how to design a MIPS datapath (at the gate level.)
Regarding silicon implementations, consider that 1) you can synthesize it from HDL/RTL designs using modern CAD tools, and 2) MIPS was originally designed to be simple enough for grad students to implement with the primitive CAD tools of the 1980s (basically semi-manual layout).
> This is primarily because core is primarily a teaching ISA.
That doesn't necessarily make it all that great for industrial use, does it?
> One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used.
You can also do that with Intel MCS-51 (aka 8051) or even i960. And again, having an ISA easily implementable "on a knee" by a fresh graduate doesn't says anything about its other technical merits other than being "easily implementable (when done in the most primitive way possible)".
Huh? They have no idea what they are doing. If data is unaligned, the solution is memcpy, not compiler optimizations, also their hack of 17 loads is buffer overflow. Also not ISA spec problem.
Compilers also like to unnecessarily copy data to stack: https://github.com/llvm/llvm-project/issues/53348 Which can be particularly annoying in cryptographic code where you want to minimize number of copies of sensitive data.
For a system programming language the right solution is to properly track aliasing information in the type system as done in Rust.
Aliasing issues is just yet another instance of C/C++ inferiority holding the industry back. C could've learnt from Fortran, but we ended up with the language we have...
LOL, nope. Those annotations must be part of the type system (e.g. `&mut T` in Rust) and must be checked by the compiler (the borrow checker). The language can provide escape hatches like `unsafe`, but they should be rarely used. Without it you get a fragile footgunny mess.
Just look at the utter failure of `restrict`. It was so rarely used in C that it took several years of constant nagging from Rust developers to iron out various bugs in compilers caused by it.
Does make me wonder what restrict-related bugs will be (have been?) uncovered in GCC, if any. Or whether the GCC devs saw what LLVM went through and decided to try to address any issues preemptively.
Possibly? LLVM had been around for a while as well but Rust still ended up running into aliasing-related optimizer bugs.
Now that I think about it some more, perhaps gfortran might be a differentiating factor? Not familiar enough with Fortran to guess as to how much it would exercise aliasing-related optimizations, though.
I generally agree with this opinion and would love to see a proper well documented low-level API for working with GPU. But it would probably result in different "GPU ISAs" for different vendors and maybe even for different GPU generations from one vendor. The bloated firmwares and drivers operating on a higher abstraction level allow to hide a lot of internal implementation details from end users.
In such world most of software would still probably use something like Vulkan/DX/WebGPU to abstract over such ISAs, like we use today Java/JavaScript/Python to "abstract" over CPU ISA. And we also likely to have an NVIDIA monopoly similar to x86.
There’s a simple (but radical) solution that would force GPU vendors to settle on a common, stable ISA: forbid hardware vendors to distribute software. In practice, stop at the border hardware that comes from vendors who still distribute software.
That simple. Now to sell a GPU, the only way is to make an ISA so simple even third parties can make good drivers for it. And the first successful ISA will then force everyone else to implement the same ISA, so the same drivers will work for everyone.
Oh, one other thing that has to go away: patents must no longer apply to ISAs. That way, anyone who wants to make and sell x86, ARM, or whatever GPU ISA that emerges, legally can. No more discussion about which instruction set is open or not, they all just are.
Not that the US would ever want to submit Intel to such a brutal competition.
I wouldn't be so sure, as if we analogize to x86(_64), the ISA is stable and used by many vendors, but the underlying microarchitecture and caching model, etc., are free reign for impl-specific work.
For all my skepticism towards using LLM in programming (I think that the current trajectory leads to slow degradation of the IT industry and massive loss of expertise in the following decades), LLM-based advanced proof assistants is the only bright spot for which I have high hopes.
Generation of proofs requires a lot of complex pattern matching, so it's a very good fit for LLMs (assuming sufficiently big datasets are available). And we can automatically verify LLM output, so hallucinations are not the problem. You still need proper engineers to construct and understand specifications (with or without LLM help), but it can significantly reduce development cost of high assurance software. LLMs also could help with explaining why a proof can not be generated.
But I think it would require a Rust-like breakthrough, not in the sense of developing the fundamental technology (after all, Rust is fairly conservative from the PLT point of view), but in the sense of making it accessible for a wider audience of programmers.
I also hope that we will get LLM-guided compilers which generate equivalency proofs as part of the compilation process. Personally, I find it surprising that the industry is able to function as well as it does on top of software like LLVM which feels like a giant with feet of clay with its numerous miscompilation bugs and human-written optimization heuristics which are applied to a somewhat vague abstract machine model. Just look how long it took to fix the god damn noalias atrtibute! If not for Rust, it probably would've still been a bug ridden mess.
>we had about an inch of separation between our laser diodes and our photodiodes
Why can't you place them further away from each other using an additional optical system (i.e. a mirror) and adjusting for the additional distance in software?
You can, but customers like compact self-contained units. All trade offs.
Edit: There's basically three approaches to this problem that I'm aware of. Number one is to push the cross-talk below the noise floor -- your suggestion helps with this. Number two is to do noise cancellation by measuring your cross-talk and deleting it from the signal. Number three is to make the cross-talk signal distinct from a real reflection (e.g. by modulating the pulses so that there's low correlation between an in-flight pulse and a being-fired pulse). In practice, all three work nicely together; getting the cross-talk noise below saturation allows cancellation to leave the signal in place, and reduced correlation means that the imperfections of the cancellation still get cleaned up later in the pipeline.
In my (Rust-colored) opinion, the async keyword has two main problems:
1) It tracks code property which is usually omitted in sync code (i.e. most languages do not mark functions with "does IO"). Why IO is more important than "may panic", "uses bounded stack", "may perform allocations", etc.?
2) It implements an ad-hoc problem-specific effect system with various warts. And working around those warts requires re-implementation of half of the language.
No, it's not. `async` is just syntax sugar, the "effect" gets emulated in the type system using `Future`. This is one of the reasons why the `async` system feels so foreign and requires so many language changes to make it remotely usable. `const` is much closer to a "true" effect (well, to be precise it's an anti-effect, but it's not important right now).
Also, I think it's useful to distinguish between effect and type systems, instead of lumping them into just "type system". The former applies to code and the latter to data.
>That feels dense.
Yes. But why `async` is more important than `alloc`? For some applications it's as important to know about potential allocations, as for other applications to know about whether code potentially yields or not.
Explicitly listing all effects would the most straightforward approach, but I think a more practical approach would be to have a list of "default effects", which can be overwritten on the crate (or maybe even module) level. And on the function level you will be able to opt-in or opt-out from effects if needed.
>I think ideally it would be something readers could spot at first glance
Well, you either can have "dense" or explicit "at first glance" signatures.
While I agree that dependency tree size can be sometimes a problem in Rust, I think it often gets overblown. Sure, having hundreds of dependencies in a "simple" project can be scary, but:
1) No one forces you to use dependencies with large number of transitive dependencies. For example, feel free to use `ureq` instead of `reqwest` pulling the async kitchen sink with it. If you see an unnecessary dependency, you could also ask maintainers to potentially remove it.
2) Are you sure that your project is as simple as you think?
3) What matters is not number of dependencies, but number of groups who maintain them.
On the last point, if your dependency tree has 20 dependencies maintained by the Rust lang team (such as `serde` or `libc`), your supply chain risks are not multiplied by 20, they stay at one and almost the same as using just `std`.
On your last note, I wish they would get on that signed crate subset. Having the same dependency tree as cargo, clippy, and rustc isn't increasing my risk.
Rust has already had a supply chain attack propagating via build.rs some years ago. It was noticed quickly, so staying pinned to the oldest thing that worked and had no cve pop in cargo audit is a decent strategy. The remaining risk is that some more niche dependency you use is and always has been compromised.
Great comment! I agree about comptime, as a Rust programmer I consider it one of the areas where Zig is clearly better than Rust with its two macro systems and the declarative generics language. It's probably the biggest "killer feature" of the language.
> as a Rust programmer I consider it one of the areas where Zig is clearly better than Rust with its two macro systems and the declarative generics language
IMHO "clearly better" might be a matter of perspective; my impression is that this is one of those things where the different approaches buy you different tradeoffs. For example, by my understanding Rust's generics allows generic functions to be completely typechecked in isolation at the definition site, whereas Zig's comptime is more like C++ templates in that type checking can only be completed upon instantiation. I believe the capabilities of Rust's macros aren't quite the same as those for Zig's comptime - Rust's macros operate on syntax, so they can pull off transformations (e.g., #[derive], completely different syntax, etc.) that Zig's comptime can't (though that's not to say that Zig doesn't have its own solutions).
Of course, different people can and will disagree on which tradeoff is more worth it. There's certainly appeal on both sides here.
Consider that Python + C++ has proved to be a very strong combo: driver in Python, heavy lifting in C++.
It's possible that something similar might be the right path for metaprogramming. Rust's generics are simple and weaker than Zig's comptime, while proc macros are complicated and stronger than Zig's comptime.
So I think the jury's still out on whether Rust's metaprogramming is "better" than Zig's.
1) https://github.com/llvm/llvm-project/issues/150263
2) https://github.com/llvm/llvm-project/issues/141488
Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM.
reply