I feel like that was much more true in the past but the X925 was only spec'd 18 months ago(?) and you can buy it today (I'm using one since October). Intel and AMD also give lots of advance notice on new designs well ahead of anything you can buy. ARM is also moving towards providing completely integrated solutions, so customers like Samsung don't have to take only CPU core and fill in the blanks themselves. They'll probably only get better at shipping complete solutions faster.
Honestly, Apple is the strange one because they never discuss CPUs until they are available to buy in a product; they don't need to bother.
When you do a cache lookup, there is a "tag" which you use as an index during lookup. But once you do the lookup, you may need to walk a few entries in the corresponding "bucket" (identified by that tag) to find the matching cache line. The number of entries you walk is the associativity of the cache e.g. 8-way or 12-way associativity means there are 8 or 12 entries in that bucket. The larger the associativity, the larger the cache, but also it worsens latency, as you have to walk through the bucket. These are the two points you can trade off: do you want more total buckets, or do you want each bucket to have more entries?
To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.
Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.
Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.
"PAYGO API access" vs "Monthly Tool Subscription" is just a matter of different unit economics; there's nothing particularly unusual or strange about the idea on its own, specific claims against Google notwithstanding.
Of course, Google is still in the wrong here for instantly nuking the account instead of just billing them for API usage instead (largely because an autoban or whatever easier, I'm sure).
I am afraid of using any Google services in experimental way from the fear that my whole Google existence will be banned.
I think blocking access temporarily with a warning would be much more suitable. Unblocking could be even conditioned on a request to pay for the abused tokens
Knights Landing is a major outlier; the cores there were extremely small and had very few resources dedicated to them (e.g. 2-wide decode) relative to the vector units, so of course that will dominate. You aren't going to see 40% of the die dedicated to vector register files on anything looking like a modern, wide core. The entire vector unit (with SRAM) will be in the ballpark of like, cumulative L1/L2; a 512-bit register is only a single 64 byte cache line, after all.
True! But even if only 20% of the die area goes to AVX-512 in larger cores, that makes a big difference for high core count CPUs.
That would be like having a 50-core CPU instead of a 64-core CPU in the same space. For these cloud native CPU designs everything that takes significant die area translates to reduced core count.
You're still grossly overestimating the area required for AVX-512. For example, on AMD Zen4, the entire FPU has been estimated as 25% of the core+L2 area, and that's including AVX-512. If you look at the extra area required for AVX-512 vs 256-bit AVX2, as a fraction of total die area including L3 cache and interconnect between cores, it's definitely not going to be a double digit percentage.
FWIW I think the LLVM bitcode format has stronger compatibility guarantees than the text IR. But I agree it's a bit of a pain either way; plus, if you forgo linking to the library and just rely on whatever 'llc' the user has installed, figuring out bugs is not a fun time...
No, Windows has consistently been ahead of Linux for many years in terms of average-user desktop security, from binary hardening to designs like secure desktop, because average Windows users do not typically have curated software selections, so you assume the worst. (When I wrote the original "binary hardening via compiler flags" RFC for NixOS over 10 years ago, almost everything in it was already done on Windows and had been for years.) It's still not ideal; macOS takes it even further and actually allows things like "storing secrets on disk in a way that can't be read by random programs" because it can e.g. make policy decisions based on code signatures, which are widely deployed. None of this exists in pretty much any Linux distro; you can literally just impersonate password prompts, simply override 'sudo' in a user's shell to capture their password silently, copy every file in $HOME/.config to your evil server, setuid by its very definition is an absolute atrocity, etc. Linux distros make it easy for people to live in their own chosen curated software set, but the security calculus changes when people want to run arbitrary and non-curated software.
You can make a pretty reasonably secure Linux server by doing your homework, it's nowhere close to impossible. An extremely secure server also requires a bit of hardware homework. The Linux desktop, however, is woefully behind macOS and Windows in terms of security by a pretty large margin, and most of it is by design.
(In theory you can probably bolt a macOS-like system onto Linux using tools like SCM_RIGHTS/pidfds/code signatures, along with delegated privilege escalation, no setuid, signature-based policy mechanisms, etc. But there are a lot of cultural and software challenges to overcome to make it all widely usable.)
llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...
Yeah I got 35-39tok/sec for one shot prompts, but for real-world longer context interactions through opencode it seems to be averaging out to 20-30tok/sec. I tried both MXFP4 and Q4_K_XL, no big difference, unfortunately.
--no-mmap --fa on options seemed to help, but not dramatically.
As with everything Spark, memory bandwidth is the limitation.
I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.
There are ongoing projects like Granule[1] that are exploring more precise resource usage to be captured in types, in this case by way of graded modalities. There is of course still a tension in exposing too much of the implementation details via intensional types. But it's definitely an ongoing avenue of research.
can Granule let me specify the following constraints on a function?
- it will use O(n) space where n is some measure of one of the parameters (instead of n you could have some sort of function of multiple measures of multiple parameters)
- same but time use instead of space use
- same but number of copies
- the size of an output will be the size of an input, or less than it
- the allocated memory after the function runs is less than allocated memory before the function runs
- given the body of a function, and given that all the functions used in the body have well defined complexities, the complexity of the function being defined with them is known or at least has a good upper bound that is provably true
No AVX512, client SKUs are just going to go straight to APX/AVX10, and they are confirmed for Nova Lake which is 2H 2026 (it will probably be "Core Ultra Series 4" or whatever I guess).
Honestly, Apple is the strange one because they never discuss CPUs until they are available to buy in a product; they don't need to bother.
reply