More

layoric · 2026-01-06T23:47:17 1767743237

Thanks for posting the performance numbers from your own validation. 6-7 tokens/sec is quite remarkable for the hardware.

geerlingguy · 2026-01-06T23:49:26 1767743366

Some more benchmarking, and with larger outputs (like writing an entire relatively complex TODO list app) it seems to go down to 4-6 tokens/s. Still impressive.

geerlingguy · 2026-01-07T03:54:06 1767758046

Decided to run an actual llama-bench run and let it go for the hour or two it needs. I'm posting my full results here (https://github.com/geerlingguy/ai-benchmarks/issues/47), but 8-10 t/s pp, and 7.99 t/s tg128, this is on a Pi 5 with no overclocking. Could probably increase the numbers slightly with an overclock.

You need to have a fan/heatsink to get that speed of course, it's maxing out the CPU for the entire time.

layoric · 2025-11-22T02:17:30 1763777850

This happens when you get worse and worse inequality when it comes to buying power. The most accurate prediction into how this all plays out I think is what Gary Stevenson calls "The Squeeze Out" -> https://www.youtube.com/watch?v=pUKaB4P5Qns

Currently we are still at the stage of extraction from the upper/middle class retail investors and pension funds being sucked up by all the major tech companies that are only focused on their stock price. They have no incentive to compete, because if they do, it will ruin the game for everyone. This gets worse, and the theory (and somewhat historically) says it can lead to war.

Agree with the analysis or not, I personally think it is quite compelling to what is happening with AI, worth a watch.

layoric · 2025-11-14T05:20:32 1763097632

Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.

Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.

layoric · 2025-11-12T00:38:07 1762907887

> The second you turn your head though, your fellow teammates will conspire to replatform onto Go or Rust or NodeJS or GitHub Actions and make everything miserable again.

Curious how would you use use Smalltalk in replace of GitHub Actions assuming you need a GitHub integrated CI runner?

vinceguidry · 2025-11-12T11:53:42 1762948422

All any build toolkit is is automations over bash. You can make your own. GitHub integration need not be any more than the most trivial thing that works. Your coworkers, naturally, won't be disciplined enough to keep the integration trivial and will build super complicated crap that's realty hard to troubleshoot, because they can.

layoric · 2025-11-06T21:56:38 1762466198

I have a hard time trying to conceptualize lossy text compression, but I've recently started to think about the "reasoning"/output as just a by product of lossy compression, and weights tending towards an average of the information "around" the main topic of prompt. What I've found easier is thinking about it like lossy image compression, generating more output tokens via "reasoning" is like subdividing nearby pixels and filling in the gaps with values that they've seen there before. Taking the analogy a bit too far, you can also think of the vocabulary as the pixel bit depth.

I definitely agree replacing AI or LLMs with "X driven by compressed training data" starts to make a lot more sense, and a useful shortcut.

suprjami · 2025-11-06T22:10:37 1762467037

You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens. I find it easier to conceptualize this in three dimensions. 3blue1brown has a good video series which covers the overall concept of LLM vectors in machine learning: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...

To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.

I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.

Terr_ · 2025-11-07T01:04:06 1762477446

> You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens.

I like to frame it as a theater-script cycling through the LLM. The "reasoning" difference is just changing the style so that each character has film noir monologues. The underlying process hasn't really changes, and the monologues text isn't fundamentally different from dialogue or stage-direction... but more data still means more guidance for each improv-cycle.

> say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer?

I'd like to point out that this scheme can result in things that look better to humans in the end... even when the "clarifying" choice is entirely arbitrary and irrational.

In other words, we should be alert to the difference between "explaining what you were thinking" versus "picking a firm direction so future improv makes nicer rationalizations."

esafak · 2025-11-07T02:11:26 1762481486

It makes sense if you think of the LLM as building a data-aware model that compresses the noisy data by parsimony (the principle that the simplest explanation that fits is best). Typical text compression algorithms are not data-aware and not robust to noise.

In lossy compression the compression itself is the goal. In prediction, compression is the road that leads to parsimonious models.

cruffle_duffle · 2025-11-07T05:52:05 1762494725

The way I visualize it is imagining clipping the high frequency details of concepts and facts. These things operate on a different plane of abstraction than simple strings of characters or tokens. They operate on ideas and concepts. To compress, you take out all the deep details and leave only the broad strokes.

kazinator · 2025-11-07T17:29:46 1762536586

One day people will say "we used to think the devil is in the details, but now we know it is in their removal".

astrange · 2025-11-06T22:11:54 1762467114

It is not a useful shortcut because you don't know what the training data is, nothing requires it to be an "average" of anything, and post-training arbitrarily re-weights all of its existing distributions anyway.

layoric · 2025-10-30T00:54:01 1761785641

> In general once you start thinking about scaling data to larger capacities is when you start considering the cloud

What kind of capacities as a rule of thumb would you use? You can fit an awful lot of storage and compute on a single rack, and the cost for large DBs on AWS and others is extremely high, so savings are larger as well.

twodave · 2025-10-30T13:30:33 1761831033

Well, if you want proper DR you really need an off-site backup, disk failover/recovery, etc. And if you don’t want to manually be maintaining individual drives then you’re looking at one of the big, expensive storage solutions with enterprise grade hardware, and those will easily cost some large multiple more than whatever 2U db server you end up putting in front of it.

layoric · 2025-10-28T23:13:35 1761693215

Same setup here, one game setup I've hit but this will be a rare problem, is StarCraft Remastered. Wine has an issue with audio processing which I can't seem to configure my way out of. It pegs all 32 threads and still stutters. Thankfully this game can likely run on an actual potato, so I have a separate mini PC running windows for this when I want to get my ass kicked on battle.net.

layoric · 2025-10-24T10:27:27 1761301647

Working at IT places in the late 2000s, it was still pretty common place for there to be a server rooms. Even for a large org with multiple sites 100s of kms a part, you could manage it with a pretty small team. And it is a lot easier to build resilient applications now than it was back then from what I remember.

Cloud costs are getting large enough that I know I’ve got one foot out the door and a long term plan to move back to having our own servers and spend the money we save on people. I can only see cloud getting even more expensive, not less.

hylaride · 2025-10-24T11:44:29 1761306269

There is currently a bit of an early shift back to physical infra. Some of this is driven by costs(1), some by geopolitical concerns, and some by performance. However, dealing with physical equipment does introduce a different set (old fashioned, but somewhat atrophied) set of skills and costs that companies need to deal with.

(1) It is shocking how much of a move to the cloud was driven by accountants wanting opex instead of capex, but are now concerned with actual cashflow and are thinking of going back. The cloud is really good at serving web content and storing gobs of data, but once you start wanting to crunch numbers or move that data, it gets expensive fast.

unregistereddev · 2025-10-24T13:59:52 1761314392

In some orgs the move to the cloud was driven by accountants. In my org it was driven by lawyers. With GDPR on the horizon and murmurs of other data privacy laws that might (but didn't) require data to be stored in that customer's jurisdiction, we needed to host in additional regions.

We had a couple rather large datacenters, but both were in the US. The only infrastructure we had in the EU was one small server closet. We had no hosting capacity in Brazil, China, etc. Multi-region availability drove us to the cloud - just not in the "high availability" sense of the term.

mbesto · 2025-10-24T15:35:30 1761320130

> I can only see cloud getting even more expensive, not less.

When you have three major hyperscalers competing for your dollars this is basically not true and not how markets work...unless they start colluding on prices.

We've already seen reduction in prices of web services costs across the three major providers due to this competitive nature.

ralusek · 2025-10-24T11:28:46 1761305326

And it’ll be so good and cheap that you’ll figure “hell, I could sell our excess compute resources for a fraction of AWS.” And then I’ll buy them, you’ll be the new cloud. And then more people will, and eventually this server infrastructure business will dwarf your actual business. And then some person in 10 years will complain about your IOPS pricing, and start their own server room.

layoric · 2025-10-23T20:46:58 1761252418

I discovered this project recently and used it for Himawari Standard Data format and it made it so much easier. Definitely recommend using this if you need to create binary readers for uncommon formats.

layoric · 2025-10-21T23:02:19 1761087739

Exactly, and the performance of consumer tech is wildly faster. Eg, a Ryzen 5825U mini pc with 16GB memory is ~$250USD with 512GB nvme. That thing will outperform of 14 core Xeon from ~2016 on multicore workloads and absolutely thrash it in single thread. Yes lack of ECC is not good for any serious workload, but great for lower environments/testing/prototyping, and it sips power at ~50W full tilt.

eru · 2025-10-22T00:24:36 1761092676

Curiously, RAM sizes haven't gone up much for consumer tech.

As an example: my Macbook Pro from 2015 had 16 GiB RAM, and that's what my MacBook Air from 2025 also has.

ericd · 2025-10-22T01:43:10 1761097390

Ehhh Macbook Pros can be configured with up to 128 now, iirc 16 was the max back then. But I guess the baseline hasn't moved as much.

eru · 2025-10-22T04:23:57 1761107037

Yes, there has been some movement. But even an 8 fold increase (128/16) over a decade is nothing compared to what we used to see in the past.

Oh, and the new machine has unified RAM. The old machine had a bit of extra RAM in the GPU that I'm not counting here.

As far as I can tell, the new RAM is a lot faster. That counts for something. And presumably also uses less power.