Some more benchmarking, and with larger outputs (like writing an entire relatively complex TODO list app) it seems to go down to 4-6 tokens/s. Still impressive.
Decided to run an actual llama-bench run and let it go for the hour or two it needs. I'm posting my full results here (https://github.com/geerlingguy/ai-benchmarks/issues/47), but 8-10 t/s pp, and 7.99 t/s tg128, this is on a Pi 5 with no overclocking. Could probably increase the numbers slightly with an overclock.
You need to have a fan/heatsink to get that speed of course, it's maxing out the CPU for the entire time.
This happens when you get worse and worse inequality when it comes to buying power. The most accurate prediction into how this all plays out I think is what Gary Stevenson calls "The Squeeze Out" -> https://www.youtube.com/watch?v=pUKaB4P5Qns
Currently we are still at the stage of extraction from the upper/middle class retail investors and pension funds being sucked up by all the major tech companies that are only focused on their stock price. They have no incentive to compete, because if they do, it will ruin the game for everyone. This gets worse, and the theory (and somewhat historically) says it can lead to war.
Agree with the analysis or not, I personally think it is quite compelling to what is happening with AI, worth a watch.
Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
> The second you turn your head though, your fellow teammates will conspire to replatform onto Go or Rust or NodeJS or GitHub Actions and make everything miserable again.
Curious how would you use use Smalltalk in replace of GitHub Actions assuming you need a GitHub integrated CI runner?
All any build toolkit is is automations over bash. You can make your own. GitHub integration need not be any more than the most trivial thing that works. Your coworkers, naturally, won't be disciplined enough to keep the integration trivial and will build super complicated crap that's realty hard to troubleshoot, because they can.
I have a hard time trying to conceptualize lossy text compression, but I've recently started to think about the "reasoning"/output as just a by product of lossy compression, and weights tending towards an average of the information "around" the main topic of prompt. What I've found easier is thinking about it like lossy image compression, generating more output tokens via "reasoning" is like subdividing nearby pixels and filling in the gaps with values that they've seen there before. Taking the analogy a bit too far, you can also think of the vocabulary as the pixel bit depth.
I definitely agree replacing AI or LLMs with "X driven by compressed training data" starts to make a lot more sense, and a useful shortcut.
You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens. I find it easier to conceptualize this in three dimensions. 3blue1brown has a good video series which covers the overall concept of LLM vectors in machine learning: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...
To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.
I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.
> You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens.
I like to frame it as a theater-script cycling through the LLM. The "reasoning" difference is just changing the style so that each character has film noir monologues. The underlying process hasn't really changes, and the monologues text isn't fundamentally different from dialogue or stage-direction... but more data still means more guidance for each improv-cycle.
> say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer?
I'd like to point out that this scheme can result in things that look better to humans in the end... even when the "clarifying" choice is entirely arbitrary and irrational.
In other words, we should be alert to the difference between "explaining what you were thinking" versus "picking a firm direction so future improv makes nicer rationalizations."
It makes sense if you think of the LLM as building a data-aware model that compresses the noisy data by parsimony (the principle that the simplest explanation that fits is best). Typical text compression algorithms are not data-aware and not robust to noise.
In lossy compression the compression itself is the goal. In prediction, compression is the road that leads to parsimonious models.
The way I visualize it is imagining clipping the high frequency details of concepts and facts. These things operate on a different plane of abstraction than simple strings of characters or tokens. They operate on ideas and concepts. To compress, you take out all the deep details and leave only the broad strokes.
It is not a useful shortcut because you don't know what the training data is, nothing requires it to be an "average" of anything, and post-training arbitrarily re-weights all of its existing distributions anyway.
> In general once you start thinking about scaling data to larger capacities is when you start considering the cloud
What kind of capacities as a rule of thumb would you use? You can fit an awful lot of storage and compute on a single rack, and the cost for large DBs on AWS and others is extremely high, so savings are larger as well.
Well, if you want proper DR you really need an off-site backup, disk failover/recovery, etc. And if you don’t want to manually be maintaining individual drives then you’re looking at one of the big, expensive storage solutions with enterprise grade hardware, and those will easily cost some large multiple more than whatever 2U db server you end up putting in front of it.
Same setup here, one game setup I've hit but this will be a rare problem, is StarCraft Remastered. Wine has an issue with audio processing which I can't seem to configure my way out of. It pegs all 32 threads and still stutters. Thankfully this game can likely run on an actual potato, so I have a separate mini PC running windows for this when I want to get my ass kicked on battle.net.
Working at IT places in the late 2000s, it was still pretty common place for there to be a server rooms. Even for a large org with multiple sites 100s of kms a part, you could manage it with a pretty small team. And it is a lot easier to build resilient applications now than it was back then from what I remember.
Cloud costs are getting large enough that I know I’ve got one foot out the door and a long term plan to move back to having our own servers and spend the money we save on people. I can only see cloud getting even more expensive, not less.
There is currently a bit of an early shift back to physical infra. Some of this is driven by costs(1), some by geopolitical concerns, and some by performance. However, dealing with physical equipment does introduce a different set (old fashioned, but somewhat atrophied) set of skills and costs that companies need to deal with.
(1) It is shocking how much of a move to the cloud was driven by accountants wanting opex instead of capex, but are now concerned with actual cashflow and are thinking of going back. The cloud is really good at serving web content and storing gobs of data, but once you start wanting to crunch numbers or move that data, it gets expensive fast.
In some orgs the move to the cloud was driven by accountants. In my org it was driven by lawyers. With GDPR on the horizon and murmurs of other data privacy laws that might (but didn't) require data to be stored in that customer's jurisdiction, we needed to host in additional regions.
We had a couple rather large datacenters, but both were in the US. The only infrastructure we had in the EU was one small server closet. We had no hosting capacity in Brazil, China, etc. Multi-region availability drove us to the cloud - just not in the "high availability" sense of the term.
> I can only see cloud getting even more expensive, not less.
When you have three major hyperscalers competing for your dollars this is basically not true and not how markets work...unless they start colluding on prices.
We've already seen reduction in prices of web services costs across the three major providers due to this competitive nature.
And it’ll be so good and cheap that you’ll figure “hell, I could sell our excess compute resources for a fraction of AWS.” And then I’ll buy them, you’ll be the new cloud. And then more people will, and eventually this server infrastructure business will dwarf your actual business. And then some person in 10 years will complain about your IOPS pricing, and start their own server room.
I discovered this project recently and used it for Himawari Standard Data format and it made it so much easier. Definitely recommend using this if you need to create binary readers for uncommon formats.
Exactly, and the performance of consumer tech is wildly faster. Eg, a Ryzen 5825U mini pc with 16GB memory is ~$250USD with 512GB nvme. That thing will outperform of 14 core Xeon from ~2016 on multicore workloads and absolutely thrash it in single thread. Yes lack of ECC is not good for any serious workload, but great for lower environments/testing/prototyping, and it sips power at ~50W full tilt.
reply