> The purpose here is not to responsibly warn us of a real threat. If that were the aim there would be a lot more shutting down of data centres and a lot less selling of nuclear-weapon-level-dangerous chatbots.
you're lumping together two very different groups of people and pointing out that their beliefs are incompatible. of course they are! the people who think there is a real threat are generally different people from the ones who want to push AI progress as fast as possible! the people who say both do so generally out of a need to compromise rather than there existing many people who simultaneously hold both views.
I feel this framing in general says more about our attitudes to nuclear weapons than it does about chatbots. The 'Peace Dividend' era which is rapidly drawing to a close has made people careless when they talk about the magnitude of effects a nuclear war would have.
AI can be misused, but it can't be misused to the point an enormously depopulated humanity is forced back into subsistence agriculture to survive, spending centuries if not millennia to get back to where we are now.
in the grand scheme of things, this is a very small amount of plastic waste, and as far as resources go, one of the less scarce ones. at some point, the cost of the hand wringing to avoid waste is more of a drag on society than the actual wasted material itself.
What are you going off? CPI? For thousands of years gold has been the benchmark of currencies. For example you can read the Code of Hammurabi from Ancient Babylon where they used gold and silver as their currency, and then convert the figures mentioned in their laws. You'd be surprised by how invariant everything seems. https://justine.lol/inflation/ CPI isn't a trustworthy indicator. The government can't tell the truth about inflation because retirees all own TIPS so the government would have to pay them obscene amounts of money if the official numbers went up, which it can't afford, because the whole reason the government is debasing the currency in the first place is to pay for all the other benefits it gives to retirees.
39% just doesn't pass basic muster. in the past year, my rent hasn't near-doubled. it doesn't cost anywhere near twice as much as last year to buy food or clothing or transportation. 39% inflation over the past year would mean the economy is rapidly shrinking in real terms.
Inflation benefits people like your landlord, because his property value increases while his mortgage fees go negative. The bank is basically paying him to lord over you. So maybe he's a nice guy and doesn't make life harder for you when he's doing so well. Equities have concomitantly appreciated in value, keeping the overall economy worth about the same, but the gains get redistributed to more modern companies while everyone else gets washed away in the rising tide. Everything else it needs some time to trickle down and cause some pain before vendors wise up. That arbitrage opportunity is what incentivizes the folks who get the printed money to do it in the first place.
Grade inflation hurts the students who actually deserved 'A' grades.
But if you are making consequential decisions (like admission or hiring) based on a metric or signal that's been gamed to death, then you have to be a realist.
I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.
Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.
For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.
yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.
Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.
I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:
- Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).
- Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.
It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).
To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.
Yes it would require completely new hardware and most likely ditching gradient descent for alternative optimization methods, though I'm not convinced that we'd need to turn to discrete optimization.
Some recent works that people might find interesting:
A note on the hardware part: it does not require NN-specific hardware akin to neuromorphic. Sparse compute oriented architectures already have been developer for other reasons, such as large scale graph analysis or inference. It will still require significant effort to use it to train large models, but it would not be starting from scratch.
There are advantages to having the filesystem do the snapshots itself. For example, if you have a really big file that you keep deleting and restoring from a snapshot, you'll only pay the cost of the space once with Btrfs, but will pay it every time over with LVM.
On some of my zfs servers, the number of snapshots (mostly periodic, rotated — hour, day, month, updates, data maintenance work) is 10-12 thousand.
LVM can't do that.
btrfs has eaten my data within the last decade. (not even because of the broken erasure coding, which I was careful to avoid!) not sure I'm willing to give it another chance. I'd much rather use zfs.
you're lumping together two very different groups of people and pointing out that their beliefs are incompatible. of course they are! the people who think there is a real threat are generally different people from the ones who want to push AI progress as fast as possible! the people who say both do so generally out of a need to compromise rather than there existing many people who simultaneously hold both views.
reply