Oh the Unsloth dynamic ones are not 2bit at all - it's a mixture of 2, 3, 4, 5, ...

blensor · 2025-07-23T07:52:43 1753257163

Not an AI researcher here so this is probably common knowledge for people in this field, but I saw a video about the quantization recently and wondered exactly about that, if it's possible to compress a net by using more precision where it counts and less precision where it's not important. And also wondered how one would go about deciding which parts count and which don't

Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic

danielhanchen · 2025-07-23T09:24:52 1753262692

Yes you're exactly thinking correctly! We shouldn't quantize a model naively to 2bit or 4bit, but we should do it smartly!

qxfys · 2025-07-23T14:25:39 1753280739

How do you pick which one should be 2, which one should be 4, etc. Is this secret sauce? or, something open?

danielhanchen · 2025-07-24T22:42:58 1753396978

Oh I wrote about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs We might provide some scripts for them in the future!

qxfys · 2025-07-25T07:55:56 1753430156

Thanks! But, I can't find any details on how you "intelligently adjust quantization for every possible layer" from that page. I assume this is a secret?

I am wondering about the possibility that different use cases might require different "intelligent quantization", i.e., quantization for LLM for financial analysis might be different from LLM for code generation. I am currently doing a postdoc in this. Interested in doing research together?

danielhanchen · 2025-07-25T11:39:51 1753443591

Oh we haven't yet published about it yet! I talk about in bits and pieces - we might do a larger blog on it!

Yes different use cases will be different - oh interesting! Sorry I doubt I can be of much in our research - I'm mainly an engineering guy so less research focused!

CMCDragonkai · 2025-07-23T06:02:55 1753250575

How do you decide which layers are the important ones?

danielhanchen · 2025-07-23T06:34:08 1753252448

I wrote approximately in the blog about it and linked some papers! I also wrote about it here - https://unsloth.ai/blog/dynamic-4bit - one has to inspect the activation and weight quantization errors!

blensor · 2025-07-23T15:25:27 1753284327

So you are basically looking at "fMRI" of the "brain" while it's doing a wide range of things and cutting out the things that stay dark the most?

danielhanchen · 2025-07-23T23:26:58 1753313218

Oh that's a good analogy! Yes that sounds right!

menaerus · 2025-07-23T08:50:05 1753260605

> The key reason to use Unsloth quants is because of our deep involvement in fixing critical bugs across major models

sounds convincing, eh ... /s

On the less cynical note, approach does look interesting but I'd also like to understand how and why does it work, if it works at all.

danielhanchen · 2025-07-23T09:26:39 1753262799

Oh we actually fixed bugs! We fixed a few bugs in Gemma - see https://news.ycombinator.com/item?id=39671146, a gradient accumulation bug see https://news.ycombinator.com/item?id=41859037, Phi bugs, Llama bugs and more! See https://unsloth.ai/blog/reintroducing for more details!

menaerus · 2025-07-23T09:32:44 1753263164

What does your approach with dynamics weights has to do with those bugs? All those bugs seem uncorrelated to the technique.

danielhanchen · 2025-07-23T10:22:40 1753266160

Oh apologies I got confused - it's because when we calculate our dynamic quants, we have to do it on the fixed model!

For example in Phi 3 for example, the end of sentence token was wrong - if we use this, then our quants would be calibrated incorrectly, since chatting with the model will use the actual correct token.

Another is Llama 4 - https://github.com/ggml-org/llama.cpp/pull/12889 in which I fixed a RoPE issue - if we didn't fix it first, then again the calibration process would be incorrect.

menaerus · 2025-07-23T11:29:44 1753270184

Ok, this then goes to say that your approach doesn't work without applying whatever fixes to the vanilla models. What I'm trying to understand is the approach itself. Why does it and how does it work?

danielhanchen · 2025-07-23T13:27:54 1753277274

Oh I wrote a bit about it in https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://unsloth.ai/blog/deepseekr1-dynamic if that helps!

PeterStuer · 2025-07-23T08:58:55 1753261135

If you don't mind divulging, what resources and time did it take to dynamically quantize Qwen3-Coder?

danielhanchen · 2025-07-23T09:24:07 1753262647

It takes a few hours to compute the imatrix on some calibration dataset since we use more than 1-3 million tokens of high quality data. Then we have to decide on which layers to quantize to higher bits or not, which takes more time. And the quantization creation also takes some hours. Uploading also takes some time as well! Overall 8 hours maybe minimum?

jychang · 2025-07-23T10:47:24 1753267644

What cluster do you have to do the quantizing? I'm guessing you're not using a single machine with a 3090 in your garage.

danielhanchen · 2025-07-23T13:28:21 1753277301

Oh definitely not! I use some spot cloud instances!

sleight42 · 2025-07-23T21:02:22 1753304542

But you can get one of these quantized models to run effectively on a 3090?

If so, I'd love detailed instructions.

The guide you posted earlier goes over my (and likely many others') head!

danielhanchen · 2025-07-23T23:27:43 1753313263

Oh yes definitely! Oh wait is the guide too long / wordy? This section https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locall... shows how to run it on a 3090

sleight42 · 2025-07-24T22:06:51 1753394811

Kind of you to respond! Thanks!

I have pretty bad ADHD. And I've only run locally using kobold; dilettante at DIY AI.

So, yeah, I'm a bit lost in it.

danielhanchen · 2025-07-24T22:45:10 1753397110

Oh sorry - for Kobold - I think it uses llama.cpp behind the hood? I think Kobold has some guides on using custom GGUFs

tomdekan · 2025-07-24T11:31:54 1753356714

Thanks Daniel. Do you recommend any resources showing the differences between different quantizations?

danielhanchen · 2025-07-24T22:43:45 1753397025

Oh our blog https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs compares the accuracy differences for each different quantization method for Llama 4 Scout and also Gemma 3 27B - they should apply to other quants (like Qwen 3 Coder)