As someone who has been writing harnesses for a year: the people at opencode etc aren't stupid, when they decide to break the prefix cache [usually partially] it's always because they've tested it and it gives better results overall.
If you think that dsv4 behaves differently enough from the aggregate of other models, submit a PR with a patch to special case that to your harness of choice with evidence. Just blindly assuming "append only all the time because cache" is a waste of everyone's time.
Generically, I would say, just start building it and ask your favorite coding agent for advice when you get stuck. This is the first technology that can teach you how to use it! (But do ask a model with a recent knowledge cutoff, i.e. not gemini.)
My agent wrote a pile of very interesting articles at wasnotwas.com I have been a bit quiet there for a bit, but it covers lots of areas that are very interesting to harness builders (albeit less interesting to the general public)
Those are the only two options to finding quality candidates?
Try talking more about the meta of coding itself. Get into the developers head by _talking_ to them and understanding how they would approach and attack different problems. You can show them code and ask them what they would do differently / how they would go about implementing X-Y-Z. Just because you can write foobar doesn't mean you understand how to apply algorithms or w/e specific problems [your] team has. It's _far_ better to understand how they would solve a problem over their syntax anyway.
Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.
There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.
That sends a different signal, because you're asking someone to do something you could do yourself but simply choose not to, which is essentially what you described above as "taking advantage of others". However this is quite different from what I described in my comment.
If you see every request for help as someone taking advantage of others, I'd encourage you to reconsider why you view everyone that way. It might also be preventing you from seeking help yourself, out of fear of being seen as a leech.
> If you see every request for help as someone taking advantage of others
Let me rephrase, because there seems to be some kind of misunderstanding here:
To me this advice applied broadly would take the appearance of such a signal, even if weak. The framing of "do it because people like to help" is something which wouldn't even occur to me as motivation to ask for help.
Those examples aren't something a person needs help on, I think that's the difference. I can't spot my own lift. I can't teach myself what a certain machine does if I don't even know what it's called. I can't understand a new lift I haven't seen before without asking the person doing it what it is and a little about it.
Ask people for help where help is actually needed, not to act as your servant cleaning up behind you.
How should I update my simplistic understanding that decode is bw-bound with these results that show the B70 decoding faster than a 4090 (about 50% more bw)?
I doubt you'd get the same sort of result on a modern-ish MOE or dense model via a more standard inference engine like llama.cpp or VLLM. I don't think MLPerf is a reasonable benchmark at this point.
Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb.
reply