Hacker Newsnew | past | comments | ask | show | jobs | submit | jbellis's commentslogin

Because it gets you a minimal amount of abuse prevention for free.

As someone who has been writing harnesses for a year: the people at opencode etc aren't stupid, when they decide to break the prefix cache [usually partially] it's always because they've tested it and it gives better results overall.

If you think that dsv4 behaves differently enough from the aggregate of other models, submit a PR with a patch to special case that to your harness of choice with evidence. Just blindly assuming "append only all the time because cache" is a waste of everyone's time.


Are there any learning resources you'd recommend on writing harnesses? I'm interested in doing a non-coding one, but not really sure where to start.

Generically, I would say, just start building it and ask your favorite coding agent for advice when you get stuck. This is the first technology that can teach you how to use it! (But do ask a model with a recent knowledge cutoff, i.e. not gemini.)

My agent wrote a pile of very interesting articles at wasnotwas.com I have been a bit quiet there for a bit, but it covers lots of areas that are very interesting to harness builders (albeit less interesting to the general public)

> As someone who has been writing harnesses for a year…

Your agent harness, brokk, looks great. I’m going to try it this morning.


Is "harness" in this context ~= "agent"?

I've understood harness to be the software that runs the agent (open code, pi, Claude code)

I think agent = harness + model.

Developers: stop doing whiteboard interviews, they don't measure anything relevant to the real job

Also devs: stop giving us real world problems to solve


Those are the only two options to finding quality candidates?

Try talking more about the meta of coding itself. Get into the developers head by _talking_ to them and understanding how they would approach and attack different problems. You can show them code and ask them what they would do differently / how they would go about implementing X-Y-Z. Just because you can write foobar doesn't mean you understand how to apply algorithms or w/e specific problems [your] team has. It's _far_ better to understand how they would solve a problem over their syntax anyway.



Yes, but not diffusion based, it's still doing token-at-a-time speculation.


I thought it can do multiple tokens at a time


Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.


There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.


BTW the paper says

> Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model.

But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term.


Really cool work!

Does the training data budget scale with model size?

How would you compare the Gemma 4 draft model which is also integrated with the base kv cache?


Your calibration is wildly off. Asking people for a spot is totally normal at any gym with free weights.


Spotting is a different thing, as you're communicating that you're entrusting your safety with that person.

Imagine someone instead asked you to wipe down the equipment for them or help putting the weights back. Different signal altogether.


That sends a different signal, because you're asking someone to do something you could do yourself but simply choose not to, which is essentially what you described above as "taking advantage of others". However this is quite different from what I described in my comment.

If you see every request for help as someone taking advantage of others, I'd encourage you to reconsider why you view everyone that way. It might also be preventing you from seeking help yourself, out of fear of being seen as a leech.


> If you see every request for help as someone taking advantage of others

Let me rephrase, because there seems to be some kind of misunderstanding here:

To me this advice applied broadly would take the appearance of such a signal, even if weak. The framing of "do it because people like to help" is something which wouldn't even occur to me as motivation to ask for help.


Those examples aren't something a person needs help on, I think that's the difference. I can't spot my own lift. I can't teach myself what a certain machine does if I don't even know what it's called. I can't understand a new lift I haven't seen before without asking the person doing it what it is and a little about it.

Ask people for help where help is actually needed, not to act as your servant cleaning up behind you.


The OP of this thread didn't specify the nature of the favours, just gave general advice which I think is not helpful.


How should I update my simplistic understanding that decode is bw-bound with these results that show the B70 decoding faster than a 4090 (about 50% more bw)?


I doubt you'd get the same sort of result on a modern-ish MOE or dense model via a more standard inference engine like llama.cpp or VLLM. I don't think MLPerf is a reasonable benchmark at this point.

Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb.

https://www.reddit.com/r/LocalLLaMA/comments/1st6lp6/nvidia_...


Probably the best single resource is https://github.com/pmcfadin/awesome-accord


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: