I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).
I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer