I question your fundamental premise. For example, I used to use supercomputers to run molecular dynamics simulations. When I found that the supercomputer folks didn't want me to run codes (because at the time, AMBER didn't scale to the largest supercomputers) on their machines, I moved to cloud and grid computing and then built a system to use all of Google's idle cycles to run MD. We achieved our scientific mission without a supercomputer! In the meantime, AMBER was improved to scale on large machines, which "justifies" running it on supercomputers (the argument being that if you spend 15% of the cost of the machine on interconnect, the code better use the interconnect well to scale, but can't be embarassingly parallel).
I've seen scientists who are captive to the supercomputer-industrial complex and it's not that they need this specialized tool to answer a question definitively. It's to run sims to write the next paper and then wait for the next supercomputer. Your cart is pushing the horse.
You know the term "embarrassingly parallel" but you seem to ignore that this term exits because there are other classes of problem which lack this characteristic.
Quite a few important problems are heavily dependent on interconnects, e.g. large-scale fluid dynamics and simulations that are coupled with such dynamics: aerodynamics, acoustics, combustion, weather and climate, oceanographic, seismic, astrophysics and nuclear. A primary component of the simulation is fast wavefronts that propagate globally through the distributed scalar and/or vector fields.
As long as there is a future where computers are growing to increase the scope, fidelity, and speed of these applications, there is also a need for infrastructure research to validate or develop new methods to target these new platforms. There are categories of grants that are written to a roadmap, with interlocking deliverables between contracts. These researchers do not have the luxury to only propose work that can be done with COTS materials already in the marketplace.
And conversely, if your application just needs a lot of compute and doesn't need the other expensive communication and IO aspects of these new, leading-edge machines, it _does_ make sense that your work get redirected to other less expensive machines for high-throughput computing. This is evidence of the research funding apparatus working well to manage resources, not evidence of mismanagement or waste.
One thing I've learned is that even when folks think their problem can only be solved in a particular way (fast interconnect to implement the underlying physics) there is almost always another way, that is cheaper and solves the problem, mainly by applying cleverer ideas.
I'll give (yet another) AMBER example. At some point in the past AMBER really only scaled on fast interconnects. But then somebody realized the data being passed around could be compressed before transmit and then decompressed on the other end- all faster than it could be sent over the wire. Once the code was rewritten, the resulting engine scaled better- on all platforms, including ones that had wimply (switched gigabit) interconnect. It reduced the cost of doing the same experiments significantly, by making it possible to run identical problems on less/cheaper hardware.
Second- I really do know a fair amount in this field, having worked on both AMBER on supercomputers (with strong scaling) and Folding@Home (which explicitly demonstrated that many protein folding problems never needed a "supercomputer").
I do not know much about your field of molecular dynamics. But, it is my lay understanding that it tends to have aspects of sparse models in space, almost like a finite-element model in civil engineering. Upon this, you have higher level equations and geometry to model forces or energy transfer between atoms. It may involve quadratic search for pairwise interactions and possibly spatial search trees like kdtrees to find nearby objects. Is that about right? And protein folding is, as I understand it, high throughput because it is a vast search or optimization problem on very small models.
Compared with fluid dynamics, I think your problem domain has much higher algorithmic complexity per stored byte of model data. Rather than representing a set of atoms or other particles, typical fluid simulations represent regions of space with a fixed set of per-location scalar or vector measurements. A region is updated based on a function that always views the same set of neighbor regions. Storage and compute size scales with the spatial volume and resolution, not with the amount of matter being simulated. These other problems are closer in spirit to convolution over a dense matrix, which often has so few compute cycles per byte that it is just bandwidth-limited in ripping through the matrix and updating values. But, due to the multiple dimensions, it is also ugly traversals rather than a simple linear streaming problem.
I've seen scientists who are captive to the supercomputer-industrial complex and it's not that they need this specialized tool to answer a question definitively. It's to run sims to write the next paper and then wait for the next supercomputer. Your cart is pushing the horse.