I think it would be easy, but still not worth the transistors. Think of it: what programs contain lots of NOPs? Who, desiring to write a fast program, sprinkles their code with NOPs?
It’s not worth optimizing for situations that do not occur in practice.
The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.
On CPUs with variable instruction length, like the Intel/AMD CPUs, many programs have a lot of NOPs, which are inserted by the compiler for instruction alignment.
However those NOPs are seldom executed frequently, because most are outside of loop bodies. Nevertheless, there are cases when NOPs may be located inside big loops, in order to align some branch targets to cache line boundaries.
That is why many recent Intel/AMD CPUs have special hardware for accelerating NOP execution, which may eliminate the NOPs before reaching the execution units.
It’s not worth optimizing for situations that do not occur in practice.
The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.