The problem is almost always a limitation in the software's design, not a limitation in the hardware's capability. However, devising clever software that can efficiently do the calculation is a very difficult thing to do, and most developers are on such a deadline that they just can't afford to do it. But, regardless, allow me to give an example that demonstrates what I mean.
I once had to write some code that ran a simulation on 1 million point particles for 1 million iterations on each particle. That's a total of 1 trillion iterations! Fortunately, there were certain conditions that would occur such to allow one to not perform the simulation on the particle ever again after a certain point. Now, the naive solution to this problem is to set up a double for-loop that iterates over each advancement of the simulation and then each particle, with an if-statement in the particle's iteration to check whether you need to do simulation calculation on the particle. The strange thing is, this solution still takes an hour to run, because the computer is still doing 1 trillion iterations, even if some of those iterations don't do anything. This was indeed how I did the simulation, at first.
Then a clever thing dawned on me. Since once a particle was taken out of the simulation it stayed out, why bother iterating over it further? The easiest way to force the particle out of iteration and not have to bother checking is to structure my particles in a linked list. Then, instead of a for-loop, I used a while loop that iterated until it reached the end of the list. When a particle was forced out of the calculation, it was simply unlinked out of the list. This means that as the calculation proceeds, there are fewer and fewer iterations that actually had to be done to achieve the equivalent of doing the full 1 trillion iterations naively. Even further, the linked list was allocated as a contiguous array so that it remained aligned in memory, reducing cache thrashing (which can bring even the fastest processor to its knees).
The combination of using a linked list whose links are updated as particles are removed from the simulation and ensuring the data for the particles are aligned contiguously in memory resulted in the code going from code that did 1 trillion iterations and took 1 hour to run to code that did the equivalent calculation of 1 trillion iterations but only took 3 minutes!! From 60 minutes run-time to 3 minutes run-time! A factor 20 improvement from simply recognizing a better, more efficient way to do the same calculation (using the linked list) that also took advantage of the intrinsics of the hardware (contiguous alignment of data to prevent cache thrashing).
As an example of the effect of cache thrashing, I had an earlier program that did some calculations on a grid of numbers. Following the equations for the calculation, I was initially iterating by column and then by row, that is, column iteration on the outer loop and row iteration on the inner loop. This was purely wrong, because it constantly caused the processor to mis-predict the next line of data needed. Consequently, the processor was being forced to constantly wipe and reload lines of data from main memory instead of having it ready in the processor cache. This is cache thrashing. My code took 35 minutes to run on the entire grid.
Then, I figured out how to switch the iteration order without invalidating the calculation, iterating by row and then column. Doing this kept the data contiguous, and the processor was able to properly predict the next data line and have it preloaded while it was working on a prior data line. Doing it this way, my code only took 5 minutes to run! A factor 7 improvement!
Both these incidents happened a decade or more ago.
The point here is the hardware is already hella, stupid fast, has been for a long time now. However, our approach to developing efficient code, which sometimes requires some clever tricks, is what is significantly lagging.