What you are asking the technology to do is just absolute magic. You are asking the system to render all possible end states, button presses, movements, timed to how a person may react and stream this across the web, in realtime. End state prediction is different from the route one may take to get there. Additionally, raw computer power on the back end only reduces latency of the game, logic, rendering, execution, etc. you have little control of the latency in other networks that your stream has to traverse to get the image back to you, at 30-60x per second, with no perceptible lag, artifacts, jitter, etc.
Pieces of what you suggest are already implemented. For example,. there are certain implementations that only stream pixels that change between frames and not the whole image, but that introduces a whole different set of weirdness under the right conditions.
Processing web searches and streaming trans-coded videos from repositories through YouTube is nothing at all like streaming a live, realtime video game. Even for simple games that can utilize standard compute units in a cloud service like AWS, it is sub-optimal. Oh, and local hardware has lag, absolutely. From wireless to frame processing in modern TVs, there is always lag to content with.