The only real choke point for present CPU’s is the on chip cache bus width. Increase the size of all three, L1-L3, and add a few instructions to load some bigger words across a wider bus. Suddenly the CPU can handle it just fine, not max optimization, but like 80% fine. Hardware just moves slow. Drawing board to consumer for the bleeding edge is 10 years. It is the most expensive commercial venture in all of human history.
I think the future is not going to be in the giant additional math coprocessor paradigm. It is kinda sad to see Intel pursuing this route again, but maybe I still lack context for understanding UALink’s intended scope. In the long term, integrating the changes necessary to run matrix math efficiently on the CPU will win on the consumer front and I imagine such flexibility would win in the data center too. Why have dedicated hardware when that same hardware could be flexibly used in any application space.
Something like : CPUs are now too slow … so let’s bypass them and just connect the GPUs from different computers together. … and let’s make one standard for this communication system so it works for many different manufacturer … and let’s use it to develop more AI. … and beat NVIDIA.