It makes a huge difference running bigger models. I like to run a quantized 70b or 8×7b most. These large models are far easier to access their true depth with less momentum required to find it.
The issue is not memory bandwidth in general. The primary bottleneck is the L2 to L1 bus width. That is the narrowest point, and it is designed for the typical pipeline of a sequential processor running at insanely fast speeds that are not possible if things get more spread out. The issues are more like radio than typical electronics. The route lengths, capacitance, and inductances become super critical.
It will make a major difference just to add the avx instruction set to consumer processors. Those are already present but fused off or simply not listed in the microcode in some instances. The full AVX instructions are not used because you would need to have a processes scheduler that is a good bit more complicated. These types of complex schedulers already exist in some mobile ARM devices but for more simple types of hardware and software systems, and without the backwards compatibility that is the whole reason x86 is a thing. The more advanced AVX instructions do things like loading a 512 bit wide word in a single instruction.
Hardware moves super slow, like 10 years for a total redesign. The module nature of ARM makes it a little easier to make some minor alternation. The market shifted substantially with AI a year and a half back. Right now, everything we are seeing was already in the pipeline long before the AI demand. I expect the first real products to ship in 2 years, 3-4 before any are worth spending a few bucks on, and in ~8 years, hardware from right now will feel as archaic as stuff from 20-30 years ago.
Running the bigger model makes a huge difference. Saying it will run at those speeds is actually making a statement about agents and augmented generation. I can run my largest models with streaming text barely faster than my reading pace. I can’t do extra stuff with that kind of model because I would need the entire text to send to other code or models for further processing. This speed is saying, I could do many things like text to speech, speech to text, augmented data retrieval for citations, function calling in code, and running several models to do unique things while keeping a conversational pace.
It makes a huge difference running bigger models. I like to run a quantized 70b or 8×7b most. These large models are far easier to access their true depth with less momentum required to find it.
The issue is not memory bandwidth in general. The primary bottleneck is the L2 to L1 bus width. That is the narrowest point, and it is designed for the typical pipeline of a sequential processor running at insanely fast speeds that are not possible if things get more spread out. The issues are more like radio than typical electronics. The route lengths, capacitance, and inductances become super critical.
It will make a major difference just to add the avx instruction set to consumer processors. Those are already present but fused off or simply not listed in the microcode in some instances. The full AVX instructions are not used because you would need to have a processes scheduler that is a good bit more complicated. These types of complex schedulers already exist in some mobile ARM devices but for more simple types of hardware and software systems, and without the backwards compatibility that is the whole reason x86 is a thing. The more advanced AVX instructions do things like loading a 512 bit wide word in a single instruction.
Hardware moves super slow, like 10 years for a total redesign. The module nature of ARM makes it a little easier to make some minor alternation. The market shifted substantially with AI a year and a half back. Right now, everything we are seeing was already in the pipeline long before the AI demand. I expect the first real products to ship in 2 years, 3-4 before any are worth spending a few bucks on, and in ~8 years, hardware from right now will feel as archaic as stuff from 20-30 years ago.
Running the bigger model makes a huge difference. Saying it will run at those speeds is actually making a statement about agents and augmented generation. I can run my largest models with streaming text barely faster than my reading pace. I can’t do extra stuff with that kind of model because I would need the entire text to send to other code or models for further processing. This speed is saying, I could do many things like text to speech, speech to text, augmented data retrieval for citations, function calling in code, and running several models to do unique things while keeping a conversational pace.