My question would be why do I need to run “30B parameter models at 100 tokens per second” on my PC?
I understand the benefits of running things locally, but why not just use Google’s or OpenAI’s LLM? You shouldn’t be sharing sensitive information with such tools in the first place, so that leaves level low-impact business queries and random “lifestyle” queries. Why wouldn’t I use cloud infrastructure for such queries?
Even when you’re not sending data that you consider sensitive, it’s helping train their models (and you’re paying for it!).
Also what’s not sensitive to one person might be extremely sensitive to another.
Also something you run locally, by definition, can be used with no Internet connection (like writing code on a plane or in a train tunnel).
For me as a consultant, it means I can generally use an assistant without worrying about privacy policies on the LLM provider or client policies related to AI and third parties in general.
For me as an individual, it means I can query the model away without worrying that every word I send it will be used to build a profile of who I am that can later be exploited by ad companies and other adversaries.
It makes a huge difference running bigger models. I like to run a quantized 70b or 8×7b most. These large models are far easier to access their true depth with less momentum required to find it.
The issue is not memory bandwidth in general. The primary bottleneck is the L2 to L1 bus width. That is the narrowest point, and it is designed for the typical pipeline of a sequential processor running at insanely fast speeds that are not possible if things get more spread out. The issues are more like radio than typical electronics. The route lengths, capacitance, and inductances become super critical.
It will make a major difference just to add the avx instruction set to consumer processors. Those are already present but fused off or simply not listed in the microcode in some instances. The full AVX instructions are not used because you would need to have a processes scheduler that is a good bit more complicated. These types of complex schedulers already exist in some mobile ARM devices but for more simple types of hardware and software systems, and without the backwards compatibility that is the whole reason x86 is a thing. The more advanced AVX instructions do things like loading a 512 bit wide word in a single instruction.
Hardware moves super slow, like 10 years for a total redesign. The module nature of ARM makes it a little easier to make some minor alternation. The market shifted substantially with AI a year and a half back. Right now, everything we are seeing was already in the pipeline long before the AI demand. I expect the first real products to ship in 2 years, 3-4 before any are worth spending a few bucks on, and in ~8 years, hardware from right now will feel as archaic as stuff from 20-30 years ago.
Running the bigger model makes a huge difference. Saying it will run at those speeds is actually making a statement about agents and augmented generation. I can run my largest models with streaming text barely faster than my reading pace. I can’t do extra stuff with that kind of model because I would need the entire text to send to other code or models for further processing. This speed is saying, I could do many things like text to speech, speech to text, augmented data retrieval for citations, function calling in code, and running several models to do unique things while keeping a conversational pace.
There are no low impact queries. Wouldn’t you prefer to use Google without it data mining you for correlations to predict your life?
Does it not bother you that Google knows more about you than you know about yourself because it records every restaurant you eat in, everything you buy (tap pay), every question you have?
My question would be why do I need to run “30B parameter models at 100 tokens per second” on my PC?
I understand the benefits of running things locally, but why not just use Google’s or OpenAI’s LLM? You shouldn’t be sharing sensitive information with such tools in the first place, so that leaves level low-impact business queries and random “lifestyle” queries. Why wouldn’t I use cloud infrastructure for such queries?
I understand the benefits of cutting down sugar, but why not just binge on cake and ice cream?
Sounds like you don’t understand the benefits of running things, and specifically LLMs and other kinds of AI models locally.
So what are the benefits with respect to local LLMs in the context I described?
If you’re doing it locally, more sensitive queries become ok, because that data is never leaving your computer……
Even when you’re not sending data that you consider sensitive, it’s helping train their models (and you’re paying for it!).
Also what’s not sensitive to one person might be extremely sensitive to another.
Also something you run locally, by definition, can be used with no Internet connection (like writing code on a plane or in a train tunnel).
For me as a consultant, it means I can generally use an assistant without worrying about privacy policies on the LLM provider or client policies related to AI and third parties in general.
For me as an individual, it means I can query the model away without worrying that every word I send it will be used to build a profile of who I am that can later be exploited by ad companies and other adversaries.
It makes a huge difference running bigger models. I like to run a quantized 70b or 8×7b most. These large models are far easier to access their true depth with less momentum required to find it.
The issue is not memory bandwidth in general. The primary bottleneck is the L2 to L1 bus width. That is the narrowest point, and it is designed for the typical pipeline of a sequential processor running at insanely fast speeds that are not possible if things get more spread out. The issues are more like radio than typical electronics. The route lengths, capacitance, and inductances become super critical.
It will make a major difference just to add the avx instruction set to consumer processors. Those are already present but fused off or simply not listed in the microcode in some instances. The full AVX instructions are not used because you would need to have a processes scheduler that is a good bit more complicated. These types of complex schedulers already exist in some mobile ARM devices but for more simple types of hardware and software systems, and without the backwards compatibility that is the whole reason x86 is a thing. The more advanced AVX instructions do things like loading a 512 bit wide word in a single instruction.
Hardware moves super slow, like 10 years for a total redesign. The module nature of ARM makes it a little easier to make some minor alternation. The market shifted substantially with AI a year and a half back. Right now, everything we are seeing was already in the pipeline long before the AI demand. I expect the first real products to ship in 2 years, 3-4 before any are worth spending a few bucks on, and in ~8 years, hardware from right now will feel as archaic as stuff from 20-30 years ago.
Running the bigger model makes a huge difference. Saying it will run at those speeds is actually making a statement about agents and augmented generation. I can run my largest models with streaming text barely faster than my reading pace. I can’t do extra stuff with that kind of model because I would need the entire text to send to other code or models for further processing. This speed is saying, I could do many things like text to speech, speech to text, augmented data retrieval for citations, function calling in code, and running several models to do unique things while keeping a conversational pace.
Maybe not everyone wants to hand Google or OpenAI their data or custom?
And this hardware will no doubt be used for more than that anyway.
Sure, but I prefaced my statement by saying that I am only looking at a subset of “low-impact business queries and random “lifestyle” queries”.
There are no low impact queries. Wouldn’t you prefer to use Google without it data mining you for correlations to predict your life?
Does it not bother you that Google knows more about you than you know about yourself because it records every restaurant you eat in, everything you buy (tap pay), every question you have?
AMD doesn’t manufacture CPUs for you specifically.
The industry is not exactly built on efficient computing as a business model
Unless you are a researcher you probably won’t.