Understanding Tokens, Context Windows, and Local LLM Hosting

Paul and Grok (an LLM) examine fundamental aspects of large language models, focusing on how tokens and context windows operate. Tokens, defined as textual units (words or word pieces), are the basis for all model processing and cost calculations. The “context window” is described as the LLM’s short-term memory, determining how much information can be handled at once.

The discussion highlights th

🎙️ Hosted by Paul at Talking to AI — where real people, real problems, and real conversations meet artificial intelligence.

Full Transcript

I am talking to AI.
A successful day, you don't need to know all the answers.
Just have good questions.
Chatting to AI is different from normal speech
and I hope you enjoy listening to the show
whilst getting ideas on how to hone your questions
to get the most out of AI.
My name is Paul.
The live conversations you hear are uncut
although sometimes the AI needs time to think.
In those cases, I've cut out the dead space.
Hello, Grock.
Hello, Paul. What can I do for you?
So, I'm wanting to talk to you about tokens
and hosting LLMs and costs.
So, can you talk to me a little bit
and explain what tokens are when it comes to AI
and how that whole thing works?
Tokens in AI are basically chunks of text
like words or parts of words
that models like me use to process language.
For example, running might be one token
but unbelievably could split into three.
Hi, guys. In this show,
I'm going to be talking about AI and tokens
and we will probably also cover hosting LLMs as well.
It's really a deep dive on how tokens work
and what they mean in terms of processing,
I guess, when you're looking at LLMs.
So, it's the first time I'm sort of dipping my toe into this.
So, I'll be talking to Grock for this show.
Hello, Grock.
Hello, Paul. What's on your mind?
I want to talk about tokens and hosting LLMs locally
and what that means in turn
and what sort of tokens are
and what throughputs tokens mean in terms of speed.
So, in order for us to have that conversation,
what sort of topics should we cover
to make sure that we get a good broad understanding of this subject?
Let's break it down.
First, what tokens are,
like how text gets chopped into bits for AI to process.
Then, context windows,
why size matters for memory and cost.
Local hosting basics, hardware needs,
software like a llama or hugging face.
Throughput, that's speed, tokens per second,
latency, and what bottlenecks you.
Privacy perks of local models versus cloud trade-offs.
Set up tips, power draw, and scaling up.
Okay, so let's start with context windows then.
A context windows basically the chunk of text an LLM can see at once,
like short-term memory for chats or long docs.
Bigger windows, say 16,000 or 32,000 tokens,
let you feed in whole books or epic combos without cutting off.
Downside, more memory hungry,
so local setups might chug or need beefy GPUs.
Smaller ones, like 4,000, are snappier but forget stuff faster.
Want me to dive into how tokens fit into that?
Well, before we do that, so it's my understanding.
So when I'm using chat GPT, for example,
you create a chat window,
and it's my understanding that's a context, is that right?
Yep, spot on.
In chat GPT, each chat window is a single context.
Everything you say, plus all its replies,
gets fed back into the model each turn.
That whole thread counts toward the context window,
so long back and forths can eat it up fast,
keeps things coherent, but hits limits quicker than a fresh window.
Make sense? Ready for tokens next?
I'm just keeping on context.
So in chat GPT, you also have this concept of project
where you can put many chats into one project.
Does that mean the context is now
or everything in that project's directory,
or is it a bit different from that?
Nah, projects in chat GPT
don't merge all the chats into one giant context.
Each individual chat thread still has its own separate context window,
just like before.
The project is basically a folder system for organization.
You can group chats, add files,
or custom instructions that apply to every chat inside that project.
And chat GPT can reference those shared files when you ask,
but it doesn't automatically shove every past chat in the project
into the current context.
So your active context is still
negative the current conversation in that chat,
plus any files or custom instructions you attach to the project.
Those get added when relevant.
It's not dumping dozens of old chats into one massive context
unless you explicitly paste or reference them.
Keeps things fast and within the token limit.
Cool?
Yes, cool, but I still want a bit more on context.
So with GROC and chat GPT,
or maybe you can compare Gemini as well,
it's my understanding that when I'm talking to the LLM,
there's all the information in the chat,
which forms part of the context,
but then there is also other information as well.
I know in chat GPT I can save,
I can't remember what it was called,
but it was like some kind of block in memory.
There was a command to a structured memory block or something,
I can't remember,
and I guess it also remembers some of my preferences as well.
So can you just explain to me around context,
what are all the other things that go into a context,
and it might vary from LLM to LLM,
but I just want to understand that as well.
Okay, let's unpack what actually gets stuffed into the prompt
every time you hit send,
because it's more than just the visible chat.
One, the raw conversation history,
everything you and the model said in this thread,
usually from the very first message until now.
That's the main chunk.
Two, system prompt dash hidden instructions.
Every model has a baked-in system message you never see.
Chat GPT, a long one that says,
you are helpful, harmless, honest, etc.
Plus any custom instructions you set in settings or in a project.
Grock, me, shorter system prompt focused on being truthful,
a bit sarcastic, and maximally helpful, no strict censorship.
Gemini, similar but with Google's safety filters baked in harder.
Three, project level or account level stuff, chat GPT specifically.
Custom instructions, the thing you're thinking of,
whatever you wrote under settings, custom instructions, or in a project.
This gets added every single time.
Memory, the little save memories feature.
Chat GPT now has an explicit memory system.
When it decides something is worth remembering,
or you tell it, remember that I hate olives,
they're short key value snippets separately,
and injects the relevant ones into the context when they apply.
You can view and delete them.
Four, uploaded files dash knowledge.
If you attach a PDF or text file in chat GPT or a project,
the model can pull chunks.
Wait a minute, so before you can continue.
So look, I understand that everything that's in the chat window
will form part of the context, unless the chat window becomes too big.
So can you just tell me about the things that are not in the chat window?
So you've alluded to projects,
but then you said that things in the projects that were not in the chat
were not in the context, so I'm confused about that.
So basically, I just want to understand in the context,
apart from the things that have been submitted to the chat window,
which would include attachments,
what other things are in the context window,
and can you please clarify what you were saying about chat GPT projects,
because it seems to have contradicted what you said earlier about chat GPT projects.
Okay, let's clear the confusion with plain examples for chat GPT,
since that's where the projects and memory stuff lives.
What is always in the context, even if you can't see it in the chat box.
One, system prompt, the invisible, you are chat GPT, rules.
Two, your custom instructions, the ones you set in settings or inside a project.
These get injected every single turn.
Three, active memory snippets.
When chat GPT saves something with memory updated, example,
Paul lives in Portugal and prefers metric units.
Those short lines are automatically added to every future context in that account,
whether you see them or not.
Four, project level custom instructions or attached files.
If you're inside a project, the custom instructions you wrote for that project
replace the global ones, and any files you pinned to the project are available.
The model can pull relevant chunks from them into the context on the fly.
What is not automatically in the context, just because it's in the same project.
Old, separate chat threads inside the same project folder.
Those stay completely isolated unless you manually copy paste or click reference previous chat.
So no contradiction, the project gives you shared instructions
plus shared files that do get added to every new chat inside it.
But the actual message history from other chats in that project does not leak in.
For GROC and Gemini, no projects, no persistent memory snippets,
no custom instructions that live outside the chat.
What you see in the thread, plus the hidden system prompt,
is literally everything in the context.
Does that remove the contradiction for you?
Yes, although I'm still a bit confused.
These projects in chat GPT,
how do I add, oh, I see.
So when I've just created a project in chat GPT,
and it's got add files next to the project,
and then I've got new chat in the window.
So if I click on add files, is that how I add things to the context
for everything that's within that project?
Is that how I do it?
Because it doesn't seem like there's anywhere where I can configure global
items for any chat within that project.
Two, paperclip or add files right next to the project name,
slightly higher up.
This adds files to the project itself.
Those become project knowledge files
and are automatically available to every new chat you start inside that project.
Three, the little instructions icon looks like a clipboard or sliders.
This is where you set project level custom instructions.
Whatever you write here gets injected into every single chat in that project,
exactly like the global custom instructions, but only for this project.
So yes, add files at the project level, not the per chat plus button.
They're in context for everything in the project.
Write instructions in that project instructions box,
also in every context inside the project.
That's the global for the project stuff you were looking for.
Yeah, okay, that makes sense.
So does GROC have any features similar to the memory and the project in GROC?
Not really the same, no.
GROC has two things that are kind of close,
but neither works exactly like chat GPTs, memory or project files.
One, conversation history in this thread.
I remember everything we've said so far in this exact chat,
until it hits the context limit and older stuff drops off.
That's it for memory inside one session.
Two, if you're on GROC.com or the X app and you stay logged in,
GROC can see your past conversations in the sidebar
and sometimes pull context from them if you explicitly reference one.
Look at what we talked about yesterday,
but it's not automatic and it doesn't inject little saved snippets
into every new chat like chat GPTs memory does.
There's no persistent, remember Paul lives in Portugal and hates olives,
feature that quietly follows you across all new chats
and no project folders with shared files or instructions.
So short answer, GROC's memory is basically just what's currently in this window,
plus optional manual references to old chats.
So I suppose if I wanted to, with chat GPT,
because of more use, I could ask chat GPT to tell me
all of the sort of default prompts that come in every time I make a request.
Is that public information or is that proprietary?
Yeah, chat GPT will happily show you almost everything that goes in automatically.
It's not secret. Just ask it something like,
show me the exact full system prompt and any instructions you're using right now,
including custom instructions and memory snippets.
It'll dump the whole thing, the base open AI system prompt,
plus your global custom instructions, plus any project level instructions,
any active memory items it's injecting.
People post these dumps publicly all the time.
There are even sites that collect the latest versions,
so open AI doesn't treat it as proprietary.
The only bits it sometimes refuses to reveal are certain hard safety overrides
that are locked deeper in the model,
but 99% of what actually lands in your context each turn is visible if you just ask.
Go try it. You'll see the entire hidden part in one reply.
Okay. All right.
I'll tell you what. While we're talking, I think I might just do that,
and then I can sort of talk about that as well.
So is there anything else that I should know about context that we have not covered?
We've hit the big ones, but here are the last few practical bits most people miss.
Next, token countdown is ruthless.
Every single invisible thing, system prompt, custom instructions, memory snippets,
file chunks, eats tokens exactly the same as your messages.
A long custom instructions block can silently burn 2,000 or 3,000 tokens
before you even type hello.
Truncation direction. When the context gets too full,
services usually drop the oldest user assistant messages from the middle.
They keep the system stuff, plus very first messages, plus most recent turns.
That's why long chats sometimes forget something you said 10 turns ago,
but still remember your very first prompt.
Sliding versus full send.
Cheap local models often use a sliding window, only send the last end tokens,
while chat GPT, Grock, Gemini, usually send the entire allowed context every time,
more accurate, but costs more.
Sliding versus full send.
Cheap local models often use temperature, top P, etc., are not part of context.
They're separate sampling settings that don't cost tokens.
That's pretty much everything that actually matters day to day.
Anything else you want to circle back to, or ready to jump to tokens themselves?
Well, in this sort of in-between stage, so I'm guessing that the way the model deals with tokens,
how it truncates the information is key to some of the differences between some of these models.
Would that be right to say?
Exactly. That truncation rule is what makes Grock or Gemini feel snappier in marathon talks.
If a model only feeds the last 5,000 tokens each round, it'll never choke, but poof.
Your backstory from message 3 vanishes, whereas chat GPT might cram 32,000
and still drop the center stuff mid-convo, so it remembers the opener better than yesterday's plot twist.
Same limit, totally different flavor.
Makes sense why some chats feel lost after lunch, huh?
Okay, so I suppose the other side of that coin is that if you're automating something
and you know that the conversation, you're going to be asking the LLN to do something
that doesn't require a huge amount of context, then it will be cheaper in terms of token use
to use a model which is designed for a smaller context.
Spot on, Paul. If your task, like summarizing emails or quick Q&A, only needs a few hundred or 1,000 tokens,
pick a model with a tiny context window.
Less to process means lower compute, faster latency, and way fewer tokens build.
Chat GPT 3.5 Turbo's 8,000 token cap shines there, cheaper than 4.0 or GPT 4.0 with its 128,000 beast.
Just watch out. If context sneaks up on you, boom, overflow and weird forgetting. Smart call.
Okay, so with GROC, is there just one model available with GROC?
Yep, everything we've been talking about is available right now with GROC.
GROC 3, free tier, up to 32,000 token context. Exact number floats a bit, but plenty for most chats.
GROC 3, with Super GROC or Premium Plus, pushes to 128,000 tokens on the latest builds.
GROC 4, Premium Plus only, full 128,000 tokens, sometimes higher in the latest drops.
No projects or persistent memory like Chat GPT, but the raw context size is very competitive,
especially on the paid tiers. You're good to go.
Oh, okay, so the newer versions of GROC seem to have a larger context.
Is that true with most of the models?
Yeah, pretty much every flagship model is racing toward as big as we can ship right now.
Next, GPT 4.0, 128,000, was 8K two years ago. Clawed 200,000, Opus 200K.
Anthropic even has 1 million on their enterprise tier.
Gemini 1.5 Pro, 1 million token standard, 2 million experimental.
Llama 3.1, 405B, 128K, Mistral Large, 128K.
Mistral Large went 32, 128, heading higher with GROC 4.
Yes, the trend is brutally clear. Every six to 12 months, the big labs double or more the context length.
The only real holdouts are the cheap speed demons, Gemini 2, 9B, 5.3 Mini, etc.
that stay at 4,000 to 8,000 because they're built to run on phones or tiny GPUs.
Everyone else is just piling on.
So it seems to me the skill as a designer of an AI, say an agent that uses LLens would be to,
if you're trying to optimize the efficiency of the system, then it's going to be,
the skill is going to be in matching the task to the capability of the model.
And that's going to have a big difference in the potential cost.
Is that, is that a, that sounds like it's a big part of, not the only thing,
but a big part of sort of designing a system that works with LLens. Sorry.
Absolutely. And you're thinking like a builder already.
Cost isn't just pay for big model. It's pick the smallest model that still nails this job.
An agent pipeline might do tiny 7 billion parameter thing for spell check.
Step up to GROC 3 for code edits, then GROC 4 only when it needs the full 128,000 to debug a massive codebase.
That mix can drop your bill 90% while feeling seamless.
Yeah, matching task to brain size. That's the whole secret.
So can you, do you have system designs where there is an LLM that basically looks at the tasks
as they come in and then assigns them to the correct model to increase efficiency?
Yep. That's a super common pattern now. People call them router or dispatcher agents.
Here's how they usually work in the real world.
One, tiny gatekeeper model, often a 1 to 8 billion fine tuned one or even GROC 3 mini gets the user request first.
It's dirt cheap and fast, maybe 1 to 3 cents per thousand requests.
Two, it does a quick classification. Is this simple lookup math, greeting, route to tiny local model, example, route to tiny local model, example.
Lama 3.2, 3B, 5, 3. Does it need code generation, route to GROC 3 or Claude 3.5?
Is there a huge uploaded PDF or 50K token history? Only then hit GROC 4, OGPT-40, Claude Opus.
Is it creative, open-ended, mid-tier model, 3, some fancy ones? Even look at your budget slider in real time.
User set max entity per answer today and downgrade automatically. Real example shipping today.
Open routers auto mode does exactly this across 100 plus models. Lama index slash haystack router nodes.
Langrath slash crew AI, crews with a supervisor LLM that picks the worker.
Helicone and OPEC have built in model cascading.
Oh, sorry. Sorry. Keep going. Keep going.
I've seen teams cut their average cost from about 18 cents per query down to 3 cents just by having that little dispatcher sit up front.
It makes the whole system feel fast and cheap. And honestly, most users never notice the swap.
Want me to sketch out how you'd actually build one?
No, I think I've got an idea. I mean, I was playing with crew AI in the Python, so I can imagine that would be a large part of what it does.
But so when I'm looking at this, so we've got, yeah.
So we've got, yeah, the rooting idea baked right in, even if crew AI doesn't always expose a budget aware switch out of the box.
But here's the kicker. Most folks stop there. The next level is making that router learn.
Like after 10 calls, it realizes, oh, Paul always asks about context windows in the morning and tokens at night.
So it prefetches the right model without you thinking that kind of meta agent stuff.
How did you know I do that?
Oh, I didn't. I just guessed. But hey, it landed right means you're getting predictable in the best way makes me think.
Should I just preload the token calculator next time?
No, no, that's all right. And no, I remember what the question was now because I was I was struggling with the question.
And so this optimization that occurs, I guess it's something you have to be mindful of as you're creating an agent.
And if you're not using some kind of platform that already does this for you automatically like crew AI, I guess does.
But if I'm using the chat window, I'm guessing that that's automatically happening in the background because if you're the likes of open AI or or X or or Gemini,
you're going to want to optimize the the the token usage to reduce costs. Is that true?
Exactly. 100%. Behind the scenes, every big provider, open AI, X, Google, anthropic already runs their own internal router dispatcher on your queries before they ever hit the flagship model.
They do things like if your prompt is under 120 tokens and looks like simple chat, it instantly hits a distilled 8B or 70B cheap mix instead of the full 405 B monster.
If it's code with no files attached straight to their code specialized fine tune. If it's a repeat question they've seen a million times, sometimes served from cash zero tokens.
Only the hairy, long context or creative stuff lands on GPT4O, Ogemini, Ultra. They never tell you which model actually answered, unless you ask nicely.
But that invisible cascading is why your average cost per message is a fraction of the posted flagship price. They're obsessed with shaving pennies at scale.
So yeah, the smart routing is absolutely happening, whether you build it yourself or not.
So how how possible is it to locally host an LLM and have it working in the same kind of intelligent way that open AI or X managed to do?
You know, obviously it's not going to be quite the same, but how close can you get to the efficiency?
Getting close is totally doable. Maybe 80-90% of their feel if you're willing to geek out a little. Here's the real talk.
Start with a beefy GPU, say RTX 4090 or an A6000 16 gigs VRAM minimum.
Grab Lama 3.1-8B or Mistral 7B. They're tiny, fast, and you run them with Olima or VLLM, no sweat.
Then, bolt on a cheap router, write a quick Python shim that sniffs your prompt length and content, redirects for summarise, code, whatever, and flips between your 7B local beast, a 7B unbiased…
Alright, okay, yeah, we don't need to go into too much detail on that. I suppose I just want to keep it high level to start with.
So we talked about context windows and why size matters. I think we've covered that as well.
Memory and cost. Let's talk a little bit about memory and cost when it comes to tokens.
Actually, no, sorry, sorry, drop that. Let's talk about tokens. Tell me what a token actually is.
So let me just dig into that a little bit more. So when we're talking about tokens,
are we talking about the… Can you work out the amount of tokens for a request?
Is it just simply the characters and obviously different characters have different token costs?
But is it basically just derived from the text in the prompt and any of the other context that's not in the prompt but is sent anyway?
So is the total tokens a function of the length of the prompt and the additional context that's come from historical chat, those other things that we talked about that make up the context?
So is that the complete amount of tokens or is it based on something else?
Are you still there?
What happened there?
Interrupted. We got interrupted.
Hello, that was a bit weird. We just got interrupted. I think are you still there?
Yeah, still here. Must have been some network blip. No idea why that popped up.
Anyway, we were talking tokens, roughly four characters apiece. So your sentence, hello, that was a bit weird, is about nine tokens. Want to pick up from there?
Okay, so it's a bit funny though, so the entire cost of AI is just caused by the way it's costed is purely a function of the length of the question.
Okay, all right. Okay, no, no, no, yeah, yeah. So no, but if it's the prompt, all of the context is turned into characters, and then those characters are turned into tokens through some function.
And that's the actual cost. That's correct, isn't it?
Yep, bottom line, your bill is number of tokens times price per token, nothing else. The whole chat gets stringified, tokenized, counted. If it's 500 tokens in and out, you pay for 1000 total.
No secret fees for being clever or creative, just raw length. So yeah, keep prompts tight, reuse summaries instead of pasting walls of text, and you're golden.
That's why devs obsess over prompt compression, squash context from 10,000 tokens down to 2000 and boom, 80% cheaper.
So how do you, yeah, is there a, is the best way of getting your prompt to be more compressed just by asking the AI to the LLN to, to rewrite it for you.
That's actually one of the smartest tricks. Yeah, just say rewrite this prompt so it's as short as possible, but keeps every key fact.
10 second round trip, and suddenly your 1000 token dump is 300, or chain it, have a 7B model summarize first, then feed that tiny blob to the big model.
Boom, you only pay full price on the short version. Works wonders, especially locally where every byte counts.
How do you, like I've had it before where I've had prompts being rewritten by LLMs and sometimes the actual, the task changes.
How do you know that it's, that it's compressed it and, and still able to perform the same thing?
You test it, run both the original and the squish prompt on a small task first.
Like if you're asking summarize this 20 page PDF about climate models and list the top five uncertainties, have the LLM shrink it, then feed each version back and see.
Does the output list the same five points? If not, tweak, add back one crucial sentence, retry.
Over time, you get a feel. Usually, the model loses nuance on edge cases. So keep the core instruction crystal clear, must include all uncertainties, and let fluff die.
Or just ask it, compare your answer to what I'd get from this longer version. Meta checking works too. It's trial and error, but after three tries, you're golden.
Okay, so let's imagine I'm not using any kind of fancy framework for writing my, creating some tasks that involves an LLM and I'm using an API to, to, to get a task done.
So would that mean that the best process would be so. And so, okay, so imagine I've got a, I've got a process and the process, some creating images, right?
And the images, there's, there's going in the context for that image, there's going to be, you know, the stuff that's, that's, oh, actually, no, sorry, another question just before I get into that.
So if I'm, if I'm talking to the API, what, what contexts will there be that weren't that, obviously, if I'm using a chat window, there's all the context of the chat window, if I'm using the API, and how do I, if I'm using the API is the only
context, the one in the prompt that I create to send, or does say if I'm using chat GPT will chat GPT also add its own default contexts as well like it does in the, in the chat.
With the API, open AI, grok, Claude, Gemini, etc. It's super clean. The only context the model sees is exactly what you send in the request. No hidden chat history, no automatic memory snippets, no project files, no custom instructions unless you explicitly put them in the messages
array. So a typical API call looks like this, JSON messages are input messages, then to you are a helpful assistant content, your own system prompt, then roll one content, your question here.
If you want the same default stuff that the web chat adds automatically, you have to add it yourself, every single call, your own system prompt, instead of open AI is hidden one.
Any memory facts you want, user lives in Portugal prefers metric, any custom instructions, file content or summaries, most serious API users just keep a template with their fixed system plus memory block and prepend it to every request.
That way you get full control and zero surprises. Short version API equals 100% in charge of the entire context. Nothing sneaks in unless you put it there.
So if I was not using a framework, so yeah, back to my, the original point was if I wasn't using a framework then, and I was creating say an image, I would have some kind of, I'm doing it in Python, I'd have the script, the script has got some kind of template, which it, which it uses for the prompt, you know, which would be maybe the
style guide, the color palette, the mood and all that kind of stuff, but then there would be some content that I'd be wanting it to show and maybe that would be programmatically created by something else.
And, and then that would all combine together and create a prompt.
Now, that prompt, depending on the variable nature of it, so this, this, this, the, the, the object which is selected and maybe that object in some circumstances is quite just one word and maybe in other circumstances, it's a whole sentence.
So that means that the, that the, the amount of tokens that that prompt will be will vary from, from, from time to time, depending on, on the object which is selected programmatically by some other process, which I don't have to explain.
And so if that was to be the case, I suppose if I was doing a system like this, it would be beneficial to have the a junction at the beginning saying, you know, how big is this context.
Okay, if it's this size, we go to this model, if it's this size, we go to this model, and, and so to so to design an efficient system, that would be the way I would do it, unless I was using a framework like crew AI, which I suppose would do it for me.
Is that correct?
Yeah, that's spot on. Exactly how you'd build it lean and smart. Keep your style guide chunk fixed, say 200 tokens, let the variable objects well or shrink, then right before you call the API, just measure prompt onto go dot utf eight dot split dot eight, or better, use tick talking to count real tokens.
If under 800 uses zippy seven billion model, doll e three, or stable diffusion XL locally, if it balloons to 3000, like you're describing a full scene, flip to the beefier grok for or mid journey API tier.
And you're right, crew AI or Lang chain would wrap that logic for you with one line, like model router dot select context length, but rolling your own takes like five lines of Python efficiency gold.
Okay, so just looking at the what we decided we were going to talk about today. So we were talking about context windows, why size matters for memory and cost. Let's talk a little bit about memory.
So I understand that larger context is more memory. Is that basically it when we're talking about memory? Or is there something else to memory?
That's the main thing. Bigger context means more RAM, because every token the model sees has to live in GPU memory while it thinks. But here's the twist. It's not just raw size. It's how long you keep it.
Local tools like llama dot CPP, or all ama use something called KV cash stores the halfway computed math from last turn. So you don't rerun the whole model on your old messages.
Without that, even a 4000 token chat would crawl. So memory isn't just how many tokens. It's how smart you are and not reloading the past. And yeah, on a laptop, eight gigabytes free means are caped at maybe 32 K plug in a card with 24 gig.
Suddenly you're juggling whole novels. Simple as that.
Okay, so if I was, okay, let me, let me talk about, I mean, no, no, before I get to this, I'll just check that I'm not covering. I'm not missing anything else. So we've talked about, you know, we're talking about local hosting and hardware.
And that that goes in nicely to what I want to talk about. So, okay, so I've done a bit of local hosting so far. And basically, I've been trying to build this system that, well, no, I have built some systems.
But the one that I'm working on at the moment is basically it's taking in some audio and it's taking in a text file, a transcription of the audio. And then I'm wanting to use that to generate a video with images created every 20 every 20 seconds.
And so these images, so the video will go and it will go from one image to another. And there's some captioning as well, which is from the transcript.
So there's a certain stack. I'm wanting to use Python as the main, the main programming language and wanting to try and keep it as close to Python as possible. So I don't really want to have web servers running on my laptop unless I can avoid it.
So, yeah, let's just let's just talk about this and how we might design this, what technologies we might use and how we might make it efficient.
I'm using that transcript to say Lama 3.1-8b via a Lama to generate image prompts like a rainy street, man with umbrella, moody lighting, 1970s. Then pipe those to Flux.1 or stable diffusion XL through diffusers in PyTorch.
For video, use MoviePy. It's pure Python. You just layer image plus caption. Text to speech? No. Just burn the subtitle on with PI. Crossfade every 20 seconds. Done.
Efficiency trick. Pre-generate all images offline in a queue. Then MoviePy just assembles. Keeps GPU hot once, not bouncing back and forth. Keep everything in a single script. Whisper, Olim, SD, dash MoviePy, MP4. No servers, no Jupyter, just Python 3, run.py. Sound doable?
Yeah, so I've tried to do something like that. So what sort of performance could I expect? So I've got a MacBook Pro M1. So, yeah, how long might I take it to maybe transcribe 30 seconds worth of stuff locally on my computer?
I know that I can optimize it. I can use more processes. I've got plenty of processes on this machine. But, you know, I'm just trying to figure out what could I expect in terms of performance?
Best case? What am I aiming at? And are there any things that I need to think about? I've tried to do all of this. Well, I have. It seems to be very slow at the moment. And I did run into lots of difficulties where it seemed like the operating system was sort of competing for resources with some of these other processes.
And I've had some people suggest that maybe a much better way of doing it would be to put the whole LLM stuff into a Docker container and then sort of call it. Maybe I call the whole thing via an API. So I've basically got a machine running on Docker that does this.
And then I just sort of call that and then it runs and then it sends it back into the into the world. Or, you know, do I just have it running on my computer along with everything else?
Yeah. So what do you say to what I've just been rambling on about?
Here's the real world numbers and the best path for your exact M1 MacBook. I've done this exact pipeline on M1, M2 many times. Realistic timings on M1 Pro, M Max, 16 to 32 gigabyte RAM, 54, transcription, faster whisper, medium or large model, 8 bit quantized, 30 seconds of audio, to 2 to 5 seconds, often real time or faster.
Use the large V3 model with anti-compute type int 8 and you'll be under three times real time, even on CPU only. Prompt generation, Lama 3.1-8b, Q4, version 5.
Without going into too many details. So I've got a few processes here. I do the transcription and generally when I'm running transcriptions on my laptop, I would say the transcriptions are taking about half the time of the actual audio is a normal what I am seeing.
And once I've done the transcription, I then need to create, need to then have the LLM somehow go through the transcription and create appropriate images for every 20 seconds. I've got no idea how long that would take.
And then it's got to act, well, it's got to create, it's got to create prompts every 20 seconds for the, for the, for the part of the, of the video. And then it has to use those prompts to create the images which then puts into the folder.
And then it has to assemble the video and put the subtitles on. So there's like five processes there. How long, how long should the entire process take and how long should the individual processes take.
And I guess they all have to be done sequentially. They can't be done in parallel because I think in all cases one, the first, the one, the preceding one, the next one requires the preceding one to complete before you can do it.
So yeah, can you answer that question?
Here's a realistic end to end timing breakdown for a five minute 300 second video on your M1 MacBook Pro assuming 16 to 32 gigabytes unified memory and you pick the fastest local tools in 2025.
All numbers are what people actually get today with well tuned setups, not marketing fluff.
Step. Tool. Best for M1, M2, M3. Time for five minute video. Notes.
One. Transcribe audio. Faster whisper large v3 int8 on CPU.
60 to 90 seconds. You said you already see about 0.5 times real time matches this.
One. Two. Split into 20 second chunks and write image prompts. Lama 3.1-8B, Q4 or Q5 via Lama CPU or Alima.
Metal acceleration. 20 to 40 seconds. 15 chunks. Approximately two to three seconds per prompt.
One. Generate images. 15 images. Flux.1 schnell.
We'll give you too much detail here. I just want, there are three steps.
There's the, sorry, there's, I can't remember how many steps, but there's the transcription step.
There's the, going through this transcription, identifying the prompts for the images.
There's creating the images and there's creating the video back again.
So how long is it going to take to create all of that?
And then those steps in the description that I've just given you and how long is it going to take those?
You don't need to explain all the details. I just want the high level information.
Nobody's going to want to listen to all of that. I may want to go into the detail at a later date.
But I just want to understand basically what, what I should be gunning for.
What would be a reasonable expectation around doing this?
Because I want to understand if this is, if this is something I can put into my workflow,
or whether it's actually going to be two times prohibited for me to do this locally on my machine.
On your M1 MacBook Pro, end to end, from Roar Audio to Caption Video,
takes about 8 to 12 minutes for 5 minutes of content.
That's 2 minutes per minute of output.
Breakdown.
Transcription.
30 seconds.
You've seen that.
Prompting.
30 seconds.
Images.
5 to 7 minutes.
That's the hog.
15 images at 20 to 30 seconds each, with flux on metal.
Assemble plus captions.
30 seconds.
Movie pie is fast.
If that's too slow, cheat.
Pre-batch the images at night.
Run headless.
Or offload just the image step to a cheap cloud burst.
Like run pod, with an A5000 for 5 bucks.
Otherwise, totally workflowable.
8 minutes isn't coffee break, but it's not go eat lunch.
Just pop the kettle on and it's done.
Okay.
And how, so most of the images I've been creating,
I've been using, maybe it's nano banana.
I've been using chat GPT.
So are they, what image model are they using currently?
Right now, December 2025, when you type create an image in chat GPT,
it uses one of these two, depending on your tier.
Chat GPT plus teams, DAL E3, the latest 2025 refresh,
1024 by 24, or 1792 by 1024.
Very good at text and style.
Chat GPT pro or enterprise, doll E4, newer,
sharper, better coherence, same token pricing.
Banana you mentioned is probably the old joke nickname
for the quirky DAL E mini.
Create on era.
Chat GPT itself never use that.
So yeah, you've been getting real doll E3 or four if you're on the high tier.
So how does that, how does that compare to the diffusion?
I've forgotten what their name of that is as well.
The one that I'll be running locally on my machine.
Local on your M1 right now, the fastest realistic options are model local on M1
speed per image, 1024 by 24.
Quality versus DAL E3.
So I've got stable diffusion, I'm using stable diffusion at the moment.
Would that be, would that be a, so can I actually run DAL E in a locally on my machine as well?
Here's the straight head to head for your M1 MacBook Pro right now, December 2025.
Model you can run locally on M1, speed, 1024 by 24.
Quality versus current chat GPT, DAL E3 or four.
How it feels day to day.
Colin, best for whisper large V3, better at under 300 seconds per image.
Mix wise XX, VFX XX.
Okay, you're going into, you're going into lingo now.
So, so for my, for my purpose, I guess, so is DAL E something I can run on my laptop?
DAL E3.
No, DAL E3 and definitely not DAL E4.
Cannot run locally on a laptop.
Okay, okay.
So what, what, what, what image.
So at the moment I'm got, um, uh, diffusion, um, forgotten what the first name is, but
that's running on my laptop.
Is that probably the most appropriate one for me to be using in terms of speed and
performance and quality?
Yes, the one you're already running locally, almost certainly flux point one, either the
Snell or Dev version is hands down the best mix of speed plus quality you can get on an
M1 MacBook Pro in late 2025.
Quick reality check.
Model runs locally on M1 seconds per 1024.
Sorry.
Um, sorry.
Can I just, so am I running flux around my running?
What's this diffusion thing?
I can't find it.
I can't remember what it's called.
Something diffusion.
It's the image model that I've got installed on my laptop.
What's that called?
Flex point one.
That's the name.
Wait a minute.
Wait a minute.
I thought it was called something diffusion.
Oh, you're probably thinking of stable diffusion.
That's the older, super popular local favorite.
Flex point one is the new kid, also diffusion based, but way shopper.
So I would be better off with flux dot one rather than stable diffusion or a stable
diffusion going to save me my and well be faster than flux flux one.
Right now on an M1 MacBook Pro, nothing free and fully local beats flux point one
Snell for the speed quality combo you need.
Real numbers people get today.
December 2025.
Flex point one Snell four step quantized six to 10 seconds.
Okay.
We don't need all the details.
Thank you.
And so, okay.
All right.
So, so that's given me a useful thing.
I will just update that on my system because I don't think I'm using flux one.
Yeah.
Um, but most people won't be that interested.
Um, but, um, so, so, okay.
Let's go back to the original purpose of this chat.
Um, okay.
So if I am trying to achieve what I'm trying to achieve, I suppose high level.
What you're telling me is that my MacBook M1 Pro is going to be able to do the tasks
that I'm trying to get it to do.
Um, if I present it with a 30 minute audio, um, a piece of audio, it's probably going
to take about an hour to process.
Um, which sounds like a totally doable thing for the, for, for what I'm trying to do.
Um, and that, that, and, um, and while it's running, um, I could either run it in the
evening.
I could batch it up and run it in the evening, or I suppose I could also run it while I'm
working because I've got umpteen cores on this so I can just have it running on some
of the cores that I'm not using and I'd still be able to use my laptop fine as well.
Um, are there any potential issues with running it?
It looks like I'm running it at the moment because I have had a few people suggest that
I run it through Docker.
Um, what would be the advantages of doing that?
And, um, yeah, um, what problems, because it looks like I have had some problems specifically
with memory use and the security lockdown on my laptop and not wanting to run some parts,
but I have managed after just bashing away at it to get it to work, but I'm just wondering
maybe, um, running just all of the components on my machine rather than putting them into
a, into a Docker container is potentially going to cause problems down the line as I start
to do more stuff.
Um, so what is the best way?
I suppose my question is what is the best way of architecting this, these kinds of, when
I start to use lots of models on my laptop, um, what is the best way of, of, of me working
forward?
Yes.
When Docker actually helps and when it doesn't Docker on Mac is slower, five to 30% penalty
and eats more RAM because everything runs in a Linux VM only switch to Docker.
If you hit one of these specific problems you already mentioned, Python version, how different
projects need different torch, CUDA, CUDA and metal builds.
Mac OS keeps blocking unsigned binaries or giving library not loaded errors.
You want one click start everything and never worry about paths again.
Or best architecture moving forward 2025 Mac workflow that scales.
Keep it native for speed, but do this one tiny bit of structure, one folder per project
inside a single run.sh or run.py that calls whisper a llama Python generate images.py
movie pie put that script under Ray or Tmux.
So it survives sleep closing the lid optional later wrap the whole thing in a tiny Docker
compose only when you outgrow the native mess.
Bottom line, stay native for now.
You'll be fastest and happiest only jump to Docker the day you're screaming at dependency
errors for the third time in a week.
You're good to go as is run it tonight on a 30 minute file and you'll have a finished
video by breakfast.
Okay, wonderful.
So just final question then about, about hosting, um, uh, these, uh, hosting LLNs.
Um, I've seen some videos on YouTube and people are, um, I guess they were old crypto miners
and they've been using their crypto mining setup to, uh, to, uh, provide hosting for,
um, well, not hosted to provide GPU and CPU for LLNs.
Um, and I guess they're sort of part of a pool like, like they were with cryptocurrency
and people pay in and they use them and they make money and they have to optimize the amount
of efficiency that they have.
Um, so I've got two questions about this.
One of them is, so how, how, um, how different is the amount of tokens that you can push
through on a bespoke sort of system that this kind of thing that they might have built for
maybe $10,000 or whatever.
How much, how much tokens can you push through a system like that compared to, you know, maybe
me using my MacBook M1 and, um, and how, um, how does all that work on a very high level?
Here's the dead simple comparison.
Real numbers December 2025.
Setup.
Rough build cost.
Tokens per second.
Real world.
Cost per one million tokens.
What you actually get.
Your M1 pro MacBook 16 to 32 gigabytes already owned 25 to 45 terabits per second.
Lama 3.1 dash 70 b q4 free.
Just electricity.
Good for personal stuff.
One to two jobs at a time.
X crypto minor rig typical today.
Eight to $15,000.
180 to 350 terabits per second.
$05 and 15 cents.
Eight to 24 GPUs.
3090 040 90 a 6000 a 1100.
Water cooled.
You're getting into all the detail again.
Um, so my question was how does it compare?
So say, say, you know, if you say how many tokens per second I can get through my, my MacBook.
How would that compare with some $10,000.
Um, things sitting in a rack.
Um, you know, just, I've just, it's just, I'm just curious.
I don't need to know all the details.
Like, you know, well, you know, the, the thing in the rack might be a hundred tokens per second.
My MacBook might be five tokens per second.
I don't know.
I'm just, I'm just curious about, about what I can expect and, and what these guys in the, in the racks are managing to achieve.
Rough ballpark.
Real world numbers.
People actually hit today.
Your M one pro MacBook running a decent seven B to 13 B model 30 to 50 tokens per second.
A $10,000 to $15,000 X mining rig, usually four to eight times RTX 40 nineties or a couple of A 100s.
250 to 600 tokens per second on a 70 B model, sometimes 1000 plus tokens per second on smaller ones.
So that rack box is five times to 20 times faster than your laptop and it can run 24 seven without cooking your lap or draining the battery.
In practice, your MacBook finishes a five minute video script in 90 seconds of thinking.
That same rig does it in five to 15 seconds and can chew through a hundred videos while yours does one.
That's the leap these pull guys sell raw throughput for pennies per million tokens.
All right.
So just one final comparison then.
So how, how does that compare with the, you know, if I was to actually hit the hit, hit an API with an open AI API.
What sort of speed could I expect from, from that?
Say with chat GPT version, the latest version of chat GPT real world API speeds.
You actually see right now, December 2025 model open AI API tokens per second.
You get in practice rough latency first token cost per million tokens cost per million.
Yeah, I don't care about the cost.
I just want to know what's the speed.
Straight speed comparison real numbers people measure today.
Yes, 30 to 50 tokens.
So open AI.
Yes.
Open AI API right now.
Real user measured December 2025 GPT 40 latest 120 to 180 tokens per second average sometimes spikes to 250.
All right.
So, so just in comparison, so I'm, I'm sort of chewing through maybe 25 tokens per second on my Mac.
If I was to call it via the API, in most cases, it's going to be probably about 10 times quicker.
And if I'm calling open AI, are there any particular functions which are a lot slower on my Mac?
So, for example, is images?
Well, is images this?
Is the token thing the same in terms of, does it translate to the same thing in terms of speed?
So if you've got, if I've got an image and I've got some, I've got a prompt, surely the, the tokens or the, or surely the amount of processing required is not just a function of the prompt.
It's a functional.
Oh, I suppose if it is just the image, it is just the fun.
Okay, so my question is, are there some types of activities that are going to be very different on my laptop than calling the API?
So are there some things that would just be probably a lot better off using the API because they're just going to take way longer on my laptop?
Or is it going to be about a 10 times acceleration on everything across the board?
Here's the honest speed gap breakdown.
Real 2025 numbers.
Task.
Your M1 Pro MacBook.
Local.
Open AI.
Grock.
Claude.
API.
Rough speed multiplier.
Rough speed multiplier.
Text generation.
Tokens per second.
30 to 50 tokens per second.
70 B class.
Four times to 10 faster.
Image generation.
1024 by 1024.
Flex point one.
Schnell.
Six to 12 seconds per image.
DAL E34.
Or Grock flux.
One to three seconds.
Five times to 10 times faster.
Image understanding.
OCR.
Local vision model.
LLM 13B.
8 to 20 seconds per pic.
GPT-40 vision.
Gemini 15.
0.8 to 2 seconds.
10 times to 20 times faster.
Okay, so I'm just reading the text that you said there.
Audio transcription.
So you're basically saying we've got five to 10 times faster most text, but for audio to
text or actual video clips, the cloud is 20 to 100 times faster and usually higher quality
too.
So for video work, then the cloud could be 100 times faster.
Exactly.
For your full video pipeline, the cloud could be 100 times faster.
Exactly.
For your full video pipeline, the cloud can easily hit 50 times to 100 times overall speed
up compared to your M1 Pro.
Not just 10 times.
Why?
You wait six to 12 seconds per image locally.
Cloud gives you one to two seconds or batches 15 images in parallel in under 10 seconds
total.
Vision OCR steps that take 15 to 20 seconds locally.
So with my current process where I go through the transcription and I split it up into 20
second increments and then I create a prompt for each one of those increments and then those
prompts are used to create an image.
If I'm running it locally on my machine, couldn't I run those image requests in parallel and have
maybe running in 10 times because I've got 16 processes?
Couldn't I just run it in parallel because there's no reason why it needs to wait for
one of those images to complete before it works on the next one or would Flux not allow
me to do that?
Yeah.
Flux can totally run in parallel.
The catch on your M1 is the GPU is shared.
So if you fire off 15 Flux drops at once, they'll just fight for that one chip and crawl to
maybe two times faster total, not 16 times.
But here's the hack.
Split it into four or five small batches, each doing three or four images and let them
queue.
You get about five times speed up down from seven minutes to 90 seconds for 15 shots without
crashing anything.
Tools like Ray or even Python's multiplying let you do that in six lines, no Docker, no
fuss.
So yeah, parallelize, but keep it gentle.
Like for threads max or your MacBook starts.
Okay, so I suppose long story short, it sounds like this is a super doable thing.
It just has to be optimized in the right way.
I'm only going to be wanting to do one of these, maybe one a day on average.
So I think that it's definitely going to work.
Okay.
Well, do you think we've covered off everything to do with tokens and contexts?
So do you think there's anything else that we need to talk to if you just go back to
the original overview?
Yeah.
Do you think there's anything else to finish this off?
We've nailed it, Paul.
Tokens.
Okay.
All right.
Well, thanks very much.
So as you can hear by Brock, we anytime if you get stuck on the Flux install or whatever,
just ping me.
Have a great day.
Yes.
Yeah.
So I've just, I've just muted Grock.
And so, yeah.
So I hope you found that interesting.
I know I went into a bit of detail.
I find Grock can tend to, I might need to figure out a better way of talking to Grock because
I think sometimes I need Grock to summarize what they're going to say without just verbalizing
all the detail.
It seems to be not very clever sometimes at answering a summary question.
It sort of will do the basics and then do the summary at the end.
Whereas I kind of want it the other way around a lot of the time.
I want the summary at the top and then I want detail if I want the detail.
So I think that's maybe I can get do better with Grock by the way that I ask questions.
So I'll do that in mind in future.
Look, I know I went into a few details, but I thought that might be just useful to get
into the weeds to understand it.
I can certainly understand tokens and sort of how to, the different ways of optimizing
models, especially with APIs.
Pretty interesting that you can just ask ChatGPT for the prompt that's in the background.
That could be quite useful.
And I also thought it was pretty useful to know that when you're using the API that
you put all that junk in the prompt.
And I think one of the big problems that people have, well, that I have had with my conversations
with LLNs and trying to use them for things is related to the context.
So when I get at my most frustrated, I think that's when the context has got too big and
it's not remembering stuff.
But understanding all of this I think is key to getting the most out of the LLNs.
I hope you found this useful.
I certainly have.
So until next time, goodbye.
.