What actually IS a large language model (LLM)? What does it do under the hood?
At its core, an LLM is a next-word predictor. You give it some text, and it predicts the most plausible next chunk of text, tacks it on, then predicts again, looping until it's done. That's the whole trick. There's no database of facts it looks up and no rules engine deciding things. During training it read a huge pile of text and learned the statistical patterns, and now it replays likely continuations. The surprising part is that predicting the next word well enough, across trillions of examples, is enough to produce working answers, code, and explanations. Think wildly sophisticated autocomplete, not a search engine.
#llm#prediction#fundamentals
Basicsconcept
As a web dev, how should I think about calling an LLM in my app — what's the mental model?
Treat it like a stateless REST endpoint you POST text to and get text back. You send a request (your instructions plus the user's text and some settings), and it returns a response. The key thing: it's stateless, just like an HTTP request — the model remembers nothing between calls. If you want a 'conversation,' YOU resend the whole history each time, the same way you'd rebuild state from a session store on every request. It's also non-deterministic by default (the same input can give different answers) and you're billed per chunk of text. So: an HTTP call, text in, text out, no built-in memory, pay-per-use. That's the shape you're integrating.
#llm#rest#stateless#integration
Basicsconcept
What is a 'token,' and why should I care about it as a builder?
A token is the unit an LLM reads and writes — a chunk of text, roughly 3 to 4 characters of English. 'Hamburger' might be one token; a long word like 'antidisestablishment' splits into several; spaces and punctuation count too. A loose rule of thumb: about 750 words is roughly 1,000 tokens. You care because tokens are two things at once: the billing meter and the size limit. Providers charge per million tokens (input and output counted separately), and every model has a maximum number of tokens it can handle in one call. So token count directly drives both what you pay and whether your request even fits.
#tokens#cost#limits#fundamentals
Basicsconcept
What is the 'context window,' and how is it like the model's short-term memory?
The context window is the maximum amount of text — measured in tokens — the model can consider in a single call: your instructions, the chat history, the user's question, AND the model's own reply, all combined. It's the model's short-term, in-the-moment memory; anything outside the window simply doesn't exist for that call. The web-dev analogy: it's like a function that can only see its arguments — whatever you didn't pass in, it can't use. As of 2026 many models offer around 200K tokens, and flagship models reach about 1M tokens (roughly a 750,000-word book). Go over the limit and the call either errors or silently drops older text.
#context-window#memory#limits#tokens
Basicsconcept
Why do LLMs 'hallucinate' — confidently make things up?
Because a hallucination isn't a bug in the usual sense — it's the model doing exactly its job: predicting plausible-sounding next words. It has no concept of 'true' versus 'false' and no fact lookup; it generates whatever statistically fits, even when no real fact exists. So it'll invent a citation, an API method, or a person with total confidence, because the made-up answer reads just as fluently as a real one. The misconception to drop: it isn't 'lying' or 'confused' — it never knew in the first place. That's why you never trust an LLM for facts without grounding it in real data you supply or verify yourself.
#hallucination#reliability#facts
Basicsconcept
What does it mean when a model is called '7B' or '70B' parameters?
Parameters are the internal numbers the model adjusted during training — the dials that encode what it 'learned.' '7B' means 7 billion of them, '70B' means 70 billion. Loosely, more parameters mean more capacity to capture patterns, so bigger models tend to be smarter and handle harder reasoning — but they cost more, run slower, and need more memory and GPU power to host. You don't tune these yourself; they're baked in at training time. The web-dev analogy: it's a bit like a compiled binary's size — a rough proxy for capability and resource cost, not something you edit. Note that the big closed models (the ones you call over an API) usually don't even publish their parameter counts.
#parameters#model-size#capability#cost
Basicsconcept
Who makes the major model families, and roughly how do they differ as of 2026?
On the closed side: Anthropic makes Claude (strong reasoning, long documents, and coding; tiers like Opus, Sonnet, and Haiku, plus a top frontier tier). OpenAI makes GPT (broad ecosystem, fast and multimodal; as of 2026 the latest is a GPT-5-generation family). Google makes Gemini (deep integration with Google Cloud and Search). On the open-weight side, Meta makes Llama, alongside Mistral and others you can self-host. As of 2026 they're broadly comparable for everyday work, so you usually choose based on cost, speed, privacy needs, which cloud you're on, and fit for the task — not on one being universally 'best.'
#claude#gpt#gemini#llama
Basicsgotcha
Does an LLM 'learn' from my prompts or remember users between sessions?
No — and this trips up almost every beginner. The model's knowledge is frozen at training time; your API calls don't teach it anything, and it forgets everything the instant the request ends. There's no per-user memory built in. If your app needs to 'remember' a user's preferences or past chats, YOU store that (in your DB) and inject it into future requests. There IS a separate process called 'fine-tuning' that can bake new behavior into a model, but that's a deliberate, offline training step on your own data — not something that happens just by chatting. Default assumption: every call starts from a blank slate, like a brand-new stateless request.
#memory#statelessness#fine-tuning
Basicsconcept
What is a 'token' in the context of an LLM, and why can't I just think in words?
A token is the unit an LLM (large language model — the AI that predicts text) actually reads and counts. It's a chunk of text, usually part of a word: roughly 3/4 of a word in English, so 100 tokens is about 75 words. Common words are often one token; rarer words, code, and punctuation split into several. Think of it like how a database stores bytes, not 'words' — the model has its own internal unit. You care because both what you pay and how much you can send are measured in tokens, not characters or words.
#tokens#llm#basics
Basicsconcept
What is the 'context window', and why is it like the model's short-term memory rather than a database?
The context window is the maximum number of tokens the model can look at in a single call — your prompt plus its reply must both fit inside it. As of 2026, typical windows run from ~128K tokens up to 1M+ for the biggest models. Think of it as RAM/short-term memory the model holds while answering one request, NOT a database it keeps. Crucially it's stateless: nothing carries over to the next call automatically. Like a REST endpoint that forgets everything between requests, each call starts blank unless you resend the history.
#context-window#tokens#statelessness
Core ideahow-to
How do I estimate cost for an LLM feature before I ship it?
Cost = (input tokens × input price) + (output tokens × output price), priced per million tokens, with input and output billed separately. Output is usually a few times pricier than input. As of 2026, e.g. a mid-tier model like Claude Sonnet 4.6 runs about $3 per million input tokens and $15 per million output; a cheaper tier like Haiku 4.5 is around $1 and $5. So a request with 2,000 input tokens and 500 output on the mid-tier costs roughly (2000/1,000,000 × $3) + (500/1,000,000 × $15), about $0.006 + $0.0075 = $0.014. Multiply by your expected request volume. The trap: long instructions and chat history you resend every call quietly dominate your input cost.
#cost#tokens#pricing#budgeting
Core ideahow-to
If the context window is the model's only memory, how do chat apps 'remember' the conversation?
They don't rely on the model — they replay the history. Each turn, your backend resends the prior messages as part of the request, exactly like rebuilding state from a session store on every call. So a 'conversation' is really an array you keep growing and re-POSTing: user message, assistant reply, user message, and so on. The model re-reads the whole thing each call and continues. Two consequences fall out of this: long chats cost more every turn (you're paying to resend the history), and once the history outgrows the context window you have to trim or summarize old turns. The memory lives in YOUR code, not the model.
#conversation#context-window#state#chat
Core ideahow-to
How do I reduce hallucinations in a feature I'm building?
Stop asking the model to recall facts; instead feed it the facts and tell it to use only those. The main pattern is grounding — often called RAG, Retrieval-Augmented Generation, which just means: you fetch the real data from your DB or docs, paste it into the request, and instruct 'answer only from the text below; if it's not there, say you don't know.' Also: validate the model's output in code (check that any product ID, SKU, or URL it returns actually exists in your system), keep each task narrow, and turn down the randomness. The mindset: treat model output like untrusted user input — recompute and verify anything that triggers a real action.
#rag#grounding#validation#reliability
Core ideadecision
What does 'temperature' do, and when would I turn it up or down?
Temperature is a setting (usually 0 to about 1) that controls how random the model is when picking each next chunk of text. At low temperature (0 to 0.2) it almost always grabs the single most likely option, so output is focused, consistent, and repeatable — what you want for classification, extracting data, code, or anything you'll validate. Turn it up (0.7 to 1.0) and it samples less-likely options too, giving more varied, 'creative' output — good for brainstorming or marketing copy. The trade-off: higher temperature also raises the chance of hallucination. Sensible defaults: temperature 0 for production logic, higher for creative writing.
#temperature#randomness#tuning
Core ideagotcha
Why can the exact same prompt give me different answers each time?
Two reasons. First, sampling: unless temperature is 0, the model deliberately rolls the dice when picking among likely next chunks, so runs diverge. Second, even at temperature 0 you can see small variation — providers run these models on huge banks of parallel hardware where tiny numerical differences make perfectly identical output rare. The mindset shift for a web dev: an LLM call is not a pure function that always returns the same value for the same input — it's closer to calling a flaky third-party service. So write code that tolerates variation: validate the output, retry on bad results, and never assume you'll get a byte-identical response.
#non-determinism#temperature#reliability#testing
Core ideadecision
Bigger isn't always better — when would I pick a small model on purpose?
Often, for the right job. Smaller models (as of 2026, e.g. a Claude Haiku tier or a small 'mini'/'nano' tier from other providers) are cheaper, faster, and plenty smart for narrow, well-defined tasks: classification, extracting fields, routing a request, short rewrites, yes/no decisions. Save the big expensive flagship models for genuinely hard reasoning, multi-step agents, or tricky code. A common production pattern is routing (sometimes called cascading): a cheap model handles the easy requests, and only the hard ones get escalated to the pricey one — which can cut cost substantially with little quality loss. Match the model size to the task difficulty, like picking the right instance size for a workload.
#model-selection#cost#routing
Core ideadecision
What's the difference between 'open' and 'closed' models (like Llama vs GPT/Claude/Gemini)?
Closed models (OpenAI's GPT, Anthropic's Claude, Google's Gemini) are available only as a hosted API — you send requests, they run the model on their servers, and you never get the underlying model files. Open-weight models (Meta's Llama, plus Mistral and others) publish those files, so you can download them and run them on your own hardware or a cloud GPU. The trade-off: closed is the easiest and usually most capable, but you're renting and your data leaves your walls; open-weight gives you control over hosting, privacy, and cost at scale, but YOU manage the infrastructure, scaling, and updates. It mirrors the SaaS-versus-self-hosting decision you already know.
#open-weights#closed-models#privacy
Core ideagotcha
What is 'training data' and a 'knowledge cutoff,' and how do they bite me?
Training data is the giant pile of text the model learned from; the knowledge cutoff is the date after which it saw nothing. So a model genuinely doesn't know events, library versions, or prices that are newer than its cutoff — ask about something recent and it'll either admit it doesn't know or, worse, confidently make up a plausible answer. The fix isn't to 'update the model' (you can't from the outside); it's to supply the current info in your request — fetch today's data from your API or DB and pass it in, the same grounding/RAG move. Treat the model as frozen knowledge plus whatever fresh context you hand it on each call.
#training-data#knowledge-cutoff#facts
Core ideacode
Show me the rough shape of an actual LLM API request and response.
It's a JSON POST to a chat-style endpoint. Conceptually: you POST a body with the model name, a couple of settings, and a messages array. For example, model 'claude-sonnet-4-6', temperature 0, max_tokens 500, and messages like [{role:'user', content:"What's my order status?"}]. Behavior-shaping instructions go in a separate top-level 'system' field, not as a message. The reply comes back as JSON with the generated text plus a 'usage' block reporting input and output token counts (your bill). max_tokens caps the reply length, and the messages array is the full history you resend each call. Same JSON-in, JSON-out rhythm as any REST integration you've done.
#api#json#request-response#integration
Core ideagotcha
What's the difference between the model and the chat product (Claude vs Claude.ai, GPT vs ChatGPT)?
The model is the raw next-word predictor you reach through the API — that's what you build on. The chat product is a whole application wrapped around that model: a UI, conversation memory, behind-the-scenes instructions, safety filters, tools like web search, file uploads, and login. So ChatGPT is an application; GPT is the engine inside it. Claude.ai is the app; the Claude models are the engine. When you build your own AI feature, you're calling the engine (the API) and writing your OWN wrapper — your UI, your instructions, your memory, your guardrails. Don't assume the API behaves like the polished consumer chat app; that polish is all the wrapper's doing.
#api#product#architecture
Core ideahow-to
If I'm building a chatbot, how does the model 'remember' the earlier turns of the conversation?
It doesn't — you remember for it. Because each API call is stateless (the model forgets everything between requests), your backend stores the conversation and resends the whole history every time. The request is just an array of messages: [{role:'user',...},{role:'assistant',...},{role:'user',...}], and you append the new turn each call. It's like a REST API with no session — you pass the state in yourself. The catch: that growing history eats into your context window and your per-call cost, so long chats eventually need trimming or summarizing.