I keep hearing 'calling an LLM API.' As a web dev, what actually is that — is it something exotic?
Nothing exotic. An LLM (Large Language Model — an AI trained on huge amounts of text that, given some words, predicts the next ones) is reached the same way you reach any third-party REST API: an HTTPS POST request to someone else's server. You send JSON containing your prompt; you get JSON back containing the model's text. A provider like OpenAI or Anthropic runs the heavy GPU machine — you just hit their endpoint with an API key. If you've called Stripe or SendGrid, you already know how to do this.
#llm#rest#http#basics
Basicsconcept
What is the 'messages' array, and why isn't it just one big text string?
The messages array is the conversation you send — an ordered list of turns, each an object like {"role": "user", "content": "..."}. Think of it as a chat log you replay on every call. Splitting it into roles (instead of one blob) tells the model who said what, so it can answer the latest user turn in context. Crucially, the model is stateless — it remembers nothing between calls, like a REST endpoint that forgets you between requests — so YOU resend the whole array each time. That's how multi-turn chat works.
#messages#conversation#state#json
Basicsconcept
What's the difference between the system, user, and assistant roles in messages?
Three roles, three speakers. 'system' is your standing instructions — tone, rules, persona ('You are a support bot, be concise'). You set it, not the end user — think of it like config or middleware that frames the whole conversation. 'user' is what the human typed. 'assistant' is what the model said on earlier turns. You build a multi-turn chat by appending the model's reply back as an assistant message, then the next user message, and resending. Put your guardrails in the system message, never in user text a visitor controls.
#roles#system-prompt#messages
Basicscode
Show me the shape of a basic LLM request body. What do I actually put in the POST?
Minimally: which model, the messages, and a cap on output length. Roughly:
``json
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Summarize this ticket in one line."}
]
}
``
You POST that as JSON with an auth header (Anthropic uses x-api-key; OpenAI uses Authorization: Bearer ...) and Content-Type: application/json. The model IDs shift over time (as of 2026, e.g. claude-sonnet-4-6 or gpt-5.5). Shapes differ slightly — OpenAI puts the system prompt inside the messages array; Anthropic takes a separate top-level 'system' field — but it's always model + messages + a few knobs.
#request#json#payload#code
Basicscode
What does the response JSON look like, and where's the actual text?
It's JSON with metadata wrapped around the text — you have to reach one level in, the way you'd read data.results[0] from any API. Anthropic returns 'content' as an array of blocks, so the text is at response.content[0].text. OpenAI nests it at response.choices[0].message.content. You also get a 'usage' object (token counts, for billing), an id, the model name, and a stop reason. Beginner gotcha: people expect a bare string and get tripped up by the envelope. Just pull the text out of the documented path.
#response#json#parsing#usage
Basicsconcept
What is an API key here, and where should it absolutely never live?
The API key is your secret credential — it authenticates the call and gets your account billed, so treat it exactly like a database password. It goes in a request header (Authorization: Bearer ... or x-api-key). Where it must NEVER live: your frontend (React/browser) code, a public repo, or anything shipped to the client. Anyone who opens DevTools can copy a key sent to the browser and run up your bill. Keep it in a server-side environment variable (e.g. process.env.ANTHROPIC_API_KEY), kept out of git via .gitignore. Same discipline you already use for any secret.
#api-key#secrets#env-vars#security
Basicsdecision
Why should the LLM call come from my backend and not directly from the React frontend?
Two reasons. First, secrets: a browser call would expose your API key to anyone, since frontend code is fully visible — same reason you'd never put your Stripe secret key in the browser. Second, control: from your backend you can validate input, rate-limit, log usage, cache, swap models, and require auth — none of which you can trust the client to do. So the pattern is: browser → your /api/chat endpoint (server holds the key) → LLM provider → back. Your backend is a proxy that keeps the key safe and stays in control.
#backend#architecture#security#proxy
Basicsconcept
What is a 'token,' and why does the API talk about tokens instead of characters?
A token is a chunk of text the model reads and writes in — usually a word-piece, very roughly 3–4 characters of English (so ~750 words is about 1000 tokens). The model processes text token by token, so tokens are its natural unit of work. APIs price and limit by tokens — both what you send (input) and what comes back (output) — the way an API might bill per request rather than per byte. Practical impact: your bill and your length limits are counted in tokens, not characters. The response's 'usage' field tells you exactly how many you spent — watch it like an API quota.
#tokens#billing#usage#limits
Basicsconcept
When I add an AI feature to my React app, why can't my browser code just call the LLM provider (OpenAI/Anthropic) directly, the way I'd call any other REST API?
Because calling an LLM requires your secret API key, and any code running in the browser is fully visible to users (View Source, the Network tab, devtools). A key shipped to the browser is a leaked key. Think of it exactly like your database password or a Stripe secret key: you would never put those in frontend JS. Anyone could grab it and run up your bill or abuse your account. So the call must happen on a server you control, where the key stays private. Your React app talks to your backend; your backend talks to the provider.
#api-key#security#backend#frontend
Basicshow-to
What does the actual wiring look like — what's the path a request takes from my React button click to the LLM and back?
Three hops, and you own the middle one. Your React app does a normal fetch('/api/chat', {...}) to YOUR backend (same as any feature you've built). Your backend route receives it, attaches the secret API key from an env var, and makes its own server-to-server call to the provider (e.g. https://api.openai.com/...). The provider replies to your backend; your backend forwards the answer back to React as plain JSON. Browser → your server → provider → your server → browser. The browser never sees the key or the provider's URL. This middle hop is often called a 'proxy' or 'BFF' (backend-for-frontend) — a thin pass-through endpoint.
#proxy#wiring#backend#fetch
Core ideagotcha
What does max_tokens control, and what happens if I set it too low?
max_tokens caps how many tokens the model may generate in its reply — a ceiling on output length, not a target. Set it too low and the reply gets cut off mid-sentence: you'll see a stop reason like 'length' or 'max_tokens' and a truncated answer. It does NOT make the model try to fill that many tokens; it just stops there. Beginner trap: people set 50, then wonder why answers are chopped. Size it to your worst-case useful reply. And note output tokens usually cost more per token than input.
#max_tokens#output#truncation#cost
Core ideaconcept
Why do I get a different answer each time I send the exact same prompt? That feels broken.
It's expected, not a bug. Instead of behaving like a pure function (same input, same output), the model picks each next word by rolling weighted dice over its likely options — so the same prompt can produce different valid wording each call. A setting called 'temperature' controls how adventurous that dice-roll is: temperature 0 makes it almost always pick the single most likely next word, so outputs become much more consistent (close to identical, though not always byte-for-byte). For predictable behavior in production, set temperature to 0 and keep your prompts stable.
#temperature#determinism#sampling#randomness
Core ideadecision
What is 'temperature' and when would I turn it up vs down?
Temperature is a knob (usually 0 to 1, sometimes up to 2) for how much randomness the model uses when choosing each next word. Low (0–0.3): focused and repeatable — use it for classification, data extraction, lookups, anything where you want the same answer every time. High (0.7–1): more varied and creative — use it for brainstorming, marketing copy, or chat where sameness feels robotic. Default to low for app features whose output your code will parse: you don't want your JSON parser surprised because the model got poetic. It's a creativity-vs-consistency dial.
#temperature#tuning#creativity#determinism
Core ideagotcha
What is a 'stop reason' / 'finish reason' in the response, and why should I check it?
It's a field telling you WHY the model stopped generating — the difference between 'done' and 'silently broken.' Common values: 'stop'/'end_turn' (finished naturally — good), 'length'/'max_tokens' (hit your cap, so the answer is truncated — raise the limit), 'content_filter' (blocked by a safety filter), and 'tool_use'/'tool_calls' (it wants to call a function you defined). Always read it before trusting the text. A truncated answer that happens to look complete is a classic beginner bug — the stop reason is how you catch it.
Should I use the official SDK or just call the endpoint with fetch/axios?
Either works — it's the same HTTP underneath. The official SDK (the openai or @anthropic-ai/sdk packages, for example) is usually worth it: it handles auth headers, retries with backoff, parsing of streamed responses, typed results, and keeps up with API changes — like using the Stripe SDK instead of hand-rolling requests. Raw fetch/axios is fine for a quick spike, a tiny serverless function where you want zero dependencies, or an unusual runtime. Start with the SDK; drop to raw HTTP only when you have a reason. Both read the key from an env var.
#sdk#fetch#axios#tooling
Core ideacode
How do I call the API with the official SDK — show me the minimal Node code.
Install the package, let it read the key from an env var, call one method:
``js
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from the environment
const msg = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Write a haiku about deploys." }],
});
console.log(msg.content[0].text);
``
OpenAI is the twin: new OpenAI(), then client.chat.completions.create({ model, messages }), read at choices[0].message.content. Model IDs change over time (as of 2026, e.g. claude-sonnet-4-6 or gpt-5.5). Notice you never type the key — the SDK pulls it from the environment.
#sdk#nodejs#code#example
Core ideaconcept
What is 'streaming' a response, and why does it feel familiar from web dev?
Streaming sends the answer back piece by piece as it's generated, instead of making you wait for the whole thing — that's the typewriter effect you see in ChatGPT. It rides on tech you already know: Server-Sent Events (SSE) / HTTP chunked transfer. You set stream: true, then read an event stream where each chunk carries a bit of text you append to the UI. Why bother: a 10-second wait for a full answer feels broken, but streaming starts showing words in about a second, so the perceived wait drops sharply. It's the LLM version of progressive rendering.
#streaming#sse#latency#ux
Core ideadecision
When should I stream vs just wait for the full response?
Stream when a human is watching text appear — chat UIs, assistants, long explanations — because it kills the perceived wait even though total time is the same. Use a plain single request/response (sometimes called 'sync') when no human is staring at it, or when you need the whole output before acting: classification, data extraction, anything you'll JSON.parse and feed to code, or batch/cron jobs. Streaming also complicates error handling and parsing, since you're reassembling chunks. Rule of thumb: stream for humans reading, single response for machine-to-machine.
#streaming#sync#ux#architecture
Core ideagotcha
What are rate limits on an LLM API and how do they bite me?
Same idea as any API quota: the provider caps how much you can send per minute — usually both requests-per-minute (RPM) and tokens-per-minute (TPM). Go over and you get HTTP 429 Too Many Requests. The LLM twist is that TPM matters as much as RPM, so a few huge prompts can throttle you even at a low request count. Limits start low on new accounts and rise as you use the API more (tiers). Watch the Retry-After header the response sends back, and design for it: queue, batch, and back off rather than hammering.
#rate-limits#429#tpm#quota
Core ideahow-to
How should I handle errors and retries when calling an LLM?
Treat it like any flaky network dependency, with one twist. Retry the transient failures — 429 (rate limited) and 5xx (server errors) — using exponential backoff with jitter (wait 1s, then 2s, then 4s, plus a little randomness so retries don't all fire at once and stampede), honoring any Retry-After header. Do NOT blindly retry 4xx errors like 400 (bad request) or 401 (bad key) — those won't fix themselves. Cap total retries, set a timeout, and have a fallback (a cached answer, a smaller model, or a graceful 'try again' message). The official SDKs do much of this for you, which is one reason to use them.
#retries#backoff#429#error-handling
Core ideadecision
There are a dozen models — how do I actually pick one for my feature?
Think tiers, not brands. Each provider ships a big/smart/pricier model and smaller/faster/cheaper ones. As of 2026, for example, Anthropic has Opus (most capable, e.g. claude-opus-4-8), Sonnet (balanced, claude-sonnet-4-6), and Haiku (fast and cheap, claude-haiku-4-5); OpenAI mirrors this from gpt-5.5 down to gpt-5.4-mini. Start with a mid or large model to prove the feature works, then try dropping to the cheapest one that still passes your tests. Judge by quality on YOUR task, latency, cost per token, and how much text it can take in. Cheap models for classification/extraction; bigger ones for hard reasoning.
#model-selection#cost#latency#tiers
Core ideaconcept
What's a 'context window' and why does it limit how much I can send?
The context window is the maximum total tokens the model can look at in one call — your entire messages array (system + history + the new input) PLUS the reply it generates all have to fit inside it. Think of it like a function that only accepts so many bytes of arguments. As of 2026 many flagship models offer roughly 1,000,000-token windows; smaller ones around 200,000. Go over and the API rejects the call. Practical impact: long chat histories and big documents eat the budget, so you trim old turns, summarize, or split documents into chunks. It's a hard limit, not a soft suggestion.
#context-window#tokens#limits#memory
Core ideaconcept
Is the conversation stored on the provider's side between my API calls?
No — by default each call is stateless, like a REST request with no session. The model has zero memory of your previous call; it only 'knows' what's in the messages array you send THIS time. Multi-turn chat works because YOU store the history (in your DB or session) and resend it each call, appending the latest turn. Beginner surprise: people expect it to recall earlier chats automatically — it doesn't, you own the memory. (Some newer 'threads'/stateful APIs can persist state server-side, but the default mental model is stateless: you carry the history.)
#state#memory#conversation#stateless
Core ideagotcha
Where exactly does the API key go, and how do I keep it out of git and out of the bundle?
The key lives in an environment variable on your server only — same discipline as any DB password. Put it in a .env file locally (ANTHROPIC_API_KEY=sk-...) and read it via process.env.ANTHROPIC_API_KEY; in production set it in your host's config (Render, Vercel server env, Docker secret, etc.). Add .env to .gitignore so it never gets committed. One trap: in Vite/Next, only vars prefixed VITE_/NEXT_PUBLIC_ get bundled into the browser — never use those prefixes for a secret, or you've just shipped it to every visitor. Keep secrets unprefixed and read them server-side.
#env-var#gitignore#secrets#bundler
Hands-oncode
How do I stream from my backend to my React frontend without leaking the key?
Two hops, key stays server-side. Your backend calls the provider with stream: true, then re-streams the chunks to the browser over its own SSE response (or a ReadableStream). Sketch:
``js
// Express route
res.setHeader("Content-Type", "text/event-stream");
const stream = await client.messages.create({ /* ... */, stream: true });
for await (const event of stream) {
if (event.type === "content_block_delta")
res.write(data: ${event.delta.text}\n\n);
}
res.end();
``
The React side consumes it with EventSource or fetch + a stream reader. The provider key never touches the client — your endpoint is the trusted proxy in the middle.
#streaming#sse#backend#react
Hands-ongotcha
How do I keep an LLM feature from blowing up my bill?
You pay per token, split into input (what you send) and output (what comes back), and output usually costs several times more per token. As of 2026, a flagship model runs roughly $5 per million input tokens and $25–30 per million output tokens, with small models several times cheaper. Your levers: pick the smallest model that passes your tests, cap max_tokens, trim the history/context you resend every turn, and cache repeated answers (some providers also offer 'prompt caching' that discounts a big reused system prompt). Log the usage field on every request and set spend limits/alerts in the provider dashboard. Treat tokens like a metered API and measure before optimizing.
#cost#tokens#pricing#optimization
Hands-ongotcha
The model returned JSON I asked for, but my JSON.parse sometimes throws. What's going on?
Classic trap: asking for JSON in the prompt doesn't guarantee valid JSON. The model is a text generator, so it may wrap the output in ```json code fences, add a 'Here you go:' preamble, or leave a trailing comma. Don't trust-and-parse blindly. Best fix: use the provider's structured-output or JSON mode (the model is constrained to emit valid JSON matching a schema you give it), or its tool/function-calling feature, which does the same. At minimum, strip fences and wrap the parse in try/catch with a retry. Then validate the parsed object against your own schema before using it — the golden rule is recompute/verify, don't blindly trust the model.
#json#structured-output#validation#parsing
Hands-oncode
Can you show me a minimal backend proxy endpoint that safely calls an LLM — the smallest thing that works?
Here's the shape in Node/Express. The key comes from process.env, never hardcoded:
``js
app.post('/api/chat', async (req, res) => {
const { message } = req.body;
const r = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'x-api-key': process.env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01',
'content-type': 'application/json'
},
body: JSON.stringify({
model: 'claude-sonnet-4-5', // as of 2026, e.g. a current Claude model
max_tokens: 500,
messages: [{ role: 'user', content: message }]
})
});
const data = await r.json();
res.json({ reply: data.content[0].text });
});
`
It's just a POST forwarding a POST. Set ANTHROPIC_API_KEY in your .env`, not in code.