AI Test Stack
AI Foundations for QA Professionals/Level 4 — Large Language Models (LLMs)
Lesson

Understanding LLMs

Cover tokens, parameters, context windows, embeddings, transformers, and model families.

9 min read
A clear LLM pipeline diagram showing prompt text, tokenization, embeddings, transformer layers, and generated output.
A clear LLM pipeline diagram showing prompt text, tokenization, embeddings, transformer layers, and generated output.

Overview

Large language models, or LLMs, are the systems behind tools like ChatGPT, Claude, Gemini, and Copilot. They do not store finished answers like a database. Instead, they process input as tokens, represent those tokens as vectors, and generate responses one token at a time using patterns learned during training.

For QA professionals, an LLM is not just an AI buzzword. It is a probabilistic software component with input limits, formatting expectations, model-specific behavior, and very real failure modes. This lesson gives you the mental model you need before going deeper into transformer internals and ChatGPT-style workflows.

A Practical Note for QA Learners

If this lesson starts to feel overly technical, do not worry about memorizing every term. The most important mental model is that an LLM is a text-processing system with limits, probabilities, and context rules rather than a source of guaranteed truth.

For practical QA work, three things matter most:

  • The model only sees what fits into its current context window.
  • The same task can behave differently depending on prompt wording and decoding settings.
  • Good QA focuses on behavior, consistency, and failure patterns, not only whether one answer "looks smart."

Learning Goals

  • Explain what an LLM is in plain language.
  • Distinguish tokens, parameters, embeddings, and context windows.
  • Recognize the main model families and why they behave differently.
  • Understand why outputs are probabilistic rather than fully deterministic.
  • Translate LLM concepts into concrete QA and SDET testing strategy.

Core Concepts

1. What Makes a Model an LLM?

An LLM is a deep learning model trained on very large amounts of text so it can predict likely next tokens in a sequence. Modern LLMs are typically built on transformer architecture and are useful for many language tasks without needing separate models for each task.

In practical terms, that means one model may be able to:

  • summarize release notes
  • draft API tests
  • explain a stack trace
  • rewrite a defect report for clarity
  • generate structured JSON

That versatility is why LLMs matter so much in QA workflows.

2. Tokens: The Real Unit of Input and Output

Humans think in words and sentences. LLMs process tokens.

A token can be:

  • a whole word
  • part of a word
  • punctuation
  • whitespace pattern

Why this matters:

  • context limits are measured in tokens, not pages
  • prompt cost is often measured in tokens
  • long logs or requirements can overflow the available context window

Example:

  • A short sentence may become 10 to 20 tokens.
  • A large bug report with logs may become thousands of tokens.

QA implication:

  • If a prompt silently truncates important context, the model may still answer confidently but miss the real issue.

3. Parameters and Embeddings

Parameters are the learned numeric weights inside the model. More parameters often increase capacity, but parameter count alone does not guarantee better quality for your use case.

Embeddings are vector representations of tokens or text. They let the model treat language as geometry in a high-dimensional space, where similar meanings can land closer together.

Practical intuition:

  • tokens are the pieces
  • embeddings are the numeric representation of those pieces
  • parameters are the learned rules for transforming those representations

Manual QA angle:

  • similar prompts can still behave differently because the model is sensitive to wording, order, and hidden structure in the token sequence.

QA automation / SDET angle:

  • regression suites should track behavior, not just the published parameter count or marketing claims for a model.

4. Context Windows: What the Model Can See at Once

The context window is the maximum number of tokens the model can use at one time. That usually includes:

  • system or developer instructions
  • user messages
  • retrieved documents
  • tool outputs
  • the generated answer itself

This is one reason prompt engineering and retrieval design matter so much. If the relevant information is not in the current context, the model cannot reliably use it.

QA implication:

  • Large-context support does not guarantee good long-context reasoning.
  • You should test near the context boundary with realistic data sizes.

5. Base Models, Instruct Models, and Chat Models

Not all LLMs are trained for the same interaction style.

Model typeWhat it is optimized forQA-relevant behavior
Base modelRaw next-token continuationOften less safe, less instruction-following
Instruct modelFollowing direct instructionsBetter for task completion
Chat modelConversational multi-turn exchangeBetter dialogue flow, role handling, and assistant behavior

This distinction matters because a model can be technically strong and still perform poorly if you use the wrong prompt format for its training style.

6. Decoder Behavior: Why Outputs Vary

At inference time, the model predicts a probability distribution over possible next tokens. The application or inference engine then chooses one.

Common generation controls include:

  • temperature: higher values increase randomness
  • top-k: limit selection to the top-k candidates
  • top-p: limit selection to the smallest set whose cumulative probability reaches a threshold
  • max_new_tokens: limit output length

This is why the same prompt can produce different valid answers across runs.

QA implication:

  • Exact string matching is often the wrong oracle.
  • Property-based checks, schema validation, and rubric scoring are more robust.

7. Where LLMs Commonly Fail

LLMs are powerful but imperfect. Frequent failure patterns include:

  • hallucinated facts
  • dropped constraints from long prompts
  • brittle formatting under schema pressure
  • inconsistent behavior across paraphrases
  • latency or timeout issues with long inputs
  • bias or uneven quality across languages and user groups

8. What QA Teams Should Verify Before Trusting an LLM in Production

Before adopting an LLM for a real workflow, it helps to verify a few basic things deliberately:

AreaWhat to verifyWhy it matters
Context handlingImportant instructions are not ignored or truncatedLong prompts often fail silently
Output structureJSON, tables, or required formats are stableMany AI workflows break in post-processing
Domain accuracyProduct terminology and business rules are preservedFluent language can still hide wrong logic
Behavioral consistencySimilar prompts produce similarly acceptable resultsPrompt brittleness creates operational risk
Safety and boundariesThe model refuses unsafe requests without blocking valid onesOver-refusal and under-refusal both matter

QA/SDET Relevance

Manual QA should validate:

  • whether the model actually used the provided context
  • whether long prompts change answer quality
  • whether domain terminology is understood correctly
  • whether the answer remains useful across phrasing changes

QA automation and SDETs should validate:

  • token budget edge cases
  • schema compliance for structured outputs
  • regression behavior across model or prompt changes
  • latency, timeout, and fallback paths
  • safety and refusal handling for risky prompts

Practical Work

Exercise: Build a Small LLM Capability Test Matrix

Objective: Evaluate an LLM as a product component instead of treating it like a magic box.

Pick one QA-oriented task, such as:

  • generating API test cases
  • summarizing a production defect
  • converting acceptance criteria to BDD scenarios

Create 12 test cases across these categories:

CategoryExample
Short clean promptOne requirement paragraph
Long promptRequirement plus logs plus edge-case notes
Ambiguous promptMissing one key business rule
Structured outputRequire JSON or markdown table
Paraphrased promptSame task, different wording
Adversarial promptAdd conflicting or distracting instructions

For each test case, record:

  • input size in rough token terms
  • expected behavior
  • acceptable variation range
  • failure type if the model goes wrong

Reflection questions:

  1. Which failures came from missing context versus poor reasoning?
  2. Which checks could be automated reliably?
  3. Which output properties matter more than exact phrasing?

Key Takeaways

  • LLMs process tokens, not human-friendly pages or paragraphs.
  • Parameters, embeddings, and context windows affect behavior in different ways.
  • Chat quality depends on both model architecture and fine-tuning style.
  • Output is probabilistic, so evaluation must go beyond exact string comparison.
  • QA teams need coverage for context limits, formatting reliability, and behavioral drift.

YouTube Resources

Andrej Karpathy video thumbnail for an introduction to large language models.
Andrej Karpathy video thumbnail for an introduction to large language models.

What this helps with: One of the clearest high-level explanations of tokens, training, inference, and why LLMs behave like probabilistic text engines.

3Blue1Brown video thumbnail for a short explanation of large language models.
3Blue1Brown video thumbnail for a short explanation of large language models.

What this helps with: Reinforces the high-level mental model of what LLMs are doing without jumping too quickly into implementation detail.

Next Step

Next, move into Transformers and Attention to understand the architecture that made modern LLMs practical at scale.