AI Test Stack
AI Foundations for QA Professionals/Level 4 — Large Language Models (LLMs)
Lesson

How ChatGPT Works

Follow the prompt-to-response workflow including tokenization, embeddings, transformers, alignment, and RLHF.

7 min read
A ChatGPT workflow diagram showing prompt assembly, model processing, and response generation.
A ChatGPT workflow diagram showing prompt assembly, model processing, and response generation.

Overview

When people say "ChatGPT works like a chatbot," they are only describing the interface. Under the hood, a ChatGPT-style system is a layered workflow: role instructions, conversation state, tokenization, transformer inference, decoding, post-processing, and sometimes tools or memory.

For QA teams, this matters because bugs can come from many places besides the base model itself. A failure may be caused by prompt assembly, truncated context, wrong role ordering, unsafe tool output, or a decoding setting that changes consistency. This lesson breaks down that prompt-to-response path so you can test the full system, not just the model brand.

A Practical Note for QA Learners

This lesson is less about memorizing internal AI terminology and more about learning where failures can happen in a chat-style system. If you already use tools like ChatGPT or Claude, the practical goal is to stop treating a bad answer as "just an AI issue" and start locating the likely failure layer more precisely.

Focus on these three ideas:

  • a chat assistant is a system, not only a model
  • prompt assembly and context handling are often part of the bug
  • retrieval, tools, and post-processing can fail even when the base model is fine

Learning Goals

  • Trace the path from user prompt to assistant response.
  • Understand the role of instructions, messages, and conversation state.
  • Explain where tokenization, inference, and decoding happen.
  • Recognize where alignment and RLHF influence behavior.
  • Build QA checks for chat-style AI workflows.

Core Concepts

1. A Chat App Is More Than a Raw Model Call

A ChatGPT-style system usually includes:

  • application-level instructions or policies
  • user input and chat history
  • optional retrieved context or tool output
  • the LLM itself
  • decoding settings
  • output filtering or post-processing

That means "the model answered badly" is often an incomplete diagnosis.

2. Message Roles and Instruction Priority

Modern chat systems often structure input as messages with roles such as:

  • system or developer
  • user
  • assistant

These roles matter because they shape instruction priority. Model behavior can change significantly if:

  • a business rule is placed in the wrong message role
  • previous conversation turns are omitted
  • the prompt is formatted differently than the model expects

Practical QA implication:

  • Test the same request with changed role placement to uncover integration bugs.

3. Tokenization and Chat Formatting

Even chat models still continue a token sequence under the hood. Chat templates convert messages into the specific token pattern the model was fine-tuned on.

This is why wrong formatting can degrade output quality. A model may expect special markers or a certain role structure, and if the application builds the prompt incorrectly, quality can collapse without obvious errors.

4. Inference: Generating One Token at a Time

Once the input is assembled, the model processes the tokenized sequence and predicts the next token distribution. It then generates one token, adds it to context, and repeats.

This loop continues until:

  • a stop token is reached
  • a maximum token limit is reached
  • a tool call or structured output condition changes flow

Important implication:

  • the model does not plan the full final answer in one perfect step
  • early token choices influence later output quality

5. Decoding Controls Shape the Personality of the Output

The same model can feel very different depending on generation settings.

SettingEffectQA concern
TemperatureMore or less randomnessStability across runs
Max tokensLimits output lengthTruncation of important detail
Top-k / top-pRestricts candidate tokensCreativity vs consistency trade-off
Stop conditionsEnds output earlyBroken schemas or cut-off explanations

6. Alignment and RLHF

After pretraining, many chat systems are further shaped through methods such as:

  • instruction tuning
  • supervised fine-tuning
  • reinforcement learning from human feedback
  • safety filters and refusal policies

These help the system become more useful, safe, and consistent for chat use.

However, they can also introduce trade-offs:

  • over-refusal for legitimate prompts
  • bland or generic answers
  • inconsistent behavior near policy boundaries

7. Tool Use, Retrieval, and Hidden Failure Points

Many production chat assistants are not just text generators. They may call tools, retrieve documents, or compose answers from external data.

That creates additional failure surfaces:

  • wrong retrieval result
  • tool timeout
  • stale external data
  • unsafe tool output passed back into the prompt
  • poor answer synthesis after retrieval

8. A Useful QA Debugging Mindset for Chat Systems

When a chat assistant fails, ask these questions in order:

  1. Did the application send the right instructions and context?
  2. Was important information truncated or buried?
  3. Did the model misinterpret the prompt or ignore a constraint?
  4. Did decoding settings make the answer unstable?
  5. Did retrieval, tool calls, or post-processing introduce the real error?

This mindset helps QA teams separate model problems from integration problems more quickly.

QA/SDET Relevance

Manual QA should test:

  • message history handling
  • context truncation behavior
  • prompt-role mistakes
  • refusal boundary consistency
  • groundedness when external data is included

Automation and SDET workflows should test:

  • prompt assembly deterministically
  • schema validity for structured outputs
  • retries and fallback logic on timeout or tool failure
  • stable behavior on a regression prompt set
  • monitoring signals such as latency, refusal rate, and hallucination rate

Practical Work

Exercise: Map the Failure Surface of a Chat Assistant

Objective: Treat a chat AI feature like a distributed system with multiple failure points.

Pick a QA use case such as "Generate test cases from acceptance criteria."

Create a workflow map with these stages:

StageExample riskExample test
Prompt assemblyMissing rule in developer messageCompare raw prompt before send
Tokenization/contextLong requirement gets truncatedNear-limit prompt test
Model inferenceHallucinated constraintsGold-set evaluation
DecodingOutput becomes inconsistentRepeat-run variability check
Post-processingJSON parsing failureSchema validator
Retrieval or toolsWrong source usedGroundedness and citation check

Then write one test case per stage.

Reflection:

  1. Which failures belong to the model versus the application layer?
  2. Which of these checks should run in CI?
  3. Which failures would a user see first in production?

Key Takeaways

  • ChatGPT-style systems combine prompt assembly, model inference, and application logic.
  • Role formatting and conversation state strongly influence quality.
  • Responses are generated token by token, so decoding settings matter.
  • Alignment improves usefulness and safety but can introduce new edge cases.
  • QA should test the whole chat workflow, not only the underlying model.

YouTube Resources

Andrej Karpathy video thumbnail for a deep dive into LLMs like ChatGPT.
Andrej Karpathy video thumbnail for a deep dive into LLMs like ChatGPT.

What this helps with: Connects model internals to the actual user experience of modern chat systems and explains many practical failure modes.

Video thumbnail for a concise explanation of how ChatGPT works.
Video thumbnail for a concise explanation of how ChatGPT works.

What this helps with: Gives a cleaner prompt-to-response system view of ChatGPT, including where context, model inference, and safety layers show up in practice.

Next Step

Next, continue to Level 5 on Prompt Engineering Fundamentals, where you will turn these internal concepts into repeatable prompting and evaluation techniques.