Lesson

How AI Systems Learn

Understand pretraining, fine-tuning, and feedback loops so you can test AI systems with better failure awareness.

10 min read

A lifecycle diagram showing pretraining, fine-tuning, alignment, evaluation, and production monitoring.

Overview

Generative AI behavior is shaped by how models are trained and refined over time. To test these systems well, QA teams need a mental model of that learning lifecycle, from pretraining to post-deployment monitoring.

This lesson explains where model behavior comes from, why performance can drift, and how to map each training stage to practical QA checks.

A Practical Note for QA Learners

If this lesson starts to feel more technical than your day-to-day QA work, focus on the lifecycle rather than the terminology. The most important idea is simple: AI behavior is shaped in stages, and each stage can introduce different types of quality risk.

For practical QA work, three things matter most:

You need to know where a model behavior probably came from.
You need different test strategies for different failure types.
You should treat evaluation and monitoring as part of testing, not as separate activities.

If you prefer, skim the mathematical notation and focus on the lifecycle stages, the QA mapping table, and the practical exercise.

Learning Goals

Explain the difference between pretraining, fine-tuning, and alignment.
Describe why data quality controls are central to model quality.
Identify how evaluation differs from traditional software testing.
Connect training-stage risks to concrete QA and SDET test strategies.
Build a lightweight validation workflow for generated output quality.

Core Concepts

1. Pretraining: Learning General Language and Patterns

Pretraining is the large-scale phase where a model learns broad statistical patterns from huge datasets.

At this stage, the model learns:

Syntax and grammar patterns
Common reasoning structures
Domain vocabulary distributions
Typical formatting patterns in text and code

The model objective is often next-token prediction:

θ min - t = 1 \sum T lo g P_{θ} (x_{t} ∣ x_{< t})

Practical QA implication:

If pretraining data has noise, bias, or low-quality sources, those weaknesses can surface later as hallucinations or inconsistent output.

Manual QA example:

If an assistant gives different quality answers for similar requests across accents, geographies, or user types, the root cause may come from uneven data representation long before the prompt reaches the model.

2. Fine-tuning: Specializing for Tasks and Domains

Fine-tuning adapts a pretrained model to a narrower domain or behavior target.

Examples:

Support chatbot style and tone tuning
Code-generation tuning for a specific framework
Domain tuning for insurance, finance, or healthcare vocabulary

Manual QA angle:

Validate domain language correctness with representative scenarios.

Automation QA/SDET angle:

Build regression prompts and compare output snapshots across model versions.

Typical fine-tuning risks:

Catastrophic forgetting of general capability
Overfitting to narrow training style
Performance gains in one task with regressions in another

3. Alignment and Feedback Loops

After fine-tuning, teams often apply alignment methods so outputs are safer and more useful.

Common alignment mechanisms:

Human preference ranking
Rule-based safety filters
Refusal behavior tuning
Post-processing constraints

This introduces a tension:

Strong safety controls can reduce harmful outputs.
Overly strict controls can block valid, useful outputs.

QA needs to test both sides:

Under-refusal: unsafe content slips through.
Over-refusal: valid requests are rejected.

Manual QA example:

A valid request such as "Summarize the likely causes of a failed login workflow" should not be blocked just because it contains words like "failure" or "security."

Automation QA/SDET example:

Add prompt suites that intentionally probe refusal boundaries and compare behavior across prompt rewrites, user roles, and model versions.

4. Where Feedback Comes From in Real AI Products

In production systems, learning does not come only from the original training run. Teams usually create improvement loops from multiple sources:

Human reviewers score outputs for correctness, clarity, and safety.
Evaluation datasets reveal repeated failure patterns.
Support tickets and production incidents expose real user pain points.
Prompt logs show where users rephrase the same request because the first answer was weak.

For QA professionals, this is important because feedback loops shape later behavior. If a team reacts only to a narrow class of problems, the next model or prompt version may improve in one area while quietly regressing in another.

5. Evaluation Is Multi-Dimensional

Traditional tests often ask: pass or fail?

AI evaluation must ask multiple questions at once:

Is the output correct?
Is it complete?
Is it safe and policy-compliant?
Is it stable across prompt variants?
Does quality hold across user cohorts and locales?

A practical evaluation matrix:

Dimension	Example check	QA method
Correctness	Are facts and logic valid?	Gold-set comparison + human review
Robustness	Does prompt paraphrasing break quality?	Prompt mutation tests
Safety	Does output violate policy?	Red-team prompts
Format reliability	Does output match JSON schema?	Automated schema validators
Fairness	Is quality consistent across groups?	Stratified evaluation sets

6. Monitoring After Release

AI quality is not fixed at deployment. Input patterns, user behavior, and domain context change.

Post-release monitoring should track:

Hallucination rate
Refusal rate
Task completion quality
Latency and timeout behavior
Incident frequency by prompt category

For QA teams, production monitoring is part of testing, not a separate concern.

7. What QA Teams Should Verify at Each Stage

Lifecycle Stage	Manual QA Focus	Automation / SDET Focus	What to verify before release
Pretraining	Look for uneven quality across topics, locales, and user styles	Validate benchmark coverage and representative datasets	Known weak areas are documented and visible to the team
Fine-tuning	Check that domain language and expected behavior match product needs	Run regression prompts against current and previous model behavior	Improvements in one scenario do not break another scenario
Alignment	Test both unsafe acceptance and unnecessary refusal	Automate policy and refusal-boundary prompt packs	Safe behavior is consistent without blocking normal use cases
Evaluation	Review correctness, completeness, and clarity of outputs	Run schema, rubric, and mutation tests	Acceptance thresholds are defined and met
Monitoring	Review production failures and real-user complaints	Track drift, incident patterns, and alert thresholds	Rollback and re-evaluation paths are ready

QA/SDET Relevance

How this lifecycle maps to test work:

Pretraining stage: audit dataset assumptions and coverage claims.
Fine-tuning stage: run domain regression suites and failure clustering.
Alignment stage: test refusal boundaries and policy consistency.
Evaluation stage: maintain benchmark datasets and acceptance thresholds.
Monitoring stage: define alert thresholds and rollback criteria.

A practical team pattern:

Maintain a versioned prompt-and-expected-output test pack.
Run automated checks on every model or prompt-template change.
Add human review for high-risk scenarios such as legal, finance, and security advice.

One practical QA mindset shift:

In traditional testing, we often validate logic that developers intended to be deterministic.
In AI testing, we also validate behavior that emerges from data, tuning, and feedback loops.
That means reproducibility, evaluation sets, and ongoing monitoring become part of normal quality engineering, not optional extras.

Practical Work

Exercise: Build a Mini AI Evaluation Harness

Objective: Create a repeatable quality gate for generated outputs.

Use this sample Python script to score response quality from inline sample cases first. Once the script makes sense, you can replace those samples with cases from your own project or load them from a JSON file later.

python

48 lines

1from dataclasses import dataclass
2
3@dataclass
4class EvalCase:
5    prompt: str
6    output: str
7    expected_keywords: list[str]
8    forbidden_keywords: list[str]
9
10
11def score_case(case: EvalCase) -> dict:
12    text = case.output.lower()
13    coverage = sum(1 for k in case.expected_keywords if k.lower() in text)
14    forbidden_hits = sum(1 for k in case.forbidden_keywords if k.lower() in text)
15
16    return {
17        "coverage_score": coverage / max(1, len(case.expected_keywords)),
18        "safety_score": 1.0 if forbidden_hits == 0 else 0.0,
19        "forbidden_hits": forbidden_hits,
20    }
21
22
23def run_eval(cases: list[EvalCase]) -> None:
24    for i, case in enumerate(cases, start=1):
25        result = score_case(case)
26        print(f"Case {i}")
27        print(f"  Coverage: {result['coverage_score']:.2f}")
28        print(f"  Safety:   {result['safety_score']:.2f}")
29        print(f"  Violations: {result['forbidden_hits']}")
30
31
32if __name__ == "__main__":
33    sample_cases = [
34        EvalCase(
35            prompt="Generate API tests for password reset flow",
36            output="Include OTP expiry checks, invalid OTP retries, and lockout tests.",
37            expected_keywords=["otp", "expiry", "lockout"],
38            forbidden_keywords=["disable security", "bypass auth"],
39        ),
40        EvalCase(
41            prompt="Summarize production issue",
42            output="Likely timeout spike after deployment. Recommend rollback and incident timeline review.",
43            expected_keywords=["timeout", "deployment", "rollback"],
44            forbidden_keywords=["ignore users", "hide issue"],
45        ),
46    ]
47
48    run_eval(sample_cases)

Steps:

Add 10 real prompts from your QA workflow.
Define expected and forbidden terms per case.
Run scoring on current model outputs.
Improve prompt templates and compare scores.
Create acceptance thresholds for release gating.

Suggested QA scenarios to try:

Password reset assistant should mention OTP expiry, retry limits, and lockout behavior.
Bug triage summarizer should preserve severity and impact without inventing root causes.
Test-case generator should include both positive and negative cases for file upload validation.
Support assistant should refuse unsafe requests but still answer legitimate diagnostic questions.

Reflection:

Which failures were not visible from a quick manual read?
Did stricter safety checks reduce helpfulness?
Which cases require mandatory human review regardless of score?

YouTube Resources

3Blue1Brown video thumbnail for a short explanation of large language models.

What this helps with: A clear visual explanation of how large language models are trained to predict the next token, which supports the pretraining and model-behavior ideas in this lesson.

Google Cloud Tech video thumbnail about tuning foundation models.

What this helps with: A practical introduction to model tuning, which reinforces the fine-tuning stage and helps learners connect training adjustments to real product behavior.

Key Takeaways

AI systems learn in stages, and each stage introduces unique failure modes.
QA must test data quality assumptions, not only final outputs.
Evaluation is multi-dimensional: correctness, robustness, safety, format, and fairness.
Continuous monitoring is required because quality can drift after release.
Human judgment remains essential for high-impact and ambiguous scenarios.

Next Step

Next, continue to Understanding LLMs to build a clearer mental model of tokens, context windows, embeddings, and transformers. If that lesson is not available yet, pause here and make sure the lifecycle, evaluation, and monitoring ideas from this lesson feel comfortable first.