Create ingredients, not finished lessons
The fundamental shift in AI-native education: static content can't be personalized. If the lesson is pre-written, there's no room to adapt pace, examples, or remediation. So the system should produce structured ingredients — and let the AI teacher assemble them for each learner in real time.
Content ingestion is the system that creates those ingredients at scale: from expert knowledge and existing materials into structured, reviewable, runtime-ready teaching components.
+ existing materials
THIS SYSTEM
per lesson
adapts in real time
Metaphor: don't bake the cake and hand it over. Prepare the ingredients, then bake it together with the learner.
Manual ingredient creation is the bottleneck
An AI teacher can only adapt if it has the right ingredients to work with. Right now, creating those ingredients — examples, exercises, likely misunderstandings, hint ladders, contextual variants — is mostly manual and takes weeks per lesson.
- Expert designs exercises and datasets — slow, expensive
- Likely misunderstandings are tribal knowledge, rarely written down
- Contextual variants (domain-specific examples) created ad hoc
- Delivery — AI teacher delivers adaptively from ingredients
- Instruction design — AI helps create the ingredients themselves ← the next big lever
- Curriculum design — AI determines which concepts to include for a given audience ← future
Before any scaffold system — the absolute minimum that already works:
That's it. One LLM, one DB table for profile, one loop. The DB state IS the memory — no chat history needed as context. Each interaction overwrites the relevant profile fields, and those fields become the prompt context for the next turn.
This already beats "chatbot beside static content" because the system tracks what the learner struggles with and adapts. But it's fragile — no structured scaffold, no separation of validation from generation, no observability. That's why we evolve toward the scaffold approach.
- SELECT · WHERE · ORDER BY · LIMIT
- COUNT / SUM / AVG · GROUP BY basics
- Each existing lesson → structured adaptive scaffold
- Out: joins, subqueries, window funcs, curriculum-wide planning, multi-agent
- Existing lesson assets + SME material
- Lesson decomposition → structured scaffold
- Critique + clarification + review + export
- Workflow metrics · release gates · minimal learner profile (success rate + pacing)
domain truth
scaffold builder
scaffold approval
current ops + runtime
- Decompose existing lessons into structured scaffold elements
- Identify likely misunderstandings, common errors, define hint ladders
- Explicit scaffold approval → validate → export stable package
- Pedagogical adequacy — scaffold must be educationally meaningful, not just valid
- Trust / bounded inference — no unsupported additions for misconceptions or hints
- Reviewer efficiency — reduce manual decomposition, not create cleanup
- Structural correctness · rerun stability · observability · cost
Meaning: a scaffold can be schema-valid but still teach badly. Example: a WHERE lesson scaffold lists "likely misunderstanding: student confuses SELECT with WHERE" — that's too vague to be useful. An adequate scaffold says: "student writes SELECT price > 50 instead of WHERE price > 50 — confuses column selection with row filtering."
How to measure: expert review acceptance rate on inferred elements. If experts reject >30% of model-generated misunderstandings, the scaffold quality is too low.
Meaning: the system must not confidently fill scaffold fields it doesn't have evidence for. Example: model infers "students commonly confuse GROUP BY with DISTINCT" — but the source material never mentions this. If confidence metadata says "high" on an unsupported claim, downstream systems will treat it as ground truth.
How to measure: unsupported high-confidence addition rate — % of scaffold elements marked "high confidence" that have no source reference. Target: < 5%.
Meaning: the system should save reviewers time, not create new cleanup work. Example: if scaffold builder generates 20 hint variants but 15 are redundant, reviewer spends more time pruning than writing from scratch.
How to measure: reviewer touch time per lesson. Baseline: 4h manual decomposition. Target: < 1.5h with scaffold assist. Track time per resolved issue.
Meaning: no broken references, missing required fields, orphaned elements. Every exercise template must link to an objective. Every hint must link to an error type.
How to measure: schema validation pass rate on export. Target: 100% — this is a hard gate, not a metric to optimize.
"Pedagogically weak" ≠ "explanation is ugly." It means the learning path is weak as an instrument — learner can go through content without reliably gaining understanding.
Upstream rubric (5 axes, rated 1–5 by expert):
Bad: "understand SQL filtering" · Good: "use WHERE to filter rows by condition"
Bad: GROUP BY before basic aggregation · Good: filter → aggregate → group
Does scaffold anticipate filtering vs sorting confusion?
Does exercise test the objective, or just pattern completion?
One direct exercise ≠ mastery. Need transfer check
Downstream validation (runtime proves it):
If scaffold is pedagogically weak, runtime will show: repeated confusion at same objective · high direct-task pass but poor transfer · collapse after advancement · hints used too heavily · later revisits of "mastered" material.
Skill / concept
- Building block knowledge
- "Filtering rows"
- "Sorting rows"
- "Aggregation basics"
Learning objective
- Observable capability
- "Use WHERE to filter rows"
- "Distinguish filter from sort"
- "Use GROUP BY with aggregates"
Input side:
- Ingest SME notes, existing lesson scripts, examples, exercises, hints, glossary, legacy assets
- Normalize into stable source pack with provenance
Transformation:
- Identify lesson objective · extract explanation blocks · extract/map examples · extract exercise templates
- Identify likely misunderstandings · infer common error patterns · define hint ladders · define remediation patterns
- Add reinforcement tags · contextualization slots · preserve provenance + confidence
Review + export:
- Scaffold review · issue tracking · one clarification round · explicit approval · stable versioned export
Lesson shell stays — internals become structured
Don't rebuild the course. Don't throw away the lesson format. Enrich each lesson into an adaptive teaching unit.
Batch pipeline: trigger → queue → async workers → review gate → export.
One lesson node becomes a structured adaptive teaching unit:
tool_type_assets — each ingredient specifies its delivery format: slide_deck, chart, code_editor, interactive_figure, video_ref, diagram (mermaid). The AI teacher selects the right tool for each interaction.
Not all ingredients need to be authored from scratch. The pipeline supports three source modes:
Extract
- Decompose existing lesson text into explanation blocks
- Pull exercises from current course
- Extract code templates from existing examples
- Map existing video segments to objectives
Source: existing course content
Discover
- Find relevant public videos / tutorials
- Surface documentation pages for reference
- Identify real-world datasets for exercises
- Link to community examples (StackOverflow, GitHub)
Source: external knowledge
Generate
- Draft likely misunderstandings from error corpora
- Generate diagrams (mermaid) for concept visualization
- Adapt code templates to learner's domain context
- Create exercise variants at different difficulty levels
Source: LLM + scaffold context
Each ingredient carries a source_type tag (extracted / discovered / generated) + confidence. Generated ingredients always route through expert review. Extracted ones may auto-approve if source is trusted.
If likely misunderstandings or hint patterns aren't explicitly authored, the model drafts them. Reviewable, not auto-shipped.
Adapt examples and exercise surfaces to learner's context — shopping products, travel, phone models, employee data.
For each error type, scaffold holds base hint + clarification + corrective example. Model adapts to current task wording and learner context — few-shot, grounded in scaffold data.
Shorter / slower / more example-driven / more formal versions. Runtime selects based on learner pace signals.
The pipeline doesn't only produce text — it finds and creates multi-modal ingredients:
- Video refs: link relevant existing course videos or external explainers as ingredients
- Diagrams: generate concept maps, flow diagrams (Mermaid/SVG) from scaffold structure
- Code templates: extract from existing courses or generate starter code, adapted to lesson context
- Interactive figures: step-through visualizations for complex concepts (state machines, data flows)
- Charts: data visualizations with realistic data that illustrate the concept
Each asset tagged with tool_type — the AI teacher knows which visual tool to use at delivery.
- Pre-generate at build: contextual variants + explanation variants generated during ingestion, not at runtime. Cached per lesson × domain
- Runtime model calls: only for truly dynamic responses (error adaptation to current wording). Smaller model with scaffold as context
- Token budget: per-lesson enrichment budget; overshoot → flag for manual completion
- Fallback: if enrichment fails → serve base scaffold elements. Never leave learner with nothing
All three work because the scaffold contains the right elements — misunderstandings, hint ladders, remediation patterns, contextualization slots.
Ingestion is a batch pipeline, not a real-time service. Author/reviewer triggers a run, system processes async.
Storage map:
Postgres
- Scaffold objects + run state
- Review decisions + issues
- Provenance + trace index
S3 / Object Store
- Raw uploads + source packs
- Exported packages
- Generated artifacts + traces
Queue (SQS / Celery)
- Build / critique / validation
- Export / eval jobs
- Dead letter → ops alert
Scaling:
- Horizontal workers: build/critique/validation independent per lesson — scale behind queue
- LLM cost tiering: builder = stronger model; critic = cheaper model + deterministic pre-filters. Token budget per lesson
- Read replicas: Postgres replicas for reviewer dashboards; single-primary write
- Cache: Redis for shared error pattern library + skill taxonomy — avoids re-inference across lessons
Reliability:
- Idempotent stages: retry-safe. Run ID + stage checkpointing
- DLQ: failed jobs → dead letter queue → ops alert, not silent loss
- Fallback: LLM fails → export base scaffold with explicit gaps. Never confident nonsense
Observability:
- Structured logs: per-stage JSON → Datadog / CloudWatch / ELK
- Metrics dashboard: stage latency, token cost, scaffold completeness → Prometheus + Grafana
- Run traces: source → inferred elements → confidence → reviewer actions → export. S3 artifact, Postgres index
- Alerts: stability drift / export failure spike → auto-hold releases → PagerDuty / Slack
- Audit log: every reviewer decision + model/prompt version — immutable, compliance-ready
Deterministic
- Schema validation
- Export completeness
- Provenance integrity
- ■Rerun stability · release blockers
Model-based
- Lesson decomposition
- Misunderstanding inference
- Hint + remediation drafting
- ■Contextual examples · clarification Qs · explanation variants
Human judgment
- Lesson depth / scope
- Misconception correctness
- Hint quality review
- ■Remediation quality · approval gates
- Schema: every scaffold element present and typed correctly
- Completeness: objective has ≥1 explanation, ≥1 exercise, ≥1 hint step
- Provenance: every inferred element links to source or explicit "model-generated" marker
- Export contract: runtime expects a specific scaffold shape — validate against it
- Config-driven: domain teams add checks without engineering
- Severity: blocker (stops export) vs warning (flags reviewer)
- Regression suite: golden scaffolds per domain; CI validates on schema change
Exportable scaffold that's pedagogically weak
Structurally valid, complete — but explanations miss the objective, errors are wrong, hints are shallow, contextual adaptation distracts rather than helps.
Schema: ✓ valid. All fields present. Exports clean. But:
hint_ladder says: "remember to use WHERE" — restates the problem, doesn't teach
exercise_template uses ORDER BY as the scenario — but lesson is about WHERE, exercise doesn't test the objective
context_adaptation picks "quantum physics data" — confuses a SQL beginner with unfamiliar domain
Each field is filled. Schema passes. But a learner using this scaffold will not actually learn WHERE correctly.
How to catch it: expert review + exercise-objective alignment check (does the exercise actually test the stated objective?) + misunderstanding specificity score (is the error description actionable enough to generate a useful hint?).
Adaptive-ready lesson structure > broad automated generation
- Reuse existing course shell
- Better adaptive inputs
- Faster time to believable next product version
- Give up: surface magic, full generation breadth, impression of autonomy
- Repeated manual interpretation — the core problem
- Weak lesson decomposition — existing lessons not broken into elements
- Poorly specified misunderstandings / hints — no one has written them
- Messy material · weak exercise alignment · weak instrumentation · reviewer bottleneck
- Decomposition quality: domain-specific decomposition templates per lesson type; model fine-tuned on best decompositions
- Missing misunderstandings: model drafts from common SQL error corpora; reviewer approves/rejects; approved ones seed future scaffolds
- Reviewer bottleneck: progressive review — reviewer sees scaffold at 50%, 100%, not just at end
- Instrumentation: per-stage latency + token cost + human touch time from day 1
- Reuse: hint ladders and error patterns shared across lessons for similar SQL concepts
Every scaffold element traces to a source. Example: hint_ladder[2] → "generated by model from exercise_template[1], confidence: medium, no source reference." Reviewer knows this is inferred, not authored.
Does the exercise actually test the stated objective? Example: lesson objective is "filter rows with WHERE." If the exercise asks to sort results (ORDER BY), alignment score = 0. Check: extract the SQL operation from the exercise, compare to the objective verb.
For known common errors in this topic area, how many does the scaffold address? Example: WHERE has 3 known confusions (SELECT vs WHERE, ORDER BY vs WHERE, string quoting). Scaffold covers 2 → coverage = 67%.
% of clarification questions that actually changed the scaffold. Example: system asked SME 4 questions, 3 led to scaffold edits → yield = 75%. If yield < 40%, question planner is too noisy.
LangSmith / LangFuse
- LLM-specific traces
- Prompt → completion → latency
- Token cost per call
- Eval scoring per output
Use for: debugging model behavior, prompt iteration, quality scoring of scaffold outputs.
OpenTelemetry
- Cross-service request tracing
- Span-level latency
- Error propagation
- Correlation IDs
Use for: end-to-end pipeline tracing, finding bottleneck stages, SLA monitoring.
Prometheus + Grafana
- Aggregate metrics
- Dashboard for ops
- Alerting rules
- Historical trends
Use for: operational dashboards, release gate metrics, token cost tracking over time.
- One lesson-scaffold schema
- One decomposition workflow
- One clarification loop + one validator
- One stable export contract + instrumentation
- Repeated scaffold gaps: which elements are missing most? (misunderstandings, hints, transfer checks)
- Clarification yield: which questions actually improve the scaffold?
- Reviewer burden: where is human time going?
- Validation failures · rerun instability
- Postgres: scaffold objects, review state, run state, approval decisions, provenance. JSONB flexibility where needed
- S3: raw files, source packs, exported packages, generated artifacts, traces
- Queue: SQS/Celery for extraction, critique, validation, export jobs
- Reliability: if enrichment is weak → leave explicit gap, don't silently overfill. If export validation fails → no release
- Scaling hotspots: extraction cost, critique cost, clarification generation, reviewer queue throughput
- Traceability: source inputs → inferred elements → confidence → reviewer decisions → export lineage → rerun comparisons
"Preserve the existing lesson shell, transform each lesson into a structured adaptive scaffold, use the model to enrich and personalize that scaffold where useful, keep integrity and trust deterministic, and only then broaden into more autonomous and graph-driven adaptation."
Scaffold schema detail, decomposition service, critique engine, clarification planner, storage shape.
Orchestration, async boundaries, framework choices, LLM cost strategy, representative decomposition cases.
The adaptive runtime surfaces signals that make ingestion better over time:
new error patterns
unmatched errors
to scaffold
next version
- New error types: runtime logs errors that don't match any
common_error_types→ analytics flags these → ingestion adds them to scaffold in next revision - Weak hints: if learners consistently ignore a hint (hint shown but error persists), that hint is ineffective → flag for re-authoring
- Missing misunderstandings: if runtime's remediation branch fires often for errors not in
likely_misunderstandings, that's a scaffold gap - Contextualization feedback: if learners in "travel" context perform better than "quantum physics" context, that's evidence for contextualization quality
- Phase 1: one SQL domain, lesson scaffold, basic enrichment, prove quality + trust
- Phase 2: broader SQL → multi-domain, richer misconception taxonomies, stronger contextual adaptation, more explanation variants
- Phase 3: cross-lesson graph structure, deeper reinforcement (forgetting curve), tenant-aware scaffolds (corporate training), more autonomous authoring assistance
- Learner profile evolution: v1 = success rate + pacing → v2 = interest domain + error patterns → v3 = full learning trajectory + retention curves
- Fallback: if enrichment fails → serve base scaffold. Never leave learner empty
- Framework: start from loop, not framework; LangChain for utilities; LangGraph if staged loops + HITL needed
- Corporate training: same scaffold but contextualization slots filled with company-specific data (internal tools, domain terms)
Where this goes in 12–18 months:
Synthetic learner agents — representing different backgrounds, error patterns, learning speeds — walk through each lesson before real learners see it. They expose weak hints, missing misunderstandings, and broken remediation paths. Combined with human expert review, this creates a continuous auto-improvement loop: agent feedback → scaffold patch → re-test → ship.
The platform gradually becomes autonomously self-improving — each lesson gets better with every cohort.
Instead of explicit "go back and review" — weak spots are reinforced naturally in upcoming lessons. If a learner struggled with WHERE, the next lesson on GROUP BY includes a warm-up exercise that subtly tests filtering. The learner doesn't feel remediated — they feel like the course flows naturally.
Track per-topic retention decay. Schedule reinforcement not by calendar but by predicted forgetting point. A fast learner on WHERE might need review in 7 days. A struggling learner needs it in 2 days. The system knows.
The ultimate metric: how fast can this specific learner genuinely master this specific concept? Not by skipping content, but by optimizing delivery:
- Break complex concepts into cognitively optimal chunks
- Choose examples that connect to existing mental models (from learner profile)
- Interleave practice with explanation at the right ratio for this learner's pace
- Use contrast examples at the exact point where confusion is most likely
Background in neuroscience and behavioral research is directly relevant here — understanding how people process information, what drives attention, and how to structure material for maximum retention across different cognitive profiles.
Not every learner needs the same sequence. Some learn GROUP BY better after seeing a real-world analytics example first. Others need the formal syntax first. The scaffold contains multiple valid paths, and the runtime selects based on learner signals.
The final frontier: given an audience and learning objectives, the system generates a complete course — syllabus, skeleton, ingredients, validation — with human experts only at approval gates.
Human expert stays at three gates:
- Syllabus approval: "Are these the right topics in the right order for this audience?"
- Skeleton approval: "Are the skill breakdowns and prerequisite assumptions correct?"
- Issue review: "Agent learners found these weak spots — are the fixes good?"
What this means in practice:
- New course from objectives to teachable: days, not months
- Expert time shifts from creation to validation — 80% less manual work
- Agent-tested before any real learner sees it — fewer surprises in production
- Same framework works for any domain — SQL today, leadership tomorrow