The AI doesn't assist — it teaches
A chatbot waits to be asked a question. An AI teacher actively delivers the learning — decides what to explain, which example to show, when to assess, when to remediate, and when to advance. It uses structured ingredients to adapt every interaction to this learner.
ingredients
adapts in real time
tracks + adjusts
explain / assess / remediate
Static content can't be personalized. The AI assembles each lesson interaction from ingredients, adapted to this learner's context, pace, and gaps.
- How to explain this concept (style, depth, examples from learner's domain)
- Which exercise to give and at what difficulty
- How much support to provide (hints, scaffolding)
- Whether to slow down or speed up · when to reinforce a topic later
The AI teacher doesn't just chat — it uses tools: slides it controls, code editors, interactive figures (step-through visualizations), charts, videos. Each ingredient specifies which tool type to use. The result feels like a live lesson, not a text conversation.
One LLM, one DB table for profile, one loop. The DB state IS the memory. Each interaction overwrites relevant profile fields → those fields become context for the next turn.
- Mostly fixed lesson sequence (stable macro-path)
- Local adaptation within each lesson (explanation, exercises, pacing)
- Compact learner profile + reinforcement scheduling
- Out: graph-wide replanning, large retrieval, full learner modeling, product shell redesign
main user
pedagogy review
traces, safety
- Adapt explanation style, example choice, task difficulty per learner
- Validate responses → qualify evidence strength → update compact profile
- Schedule reinforcement based on topic stability + forgetting curve
- Pedagogical correctness — don't advance on weak evidence
- Graceful fallback — if adaptation fails, serve stable approved lesson block
- Observability — what branch was chosen, why, what evidence
- Runtime responsiveness · privacy/tenant safety · cost efficiency
Meaning: the system must not advance a learner who got the right answer only with heavy hints. Example: learner solves WHERE exercise but only after 3 hints → evidence = weak positive, NOT enough to advance. Require independent transfer check.
How to measure: premature advance rate — % of learners who advance but fail the next topic's prerequisite check. Target: < 10%.
Meaning: if LLM adaptation fails (timeout, poor output), learner still gets a coherent next step. Example: contextual example generation fails → serve the base example from ingredients. Never silence, never "something went wrong."
How to measure: fallback rate — % of interactions that use fallback path. Target: < 5%. If higher → investigate LLM health.
Meaning: for every learner interaction, the team can answer: what branch was chosen, why, what evidence was used, how the profile changed. Without this, "it feels adaptive" can't be turned into "it IS adaptive."
How to measure: trace completeness — % of interactions with full decision trace. Target: 100% (hard requirement, not metric to optimize).
- Learner runtime: availability + low latency + fallback
- Profile updates / policy: correctness + consistency
- Analytics: eventual consistency OK
- Tenant boundaries: hard correctness
Different layers have different reliability contracts.
Three onboarding fields (context):
Performance signals (updated each interaction):
Explicit profile, NOT chat history as truth. Three context fields + performance signals = enough for meaningful adaptation.
1. Checker
- Correctness
- Syntax vs logic error
- Misconception-linked?
- Hint dependency
- Transfer success
2. Evidence qualifier
- Weak evidence
- Medium evidence
- Strong evidence
- Negative evidence
3. Profile update
- Increase confidence
- Flag weak topic
- Schedule reinforcement
- Slow pacing
- Advance / retry
Real-time learner path (left) + offline pipeline (right).
v1: ingredients tightly linked to each lesson → direct lesson-linked lookup. No retrieval layer needed yet.
Deterministic
- Validation logic + evidence thresholds
- State-transition guards
- Release blockers + tenant isolation
Model-based
- Explanation style adaptation
- Contextualization (learner's domain)
- Hint wording + error response
LangSmith / LangFuse
- LLM-specific traces
- Prompt → completion
- Token cost · eval scoring
Debug model behavior, prompt iteration.
OpenTelemetry
- Cross-service tracing
- Span-level latency
- Correlation IDs
E2E request tracing, SLA monitoring.
Prometheus + Grafana
- Aggregate metrics
- Dashboards + alerting
- Historical trends
Ops dashboards, cost tracking.
Coherent-looking but pedagogically wrong path
Learner appears to progress, system sounds adaptive — but mastery is overclaimed, misconceptions survive. Especially: learner passes direct tasks but fails transfer to new context.
Bounded inspectable adaptation > autonomy theater
- Stronger control + simpler debugging
- Safer progression + measurable evidence
- Give up: surface magic, fully dynamic paths, broad autonomy
- Weak learner profile — wrong model = wrong decisions
- Weak validation — can't validate = can't adapt
- Premature advancement — looks done but isn't
- Weak evidence thresholds · poor ingredients · over-remediation · latency/cost
- Validation: domain-specific checker configs; hybrid — deterministic for syntax, model for logic classification
- Premature advance: transfer gate mandatory; delayed retention checks; confidence decay
- Latency: pre-generate exercise variants; cache explanations; async trace writes
- Cost: smaller model for routine; larger for ambiguous; per-interaction token budget
- Fallback cascade: LLM timeout → cached variant → base ingredient → static content. Never silence
- Circuit breaker: on LLM adapter — error rate > threshold → deterministic-only mode
Learning quality and engagement can diverge. Track both, optimize learning.
- Branch decisions: what did runtime choose, why, what evidence?
- Validation outcomes: syntax vs logic vs misconception; hint dependency
- Profile transitions: when confidence changes, when advancement, when later collapse
- Reinforcement effectiveness · transfer failures = false mastery
- New error types: errors not matching any known pattern → flag for ingredient update
- Weak hints: hint shown but error persists → ineffective, needs re-authoring
- Missing misunderstandings: remediation fires for errors not in ingredient set → gap
- Context performance: learners in "travel" domain outperform "quantum physics" → contextualization quality signal
- One stable lesson sequence · one adaptive loop · direct lookup
- Compact learner profile · bounded validation · reinforcement
- Strong traces · strict quality gates
"Start with one bounded adaptive loop: stable lesson sequence, structured ingredients, compact learner profile, explicit validation, strong traces. The AI teaches from ingredients, not from improvisation. Then evolve toward richer progression, retrieval, and more dynamic adaptation — once the bounded loop is proven and measurable."
Services, state, validators, storage, what stays simple in v1.
Orchestration, framework choices, representative flows, fallback, checker outputs.
- Phase 1: fixed lesson sequence, local adaptation, direct lookup, compact profile, bounded validators
- Phase 2: graph-driven progression, richer profile, more branch types, retrieval when asset pool grows
- Phase 3: dynamic path planning, cohort calibration, tenant-aware content, policy A/B
- Fallback: always serve pre-authored content if adaptation fails. Never leave learner empty
- Framework: start from loop; LangChain for utilities; LangGraph if staged loops + HITL
- Storage: Postgres profiles + ingredients; S3 traces; Redis session cache; queue for async
The ultimate metric: how fast can this learner genuinely master this concept? Not by skipping — by optimizing delivery: cognitively optimal chunks, examples connected to existing mental models, interleaved practice at the right ratio, contrast examples at confusion points.
Neuroscience and behavioral research background is directly relevant — understanding how people process information and how to structure material for maximum retention.
Track per-topic retention decay. Reinforce not by calendar but by predicted forgetting point. Fast learner on WHERE → review in 7 days. Struggling learner → 2 days.
Instead of "go back and review" — weak spots reinforced naturally in upcoming lessons. Learner doesn't feel remediated — course flows naturally.
Synthetic learner agents walk through lessons before real learners. Expose weak hints, broken remediation, missing misconceptions. Combined with human review → continuous auto-improvement loop.