Session Architecture: Netflix for Learning Experiences
[CONVICTION]
Netflix's core innovation was not video but turning video into a standardized, packageable, deliverable unit. The session architecture applies the same pattern to interactive learning: the unit is not a video file but a session card -- a self-contained experience that gets packaged, stored on a CDN, and rendered by a universal runtime. The content is alive because the AI conversation within the session is generative. The same session card produces a different experience for every learner, every time. The session card is a score, not a recording.
The Experience-First Pipeline
[CONVICTION]
The critical design decision, arrived at through iterative stress-testing: experience design comes first, not the script. A pottery session and a Carnatic music session and a volcano exploration are fundamentally different experiences. The script cannot come first because it does not know what the experience is yet. You cannot write "now you try singing the arohanam" until you have decided that the experience surface is a note grid with pitch tracking and tala overlay -- because that surface shapes what the student can do, which shapes what the narration should say.
The production pipeline inverts the traditional film pipeline:
Phase 0 -- Experience Design (new, most consequential). Input: teacher brief. Output: which adapter surfaces this session uses, what the domain model looks like, what interaction modes exist. The creative director (Claude with skills) asks: what does a student DO in this session? Not what do they hear -- what do they do with their hands, their voice, their eyes. This phase picks adapters, defines domain schemas, and establishes the moment structure as a sequence of surfaces. No narration text yet.
Phase 1 -- Blueprint. Locks layout contracts per moment -- which adapter, which layout template, which signals and events. Working from the experience design, not from a script.
Phase 2 -- Script. Now the scriptwriter knows exactly what surfaces exist and what the student can do at each moment. Narration refers to the actual experience: "tap the Sa to hear it ring" only appears because the script knows a note grid adapter with tap-to-hear is mounted at that moment.
Phases 3-6 -- Conductor (moment sequence with branch maps), Production (TTS, content generation, curation -- three parallel tracks), Assembly, and Package.
The Three-Layer Content Model
[EVIDENCE]
Derived from first principles by asking: what can a human do in a learning session (watch, listen, read, tap, drag, draw, speak, type, navigate, build, perform) and what can fill a screen (text, images, diagrams, animations, video, audio, 3D models, maps, timelines, code, simulations, games).
Layer 1 -- Adapters (12 types). How the student interacts. Organized in four buckets: watch/listen (media player, visual stage, spatial navigator, conversation), respond/choose (challenge, discovery surface, iframe bridge), create/build (creative canvas, builder, simulation sandbox), perform/play (performance, game engine). Each adapter follows a standard interface contract: what it renders, what events it emits, what signals it accepts, what context it gives the voice AI.
Layer 2 -- Content Atoms (~42 types). What fills the surface. Visual static (photo, SVG, data viz, LaTeX, notation), visual motion (video, Manim, Lottie, Rive, 3D model), audio (narration TTS, music samples, ambient), interactive specs (quiz, ordering, matching, hotspot, drag-drop, game config, code exercise, applet ID), spatial (SVG parallax, WorldPlay video, panorama, map tiles).
Layer 3 -- Sources (6 types). Where it comes from. Pipeline-generated (TTS, Manim, Claude SVG), curated external (YouTube clips, CC images), shared library (reusable assets on CDN), AI runtime (live voice, hints), student-created (drawing, voice, code), teacher-authored (custom problems, uploads).
This separation is what makes real-world footage (a curated YouTube clip of Mt. St. Helens erupting) work alongside generated content (a Manim cross-section animation) in the same adapter.
The Narrate/Interact Rhythm
[CONVICTION]
Voice AI platforms are optimized for conversation -- half-duplex turn-taking with 10-20 second response windows. A teacher narrating for 60 seconds while building up to a concept is fundamentally different from a customer service agent answering a question. The entire voice AI industry is optimized for the second case, not the first.
The solution: narration and interaction are two different media types. Forcing live voice AI to do narration is fighting the technology.
Pre-generated audio handles: scene-setting, storytelling, concept building, emotional pacing, transitions -- anything over 15 seconds of continuous teacher speech. Polished, rehearsed, no turn-detection glitches.
Live voice AI handles: reacting to what the student just did, confirming, correcting, encouraging, answering questions, personalizing recaps -- anything that depends on student input. Naturally short-form, reactive, and conversational.
The conductor alternates between them: narrate, interact, narrate, interact. That is the rhythm of a great lesson. The teacher tells you something (narration), then you try it (interaction), then the teacher reacts to what you did (interaction), then the teacher sets up the next thing (narration).
The moment prompt becomes the single most important artifact in the system. It tells the voice AI what to teach, what is on screen, what events to expect, and when to advance. The voice AI reads the prompt and teaches however it thinks is best. The conductor does not micromanage the voice AI's behavior within a moment. It waits for the signal that the moment is done.
The Simplification Principle
[REFRAME]
An early design iteration built a sophisticated 6-phase choreography system with 5 guided modes, 9 signal types, and per-phase state management. It was overengineered. The author's direct experience with voice AI revealed: when you give the voice AI good instructions and context, it does a good job on its own. It knows when to speak, when to wait, when to encourage, when to correct. Building an elaborate puppet rig for something that can walk by itself was the wrong direction.
The simplified conductor does three things: mount the adapter, play optional intro audio, inject the moment prompt. Then it waits for ADVANCE or BRANCH signals. Everything between "moment starts" and "voice AI says done" is the voice AI's territory. The six phases (pretext, context, silence, guided, post, bridge) survive as a description of what good teaching looks like in the voice AI's system prompt -- not as a state machine the conductor enforces.
Entertainment-Grade Production
[CONVICTION]
The reference class is 3Blue1Brown, Kurzgesagt, and Apple product pages -- not edtech. The bar is not "better than a Khan Academy exercise" but "as good as a top YouTube creator's production." When a child watches 3Blue1Brown or Kurzgesagt, that is the quality baseline. Edtech tools look dated and clinical next to that.
Every adapter uses entertainment tools: Framer Motion and GSAP for animation, Rive for interactive state-machine graphics, Tone.js for audio, Excalidraw for canvas, Three.js for 3D. Sound is a first-class citizen -- every interaction has audio feedback. Silence is intentional, not default. The visual stage adapter fires narration-synced animation cues at specific timestamps: at 3.2 seconds, the formula types itself; at 5.8 seconds, the equals sign highlights. This is what makes it feel like 3Blue1Brown -- the visual and the voice move together.
The gap between architecture and production quality was identified through direct testing: generating static images with text fade-ins produces a glorified PowerPoint. The pipeline must actually produce rich content -- Manim-generated animations, curated real footage, Rive interactive graphics -- not just reference them.
The Two Codebases
[EVIDENCE]
claude-content is a plugin -- Claude acting as creative director with access to skills (research, generation, production, curation). It is not a separate pipeline runner application. Claude IS the orchestrator. The plugin gives Claude access to skills, and Claude decides what to call and when. The phases are mental scaffolding in the system prompt, not a rigid execution sequence.
claude-sessions is the runtime -- a React + Pipecat application running live per-student. The voice AI within Pipecat is Claude (or GPT-4o) with a moment prompt, reading adapter events and emitting signals. The browser side has the conductor, adapters, world renderer, and voice hook. The server side has the Pipecat node with voice AI and signal filtering.
The session package on S3 is the contract between them. claude-content produces it; claude-sessions consumes it. They never communicate directly.
Related
- education -- domain overview
- sovereign-child -- the philosophy the session system serves
- ventures/brih/overview -- the AI-native school that uses session architecture
- technology-as-training-wheels -- scaffolding that graduates applied to session design