π§ Learner Bot RAG System
What this module covers: Architecture, data pipeline, graph schema, multi-tenancy model, privacy rules, and open decisions for the per-child AI intelligence layer that powers the Learner Bot. This is a design document β no code has been written yet.
1. Overview
Every child on the Pickatale platform gets their own AI-driven Learner Bot. The bot tracks curriculum mastery, vocabulary gaps, and reading engagement β then generates nightly recommendations and teacher/parent reports.
The RAG system is the intelligence backbone: a shared infrastructure with strict per-child data isolation. Heavy processing runs on dedicated worker servers (no GPU on primary). Results are stored in a central knowledge graph queried by the Learner Bot.
Key Design Principles
- One shared RAG, per-child isolation internally β separate RAG per child would be unscalable
- Workers separate from primary server β primary server runs live projects; graph processing cannot share its resources
- Educational knowledge graph, not chat memory β the goal is curriculum mastery tracking, not conversation history
- Cloud embeddings, no local GPU β primary server has no GPU; OpenAI Embeddings API is reliable and cost-effective
- Data flows UP only β aggregated, anonymized. Never sideways between schools or down to children
2. Architecture
Primary Server (readingtester.com)
βββ Telemetry DB (port 3110) β raw reading events
βββ Per-child markdown logs β data/children/{child_id}/logs/
βββ Existing projects β untouched, no resource competition
β nightly batch (worker servers only)
Worker Servers (dedicated, separate machines)
βββ Pull raw telemetry per child cohort
βββ Call OpenAI Embeddings API (text-embedding-3-small)
βββ Mine against predefined educational goals
βββ Generate mastery scores + knowledge relationships
βββ Write results β Central RAG Server
β merge results
Central RAG Server (dedicated, separate machine)
βββ Neo4j β knowledge graph (educational relationships)
βββ PostgreSQL β identity, access control, raw profiling
βββ pgvector β semantic similarity search
β
Learner Bot (port 3120) queries Central RAG nightly
β
Content recommendations + Teacher/Parent reports
Component Responsibilities
| Component | Server | Responsibility |
|---|---|---|
| Telemetry DB | Primary (port 3110) | Append-only event log β reading sessions, word taps, assessments |
| Per-child markdown logs | Primary | Human-readable activity summaries. Encrypted at rest (AES-256). |
| Worker Servers | Dedicated (separate machines) | Nightly batch: embeddings, mastery scoring, graph updates |
| Neo4j | Central RAG Server | Curriculum objectives, concept relationships, child mastery graph |
| PostgreSQL + RLS | Central RAG Server | Identity, access control, tenant isolation, raw profiling |
| pgvector | Central RAG Server | Semantic similarity β similar learner profiles, content matching |
| Learner Bot | Primary (port 3120) | Queries Central RAG, generates reports and recommendations |
| Analytics | Primary (port 3114) | Aggregated stats, classβschoolβnation rollup, k-anonymity enforced |
3. Data Pipeline (5 Stages)
| Stage | Technology | When | Server |
|---|---|---|---|
| 1. Collection | PostgreSQL (Telemetry DB) | Real-time | Primary |
| 2. Raw Profiling | Python + OpenAI Embeddings API | Nightly | Workers |
| 3. Knowledge Graph Building | Neo4j | Nightly | Central RAG |
| 4. Semantic Search Indexing | pgvector | Nightly | Central RAG |
| 5. Learner Bot Queries | Neo4j + pgvector combined | Nightly + real-time | Central RAG |
| 6. Aggregated Analytics | PostgreSQL aggregations | Weekly | Analytics server |
Nightly Batch Flow (per child cohort)
1. Worker pulls raw telemetry for child cohort from Telemetry DB 2. Worker reads per-child markdown log 3. Worker calls OpenAI Embeddings API β generates vectors 4. Worker mines against predefined educational goals 5. Worker writes to Central RAG: a. Updates mastery scores on Neo4j nodes b. Updates pgvector embeddings c. Appends to child profile in PostgreSQL 6. Learner Bot queries Central RAG 7. Learner Bot generates recommendations + reports 8. Reports delivered to Teacher Dashboard + Parent view
Batch window: 00:00β06:00 local time per region.
Real-time Flow
Child reads β Telemetry event logged β Telemetry DB Teacher requests report β Learner Bot queries Central RAG (cached results) Content recommendation needed β pgvector similarity query β ranked book list
4. Graph Schema (Neo4j)
Nodes
| Node Label | Properties | Source |
|---|---|---|
Child | child_id, school_id, district_id, nation_code, fk_level, created_at | Account Center |
CurriculumObjective | id, title, nation_code, curriculum_version, year_group | CM API |
Concept | id, title, description, difficulty | Extracted from CM objectives |
Vocabulary | word, frequency_band, fk_grade | Telemetry word taps |
Book | id, title, fk_level, nation_code | Content library |
Relationships
| Relationship | From β To | Properties |
|---|---|---|
MASTERED | Child β Concept | confidence_score, date_achieved |
STRUGGLING_WITH | Child β Concept | since_date, attempts |
READ | Child β Book | completion_rate, date, engagement_score |
CONTAINS | CurriculumObjective β Concept | β |
PREREQUISITE_OF | Concept β Concept | β |
TEACHES | Book β Concept | strength |
Example Query Patterns
// Find concepts a child is struggling with, that are prerequisites for their next objective
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
<-[:PREREQUISITE_OF]-(blocker:Concept)
RETURN blocker.title, blocker.difficulty ORDER BY blocker.difficulty
// Find books that teach concepts this child needs
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
<-[:TEACHES]-(book:Book)
WHERE book.fk_level BETWEEN $child_fk - 0.5 AND $child_fk + 0.5
RETURN book.title, book.id ORDER BY book.fk_level
5. Multi-Tenancy & Access Control
Tenant Hierarchy
Nation
βββ District
βββ School
βββ Class
βββ Child
Data Isolation Rules
- Every node carries:
child_id+school_id+district_id+nation_code - All queries are scoped to the appropriate level β never cross-school or cross-district
- PostgreSQL RLS enforces tenant boundaries at the database level
- Neo4j RBAC enforces property-level access in the graph
- Data flows UP only (aggregated, anonymized) β never DOWN or SIDEWAYS
Access Control Model
| Role | Can Access |
|---|---|
| Learner Bot | Single child only (child_id scoped) |
| Teacher | Their class only |
| School Admin | Their school only |
| District Admin | Their district (aggregated + anonymized) |
| Pickatale Analytics | All nations (k-anonymity enforced above school level) |
6. Privacy & GDPR
Children under 13 β strictest GDPR + COPPA category. All child data must be treated with maximum protection. Voice audio deleted immediately (COPPA). No PII in telemetry events.
- Per-child folder isolation:
data/children/{child_id}/ - AES-256 encryption at rest on all raw markdown files
- Deletion: remove child = delete folder + purge all graph nodes with that
child_id - No PII in telemetry events β child_id only, never name/email
- Embeddings tagged with tenant metadata, never raw text stored in embeddings layer
- k-anonymity enforced at district level and above (minimum group size: configurable, default 10)
- GDPR export: ZIP per-child folder + export graph node properties on request
7. Decisions Log
| # | Date | Decision | Rationale |
|---|---|---|---|
| 1 | 2026-04-18 | One shared RAG, per-child isolation internally | Separate RAG per child would explode server resources |
| 2 | 2026-04-18 | Heavy processing on worker servers, not primary | Primary server serves live projects β cannot handle graph processing load |
| 3 | 2026-04-18 | Use OpenAI Embeddings API, not local model | No GPU on primary server; cloud API is more reliable and cost-effective |
| 4 | 2026-04-18 | Hybrid stack: Neo4j + PostgreSQL + pgvector | Neo4j for relationships, PostgreSQL for auth/control, pgvector for similarity β each does what it's best at |
| 5 | 2026-04-18 | Raw data as markdown files + telemetry DB | Cheap, readable, easy to audit; graph built from processed insights, not raw logs |
| 6 | 2026-04-18 | Educational knowledge graph, not chat memory | Goal is curriculum mastery tracking, not conversation history |
| 7 | 2026-04-18 | Central RAG on dedicated server, separate from primary | Isolation from live projects; can scale independently |
8. Assumptions
- Worker servers will be provisioned by Jixian (cloud VPS or cloud functions)
- Central RAG server is a separate dedicated machine β not the current primary server
- Curriculum objectives are sourced from CM (
cm.readingtester.com) via API only (no direct DB access) - OpenAI
text-embedding-3-smallis the embedding provider - Nightly batch window: 00:00β06:00 local time per region
- Learner Bot is the ONLY service that queries the Central RAG directly
9. Open Questions
To be resolved in Sig + Jixian design call before build begins.
Q1. Where do worker servers live? Cloud functions (Lambda/Cloud Run) vs dedicated VPS vs Jixian's machine? β Jixian to decide
Q2. How many predefined educational mining goals to start with? What are the first 5? β Sig + Jixian to agree
Q3. Does the Learner Bot cache RAG query results, or query fresh each nightly run?
Q4. What is the initial set of
Concept nodes β auto-extracted from CM objectives or manually curated?Q5. How does the system handle a child switching schools or nations mid-year?
Q6. Real-time vs batch for vocabulary gap updates β word tap events are high frequency
Q7. Which dedicated server hosts the Central RAG? β Jixian to provision
10. Next Steps
- Sig + Jixian phone call to finalize per-child use cases and sign off on this design
- Resolve Open Questions 1β7 above
- Write detailed dev plan (schema migrations, API contracts, worker job specs)
- Write infrastructure plan (server provisioning, network topology, monitoring)
- Only then: begin build
Module 15 Β· Learner Bot RAG System Β· v1.0 Β· 2026-04-18 Β· Design phase β not yet approved for build