🧠 Learner Bot RAG System

Module 15 Design Phase v1.0 β€” 2026-04-18 ⚠️ Pending Sig + Jixian sign-off before build
What this module covers: Architecture, data pipeline, graph schema, multi-tenancy model, privacy rules, and open decisions for the per-child AI intelligence layer that powers the Learner Bot. This is a design document β€” no code has been written yet.

1. Overview

Every child on the Pickatale platform gets their own AI-driven Learner Bot. The bot tracks curriculum mastery, vocabulary gaps, and reading engagement β€” then generates nightly recommendations and teacher/parent reports.

The RAG system is the intelligence backbone: a shared infrastructure with strict per-child data isolation. Heavy processing runs on dedicated worker servers (no GPU on primary). Results are stored in a central knowledge graph queried by the Learner Bot.

Key Design Principles

2. Architecture

Primary Server (readingtester.com)
β”œβ”€β”€ Telemetry DB (port 3110)      β€” raw reading events
β”œβ”€β”€ Per-child markdown logs        β€” data/children/{child_id}/logs/
└── Existing projects              β€” untouched, no resource competition

        ↓ nightly batch (worker servers only)

Worker Servers (dedicated, separate machines)
β”œβ”€β”€ Pull raw telemetry per child cohort
β”œβ”€β”€ Call OpenAI Embeddings API (text-embedding-3-small)
β”œβ”€β”€ Mine against predefined educational goals
β”œβ”€β”€ Generate mastery scores + knowledge relationships
└── Write results β†’ Central RAG Server

        ↓ merge results

Central RAG Server (dedicated, separate machine)
β”œβ”€β”€ Neo4j         β€” knowledge graph (educational relationships)
β”œβ”€β”€ PostgreSQL    β€” identity, access control, raw profiling
└── pgvector      β€” semantic similarity search

        ↓
Learner Bot (port 3120) queries Central RAG nightly
        ↓
Content recommendations + Teacher/Parent reports

Component Responsibilities

ComponentServerResponsibility
Telemetry DBPrimary (port 3110)Append-only event log β€” reading sessions, word taps, assessments
Per-child markdown logsPrimaryHuman-readable activity summaries. Encrypted at rest (AES-256).
Worker ServersDedicated (separate machines)Nightly batch: embeddings, mastery scoring, graph updates
Neo4jCentral RAG ServerCurriculum objectives, concept relationships, child mastery graph
PostgreSQL + RLSCentral RAG ServerIdentity, access control, tenant isolation, raw profiling
pgvectorCentral RAG ServerSemantic similarity β€” similar learner profiles, content matching
Learner BotPrimary (port 3120)Queries Central RAG, generates reports and recommendations
AnalyticsPrimary (port 3114)Aggregated stats, class→school→nation rollup, k-anonymity enforced

3. Data Pipeline (5 Stages)

StageTechnologyWhenServer
1. CollectionPostgreSQL (Telemetry DB)Real-timePrimary
2. Raw ProfilingPython + OpenAI Embeddings APINightlyWorkers
3. Knowledge Graph BuildingNeo4jNightlyCentral RAG
4. Semantic Search IndexingpgvectorNightlyCentral RAG
5. Learner Bot QueriesNeo4j + pgvector combinedNightly + real-timeCentral RAG
6. Aggregated AnalyticsPostgreSQL aggregationsWeeklyAnalytics server

Nightly Batch Flow (per child cohort)

1. Worker pulls raw telemetry for child cohort from Telemetry DB
2. Worker reads per-child markdown log
3. Worker calls OpenAI Embeddings API β†’ generates vectors
4. Worker mines against predefined educational goals
5. Worker writes to Central RAG:
   a. Updates mastery scores on Neo4j nodes
   b. Updates pgvector embeddings
   c. Appends to child profile in PostgreSQL
6. Learner Bot queries Central RAG
7. Learner Bot generates recommendations + reports
8. Reports delivered to Teacher Dashboard + Parent view

Batch window: 00:00–06:00 local time per region.

Real-time Flow

Child reads β†’ Telemetry event logged β†’ Telemetry DB
Teacher requests report β†’ Learner Bot queries Central RAG (cached results)
Content recommendation needed β†’ pgvector similarity query β†’ ranked book list

4. Graph Schema (Neo4j)

Nodes

Node LabelPropertiesSource
Childchild_id, school_id, district_id, nation_code, fk_level, created_atAccount Center
CurriculumObjectiveid, title, nation_code, curriculum_version, year_groupCM API
Conceptid, title, description, difficultyExtracted from CM objectives
Vocabularyword, frequency_band, fk_gradeTelemetry word taps
Bookid, title, fk_level, nation_codeContent library

Relationships

RelationshipFrom β†’ ToProperties
MASTEREDChild β†’ Conceptconfidence_score, date_achieved
STRUGGLING_WITHChild β†’ Conceptsince_date, attempts
READChild β†’ Bookcompletion_rate, date, engagement_score
CONTAINSCurriculumObjective β†’ Conceptβ€”
PREREQUISITE_OFConcept β†’ Conceptβ€”
TEACHESBook β†’ Conceptstrength

Example Query Patterns

// Find concepts a child is struggling with, that are prerequisites for their next objective
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
      <-[:PREREQUISITE_OF]-(blocker:Concept)
RETURN blocker.title, blocker.difficulty ORDER BY blocker.difficulty

// Find books that teach concepts this child needs
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
      <-[:TEACHES]-(book:Book)
WHERE book.fk_level BETWEEN $child_fk - 0.5 AND $child_fk + 0.5
RETURN book.title, book.id ORDER BY book.fk_level

5. Multi-Tenancy & Access Control

Tenant Hierarchy

Nation
  └── District
        └── School
              └── Class
                    └── Child

Data Isolation Rules

Access Control Model

RoleCan Access
Learner BotSingle child only (child_id scoped)
TeacherTheir class only
School AdminTheir school only
District AdminTheir district (aggregated + anonymized)
Pickatale AnalyticsAll nations (k-anonymity enforced above school level)

6. Privacy & GDPR

Children under 13 β€” strictest GDPR + COPPA category. All child data must be treated with maximum protection. Voice audio deleted immediately (COPPA). No PII in telemetry events.

7. Decisions Log

#DateDecisionRationale
12026-04-18One shared RAG, per-child isolation internallySeparate RAG per child would explode server resources
22026-04-18Heavy processing on worker servers, not primaryPrimary server serves live projects β€” cannot handle graph processing load
32026-04-18Use OpenAI Embeddings API, not local modelNo GPU on primary server; cloud API is more reliable and cost-effective
42026-04-18Hybrid stack: Neo4j + PostgreSQL + pgvectorNeo4j for relationships, PostgreSQL for auth/control, pgvector for similarity β€” each does what it's best at
52026-04-18Raw data as markdown files + telemetry DBCheap, readable, easy to audit; graph built from processed insights, not raw logs
62026-04-18Educational knowledge graph, not chat memoryGoal is curriculum mastery tracking, not conversation history
72026-04-18Central RAG on dedicated server, separate from primaryIsolation from live projects; can scale independently

8. Assumptions

9. Open Questions

To be resolved in Sig + Jixian design call before build begins.

Q1. Where do worker servers live? Cloud functions (Lambda/Cloud Run) vs dedicated VPS vs Jixian's machine? β€” Jixian to decide
Q2. How many predefined educational mining goals to start with? What are the first 5? β€” Sig + Jixian to agree
Q3. Does the Learner Bot cache RAG query results, or query fresh each nightly run?
Q4. What is the initial set of Concept nodes β€” auto-extracted from CM objectives or manually curated?
Q5. How does the system handle a child switching schools or nations mid-year?
Q6. Real-time vs batch for vocabulary gap updates β€” word tap events are high frequency
Q7. Which dedicated server hosts the Central RAG? β€” Jixian to provision

10. Next Steps

  1. Sig + Jixian phone call to finalize per-child use cases and sign off on this design
  2. Resolve Open Questions 1–7 above
  3. Write detailed dev plan (schema migrations, API contracts, worker job specs)
  4. Write infrastructure plan (server provisioning, network topology, monitoring)
  5. Only then: begin build
Module 15 Β· Learner Bot RAG System Β· v1.0 Β· 2026-04-18 Β· Design phase β€” not yet approved for build