🧠 Learner Bot RAG System

Module 15 Design Phase v1.0 — 2026-04-18 ⚠️ Pending Sig + Jixian sign-off before build

What this module covers: Architecture, data pipeline, graph schema, multi-tenancy model, privacy rules, and open decisions for the per-child AI intelligence layer that powers the Learner Bot. This is a design document — no code has been written yet.

1. Overview

Every child on the Pickatale platform gets their own AI-driven Learner Bot. The bot tracks curriculum mastery, vocabulary gaps, and reading engagement — then generates nightly recommendations and teacher/parent reports.

The RAG system is the intelligence backbone: a shared infrastructure with strict per-child data isolation. Heavy processing runs on dedicated worker servers (no GPU on primary). Results are stored in a central knowledge graph queried by the Learner Bot.

Key Design Principles

One shared RAG, per-child isolation internally — separate RAG per child would be unscalable
Workers separate from primary server — primary server runs live projects; graph processing cannot share its resources
Educational knowledge graph, not chat memory — the goal is curriculum mastery tracking, not conversation history
Cloud embeddings, no local GPU — primary server has no GPU; OpenAI Embeddings API is reliable and cost-effective
Data flows UP only — aggregated, anonymized. Never sideways between schools or down to children

2. Architecture

Primary Server (readingtester.com)
├── Telemetry DB (port 3110)      — raw reading events
├── Per-child markdown logs        — data/children/{child_id}/logs/
└── Existing projects              — untouched, no resource competition

        ↓ nightly batch (worker servers only)

Worker Servers (dedicated, separate machines)
├── Pull raw telemetry per child cohort
├── Call OpenAI Embeddings API (text-embedding-3-small)
├── Mine against predefined educational goals
├── Generate mastery scores + knowledge relationships
└── Write results → Central RAG Server

        ↓ merge results

Central RAG Server (dedicated, separate machine)
├── Neo4j         — knowledge graph (educational relationships)
├── PostgreSQL    — identity, access control, raw profiling
└── pgvector      — semantic similarity search

        ↓
Learner Bot (port 3120) queries Central RAG nightly
        ↓
Content recommendations + Teacher/Parent reports

Component Responsibilities

Component	Server	Responsibility
Telemetry DB	Primary (port 3110)	Append-only event log — reading sessions, word taps, assessments
Per-child markdown logs	Primary	Human-readable activity summaries. Encrypted at rest (AES-256).
Worker Servers	Dedicated (separate machines)	Nightly batch: embeddings, mastery scoring, graph updates
Neo4j	Central RAG Server	Curriculum objectives, concept relationships, child mastery graph
PostgreSQL + RLS	Central RAG Server	Identity, access control, tenant isolation, raw profiling
pgvector	Central RAG Server	Semantic similarity — similar learner profiles, content matching
Learner Bot	Primary (port 3120)	Queries Central RAG, generates reports and recommendations
Analytics	Primary (port 3114)	Aggregated stats, class→school→nation rollup, k-anonymity enforced

3. Data Pipeline (5 Stages)

Stage	Technology	When	Server
1. Collection	PostgreSQL (Telemetry DB)	Real-time	Primary
2. Raw Profiling	Python + OpenAI Embeddings API	Nightly	Workers
3. Knowledge Graph Building	Neo4j	Nightly	Central RAG
4. Semantic Search Indexing	pgvector	Nightly	Central RAG
5. Learner Bot Queries	Neo4j + pgvector combined	Nightly + real-time	Central RAG
6. Aggregated Analytics	PostgreSQL aggregations	Weekly	Analytics server

Nightly Batch Flow (per child cohort)

1. Worker pulls raw telemetry for child cohort from Telemetry DB
2. Worker reads per-child markdown log
3. Worker calls OpenAI Embeddings API → generates vectors
4. Worker mines against predefined educational goals
5. Worker writes to Central RAG:
   a. Updates mastery scores on Neo4j nodes
   b. Updates pgvector embeddings
   c. Appends to child profile in PostgreSQL
6. Learner Bot queries Central RAG
7. Learner Bot generates recommendations + reports
8. Reports delivered to Teacher Dashboard + Parent view

Batch window: 00:00–06:00 local time per region.

Real-time Flow

Child reads → Telemetry event logged → Telemetry DB
Teacher requests report → Learner Bot queries Central RAG (cached results)
Content recommendation needed → pgvector similarity query → ranked book list

4. Graph Schema (Neo4j)

Nodes

Node Label	Properties	Source
`Child`	child_id, school_id, district_id, nation_code, fk_level, created_at	Account Center
`CurriculumObjective`	id, title, nation_code, curriculum_version, year_group	CM API
`Concept`	id, title, description, difficulty	Extracted from CM objectives
`Vocabulary`	word, frequency_band, fk_grade	Telemetry word taps
`Book`	id, title, fk_level, nation_code	Content library

Relationships

Relationship	From → To	Properties
`MASTERED`	Child → Concept	confidence_score, date_achieved
`STRUGGLING_WITH`	Child → Concept	since_date, attempts
`READ`	Child → Book	completion_rate, date, engagement_score
`CONTAINS`	CurriculumObjective → Concept	—
`PREREQUISITE_OF`	Concept → Concept	—
`TEACHES`	Book → Concept	strength

Example Query Patterns

// Find concepts a child is struggling with, that are prerequisites for their next objective
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
      <-[:PREREQUISITE_OF]-(blocker:Concept)
RETURN blocker.title, blocker.difficulty ORDER BY blocker.difficulty

// Find books that teach concepts this child needs
MATCH (c:Child {child_id: $cid})-[:STRUGGLING_WITH]->(concept:Concept)
      <-[:TEACHES]-(book:Book)
WHERE book.fk_level BETWEEN $child_fk - 0.5 AND $child_fk + 0.5
RETURN book.title, book.id ORDER BY book.fk_level

5. Multi-Tenancy & Access Control

Tenant Hierarchy

Nation
  └── District
        └── School
              └── Class
                    └── Child

Data Isolation Rules

Every node carries: child_id + school_id + district_id + nation_code
All queries are scoped to the appropriate level — never cross-school or cross-district
PostgreSQL RLS enforces tenant boundaries at the database level
Neo4j RBAC enforces property-level access in the graph
Data flows UP only (aggregated, anonymized) — never DOWN or SIDEWAYS

Access Control Model

Role	Can Access
Learner Bot	Single child only (child_id scoped)
Teacher	Their class only
School Admin	Their school only
District Admin	Their district (aggregated + anonymized)
Pickatale Analytics	All nations (k-anonymity enforced above school level)

6. Privacy & GDPR

Children under 13 — strictest GDPR + COPPA category. All child data must be treated with maximum protection. Voice audio deleted immediately (COPPA). No PII in telemetry events.

Per-child folder isolation: data/children/{child_id}/
AES-256 encryption at rest on all raw markdown files
Deletion: remove child = delete folder + purge all graph nodes with that child_id
No PII in telemetry events — child_id only, never name/email
Embeddings tagged with tenant metadata, never raw text stored in embeddings layer
k-anonymity enforced at district level and above (minimum group size: configurable, default 10)
GDPR export: ZIP per-child folder + export graph node properties on request

7. Decisions Log

#	Date	Decision	Rationale
1	2026-04-18	One shared RAG, per-child isolation internally	Separate RAG per child would explode server resources
2	2026-04-18	Heavy processing on worker servers, not primary	Primary server serves live projects — cannot handle graph processing load
3	2026-04-18	Use OpenAI Embeddings API, not local model	No GPU on primary server; cloud API is more reliable and cost-effective
4	2026-04-18	Hybrid stack: Neo4j + PostgreSQL + pgvector	Neo4j for relationships, PostgreSQL for auth/control, pgvector for similarity — each does what it's best at
5	2026-04-18	Raw data as markdown files + telemetry DB	Cheap, readable, easy to audit; graph built from processed insights, not raw logs
6	2026-04-18	Educational knowledge graph, not chat memory	Goal is curriculum mastery tracking, not conversation history
7	2026-04-18	Central RAG on dedicated server, separate from primary	Isolation from live projects; can scale independently

8. Assumptions

Worker servers will be provisioned by Jixian (cloud VPS or cloud functions)
Central RAG server is a separate dedicated machine — not the current primary server
Curriculum objectives are sourced from CM (cm.readingtester.com) via API only (no direct DB access)
OpenAI text-embedding-3-small is the embedding provider
Nightly batch window: 00:00–06:00 local time per region
Learner Bot is the ONLY service that queries the Central RAG directly

9. Open Questions

To be resolved in Sig + Jixian design call before build begins.

Q1. Where do worker servers live? Cloud functions (Lambda/Cloud Run) vs dedicated VPS vs Jixian's machine? — Jixian to decide

Q2. How many predefined educational mining goals to start with? What are the first 5? — Sig + Jixian to agree

Q3. Does the Learner Bot cache RAG query results, or query fresh each nightly run?

Q4. What is the initial set of Concept nodes — auto-extracted from CM objectives or manually curated?

Q5. How does the system handle a child switching schools or nations mid-year?

Q6. Real-time vs batch for vocabulary gap updates — word tap events are high frequency

Q7. Which dedicated server hosts the Central RAG? — Jixian to provision

10. Next Steps

Sig + Jixian phone call to finalize per-child use cases and sign off on this design
Resolve Open Questions 1–7 above
Write detailed dev plan (schema migrations, API contracts, worker job specs)
Write infrastructure plan (server provisioning, network topology, monitoring)
Only then: begin build

Module 15 · Learner Bot RAG System · v1.0 · 2026-04-18 · Design phase — not yet approved for build