SchoolsData — Methodology, Sources & Decision Rules
Updated: 2026-05-22
This document is the canonical reference for how SchoolsData makes decisions: what data we use, where it came from, how we score and tag, and what we deliberately do NOT do (defamation rules, privacy boundaries, methodology transparency requirements).
Every public page on the live site must link back to this document.
---
1. Data sources
| Layer | Source | URL | Coverage | Last refresh |
|---|---|---|---|---|
| Schools directory (identity, EQI, roll, lat/lng) | NZ Ministry of Education | data.govt.nz CKAN directory-of-educational-institutions | 2,576 schools | 2026-03-02 (then nightly) |
| Student roll history (10-year) | Ministry of Education bulk | educationcounts.govt.nz/statistics/school-rolls (raw CSV) | 24,973 rows | as ingested |
| Staff counts (teacher headcounts, FTTE, ratios) | Education Counts bulk | educationcounts.govt.nz/statistics/teacher-workforce | 38,005 rows | as ingested |
| ERO reports | Education Review Office via Wayback Machine | ero.govt.nz/institution/{moe_id}/{slug} → web.archive.org/web/{ts}/{url} | 636 reports across 175 schools (≥2 review cycles for 169 schools) | 2026-05-21 |
| ERO verbatim quotes | Extracted from ERO report markdown via Claude Haiku 4.5 | (same as above) | 3,459 quotes, all embedded (text-embedding-3-small 1536-dim) | 2026-05-21 |
| ERO trajectory diffs | Generated by pairing consecutive cycles per school | (computed) | 1,634 diffs across 169 schools | 2026-05-22 |
| ERO authorship + scoring | Re-extracted from stored markdown via Haiku | (computed) | 636 reports tagged with Director name, regional office, 0-100 per-dimension scores | 2026-05-22 |
| Audited annual financial statements | School websites (PDF) discovered via DataforSEO SERP | site:{school-domain.school.nz} (annual report OR annual financial OR financial statement) filetype:pdf → school's own CDN | 126 reports across 70+ schools | 2026-05-21 |
| Financial disclosures (verbatim) | Extracted from annual report PDFs via Haiku-PDF | (same) | 1,915 disclosures, all embedded | 2026-05-21 |
| Named board members + principals | Extracted from annual report PDFs | (same) | 1,329 person-tenure rows | 2026-05-21 |
| Feeder-school relationships | Mentioned in ERO report markdown | (same as ERO) | 22 relationships (most ERO reports don't name feeders by name) | 2026-05-22 |
| Suburb safety + demographics | Pangaea's SafeSuburbs/CrimeStats (don't re-ingest) | crimestats.co.nz/data/suburbs.json master + nz-crime-8kpe.onrender.com live API | 2,416 of 2,576 schools (93.8%) joined via geo-nearest matching | 2026-05-22 |
| School tags | Computed deterministically from above | (in-house) | 10,300 tag rows | 2026-05-22 |
Sources deliberately NOT yet used
| Source | Status | Why deferred |
|---|---|---|
| Education Counts per-school CSV (NCEA / attendance / discipline / funding) | Imperva-blocked — Phase 4.7 paid proxy required | Bypass costs $30-50/mo; defer until first revenue signal |
| Charities Register satellite trusts | Schema applied, ingest paused | Name-substring matcher had too many false positives (e.g. "Mt Albert Grammar" matched "Mt Albert Baptist Church"). Picks up after Phase 3 school-website crawl provides explicit proprietor names |
| Parent + student reviews (Google Places API) | Roadmap Phase 6.4 | One-off ~$45 for all 2,576 schools |
---
2. Provenance rules (non-negotiable)
Every fact in every table has a sourceurl. Migration 007 added sourceurl + sourcedocdate + sourcesnapshotts + ingestmethod + ingestedat columns to every fact table.
On the public site: every claim renders with a <SourceChip> next to it — clickable, opens the original document. No claim renders without one.
Build-time gate: audit-facts-gate.cjs (port of HIC's facts-gate) refuses to build if any rendered claim doesn't have a corresponding row with non-null source_url. Hard fail, blocks deploy.
---
3. ERO scoring methodology
ERO publishes categorical ratings, not numeric scores. We derive 0-100 composite scores via:
1. Categorical → numeric mapping:
- well placed → 80-100
- developing → 60-79
- establishing → 40-59
- requires improvement → 20-39
- concerning → 0-19
- Topic not discussed in report → null (do NOT score)
2. Narrative-derived fill: When a dimension isn't categorically rated, a Haiku pass reads the narrative and assigns a score 0-100 based on tone + specifics — only if the dimension is meaningfully discussed. Never invented.
3. Composite: average of available per-dimension scores. Stored in schooleroreports.composite_score. Methodology displayed on every school page.
4. Trajectory direction (across consecutive review cycles):
- score delta ≥ 8 → improvement
- score delta ≤ -8 → decline
- otherwise → stable
- if scores not available → derived from categorical rating order
Note: we explicitly do NOT copy SchoolRank.com.au's 4-axis formula. Their inputs are AU (ICSEA, NAPLAN, ATAR); ours are NZ (EQI, NCEA, ERO trajectory, audited financials). The schoolrank composite is a UX inspiration, not a methodology copy.
Planned NZ-specific 4-axis composite (Phase 4.4)
| Axis | Inputs |
|---|---|
| academic | NCEA L2/L3 pass rate + UE rate (Phase 4.7) |
| trajectory | ERO diff direction × magnitude + roll growth % |
| wellbeing | ERO wellbeing rating + attendance % + stand-down rate |
| equity | Māori/Pasifika achievement parity + EQI-adjusted outcomes |
| governance | Financial health (surplus consistency) + audit opinion + board stability |
Parent weight-shifting on the site: each axis renders as a slider, composite recomputes live. Methodology shown on every page.
---
4. School↔suburb matching
Schools join to Crimestats suburb_id via a 4-tier strategy (see match-schools-to-suburbs.cjs):
1. A confidence — exact name + region: norm(school.suburb) + norm(school.region) == norm(suburb.name) + norm(suburb.region). 34 of 2,576 schools.
2. B confidence — name match disambiguated by geo: when school suburb name appears in multiple Crimestats suburbs, pick the geographically nearest (within 15km). 266 schools.
3. B confidence — geo nearest <3km: lat/lng nearest-neighbour to a suburb centroid. 1,737 schools.
4. C confidence — geo nearest 3-10km: same as above but wider radius. 379 schools.
5. Unmatched: rural schools without lat/lng or with no nearby suburb. 160 schools.
Confidence rendered on the public site — A and B are surfaced freely; C surfaced with a "best-fit" disclaimer; unmatched shows no suburb panel.
Known limitation: Crimestats uses StatsNZ SA2 boundaries (small statistical areas), so commercial-fringe SA2s (e.g. "Mount Eden North East") can have inflated harmper1k metrics from non-resident offences. Refinement planned: prefer name-overlap match within 1km even if not absolute nearest.
---
5. School↔ERO institution matching
ERO uses Ministry of Education schoolid as their canonical institutionid. URL pattern: ero.govt.nz/institution/{school_id}/{slug}.
Wayback Machine bypass strategy (see proof-slice.cjs):
1. Try archive.org/wayback/available?url={canonical} with multiple slug variants
2. Fall back to CDX search if wayback/available returns empty (their two APIs sometimes disagree)
3. Pick the newest 200-status snapshot across all variants
Footguns documented in this pipeline:
encodeURIComponentbreakswayback/available— pass URL literally- CDX
matchType=prefixstrips trailing slashes so "54/" text-matches "5499/" — use exact-URL lookup instead
---
6. Defamation + privacy rules
School-level tags — LOW risk
Tags about institution behavior (e.g. turnaroundgovernance, surplusconsecutive3y, improvingmaori_achievement). All deterministic from data; every tag links to its evidence row.
Person-level tags — HIGH risk, restricted
ALLOWED (factual + public-role only):
prolificerodirector(≥100 reports written) — verifiable count, public roleregional_specialist(≥80% of reports for one ERO regional office)longtenureprincipal(named principal ≥10 years at same school inschool_leadership)
NOT ALLOWED (evaluative judgments on named individuals):
strictreviewer/lenientreviewer— even if statistically supportedunderperformingprincipal/consistentlynegative_reviewer- Any tag combining named individual + negative outcome
GREY ZONE (allowed only after legal review):
- "Director X averages 73/100 composite score (national mean 71)" — render as factual stat line, not pill
- "Principal X led the 2017-2019 turnaround period at School Y" — positive attribution to public role, date range tied to documented improvement
- Stored with
safetyclass='greyzone'+legalreviewedat IS NULL— not surfaced publicly until reviewed
Defamation guard at extraction time
Every verbatim quote inserted into schooleroquotes must be an exact substring of the source markdown. Quotes that fail substring match are silently dropped by the extractor (Haiku occasionally smooths punctuation/connectives — those are caught and filtered, with the drop count logged on the parent report row).
Privacy Act
- Public roles + verifiable facts: OK
- Cross-aggregating named individuals with personal data from other Pangaea sites: requires Privacy Impact Assessment first
- Principal/board chair names from ERO reports: public roles, OK to surface
- Photos of named individuals: only school-published photos with explicit clearance
---
7. Year-stamping + freshness
- Every aggregation page (
/schools/{state},/suburb/{slug},/tag/{slug}) carries the current year in title + H1 ("Best Schools in Auckland | 2026 Rankings") - Tier-A source-doc freshness cap: 36 months. If a school's most recent ERO review is older than 36 months, the trajectory panel renders a "data may be stale" banner
- ERO reports from Wayback include
sourcesnapshotts— visible on every quote citation
---
8. AEO + LLM citation surface
| Surface | Purpose |
|---|---|
/api/school/{slug}/facts.json | Machine-readable per-school facts |
/api/school/{slug}/markdown.md | Same data as markdown |
/api/school/{slug}/ero-history.md | Full ERO report history with verbatim quotes |
/api/clause-search?q=... | Semantic search over verbatim ERO quotes (and disclosures) |
/.netlify/functions/mcp | MCP server with searcheroclauses, findschoolswithtags, compareeroreports, getero_trajectory |
/llms.txt | Master URL index walking the API tree |
/mcp-install | One-click install for Cursor / Claude Desktop / Goose |
Citation discipline: every fact in JSON / markdown carries sourceurl + sourcedoc_date. LLMs cite the source, not us.
---
9. Cross-Pangaea network
Suburb is the canonical join axis across the Pangaea verticals:
schoolsdata.co.nz→ suburb →safesuburbs.co.nz(safety + demographics) →crimestats.co.nz(deeper crime breakdown)- No duplicate ingestion: each layer is sourced once in its canonical home and joined elsewhere
schoolsuburblinkstable holds the schools↔suburb join; equivalents for the other sites can be added (charities↔suburb is the next obvious one)
---
11. International student fees — two distinct metrics
International fees in our data come from two fundamentally different sources with different audience and reliability. We capture them as separate metrics, never as the same field.
11.1 Published tuition (metrictype='publishedtuition')
What it is: the headline annual tuition figure a school advertises to prospective international parents / agents. This is the rate a NEW international student pays.
Source: scraped from school websites' /international/fees pages and prospectus PDFs via SERP discovery (DataforSEO).
Used by: parents shopping for schools, agents quoting placements. Public-facing.
Stored in: schoolinternationalfeeshistory.tuitionannualnzd with metrictype='published_tuition'.
Coverage: ~127 of 401 SIEBA member schools (~32%). Ceiling is genuine — the long-tail of schools use price-on-application or only publish in printed prospectus.
11.2 Revenue per student (metrictype='revenueper_student')
What it is: derived = schoolfinancials.internationalstudentfees ÷ schoolinternationaldemographics.totalinternational_count. Average actually-realised revenue per international student in a given year.
Source: computed from audited annual report (revenue line) + extracted demographic count (same year, same school).
Used by: journalists, researchers, our own analytical surfaces. Should NOT be displayed as "the tuition fee" to parents because the gap from headline is real and informative:
- Sibling / loyalty discounts
- Partial-year enrolments (started/finished mid-year)
- Exchange programmes at reduced fees
- Refunds + agent commission already netted off
- Scholarships for international students
Stored in: schoolinternationalfeeshistory.revenueperstudentnzd (+ revenuetotalnzd numerator + intlstudentcount denominator, both preserved for transparency) with metrictype='revenueper_student'.
Coverage: 33 rows across ~23 schools that have BOTH an audited annual report on file AND a demographic count for the same year. Multi-year trails for ~5 schools (Rangitoto, Macleans, Western Springs, Horowhenua, St Bernard's).
11.3 Display rules
- Free public profile pages: show published tuition only if available. If not, show "Not published — contact school for fee schedule."
- Pro tier API (
/api/school/{slug}/international.json?apikey=...): show both metrics + thegappctfromschoolinternationalfees_compareview. - Enterprise tier: also surface the full revenue_total + count + year trail.
- Never present revenue-per-student as if it's the published tuition — it's almost always materially lower for high-international-cohort schools.
11.4 Provenance display
Both metric types render with their respective source URL:
- Published tuition → school's own
/international/feespage (or fees PDF) - Revenue per student → the specific annual report PDF + the demographic disclosure that contained the count
12. Tables (full schema as of 2026-05-22)
Tables loaded with provenance fields (sourceurl, sourcedocdate, ingestmethod, ingested_at):
| Table | Rows | Source |
|---|---|---|
schools | 2,576 | MoE directory CSV |
schooleroreports | 636 | Wayback ERO HTML |
schooleroquotes (1536-dim embedded) | 3,459 | Haiku extract |
schoolerodiffs | 1,634 | Haiku diff of consecutive cycles |
school_financials | 135 | School-website audited PDFs |
schoolfinancialdisclosures (embedded) | 1,915 | Haiku extract |
school_leadership | 1,329 | Annual report + ERO + planned Phase 3 site crawl |
schoolfeederrelationships | 20 | ERO mentions |
schoolsuburblinks | 2,416 | Cross-Pangaea match to crimestats backend |
schoolinternationalprogram | 401 | SIEBA member list + per-school crawl |
schoolinternationalfees_history | ~50 | Site fees pages + annual reports |
schoolinternationaldemographics | 56 | Annual report disclosure extraction |
schoolmoeinterventions | 661 | NZ Gazette LSM/Commissioner/Adviser notices |
schoolstaffturnover | 9 | Annual report disclosure extraction |
school_scholarships | 0 | Phase 5.5A — pending Phase 3 crawl |
schoolopendays | 0 | Pending Phase 3 crawl |
school_media | 0 | Pending Phase 3 crawl |
schoolclaimrequests | 0 | Pending public-site launch |
school_tags | 11,100 | Deterministic from above |
apiclients + apiusage_log | 0 | Pending first paid client |
10. Changelog
| Date | Change |
|---|---|
| 2026-05-21 | Initial methodology document |
| 2026-05-22 | Added Section 4 (school↔suburb matching), Section 9 (cross-Pangaea network), suburb integration in §1 |
| 2026-05-22 | Added Section 11 (intl fees two-metric distinction: publishedtuition vs revenueper_student) + Section 12 (full table inventory as at session end). Migrations 007-017 saved to supabase/migrations/ |