Skip to content
s schoolsnearme.co.nz

SchoolsData — Methodology, Sources & Decision Rules

Updated: 2026-05-22

This document is the canonical reference for how SchoolsData makes decisions: what data we use, where it came from, how we score and tag, and what we deliberately do NOT do (defamation rules, privacy boundaries, methodology transparency requirements).

Every public page on the live site must link back to this document.

---

1. Data sources

LayerSourceURLCoverageLast refresh
Schools directory (identity, EQI, roll, lat/lng)NZ Ministry of Educationdata.govt.nz CKAN directory-of-educational-institutions2,576 schools2026-03-02 (then nightly)
Student roll history (10-year)Ministry of Education bulkeducationcounts.govt.nz/statistics/school-rolls (raw CSV)24,973 rowsas ingested
Staff counts (teacher headcounts, FTTE, ratios)Education Counts bulkeducationcounts.govt.nz/statistics/teacher-workforce38,005 rowsas ingested
ERO reportsEducation Review Office via Wayback Machineero.govt.nz/institution/{moe_id}/{slug}web.archive.org/web/{ts}/{url}636 reports across 175 schools (≥2 review cycles for 169 schools)2026-05-21
ERO verbatim quotesExtracted from ERO report markdown via Claude Haiku 4.5(same as above)3,459 quotes, all embedded (text-embedding-3-small 1536-dim)2026-05-21
ERO trajectory diffsGenerated by pairing consecutive cycles per school(computed)1,634 diffs across 169 schools2026-05-22
ERO authorship + scoringRe-extracted from stored markdown via Haiku(computed)636 reports tagged with Director name, regional office, 0-100 per-dimension scores2026-05-22
Audited annual financial statementsSchool websites (PDF) discovered via DataforSEO SERPsite:{school-domain.school.nz} (annual report OR annual financial OR financial statement) filetype:pdf → school's own CDN126 reports across 70+ schools2026-05-21
Financial disclosures (verbatim)Extracted from annual report PDFs via Haiku-PDF(same)1,915 disclosures, all embedded2026-05-21
Named board members + principalsExtracted from annual report PDFs(same)1,329 person-tenure rows2026-05-21
Feeder-school relationshipsMentioned in ERO report markdown(same as ERO)22 relationships (most ERO reports don't name feeders by name)2026-05-22
Suburb safety + demographicsPangaea's SafeSuburbs/CrimeStats (don't re-ingest)crimestats.co.nz/data/suburbs.json master + nz-crime-8kpe.onrender.com live API2,416 of 2,576 schools (93.8%) joined via geo-nearest matching2026-05-22
School tagsComputed deterministically from above(in-house)10,300 tag rows2026-05-22

Sources deliberately NOT yet used

SourceStatusWhy deferred
Education Counts per-school CSV (NCEA / attendance / discipline / funding)Imperva-blocked — Phase 4.7 paid proxy requiredBypass costs $30-50/mo; defer until first revenue signal
Charities Register satellite trustsSchema applied, ingest pausedName-substring matcher had too many false positives (e.g. "Mt Albert Grammar" matched "Mt Albert Baptist Church"). Picks up after Phase 3 school-website crawl provides explicit proprietor names
Parent + student reviews (Google Places API)Roadmap Phase 6.4One-off ~$45 for all 2,576 schools

---

2. Provenance rules (non-negotiable)

Every fact in every table has a sourceurl. Migration 007 added sourceurl + sourcedocdate + sourcesnapshotts + ingestmethod + ingestedat columns to every fact table.

On the public site: every claim renders with a <SourceChip> next to it — clickable, opens the original document. No claim renders without one.

Build-time gate: audit-facts-gate.cjs (port of HIC's facts-gate) refuses to build if any rendered claim doesn't have a corresponding row with non-null source_url. Hard fail, blocks deploy.

---

3. ERO scoring methodology

ERO publishes categorical ratings, not numeric scores. We derive 0-100 composite scores via:

1. Categorical → numeric mapping:

- well placed → 80-100

- developing → 60-79

- establishing → 40-59

- requires improvement → 20-39

- concerning → 0-19

- Topic not discussed in report → null (do NOT score)

2. Narrative-derived fill: When a dimension isn't categorically rated, a Haiku pass reads the narrative and assigns a score 0-100 based on tone + specifics — only if the dimension is meaningfully discussed. Never invented.

3. Composite: average of available per-dimension scores. Stored in schooleroreports.composite_score. Methodology displayed on every school page.

4. Trajectory direction (across consecutive review cycles):

- score delta ≥ 8 → improvement

- score delta ≤ -8 → decline

- otherwise → stable

- if scores not available → derived from categorical rating order

Note: we explicitly do NOT copy SchoolRank.com.au's 4-axis formula. Their inputs are AU (ICSEA, NAPLAN, ATAR); ours are NZ (EQI, NCEA, ERO trajectory, audited financials). The schoolrank composite is a UX inspiration, not a methodology copy.

Planned NZ-specific 4-axis composite (Phase 4.4)

AxisInputs
academicNCEA L2/L3 pass rate + UE rate (Phase 4.7)
trajectoryERO diff direction × magnitude + roll growth %
wellbeingERO wellbeing rating + attendance % + stand-down rate
equityMāori/Pasifika achievement parity + EQI-adjusted outcomes
governanceFinancial health (surplus consistency) + audit opinion + board stability

Parent weight-shifting on the site: each axis renders as a slider, composite recomputes live. Methodology shown on every page.

---

4. School↔suburb matching

Schools join to Crimestats suburb_id via a 4-tier strategy (see match-schools-to-suburbs.cjs):

1. A confidence — exact name + region: norm(school.suburb) + norm(school.region) == norm(suburb.name) + norm(suburb.region). 34 of 2,576 schools.

2. B confidence — name match disambiguated by geo: when school suburb name appears in multiple Crimestats suburbs, pick the geographically nearest (within 15km). 266 schools.

3. B confidence — geo nearest <3km: lat/lng nearest-neighbour to a suburb centroid. 1,737 schools.

4. C confidence — geo nearest 3-10km: same as above but wider radius. 379 schools.

5. Unmatched: rural schools without lat/lng or with no nearby suburb. 160 schools.

Confidence rendered on the public site — A and B are surfaced freely; C surfaced with a "best-fit" disclaimer; unmatched shows no suburb panel.

Known limitation: Crimestats uses StatsNZ SA2 boundaries (small statistical areas), so commercial-fringe SA2s (e.g. "Mount Eden North East") can have inflated harmper1k metrics from non-resident offences. Refinement planned: prefer name-overlap match within 1km even if not absolute nearest.

---

5. School↔ERO institution matching

ERO uses Ministry of Education schoolid as their canonical institutionid. URL pattern: ero.govt.nz/institution/{school_id}/{slug}.

Wayback Machine bypass strategy (see proof-slice.cjs):

1. Try archive.org/wayback/available?url={canonical} with multiple slug variants

2. Fall back to CDX search if wayback/available returns empty (their two APIs sometimes disagree)

3. Pick the newest 200-status snapshot across all variants

Footguns documented in this pipeline:

  • encodeURIComponent breaks wayback/available — pass URL literally
  • CDX matchType=prefix strips trailing slashes so "54/" text-matches "5499/" — use exact-URL lookup instead

---

6. Defamation + privacy rules

School-level tags — LOW risk

Tags about institution behavior (e.g. turnaroundgovernance, surplusconsecutive3y, improvingmaori_achievement). All deterministic from data; every tag links to its evidence row.

Person-level tags — HIGH risk, restricted

ALLOWED (factual + public-role only):

  • prolificerodirector (≥100 reports written) — verifiable count, public role
  • regional_specialist (≥80% of reports for one ERO regional office)
  • longtenureprincipal (named principal ≥10 years at same school in school_leadership)

NOT ALLOWED (evaluative judgments on named individuals):

  • strictreviewer / lenientreviewer — even if statistically supported
  • underperformingprincipal / consistentlynegative_reviewer
  • Any tag combining named individual + negative outcome

GREY ZONE (allowed only after legal review):

  • "Director X averages 73/100 composite score (national mean 71)" — render as factual stat line, not pill
  • "Principal X led the 2017-2019 turnaround period at School Y" — positive attribution to public role, date range tied to documented improvement
  • Stored with safetyclass='greyzone' + legalreviewedat IS NULL — not surfaced publicly until reviewed

Defamation guard at extraction time

Every verbatim quote inserted into schooleroquotes must be an exact substring of the source markdown. Quotes that fail substring match are silently dropped by the extractor (Haiku occasionally smooths punctuation/connectives — those are caught and filtered, with the drop count logged on the parent report row).

Privacy Act

  • Public roles + verifiable facts: OK
  • Cross-aggregating named individuals with personal data from other Pangaea sites: requires Privacy Impact Assessment first
  • Principal/board chair names from ERO reports: public roles, OK to surface
  • Photos of named individuals: only school-published photos with explicit clearance

---

7. Year-stamping + freshness

  • Every aggregation page (/schools/{state}, /suburb/{slug}, /tag/{slug}) carries the current year in title + H1 ("Best Schools in Auckland | 2026 Rankings")
  • Tier-A source-doc freshness cap: 36 months. If a school's most recent ERO review is older than 36 months, the trajectory panel renders a "data may be stale" banner
  • ERO reports from Wayback include sourcesnapshotts — visible on every quote citation

---

8. AEO + LLM citation surface

SurfacePurpose
/api/school/{slug}/facts.jsonMachine-readable per-school facts
/api/school/{slug}/markdown.mdSame data as markdown
/api/school/{slug}/ero-history.mdFull ERO report history with verbatim quotes
/api/clause-search?q=...Semantic search over verbatim ERO quotes (and disclosures)
/.netlify/functions/mcpMCP server with searcheroclauses, findschoolswithtags, compareeroreports, getero_trajectory
/llms.txtMaster URL index walking the API tree
/mcp-installOne-click install for Cursor / Claude Desktop / Goose

Citation discipline: every fact in JSON / markdown carries sourceurl + sourcedoc_date. LLMs cite the source, not us.

---

9. Cross-Pangaea network

Suburb is the canonical join axis across the Pangaea verticals:

  • schoolsdata.co.nz → suburb → safesuburbs.co.nz (safety + demographics) → crimestats.co.nz (deeper crime breakdown)
  • No duplicate ingestion: each layer is sourced once in its canonical home and joined elsewhere
  • schoolsuburblinks table holds the schools↔suburb join; equivalents for the other sites can be added (charities↔suburb is the next obvious one)

---

11. International student fees — two distinct metrics

International fees in our data come from two fundamentally different sources with different audience and reliability. We capture them as separate metrics, never as the same field.

11.1 Published tuition (metrictype='publishedtuition')

What it is: the headline annual tuition figure a school advertises to prospective international parents / agents. This is the rate a NEW international student pays.

Source: scraped from school websites' /international/fees pages and prospectus PDFs via SERP discovery (DataforSEO).

Used by: parents shopping for schools, agents quoting placements. Public-facing.

Stored in: schoolinternationalfeeshistory.tuitionannualnzd with metrictype='published_tuition'.

Coverage: ~127 of 401 SIEBA member schools (~32%). Ceiling is genuine — the long-tail of schools use price-on-application or only publish in printed prospectus.

11.2 Revenue per student (metrictype='revenueper_student')

What it is: derived = schoolfinancials.internationalstudentfees ÷ schoolinternationaldemographics.totalinternational_count. Average actually-realised revenue per international student in a given year.

Source: computed from audited annual report (revenue line) + extracted demographic count (same year, same school).

Used by: journalists, researchers, our own analytical surfaces. Should NOT be displayed as "the tuition fee" to parents because the gap from headline is real and informative:

  • Sibling / loyalty discounts
  • Partial-year enrolments (started/finished mid-year)
  • Exchange programmes at reduced fees
  • Refunds + agent commission already netted off
  • Scholarships for international students

Stored in: schoolinternationalfeeshistory.revenueperstudentnzd (+ revenuetotalnzd numerator + intlstudentcount denominator, both preserved for transparency) with metrictype='revenueper_student'.

Coverage: 33 rows across ~23 schools that have BOTH an audited annual report on file AND a demographic count for the same year. Multi-year trails for ~5 schools (Rangitoto, Macleans, Western Springs, Horowhenua, St Bernard's).

11.3 Display rules

  • Free public profile pages: show published tuition only if available. If not, show "Not published — contact school for fee schedule."
  • Pro tier API (/api/school/{slug}/international.json?apikey=...): show both metrics + the gappct from schoolinternationalfees_compare view.
  • Enterprise tier: also surface the full revenue_total + count + year trail.
  • Never present revenue-per-student as if it's the published tuition — it's almost always materially lower for high-international-cohort schools.

11.4 Provenance display

Both metric types render with their respective source URL:

  • Published tuition → school's own /international/fees page (or fees PDF)
  • Revenue per student → the specific annual report PDF + the demographic disclosure that contained the count

12. Tables (full schema as of 2026-05-22)

Tables loaded with provenance fields (sourceurl, sourcedocdate, ingestmethod, ingested_at):

TableRowsSource
schools2,576MoE directory CSV
schooleroreports636Wayback ERO HTML
schooleroquotes (1536-dim embedded)3,459Haiku extract
schoolerodiffs1,634Haiku diff of consecutive cycles
school_financials135School-website audited PDFs
schoolfinancialdisclosures (embedded)1,915Haiku extract
school_leadership1,329Annual report + ERO + planned Phase 3 site crawl
schoolfeederrelationships20ERO mentions
schoolsuburblinks2,416Cross-Pangaea match to crimestats backend
schoolinternationalprogram401SIEBA member list + per-school crawl
schoolinternationalfees_history~50Site fees pages + annual reports
schoolinternationaldemographics56Annual report disclosure extraction
schoolmoeinterventions661NZ Gazette LSM/Commissioner/Adviser notices
schoolstaffturnover9Annual report disclosure extraction
school_scholarships0Phase 5.5A — pending Phase 3 crawl
schoolopendays0Pending Phase 3 crawl
school_media0Pending Phase 3 crawl
schoolclaimrequests0Pending public-site launch
school_tags11,100Deterministic from above
apiclients + apiusage_log0Pending first paid client

10. Changelog

DateChange
2026-05-21Initial methodology document
2026-05-22Added Section 4 (school↔suburb matching), Section 9 (cross-Pangaea network), suburb integration in §1
2026-05-22Added Section 11 (intl fees two-metric distinction: publishedtuition vs revenueper_student) + Section 12 (full table inventory as at session end). Migrations 007-017 saved to supabase/migrations/