Methodology — How We Compile Hantavirus Surveillance Data

Pipeline overview

Every figure on this site flows through six deterministic stages, run end-to-end as a single ETL job on the VPS that ships the dashboard.

Ingest. A cadence-aware orchestrator (orchestrator.py) decides which of the 15 source scrapers to invoke based on each source's declared schedule — every 30 minutes for news + outbreak alerts, daily for surveillance pages, weekly/monthly/yearly for lower-cadence indices. Each scraper writes raw JSON to output/<name>.json.
Normalise. normalise.py maps every raw output to a uniform Signal shape (id, source, sourceCode, category, rank, title, summary, url, language, countryIso2, publishedAt, ingestedAt) and a uniform CountryOverride shape (cases, source, source_url).
Resolve. News URLs are de-redirected (Google News redirector → canonical publisher URL). Subject-country attribution from the article title overrides publication-host origin.
De-duplicate. Signals are keyed by canonical-URL hash. Duplicate signals coming from multiple sources (GDELT + Google News, for instance) collapse to a single record; the earliest publishedAt wins.
Classify. Every signal carries a rank (1 official / 2 expert / 3 news) and a category (official / surveillance / advisory / news / preprint). Country-level signals_30d counts are computed from this dedup'd corpus.
Snapshot & build. regenerate.py patches js/data.js and content/scraped-snapshot.json. node tools/build.mjs rebuilds every static page and the four API endpoints. The 82-test suite must pass before the rsync to webroot.

Source tiers

Each of our 15 sources is classified into one of three tiers. The full live ledger with health status is at /sources/.

Tier 1 · Official — surveillance authorities (WHO DON, WHO Fact Sheet, CDC, ECDC, ECDC AER, CDC EID journal), peer-reviewed indices (PubMed, Europe PMC, Crossref, OpenAlex), and biodiversity records (GBIF). Case counts, outbreak alerts and citation-backing claims come from this tier.
Tier 2 · Reference — Wikipedia (with revision id provenance), bioRxiv and medRxiv preprints. Used for narrative summaries and early-science signals, never for case counts.
Tier 3 · News — GDELT 2.0 global event database and Google News RSS across 14 country-localised locales (en-US, es-AR, es-CL, es-MX, de-DE, fr-FR, pt-BR, tr-TR, ru-RU, zh-CN, ja-JP, ko-KR, vi-VN, pl-PL). Used for the live signal feed and the signal-volume chart; never quoted as case data.

Country attribution

A signal is attributed to a country when one of three deterministic conditions holds:

The source is country-scoped by construction. Example: GNEWS-ES-AR → Argentina.
The source explicitly tags the country in a structured field. Example: CDC NNDSS reporting_area column → US state; WHO DON location field → multi-country roster.
The article title contains a country name from our 50-entry dictionary in any of 13 languages (English, Spanish, Portuguese, French, German, Italian, Russian, Polish, Turkish, Chinese, Japanese, Korean, Vietnamese). We attribute only when exactly one country is named in the title — comparison pieces and multi-country dispatches stay unattributed to avoid false positives.

We deliberately do not apply free-text named-entity recognition over article bodies or summaries. The title-only constraint, combined with the single-match rule, keeps false positives below the gain in coverage.

Country-level classification

Every country row in the API and dashboard carries a status derived from the signals attached to it:

Sourced — at least one Tier-1 case count is available (CDC for the USA, ECDC AER for 29 EU/EEA countries).
Signal-active — Tier-3 signals (news / GDELT) in the last 30 days, but no Tier-1 case count.
Unsourced — country in the roster but neither case data nor news signals scraped. Rendered as "Data unavailable".

Freshness semantics

The source ledger reports each scraper's freshness against its declared cadence:

Healthy — last successful fetch within 60 minutes of expected cadence.
Stale — 60 min to 6 h beyond expected cadence (transient network or rate-limit).
Blocked — rate-limited, 4xx/5xx, or access-policy change.
Unknown — first fetch not yet performed in the current deployment.

Case definitions

We accept the case definition used by each primary source and do not attempt to harmonise upward. Where a country reports both suspected and confirmed cases, we publish only confirmed counts — laboratory-confirmed by RT-PCR, IgM ELISA or virus isolation. The ECDC AER, our main European source, reports notification rates per 100 000 alongside absolute counts; we publish the absolute counts and link the underlying AER PDF.

What we don't do

The shortest path to a fuller-looking dashboard is to fill the gaps with estimates. We refuse to do this. Explicitly:

No estimation or extrapolation of case counts. Numbers shown reflect either scraped surveillance figures or signal-mention counts — never modelled values. Confirmed cases come only from the official surveillance feeds and are clearly labelled with their year of publication.
No global daily case timeseries. Hantavirus case surveillance is reported weekly or monthly worldwide; no public source publishes a credible global daily case series. We refuse to synthesise one. The chart on the homepage labelled "Signal volume" plots news + GDELT mentions per day — a real measurement of media attention, never a stand-in for incidence.
No country numbers without a source. Countries with no scraper render "Data unavailable" everywhere their cases would appear. The country roster makes the gap explicit, not invisible.
No CFR averages from incomplete inputs. Aggregate case-fatality requires both cases and deaths from the same source. We don't have death-source coverage for most countries, so the aggregate CFR field stays blank rather than computing a misleading number.
No generative-AI summaries. We display publisher titles verbatim. The TL;DR boxes are hand-written by the maintainers, sourced inline, and never paraphrased from third-party text.
No article body republication. We store and serve titles, source names, timestamps and links — never article body text. Underlying publisher licences are unchanged.
No user tracking. No analytics, no cookies, no first- or third-party trackers on this site.

Known limitations

Three structural caveats apply to every number on this site.

First, under-reporting. Hantavirus presents with non-specific flu-like symptoms in its early phase. Mild cases are routinely missed; the WHO estimates the true global infection burden at five to ten times the laboratory-confirmed total.

Second, retroactive revision. Ministries periodically revise historical counts as late laboratory results come in. ECDC AER values for previous years can change between publications. We always cite the most recent AER and never silently overwrite prior commits — historical numbers remain in git history.

Third, nomenclature drift. The International Committee on Taxonomy of Viruses revised hantavirus taxonomy in 2018 and again in 2024, renaming several species. Our species profiles use the post-2024 names (e.g. Hantavirus andesense) but cite the older common names (Andes virus) prominently for searchability.

Open data

Every figure on this site is downloadable via the public API and refreshed alongside the dashboard:

/api/health.json — pipeline health, source counts, freshness.
/api/countries.json — country roster with ISO codes, case counts, sources, signals_30d, alerts.
/api/signals.json — most recent 1 000 dedup'd signals.
/feed.xml — RSS 2.0 of the latest 50 signals.

Cite as: Hantometer (2026) Global Hantavirus Surveillance Database, hantometer.com. Underlying source licences are unchanged — the source ledger lists each one. The Hantometer code base and aggregated dataset are not open-source; use of the public API is permitted for personal research and non-commercial reuse with attribution.

Corrections

If you spot an error, a missing source, or a country we should be tracking, write to [email protected]. Acknowledged corrections are credited in the changelog.

How we compile the data.