69 Data Sources Across 6 Provinces: How We Built It
Building reliable data pipelines for 69 public data sources is a non-trivial engineering challenge. Each source has its own API format, authentication scheme, rate limits, and failure modes. Here's how we did it.
The Stack
Our pipeline architecture has four layers:
- Ingestion (dlt) — Each data source has a dlt pipeline that handles extraction, pagination, and incremental loading. We support CKAN, Socrata, ArcGIS REST, ArcGIS Hub, and direct CSV/JSON downloads.
- Orchestration (Dagster) — 77 Dagster jobs manage the scheduling and dependency graph. Each source follows a three-asset chain: raw → normalized → entities.
- Storage (Apache Iceberg) — Normalized data lands in Iceberg tables with full schema evolution and time travel. This gives us snapshot isolation and efficient upserts.
- Transforms (DuckDB) — Entity extraction and aggregation queries run in DuckDB, reading directly from Iceberg tables.
The Three-Asset Pattern
Every data source follows the same pattern:
Raw asset — Runs the dlt pipeline, writing raw JSON/CSV records to an Iceberg table in a source-specific namespace (e.g., aer_wells_raw.wells).
Normalized asset — Reads the raw table, applies schema mapping and type coercion, and writes to a normalized namespace (e.g., aer_wells_normalized.wells). This is where we standardize column names, handle nulls, and ensure consistent types.
Entity asset — Extracts business entities from normalized data — operators, well types, commodities, permit categories. These feed the entity browser and enrichment engine.
Handling API Diversity
The biggest challenge is the diversity of API patterns across Canadian government portals:
- ArcGIS REST MapServer — Used by Ontario LIO, NL GeoAtlas, and MB GeoPortal. Requires pagination at 2,000 records with
resultOffsetandresultRecordCount. Some services return polygon geometries that need centroid calculation. - CKAN — Used by Toronto, NRCan, and federal portals. Package metadata + datastore_search API with limit/offset pagination.
- Socrata — Used by Calgary and Winnipeg. Straightforward CSV export via dataset ID.
- ArcGIS Hub — Ottawa and some MB/BC sources. CSV download endpoint with REST API fallback.
Each pattern has its own error modes. ArcGIS services sometimes return HTML error pages instead of JSON. CKAN datastores can have mismatched resource IDs. We built graceful fallback chains — try the preferred method, fall back to the alternative, log warnings, and continue.
Testing
Every pipeline has end-to-end tests that verify the full chain from raw ingestion through normalization to entity extraction. We seed Iceberg tables with representative data and assert on schema, row counts, and key field values. 240+ tests run on every commit.
What We Learned
- Government APIs are fragile — Assume every endpoint will go down. Build fallbacks.
- Schema consistency matters more than completeness — It's better to have 50 well-typed columns than 200 loosely typed ones.
- The three-asset pattern scales — Adding a new source means writing one pipeline file and one Dagster asset file. The pattern is the same whether it's your 10th source or your 69th.
Want to explore Canadian industrial data?
Start Free Trial