Dashboard Solution Description — Foursquare Check-in Dashboard

Overview

What Was Built

A personal analytics dashboard that ingests every Foursquare/Swarm check-in and renders it into fourteen pages: a main analytics dashboard, a trip journal with per-trip maps, a companions tracker, a full check-in feed with historical weather, a tips explorer with country/city tabs and closed/deleted-venue badges, a venue ratings page (likes / okays / dislikes), a venue loyalty explorer, a world cities map, a Foursquare lists page, a full-text search page, a photo gallery with 21 000+ images, a shouts archive (free-text comments attached to check-ins, with year / country filters and live search), a stats overview with Hour × Category and DOW × Category heatmaps plus shout text mining, a live travel guide that reads your last 48 hours and suggests nearby venues based on your historical preferences, and this engineering write-up — all committed to git and served via Cloudflare Pages CDN.

Beyond rendering, the system maintains a data integrity pipeline: a manual archive workflow snapshots the full check-in history, diffs it against the previous snapshot to detect renamed or moved venues, and propagates those changes into the tips dataset — all without extra API calls. Duplicate rows and check-ins that the API silently stops returning are detected on every full re-fetch and accumulated in an incremental anomaly log (checkins_anomalies.json), giving a permanent auditable history of every data quality event.

The entire system runs on free-tier infrastructure: Cloudflare Pages, Cloudflare Workers, Cloudflare KV, Cloudflare D1 (SQLite at the edge), and GitHub Actions. There are no servers, no containers. The build pipeline is 100% Python 3.9+, the frontend is vanilla JS with Leaflet and Chart.js. Dynamic features (full-text search, the paginated check-in feed) are served by Cloudflare Pages Functions querying a D1 database — keeping runtime cost at zero while enabling millisecond-latency queries against the full 65 000-row dataset.

~5 min

Check-in → Live Deploy

Generated HTML Pages

4-layer

City Normalisation

Monthly Infrastructure Cost

1 min

Polling Interval

Architecture

The Deployment Pipeline

The system is designed around a push-on-change philosophy: nothing runs unless there is new data. A Cloudflare Worker acts as the real-time sensor, keeping the whole pipeline reactive while consuming negligible resources when idle.

Trigger

Swarm Check-in

T+0

→

Cloudflare Worker

KV Timestamp Poll

T+0–60s

→

GitHub API

workflow_dispatch

T+1 min

→

GitHub Actions

Fetch → Build → Push

T+1–3 min

→

Cloudflare D1

Incremental D1 Sync

T+3–4 min

→

Cloudflare Pages

CDN Deploy

T+4–5 min

The Worker runs on a 1-minute cron trigger, fetching the most recent check-in from the Foursquare API and comparing its Unix timestamp against the last-seen value stored in Cloudflare KV. If the timestamp is newer, the Worker writes the new value to KV (idempotent — prevents double-triggering on retries) and fires a workflow_dispatch event to GitHub Actions via the REST API.

GitHub Actions then fetches the updated check-in data from the private data repository, rebuilds all fourteen HTML pages, commits and pushes to main. Cloudflare Pages detects the push and deploys automatically — no build command needed, since the HTML is already pre-built.

System Diagram

End-to-End Architecture

The full path from a check-in tap in Swarm to a deployed page on the public site, showing every component and the data that flows between them. Three external services (Foursquare API, Cloudflare R2, Open-Meteo) are sources; everything else runs on the Cloudflare + GitHub free tiers.

                user taps "check in" in Swarm
                              │
                              ▼
                  ┌──────────────────────┐
                  │  Foursquare API v2   │  ◄──── data source
                  └──────────┬───────────┘
                             │  /users/self/checkins?limit=1
                             │  every 60 s
                             ▼
                  ┌──────────────────────┐         ┌──────────────────┐
                  │  Cloudflare Worker   │  ───────►  │  Cloudflare KV   │
                  │  checkin-poller      │         │  last-seen ts    │
                  └──────────┬───────────┘         └──────────────────┘
                             │  workflow_dispatch (REST)
                             ▼
                  ┌──────────────────────────────────────────────────┐
                  │           GitHub Actions (update-dashboard.yml)           │
                  │                                                  │
                  │  fetch_checkins.py  ─►  private-data/checkins.csv│
                  │  fetch_tips.py      ─►  private-data/tips.json   │
                  │  fetch_photos.py    ─►  private-data/pix/*.jpg   │
                  │  fetch_ratings.py   ─►  private-data/venue…json  │
                  │                          │                       │
                  │                          ▼                       │
                  │  ┌──────────────────────────────────────────┐    │
                  │  │  scripts/build.py   (orchestrator)        │    │
                  │  │                                          │    │
                  │  │   transform.py  ── 4-layer city/cntry    │    │
                  │  │   metrics.py    ── aggregations, trips,  │    │
                  │  │                    companions, shouts,   │    │
                  │  │                    cross-dim heatmaps    │    │
                  │  │   gen_*.py × 14 ── templates → *.html    │    │
                  │  │   POST-PROCESS  ── {{CTRY_CODE_JSON}},   │    │
                  │  │                    {{CAT_ICON_JSON}}     │    │
                  │  └──────────────────────────────────────────┘    │
                  │                          │                       │
                  │  aws s3 sync ──► Cloudflare R2 (new photos only)   │
                  │  sync_to_d1.py ──► Cloudflare D1 (incremental)   │
                  │                          │                       │
                  │  git push ──► public repo (HTML pre-built)       │
                  └──────────────────────────┼───────────────────────┘
                                             │
                                             ▼
       ┌──────────────────────────────────────────────────────────────────┐
       │                  Cloudflare Pages   (4sq.pages.dev)                  │
       │                                                                  │
       │  ┌────────────────────────────┐     ┌─────────────────────────┐  │
       │  │  STATIC HTML (14 pages)    │     │  PAGES FUNCTIONS        │  │
       │  │  ──────────────────────    │     │  ────────────────────   │  │
       │  │  index · trips · feed      │     │  /api/feed              │  │
       │  │  companions · tips         │     │  /api/search            │  │
       │  │  ratings · shouts          │     │  /api/search-venues     │  │
       │  │  venues · world_cities     │     │  /api/venue-tips        │  │
       │  │  lists · photos · stats    │     │  /api/custom-list       │  │
       │  │  search · guide            │     │                         │  │
       │  │  + trip-{N}.html × 160     │     │  cache: 60 s browser /  │  │
       │  │                            │     │  1 h CDN, _v= cache-bust│  │
       │  └────────────────────────────┘     └────────────┬────────────┘  │
       │                                                  │               │
       │                                                  │ SQL           │
       │                                                  ▼               │
       │                                     ┌─────────────────────────┐  │
       │                                     │  Cloudflare D1          │  │
       │                                     │  (SQLite, edge-served)  │  │
       │                                     │  ─────────────────────  │  │
       │                                     │  checkins   (65 k rows) │  │
       │                                     │  venues     (~33 k)     │  │
       │                                     │  tips       (~1.9 k)    │  │
       │                                     │  ratings    (~3.8 k)    │  │
       │                                     │  lists / list_venues    │  │
       │                                     │  trips      (~160)      │  │
       │                                     │  venue_changes (audit)  │  │
       │                                     └─────────────────────────┘  │
       └──────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                  CDN edge → user (any device, any country)
                  + browser fetches historical weather per check-in
                  from Open-Meteo archive (no API key, free)

Source → Build → Sync → CDN → Edge — five stages, $0/mo

Three different cache lifetimes serve three different patterns: the static HTML is immutable per commit (CDN edge caches indefinitely via content hash), the D1-backed feed/search APIs use a 60 s browser / 1 h edge cache with a _v= query-param cache-buster for schema-shape changes, and the R2 photo CDN uses long-lived caching since photo filenames are content-addressed and never change.

Canonical normalization layer. Three pure-data JSON files act as the single source of truth for cross-cutting lookups that used to live inline in nine HTML templates and one Python generator:

       config/country_aliases.json     raw native name → English
                                       "Беларусь":"Belarus"
                                       "Тоҷикистон":"Tajikistan"
                                       39 entries
                            │
                            │  loaded by gen_tips.py at import
                            ▼
                      CTRY_NORM dict   (used by build.py, gen_guide,
                                        tip normalisation on index)

       config/country_flags.json       English → ISO 3166-1 alpha-2
                                       "Belarus":"by", "Macao":"mo"
                                       112 entries
                            │
                            │  loaded by build.py once;
                            │  substituted into {{CTRY_CODE_JSON}}
                            │  on EVERY generated HTML file as a
                            ▼  single post-process pass
                      flagHtml(country) → <span class="fi fi-by"></span>

       config/category_icons.json      category → [emoji, hex color]
                                       "Kebab Restaurant":["🌯","#E67E22"]
                                       559 entries
                            │
                            │  substituted into {{CAT_ICON_JSON}}
                            ▼  on every page in the same post-process pass
                      catIcon(cat) → <div class="cat-icon">🌯</div>

Before: 10 drifted copies. After: one JSON edit, applied everywhere.

Adding a new country / category / native-name alias is a one-line JSON edit. No code changes anywhere. The post-process pass in build.py sweeps every output file once after generators run — generators don't need to thread these lookups through their own signatures.

Engineering Decisions

Challenges & Solutions

Challenge 01 — Real-Time vs. Static

Personal dashboards typically require a backend to serve dynamic data on request. Running a server 24/7 for occasional check-in reads is wasteful and introduces ongoing cost and maintenance burden.

Solution — Pre-built Static with Event-Driven Rebuild

All HTML is generated at build time and committed directly to the git repository. Cloudflare Pages serves these files from a global CDN with zero cold-start latency. The "backend" is replaced by a 63-line Cloudflare Worker that only activates when a new check-in is detected — consuming microseconds of CPU per minute versus a server running continuously.

Challenge 02 — Data Quality from Third-Party API

Foursquare returns city and country names inconsistently: Cyrillic spellings (Минск), transliterations (Minski Rayon), district sub-names that don't match any canonical city, and outright wrong country assignments for check-ins near border crossings.

Solution — Three-Layer Normalisation Pipeline

A layered override system processes every check-in through three passes, each with higher precedence than the last:

# Layer 1 — Bulk rename (city_merge.yaml)
"Минск"         → "Minsk"
"Minski Rayon"  → "Minsk"

# Layer 2 — Per-timestamp city override (city_fixes.json)
"1706738400"   → "Brest"   # border crossing mis-tagged

# Layer 3 — Per-timestamp country override (country_fixes.json)
"1706738400"   → "Belarus"  # Foursquare returned Poland
        

This gives operators surgical control: bulk rules handle the common case, while per-timestamp overrides handle specific data anomalies without touching the raw API data.

Challenge 03 — Timezone Correctness for DST-Free Countries

Resolving timezone from lat/lng coordinates (via timezonefinder) fails for countries that don't observe Daylight Saving Time. For example, Belarus geographically maps to UTC+2 in summer, but politically observes UTC+3 year-round. This causes check-in local times to be off by one hour for half the year.

Solution — Explicit Country → IANA Timezone Dict

metrics.py maintains a _COUNTRY_TZ dictionary mapping country names to authoritative IANA timezone IDs. This takes precedence over the coordinate-based lookup. Europe/Minsk is always UTC+3, regardless of what the geometric timezone boundary says — because that's what the clocks actually show.

Challenge 04 — Weather Data Date Mismatch

The recent check-ins feed shows historical weather for each location. Check-ins made in the early hours of the morning in UTC+3 timezones (e.g., 01:14 Minsk time) correspond to 22:14 UTC the previous day — but the dashboard was requesting weather using the local date ("March 12"), not the UTC date ("March 11"). The archive API had no data for March 12 yet, so weather was silently absent.

Solution — Derive Date from Unix Timestamp via UTC

All Open Meteo archive API requests now derive the date from the raw Unix timestamp:

// Before (broken): uses pre-computed local date field
const url = `...&start_date=${r.date}...&timezone=${r.tz_name}`;

// After (correct): derive UTC date from timestamp
const utcDate = new Date(r.ts * 1000).toISOString().slice(0, 10);
const url = `...&start_date=${utcDate}...&timezone=auto`;
        

The fix aligned the dashboard's weather fetch strategy with the feed page, which had always used the UTC-based approach. One source of truth, zero silent failures.

Challenge 05 — Private Data in a Public Repository

The raw checkins.csv contains full location history: GPS coordinates, venue IDs, timestamps, and companion names across years of travel. Committing this to a public repo exposes structured personal data to anyone crawling GitHub.

Solution — Separated Private Data Repository

The CSV lives in a separate private GitHub repository (foursquare-data). A fine-grained Personal Access Token scoped exclusively to that repository allows GitHub Actions to check it out at build time. The public repository never sees the raw CSV — only the generated HTML output, which contains only the aggregated, display-ready data already visible on the site. The PAT has Contents: read/write and nothing else.

Challenge 06 — City Name Collisions Across Continents

"Rabat" is the capital of Morocco and also a city in Malta. When rendering the world cities map, a naive name-match would mark Morocco's Rabat as visited when only Malta's was, or vice versa — silently inflating the visited cities count.

Solution — Continent-Aware City Matching

Both index.html.tmpl and gen_worldcities.py implement a CTRY_CONT JavaScript dictionary mapping every country to its continent. The matchVisited() function rejects a world-cities database match unless the candidate city's country falls on the same continent as the visited check-in's country. This guard is maintained in sync across both files — a documented invariant noted in CLAUDE.md.

Challenge 07 — Automatically Detecting Trip Boundaries

A raw check-in stream has no concept of "trip". A journey from home to another city and back is just a sequence of venue visits — with no explicit departure event, no arrival marker, and often a trail of in-transit check-ins (train stations, fuel stops, border crossings) that belong to the trip but fall before the first non-home city check-in and after the last. A naïve "city ≠ home" window misses the journey itself and produces ragged trip boundaries.

Solution — Multi-Pass Extension Pipeline

metrics.py runs an 8-pass pipeline over each candidate trip window to progressively widen its boundaries until they reflect the actual journey:

# Pass 1 — Seed: consecutive non-home city rows (≥ min_checkins)
# Pass 2 — Same-day departure: scan backward for Transportation/Bus/Parking
#           on departure date; fuel stops absorbed if nearest to seed
# Pass 3 — Arrival hub: scan forward for train station / airport on return day
# Pass 4 — Intermediate home-city rows absorbed if flanked by trip cities
# Pass 5 — Repeat passes 2–4 until stable (up to 5 iterations)
# Pass 6 — trip_start_overrides: prepend rows from a specific timestamp
# Pass 7 — Bicycle extension: 4 h window, earliest passthrough category
#           as anchor (Road, Bridge, Park, Bike Rental…), include all
#           intervening rows between anchor and trip seed
# Pass 8 — Neighbourhood fallback: absorb a single home-city neighbourhood
#           check-in as arrival if no transport hub was found
        

Airport detection uses a substring match ("airport" in cat.lower()) rather than an exact string, catching all variants emitted by Foursquare: International Airport, Airport Terminal, Airport Gate, Airport Service. A prev_end_idx guard prevents the backward scan from crossing into a preceding trip's arrival rows. Where the heuristics fall short, three JSON config files provide surgical overrides: trip_start_overrides and trip_end_overrides pin exact boundary timestamps, and trip_names.json / trip_tags.json attach human-readable names and activity tags (bicycle, camping, etc.) keyed by the final resolved start timestamp.

Challenge 08 — Tips on Closed Venues Are Silently Omitted

Foursquare's /users/self/tips endpoint silently omits any tip written on a venue that has since been closed or deleted. With no error, no flag, and no indication in the response, a full fetch of 1 782 tips appeared complete — until a per-venue sweep revealed 25 additional tips that existed only on closed venues. A secondary problem: the API returns country names in the local language (Беларусь, Republica Moldova, المغرب…) rather than a consistent English form, breaking grouping, flag lookup, and tab rendering.

A third gap emerged when cross-checking against a Foursquare data export (downloaded from account settings): the export contained 1 912 tips versus 1 807 in the API-sourced file — a delta of 104 missing tips, plus a viewCount field entirely absent from the API response. The export also revealed that presence in checkins.csv is not a reliable proxy for venue activity: a venue can appear in historical check-ins and still be closed on Foursquare today. Determining true closed/deleted status required fetching each venue page individually.

Solution — Two-Strategy Fetch + Data Export Cross-Check + Page Verification

The tips pipeline uses two complementary strategies that together achieve API completeness:

# Strategy 1 — users endpoint (fast, ~1 800 tips in one paginated sweep)
GET /v2/users/self/tips?limit=500&offset=0  # silent about closed venues

# Strategy 2 — per-venue sweep (catches closed-venue tips)
for venue_id in checkins_csv:
    if venue_id not in existing_tips:
        GET /v2/venues/{venue_id}/tips?filter=self  # reveals closed-venue tips
        if new_tip_found:
            tip["closed"] = True   # auto-marked as closed
        

Any tip discovered exclusively via the sweep — not returned by the users endpoint — is automatically marked closed=True in tips.json. The 25 pre-existing sweep tips were identified retroactively by comparing the initial 1 782-tip commit in the data repo against HEAD (1 807 tips); the 25-ID delta was patched directly.

For the 104 export-only tips, closed status could not be inferred from whether the venue_id appeared in the users endpoint response — a venue can have other active tips and still be closed. The only reliable source was the venue page itself. All 100 unique venue IDs were fetched via foursquare.com/v/{id} with browser session cookies (the public page embeds "closed":true in its __NEXT_DATA__ JSON or in the raw HTML for closed venues). 95 of 100 venues were confirmed closed on-page; the remaining 5 loaded on the legacy app.foursquare.com renderer with no closed marker, indicating they are still active — and their tips were found to have been deleted by moderators rather than lost to venue closure.

# Closed/deleted status determination for export-only tips
for venue_id in new_tip_venue_ids:
    html = GET foursquare.com/v/{venue_id}  # with browser cookies
    if "closed":true in html or __NEXT_DATA__["closed"]:
        tip["closed"] = True           # 99 tips — venue confirmed closed
    elif loads_on_legacy_renderer:
        tip["deleted"] = True          # 5 tips — venue active, tip deleted by mod
        

The export also provided viewCount for all tips — a field the API never returns. fetch_tips.py was updated to capture viewCount going forward; historical counts from the export were backfilled into tips.json in one pass. View counts are refreshed on each full re-fetch (--full); incremental runs only touch tips newer than the latest known timestamp.

Country name normalisation uses a CTRY_NORM dictionary in gen_tips.py mapping every local-language variant to its English form. City names reuse the existing city_merge.yaml pipeline. Both normalised values are stored as nc (country) and nci (city) on each tip record and propagated into the recent-30 tips slice embedded in index.html. Tip cards display a red CLOSED badge, a purple DELETED badge, and a 👁 view count in the footer. Three dedicated filter buttons — By Date, Closed only, and Deleted only — let the reader surface each data quality category directly.

Challenge 09 — Venue Metadata Drifts Silently Over Time

Foursquare venues can be renamed, moved, or recategorised at any time. A check-in stored in checkins.csv three years ago may now point to a venue with a different name, city, or coordinates — and tips.json, which duplicates that venue metadata per tip, can drift out of sync independently. Additionally, full re-fetches occasionally surface a second problem: some check-ins that existed in the old CSV are simply absent from the API response (deleted or merged venues), while other rows appear duplicated within the historical data with no indication of the double-entry. Both silent drift and silent data loss are impossible to detect without an explicit comparison.

Solution — Snapshot Diff + Incremental Anomaly Log

The Archive check-in snapshot GitHub Actions workflow (manual trigger) addresses both problems in a single run:

# 1. Archive current CSV with UTC timestamp before overwriting
cp checkins.csv archive/checkins_2026-03-26T12-00-00Z.csv

# 2. Full re-fetch via beforeTimestamp pagination (no API cap)
python fetch_checkins.py --full   # quota-safe: saves partial on 403

# 3. Diff old vs new — detect venue renames, moves, category changes
python sync_venue_changes.py \
  --old archive/checkins_${ARCHIVE_NAME} \
  --new  checkins.csv \
  --tips tips.json   # patches tips in-place, no extra API calls
        

sync_venue_changes.py compares the two snapshots on six fields per venue_id: venue, city, country, lat, lng, category. For each changed venue it patches every matching tip in tips.json in-place — converting lat/lng to float rounded to 5 dp to match the tips schema — and logs a numbered summary of every updated tip. The patched tips.json is committed alongside the fresh CSV in the same atomic commit, keeping both files permanently in sync without any additional API quota.

The full re-fetch also runs two anomaly detectors and accumulates their results incrementally in checkins_anomalies.json:

# checkins_anomalies.json — grows on every full re-fetch
{
  "duplicates": [  // rows with identical (venue_id, date) in the CSV
    { "date": "1384169327", "venue": "Суперпрод", "city": "Minsk", ... }
  ],
  "missing": [    // rows present in CSV but absent from API response
    { "date": "1625992035", "venue": "Chashmayi Ayyub", "city": "Bukhara", ... }
  ]
}
        

Duplicates are rows whose (venue_id, date) key appears more than once — identical double-entries from early Swarm usage. They are intentionally preserved in the CSV rather than silently removed; the anomaly file provides visibility without data loss. A duplicate_checkins.csv sidecar is also written for direct inspection. Missing rows are check-ins present in the existing CSV but absent from the API response — venues that Foursquare deleted or merged. These too are preserved and recorded so the count discrepancy is explained and auditable. Both lists accumulate across runs: new entries are merged in, existing entries are never removed, giving a permanent history of every data quality event the re-fetch has ever observed.

Challenge 10 — Recovering Companion Overlap Data from a Deprecated API

Foursquare's overlaps field — people who independently checked in at the same venue at the same time, distinct from explicit with companions — was only ever available via the legacy /v2/checkins/{id} endpoint and is no longer surfaced through any current API. With ~65 000 historical check-ins, a one-time enrichment run was the only way to recover this data before it disappeared entirely.

Three compounding problems made the enrichment non-trivial:

1. Quota exhaustion mid-run. At 0.35 s per call the script consumed the daily quota mid-run. The original error handler treated all HTTP 403 responses identically — marking the row as a permanent skip ("-") — so quota-exhausted rows were silently discarded alongside genuinely inaccessible ones, with no way to tell them apart after the fact.

2. Overlaps duplicating with companions. For a subset of check-ins Foursquare returned the explicit with companion in the overlaps list as well, producing duplicate entries across overlaps_name and with_name / created_by_name. 408 rows were affected.

3. Crash recovery without reprocessing 51 000 rows. A mid-run crash after ~7 400 rows left 1 089 FOUND entries in the terminal log but no record of which check-in IDs they corresponded to. The remaining ~57 000 rows still needed processing, making a full restart wasteful.

Solution — Quota-Aware Retry, Dedup Filter, Position-Based Recovery

Each problem was addressed with a targeted fix:

# Fix 1 — distinguish quota 403 from access-denied 403
if resp.status_code == 403:
    error_type = resp.json()["meta"]["errorType"]
    if error_type == "rate_limit_exceeded":
        raise HTTPError("quota")   # → pause + retry, NOT permanent skip
    return "-", "-"              # genuine access denied → skip, log reason

# Fix 2 — dedup filter after each successful API call
existing = {name.lower() for name in with_name + created_by_name}
pairs = [(n, i) for n, i in zip(overlap_names, overlap_ids)
         if n.lower() not in existing]   # strip duplicates, keep genuine

# Fix 3 — position-based crash recovery via --only-ids-file
# Two known anchor pairs from the log (pos → checkin_id) gave OFFSET=4717
# rows_with_cid[pos - 1 + 4717]  →  exact checkin_id for each FOUND line
python enrich_overlaps.py --only-ids-file recovery_ids.txt
# resets any incorrectly-marked rows for those IDs, then processes only them
        

The --only-ids-file flag accepts a plain text file of one checkin_id per line. Before building the work queue it resets any row in that set whose overlaps_id was incorrectly finalised (back to ""), ensuring the IDs are always re-processed regardless of prior run state. Sleep was increased from 0.35 s to 1.5 s per call to stay safely below the quota ceiling for the full 65 000-row run.

Post-enrichment cleanup applied the dedup filter retroactively: 408 existing rows had their overlaps_name / overlaps_id scrubbed of any name already present in with_name or created_by_name, with entries reduced to "-" where nothing genuine remained. The surviving 38 genuine overlaps — people who happened to be at the same place at the same time, entirely independently — were committed to the data repo and rendered in the companions page.

Challenge 11 — Recovering and Hosting 21 000+ Check-in Photos

Foursquare's public API exposes photo metadata per check-in, but rate limits and the sheer volume (~21 000 images for ~65 000 check-ins) made a naive full re-fetch impractical. The private data export archive contained all historical photos, but the export's CSV files referenced internal Foursquare URLs rather than actual image bytes — so photos still had to be fetched individually, and the index had to be built from scratch.

Three compounding problems:

1. Incremental indexing without re-downloading history. The photo index (photos.json) grows over time as new check-ins are added. A naive run would re-probe every un-indexed check-in on every CI run — including the ~51 000 that were already confirmed to have no photos — wasting quota and time.

2. Orphan photos belonging to tips, not check-ins. After cross-referencing all 21 168 downloaded jpg files against the check-in index, 17 files were unaccounted for. Investigation of the export's photo URLs revealed 12 of them used an /item/{id} path rather than /checkin/{id} — matching tip IDs, not check-in IDs. They had been silently mis-classified.

3. Serving 21 000+ files on a static site at zero cost. GitHub Pages has a 1 GB repo limit; embedding binary assets in the repo was not viable. The site needed a public CDN that was genuinely free at this scale.

Solution — Export Cutoff Detection, Tip Photo Discovery, Cloudflare R2

Each problem was addressed with a targeted fix:

# Fix 1 — auto-detect cutoff; only probe check-ins newer than already indexed
known = set(photos_by_checkin.keys())
cutoff_ts = max(
    (int(r["date"]) for r in rows if r["checkin_id"] in known),
    default=0
)
pending = [r["checkin_id"] for r in rows
           if r["checkin_id"] not in known
           and int(r["date"]) > cutoff_ts]   # skip old confirmed-empty check-ins

# Fix 2 — tip photos stored in tips.json, not photos.json
# Export URL pattern:  /item/{tip_id}/photo → tip photo, not checkin photo
# 12 matched tips.json entries by ID → added "photo": "filename.jpg" field
# photo_url computed at build time:  pix_url + "/" + tip["photo"]
tip["photo_url"] = pix_url.rstrip("/") + "/" + tip["photo"]

# Fix 3 — Cloudflare R2 (free tier: 10 GB, 10M reads/month, $0 egress)
# Files uploaded at pix/filename.jpg; --pix-url must include the /pix prefix
aws s3 sync private-data/pix/ s3://${R2_BUCKET_NAME}/pix/ \
    --endpoint-url https://${R2_ACCOUNT_ID}.r2.cloudflarestorage.com

python scripts/build.py \
    --photos private-data/photos.json \
    --pix-url "${R2_PUBLIC_URL}"   # e.g. https://pub-xxxx.r2.dev/pix
        

The photos.html gallery renders 21 000+ images lazily in batches of 300, with a country/city accordion filter (countries collapsed by default, cities as pill buttons), a separate tip photos section with its own lightbox mode, and a hero count that includes both check-in and tip photos with an anchor link to the tip section. The multi-photo badge on index.html recent-check-in cards shows the first photo plus a +N overlay when a check-in has multiple images; clicking navigates through all photos for that check-in in the inline lightbox.

In CI, aws s3 sync uploads only new files (those not yet in the bucket), making each incremental deploy fast regardless of total gallery size. The --pix-url flag keeps the build fully decoupled: local builds use a file:/// URI; the deployed site uses the R2 public URL — no code changes needed between environments.

Challenge 12 — Venue Ratings Largely Inaccessible via API

Foursquare allows users to rate venues as Like, Okay, or Dislike. The ratings page needs all three to show the full breakdown across visited venues. The API documentation lists endpoints for each rating type — but in practice, /users/self/venueokays and /users/self/venuedislikes return HTTP 402 on every request, regardless of authentication or token scope. There is no error message explaining why, no workaround in the documentation, and no indication this will change. A naive implementation would treat 402 as a transient failure and retry indefinitely, blocking the CI run.

A secondary issue: many Like records in the API response were missing a createdAt timestamp field entirely — the field the ratings page uses to assign check-ins to years. Without it, those venues sort to the bottom with no year grouping.

Solution — Probe-and-Skip 402 Endpoints; Backfill Timestamps from Check-in History

fetch_ratings.py was updated to treat 402 as a permanent skip, not a retry-able error. The script logs a one-time warning and moves on — no hang, no false failure. Only the likes endpoint (/users/self/venuelikes) reliably returns data and is actively fetched.

# fetch_ratings.py — 402 is a hard skip, not a transient error
if resp.status_code == 402:
    print(f"⚠  {endpoint} → 402 (feature requires paid plan) — skipping")
    return []   # no retry, no exception
        

For the missing createdAt timestamps on likes: a backfill pass reads checkins.csv and, for each liked venue, finds the earliest check-in timestamp at that venue. This is stored as first_ts in venueRatings.json and used as a proxy creation date when createdAt is absent. The ratings page year-grouping logic falls through to first_ts automatically, so all likes are assigned to a year with no manual intervention.

# Backfill: first_ts = earliest checkin timestamp at that venue
venue_first: dict[str, int] = {}
for row in checkins_csv:
    vid = row["venue_id"]
    ts  = int(row["date"])
    if vid not in venue_first or ts < venue_first[vid]:
        venue_first[vid] = ts

for like in likes:
    if not like.get("createdAt"):
        like["first_ts"] = venue_first.get(like["venue_id"], 0)
        

Challenge 13 — 11 MB Embedded JSON Blocks the Feed Page

The check-in feed page was originally generated by embedding the entire 65 000-row check-in history as a const ALL=[...] JSON blob directly inside the HTML — roughly 11 MB of inline JavaScript. This created three compounding problems:

1. Page load was blocked. The browser had to parse and execute the entire 11 MB blob before any interaction was possible. On a slow connection or mobile device, the feed was unusable.

2. The build was the slowest step. Python reading and serialising the full CSV into a single JSON string on every hourly CI run was responsible for a measurable share of the build time.

3. Search was static and stale. The full-text search index (search-index.json) was generated at build time, committed to the repo (~1.5 MB), and served as a static file. Any search therefore reflected data as of the last deploy, not the current D1 state. The static file also bloated the repo and the CDN response for every search query.

Solution — Cloudflare D1 + Pages Functions + Virtual Scroll + Progressive Pagination

Both the feed and search were migrated to Cloudflare D1 — a serverless SQLite database at the edge, included in the free Pages tier. Two Cloudflare Pages Functions replace the static files:

# /api/feed  — cursor-based check-in stream (O(1) per page)
GET /api/feed?limit=50
# → { items: [[ts, date, time, venue, city, country, cat, vid, lat, lng, id], …],
#     has_more: true, next_cursor: 1712345678 }

GET /api/feed?limit=50&cursor=1712345678
# cursor = Unix timestamp of the last item; WHERE date < cursor ORDER BY date DESC
# each page is O(1) — no OFFSET scan over 65k rows

GET /api/feed?oldest=1&limit=100
# returns 100 oldest check-ins (ORDER BY date ASC), reversed for display
# used by the "Oldest" button for instant jump without loading the full dataset

# /api/search — live D1 full-text search
GET /api/search?q=minsk
# → { venue: […], city: […], trip: […], tip: […], companion: […] }
# queries 6 D1 tables in parallel; no static index, always current
        

The feed page's JavaScript was rewritten around a virtual scroll model: only the rows currently in the viewport are rendered as DOM nodes. All 65 000 rows exist only as a flat in-memory array (ALL); position computation (buildPos) maps each row to a pixel offset using a gap-row strategy. Month and year counts are served from a static feed_meta.json pre-built at CI time — zero D1 reads for the calendar. The first 50 items render immediately; the remainder are fetched in the background via cursor pagination, each page O(1) because the query uses WHERE date < cursor on an indexed column rather than OFFSET N.

// Cursor-based background fetch — O(1) per page, non-blocking
let cursor = firstData.next_cursor;
while (cursor) {
  const d = await fetch(`/api/feed?limit=50&cursor=${cursor}`).then(r => r.json());
  ALL.push(...d.items);                        // grows the shared array in-place
  addToCalCounts(d.items, prev);               // incremental calendar build
  if (!isFiltered && !monthMode) renderVis(); // skip re-render if filtered/month view
  cursor = d.has_more ? d.next_cursor : null;
  await new Promise(r => setTimeout(r, 0));   // yield to browser each page
}
        

The calendar jump bar is populated from the static feed_meta.json immediately on load — all month anchors are clickable before any D1 query fires. An isFiltered flag suppresses viewport re-renders while a search is active; monthMode suppresses them during single-month views. The feed page went from an 11 MB HTML file that blocked paint to a 56 KB shell that renders the first screen in under one second.

The D1 sync is incremental: check-ins are append-only (insert or ignore on primary key), venues are upserted only for IDs touched in the current fetch, and tips/ratings/lists are gated by per-table change flags (--tips-changed, --ratings-changed, --lists-changed) that default to false — so a manual local sync never rewrites unchanged tables accidentally. CI passes the actual fetch-step output for each flag.

Challenge 14 — Syncing 65 000 Rows to D1 Without Blowing Rate Limits

Cloudflare D1's HTTP API limits each batch to 100 SQL statements and enforces per-account rate limits. A naïve "delete everything and re-insert" strategy for 65 000 check-ins would require 650+ batch calls, hit the rate limit within seconds, and make every CI run a full-table rebuild — even when only one new check-in arrived.

Five separate tables need to stay in sync on different schedules: checkins and venues update on every new check-in; tips, ratings, and lists only change after explicit fetch runs. A single sync script must handle all five without accidentally overwriting unchanged tables when run locally or in CI.

Solution — Table-Specific Strategies + Change Flags + Force-Resync Escape Hatch

Each table uses the strategy that matches its data semantics:

check-ins — INSERT OR IGNORE keyed on (venue_id, date). New rows append; existing rows are silently skipped. A full 65 000-row pass takes ~650 batches but is still O(N) reads with zero unnecessary writes — D1 discards duplicates at the index level.

venues — INSERT OR REPLACE only for venue IDs that appear in the current fetch payload. A check-in fetch returns at most a few dozen new venue IDs, so the venue sync is always a tiny subset of the full table.

tips / ratings / lists — gated by per-table boolean flags (--tips-changed, --ratings-changed, --lists-changed). Each flag defaults to false; CI injects the actual output of the preceding fetch step. This means a routine check-in sync that produces no tip changes never touches the tips table — the batch calls simply don't happen.

# CI: sync_to_d1.py only writes tables whose upstream fetch reported changes
python scripts/sync_to_d1.py \
    --csv     data/checkins.csv \
    --tips    data/tips.json \
    --ratings data/venueRatings.json \
    --lists   data/lists.json \
    --tips-changed    ${{ steps.fetch_tips.outputs.changed }} \
    --ratings-changed ${{ steps.fetch_ratings.outputs.changed }} \
    --lists-changed   ${{ steps.fetch_lists.outputs.changed }}
        

When a full rebuild is needed — after a Foursquare data export comparison reveals rating corrections, or when deleted tips must be purged — each table has a --force-* flag that issues a DELETE FROM <table> followed by a full re-insert, bypassing the change gates. This is also available as a GitHub Actions workflow with checkboxes per table, so a selective resync never requires touching local credentials.

Challenge 15 — Bottom-Anchor Split-Region Architecture Breaks Scroll

The first virtual scroll implementation bootstrapped the feed by fetching the 100 newest and 100 oldest check-ins in parallel, placing the oldest batch at the very bottom of a pre-computed pixel canvas (totalH ≈ 8.5 million px). Everything between the two batches was an empty gap that filled from both ends as the user scrolled. This produced three cascading failures:

1. Gap prefetch zone was proportionally tiny. The gap-trigger threshold was 80 × AVG_ITEM_H = 10 560 px — about 0.12 % of the total canvas. In practice the prefetch fired only sporadically, leaving a visible blank region every time the user scrolled faster than the loader.

2. totalH shrank on every batch. Each 50-item append grew the loaded height by 50 × 126 px = 6 300 px but reduced the estimated remainder by 50 × 132 px = 6 600 px — a net loss of 300 px per page. Near the end of the dataset the browser clamped scrollTop because the sizer was shrinking under the user, causing the viewport to jump upward.

3. Calendar navigation broke scroll position. goYMD used a two-phase approach: resolve a cursor via ?resolve=TS, then load items around that timestamp. After each splice, rebuildAfterSplice patched activeIdx to compensate for prepended items — but the correction was based on array indices, not pixel positions, so any concurrent load could desynchronise it. Jumping to a date near the bottom of a fresh load left the viewport pointing at the wrong item.

Solution — Contiguous Array with State-Reset Navigation

The split-region model was replaced with a single contiguous ALL array that grows strictly by append (older items) or prepend (newer items):

// Init — fetch newest 100 only; no oldest batch, no empty gap
const d = await fetch("/api/feed?limit=100").then(r => r.json());
ALL = d.items; revDone = true;   // nothing newer to load
fwdCursor = d.next_cursor;       // cursor for loading older items
totalH = 0; buildPos(ALL);      // estimate canvas height from scratch

// loadFwd — append older items; never shrink totalH
ALL.push(...items);
buildPos(ALL);   // Math.max(totalH, loadedH + remaining*AVG_ITEM_H)

// loadRev — prepend newer items; anchor scroll position
const oldFirstPos = pos[0] || GAP_PAD;
ALL.unshift(...items);
buildPos(ALL);
sc.scrollTop += pos[items.length] - oldFirstPos;  // viewport stays put
        

Calendar navigation and the Oldest / Latest buttons now use a state-reset pattern rather than in-place splicing. Each jump increments a generation counter (_loadGen), assigns a new ALL slice, and resets totalH = 0 before calling buildPos — ensuring Math.max starts from zero rather than retaining a stale large value:

// goYMD — jump to a calendar month (example)
_loadGen++;                          // invalidate any in-flight fetch
const gen = _loadGen;
const d = await fetch(`/api/feed?month=${ym}`).then(r => r.json());
if (gen !== _loadGen) return;      // discard stale response
ALL = d.items; DATA = ALL;
totalH = 0; buildPos(DATA);       // fresh estimate, no stale Math.max floor
        

Calendar month counts use YM_IDX[ym] from the static feed_meta.json directly in renderCal, rather than an accumulated local counter (ymCnts). This prevents double-counting when the same items are loaded again after a state reset (e.g., Latest → Oldest → Latest). The result: the calendar sidebar always shows authoritative counts, and all month anchors are clickable before any scroll event fires.

Build System

Zero-Dependency Static Generation

The build system is intentionally minimal. build.py is the single orchestrator: it loads YAML/JSON config, calls transform.py to normalise city and country names, calls metrics.py to compute all aggregations and trip detection, then renders two Jinja-free template files using simple {{PLACEHOLDER}} substitution.

Each per-page generator (gen_companions.py, gen_venues.py, gen_tips.py, gen_ratings.py, gen_shouts.py, …) reads a plain templates/*.tmpl file and substitutes its own placeholders. After every generator runs, build.py does a single post-process pass on every output HTML file, substituting two cross-cutting placeholders — {{CTRY_CODE_JSON}} and {{CAT_ICON_JSON}} — from config/country_flags.json and config/category_icons.json. This means every page shares one canonical 112-entry country→flag map and one 559-entry category→icon map; adding an entry is a one-line JSON edit, no code changes across the nine templates that consume them.

gen_feed.py is an exception: after the migration to D1-backed pagination (see Challenge 13), it no longer embeds any data. It simply renders the template shell with the {{SWARM_USER_ID}} placeholder substituted — the actual check-in data is fetched at runtime from /api/feed. Similarly, the full-text search index (search-index.json) is no longer generated or committed; search is served live by functions/api/search.js querying D1 directly.

Trip detection in metrics.py runs a single-pass scan over the sorted check-in sequence: any consecutive run of check-ins where city ≠ home_city and the run length exceeds min_checkins (configurable) is declared a trip. Trip names are auto-generated from the most-visited countries and cities in that sequence.

# Data flow — single build command
python scripts/build.py \
    --input   data/checkins.csv   \
    --config-dir config            \
    --output-dir .


# Internal execution order:
#  1. transform.py        — apply 4-layer city/country normalisation
#  2. metrics.py          — compute aggregations, trip detection, centroids,
#                            collect_companions, shout_records,
#                            shout_analysis, cross_dim_analysis
#  3. index.html          — rendered from templates/index.html.tmpl
#  4. trips.html          — rendered from templates/trips.html.tmpl
#  5. gen_companions.py   → companions.html
#  6. gen_feed.py         → feed.html
#  7. gen_worldcities.py  → world_cities.html
#  8. gen_venues.py       → venues.html
#  9. gen_tips.py         → tips.html
# 10. gen_stats.py        → stats.html
# 11. gen_search.py       → search.html  (no static index — D1-backed)
# 12. gen_ratings.py      → ratings.html
# 13. gen_lists.py        → lists.html
# 14. gen_guide.py        → guide.html
# 15. gen_shouts.py       → shouts.html
# 16. gen_photos.py       → photos.html  (optional, --photos arg)
# 17. gen_trip_pages.py   → trip-{N}.html (one per trip, ~160 files)
# 18. POST-PROCESS PASS   — substitutes {{CTRY_CODE_JSON}} + {{CAT_ICON_JSON}}
#                            from config/country_flags.json + category_icons.json
#                            in every generated HTML file (single canonical source)
# 19. sync_to_d1.py       — upserts checkins/venues/tips/ratings/lists/trips into D1
#                            (incremental; each table gated by its own changed flag)
    

Stack

Technology Choices

Every choice optimises for zero operational overhead and long-term maintainability — no framework churn, no node_modules in the build pipeline, no containers to patch.

Build Pipeline

Python 3.9+

zoneinfo · requests · pyyaml · timezonefinder

Real-Time Poller

Cloudflare Worker

ES module · 1-min cron · KV timestamp store

CI/CD

GitHub Actions

schedule + workflow_dispatch · 2-repo checkout

Hosting & CDN

Cloudflare Pages

Auto-deploy on push · global CDN · free tier

Maps

Leaflet 1.9 + leaflet.heat

Heatmap · dot map · trip maps · country flags

Charts

Chart.js 4.4

Bar · line · doughnut · polar area

Data Source

Foursquare API v2

Incremental fetch · full re-fetch mode

Weather

Open-Meteo Archive

Historical weather per check-in · free · no key

Data Privacy

Private GitHub Repo

Fine-grained PAT · Contents scope only

Photo Storage

Cloudflare R2

21 000+ photos · 10 GB free · $0 egress · S3-compatible sync

Search & Feed API

Cloudflare D1 + Pages Functions

SQLite at the edge · live search · paginated feed · free tier

Design Philosophy

What Matters and Why

⚡

Correctness over convenience

Every timestamp is processed in the observer's local time using authoritative IANA timezone IDs, not browser-inferred offsets. DST transitions, political timezone changes, and UTC vs. local date edge cases are all handled explicitly.

🔒

Least-privilege by default

The data PAT has Contents read/write on one repo only. The Worker's GitHub token has workflow scope only. No secret has broader permissions than its single job requires.

🪶

No framework lock-in

No React, no Next.js, no bundler, no transpiler. The build is pure Python. The frontend is plain HTML + vanilla JS. In five years, the dashboard will still build with python scripts/build.py and serve from any static host.

🔁

Idempotent pipeline

The KV timestamp is written before the GitHub dispatch, not after. If the dispatch fails and the Worker retries, it won't trigger a double build. Every stage is designed to be safely re-runnable.

🧠

LLM-assisted development

The entire system was architected and built using Claude Code as an active pair programmer — from data pipeline design to CSS rendering edge cases. Architecture decisions, debug sessions, and normalisation logic were all developed in iterative dialogue with the model, with the human providing domain knowledge (geography, timezone rules, data quality patterns) and the LLM translating them into correct, maintainable code.

📐

Sync invariants are documented

Where two files must stay in sync (e.g., the continent-aware city matching logic shared between index.html.tmpl and gen_worldcities.py), the constraint is explicitly documented in CLAUDE.md — so future AI-assisted edits know to update both files together.