City Atlas

The world's largest cities, embedded by name ("City, Country") and projected to 2D — shown side by side with their real latitude/longitude. The question: how much of the actual world map do name-embeddings alone recover? Switch the embedding model and the projection to compare.

Projection. Two layouts of the same vectors. PCA (linear) preserves global variance directions — the honest "map": it Procrustes-aligns to the real world far better (lower map-match disparity below). UMAP optimises local neighbourhoods and deliberately distorts global geometry, so it draws crisper clusters but a worse map — a prettier UMAP is not a better embedding. Each layout is Procrustes/Umeyama-aligned to real lng/lat (rotation + reflection + scale) so models and projections are eye-comparable.

Scores measure geo-recovery on the raw embeddings (projection-independent). Global ρ = Spearman of embedding cosine-distance vs great-circle distance over random pairs — but random pairs are mostly cross-continent, so it scores global structure and understates local fidelity. Within-continent ρ and kNN-continent (share of a city's 10 nearest name-neighbours on the same continent) capture local recovery. Separation (continent silhouette) is a tell: high separation makes UMAP draw isolated blobs that look less map-like — so the picture can reward smoothness while the metrics reward correlation, and they disagree.

Setup. Each model runs in its recommended clustering configuration — document-side with no instruction (we never add a geographic instruction: that would leak the very thing being tested), except EmbeddingGemma, which uses its clustering prompt via the divepool API. So the country token does much of the geographic anchoring; the bare-name variants (drop the country) show how much.

Data: GeoNames cities15000. Embed + projection pipeline: scripts/city_atlas/. API: divepool.

loading…

Name-embedding space

Real geography (lng / lat)