City Atlas
The world's largest cities, embedded by name ("City, Country")
and projected to 2D — shown side by side with their real latitude/longitude. The
question: how much of the actual world map do name-embeddings alone recover? Switch
the embedding model and the projection to compare.
Projection. Two layouts of the same vectors. PCA (linear) preserves global variance directions — the honest "map": it Procrustes-aligns to the real world far better (lower map-match disparity below). UMAP optimises local neighbourhoods and deliberately distorts global geometry, so it draws crisper clusters but a worse map — a prettier UMAP is not a better embedding. Each layout is Procrustes/Umeyama-aligned to real lng/lat (rotation + reflection + scale) so models and projections are eye-comparable.
Scores measure geo-recovery on the raw embeddings (projection-independent). Global ρ = Spearman of embedding cosine-distance vs great-circle distance over random pairs — but random pairs are mostly cross-continent, so it scores global structure and understates local fidelity. Within-continent ρ and kNN-continent (share of a city's 10 nearest name-neighbours on the same continent) capture local recovery. Separation (continent silhouette) is a tell: high separation makes UMAP draw isolated blobs that look less map-like — so the picture can reward smoothness while the metrics reward correlation, and they disagree.
Setup. Each model runs in its recommended clustering configuration — document-side with no instruction (we never add a geographic instruction: that would leak the very thing being tested), except EmbeddingGemma, which uses its clustering prompt via the divepool API. So the country token does much of the geographic anchoring; the bare-name variants (drop the country) show how much.
Data: GeoNames cities15000. Embed + projection pipeline:
scripts/city_atlas/. API:
divepool.