Measuring Walkability with Openstreetmap and Isochrone Maps

October 20, 2022

19 min read

Walkability

Walkability is the quality of a neighborhood that allows pedestrians to accomplish everyday tasks like collecting groceries, visiting restaurants, and attending doctor's appointments, all on foot.

There are many ways to measure walkability, including intersection density and by measuring the distance to urban amenities. The most notable metric for walkability is Walk Score, which is owned by Red Fin. They use a proprietary system for this metric, but ultimately their methodology¹ is based on the distance to amenities, similar to the one I will explore today.

Isochrone maps

Isochrone maps are a way to visualize the area that could be reached from a starting point in a given time. These distances appear as contours originating from the starting point, showing bands across which the travel time is equal.

An 1881 isochrone map by Francis Galton showing world travel time starting in London.

Isochrone maps may account for travel by different methods. E.g. travel by car is constrained to roads, and travel by foot is constrained (roughly) to sidewalks.

Motivation

The main idea of this project is to examine how not just density, but also the structure of roads and sidewalks impact walkability. For instance, compare these two pedestrian isochrone maps showing the destinations possible in 15 minutes of walking:

A rural residential area in West Virginia. Copyright openstreetmap

A map of Manhattan — Manhattan, New York City. Copyright openstreetmap

Even though there are clearly fewer amenities near this rural area, the network structure of roads also prohibits covering as much distance, not to mention the lack of sidewalks facilitating safe pedestrian travel. The city block structure in Manhattan provides a 30% increase in the walkable area from the starting point, compared to the rural area.

Valhalla

To construct isochrone maps, we'll be using a tool called Valhalla, which is an open source routing engine. This tool operates on top of openstreetmap data, which is crowdsourced by volunteers, but is generally quite thorough, especially in urban areas.

Downloading the data

First I downloaded the openstreetmap data from Geofabrik,² which graciously hosts up-to-date extracts of openstreetmap for download.

Initially I used an extract specifically for Virginia which was a modest size of 342 MB. But after this code was tested, I ran imported the entire United States extract, which is 8.5 GB.

Then I copied the data to data/data.osm.pbf

Setting up Valhalla and PostGIS

I used docker compose to manage Valhalla and PostGIS so that I didn't have to install it locally.

For Valhalla, this docker compose file just exposes port 8002 for the API in addition to some volume mounts for the run script and the OSM data.

version: "3.9"
services:
  valhalla:
    image: valhalla/valhalla:run-latest
    ports:
      - "8002:8002"
    volumes:
      - ./scripts:/scripts
      - ./data:/data
    entrypoint: /scripts/build_tiles.sh

Combined with a simple bash script to build the Valhalla tiles and run the server:

#!/bin/bash

mkdir /data/valhalla_tiles

if ! test -f "/data/valhalla.json"; then
    echo "Valhalla JSON not found. Creating config." 
    valhalla_build_config --mjolnir-tile-dir /data/valhalla_tiles --mjolnir-tile-extract /data/valhalla_tiles.tar --mjolnir-timezone /data/valhalla_tiles/timezones.sqlite --mjolnir-admin /data/valhalla_tiles/admins.sqlite > /data/valhalla.json
fi

if ! test -f "/data/valhalla_tiles/timezones.sqlite"; then
    echo "Valhalla tiles not found. Building now."
    valhalla_build_timezones > /data/valhalla_tiles/timezones.sqlite
    valhalla_build_tiles -c /data/valhalla.json /data/data.osm.pbf
fi

if ! test -f "/data/valhalla_tiles.tar"; then
    echo "Tile extract not found. Extracting now."
    valhalla_build_extract -c /data/valhalla.json -v
fi

echo "Starting valhalla server."
valhalla_service /data/valhalla.json 1

The full code can be found here.

Setting up PostGIS

PostGIS is a geospatial extension on top of PostgreSQL, which does some very cool optimizations to support efficient geospatial queries.

Similarly, I set up PostGIS using my osm2pgsql-docker repo and imported the US dataset with osm2pgsql, a tool used for ingesting the protobuf formatted data into PostGIS.

version: "3.9"  # optional since v1.27.0
services:
  postgis:
    container_name: postgis
    image: postgis/postgis
    ports:
      - "15432:5432"
    environment:
      - POSTGRES_PASSWORD=password
    volumes:
      - type: bind
        source: ./postgresql.conf
        target: /etc/postgresql.conf
    command: ["postgres", "-c", "config_file=/etc/postgresql.conf"]

After PostGIS was running, I imported the data. This was initially tested on the Virginia state osm.pbf data, but I later updated this to use the entire US dataset. After making this change, the script kept running out of memory. So I added the --slim --drop flags, which process the data using PostgresSQL (using my SSD), rather than storing everything in memory. Although I was running this on a 16 GB machine, I did not have enough free memory to process for the whole file.

PGPASSWORD=password osm2pgsql \
  --create \
  --verbose \
  -P 15432 -U postgres -H localhost \
  -S osm2pgsql.style \
  --slim --drop \
  /home/wcedmisten/Downloads/us-latest.osm.pbf

I also used a simplified version of osm2pgsql.style to reduce the number of features imported into the database.

These 3 lines were retained, but most other tags were set to be deleted, rather than imported.

node,way   amenity      text         polygon
node,way   shop         text         polygon
node,way   leisure      text         polygon

I intentionally limited my analysis to amenity, shop, and leisure tags.

Due to using my SSD instead of memory for the import, the whole process took around 6 hours. I just let it run overnight, so this wasn't a huge issue.

Even with this reduced set of tags, the database still took up 25 GB of storage, much more than could fit in memory.

Writing the Code

Now that the backends were set up, I could begin writing the code to measure walkability. My code was intended to do several things for each city we want to measure:

Sample some points randomly inside a city
Find the isochrone map for each of these points within a 15 minute walk
Find the amenities within this isochrone boundary (e.g. restaurants, parks, grocery stores, etc.)

Getting the City Boundary Data

To find the boundaries for US cities, I downloaded a few different datasets for cities:

geoJSON files for each city's boundary for major cities in the US. These were based on the TIGER US census dataset that someone converted to GeoJSON
top 1,000 most populous cities. I examined the top 100 cities in this dataset to compare walkability.

I then created a PostGIS table, city, to store the GeoJSON boundaries for these 100 cities using a python script:

upload_cities.py:

import psycopg2
import json

from util import get_cities

conn = psycopg2.connect(host="localhost", port="15432", dbname="postgres", user="postgres", password="password")
cur = conn.cursor()

create_city_table = """CREATE TABLE IF NOT EXISTS city (
    name text,
    state text,
	geom geometry,
    PRIMARY KEY (name, state)
);
"""

cur.execute(create_city_table)
conn.commit()


def import_city(city, state):
    with open(f"geojson-us-city-boundaries/cities/{state}/{city}.json") as f:
        data = json.load(f)

    geom = data['features'][0]['geometry']

    cur.execute("INSERT INTO city (name, state, geom) VALUES (%s, %s, %s)", (city, state, json.dumps(geom)))
    conn.commit()


# get_cities() just munged the city population JSON data to get the top 100
for city, state in get_cities():
    # print(city,state)
    import_city(city, state)

This allowed me to query whether or not a point is inside a city with the SQL query:

SELECT ST_Contains(geom, ST_GeomFromText('POINT(%s %s)'))
    FROM ( SELECT geom FROM city WHERE name=%s AND state=%s ) as geom

Sampling

I wanted to generate an even distribution of points across a city and sample "walkability" at each of these points. Then, all samples could be averaged out.

To accomplish this, I calculated the bounding box for each city, and generated a Poisson disk sampling³ of points. This distribution guaranteed minimal overlap of points being sampled.

This bounding-box distribution for New York City looked like this:

sampling points for bounding box of NYC - map data copyright openstreetmap

After removing the points outside of the city's actual boundaries, the sample looked like this:

sampling points for NYC

Zoomed in further:

sampling points for NYC - zoomed in

Poisson disk distributions guarantee a minimum separation distance for all the points. I picked a distance somewhat arbitrarily that would have some overlap of walkable amenities for neighboring samples, but not too much.

The code to do this was also fairly simple, relying on a library poisson_disc ⁴ to calculate the sample points, and PostGIS to determine whether each point was in the city.

get_bbox_query = """SELECT ST_AsText(ST_Envelope( geom ))
FROM ( SELECT geom FROM city WHERE name=%s AND state=%s ) as geom
"""

cur.execute(get_bbox_query, [city, state])
bbox = cur.fetchone()

minx, miny, maxx, maxy = shapely.wkt.loads(bbox[0]).bounds
height = (maxy - miny)
width = (maxx - minx)

points = poisson_disc.Bridson_sampling(dims=np.array([width,height]), radius=.005)

is_in_query = """
SELECT ST_Contains(geom, ST_GeomFromText('POINT(%s %s)'))
FROM ( SELECT geom FROM city WHERE name=%s AND state=%s ) as geom
"""

scaled = list(map(lambda p: [p[0] + minx, p[1] + miny], points))

def point_in_city(point):
    is_in_city = cur.execute(is_in_query, [point[0], point[1], city, state])
    return cur.fetchone()[0]

# get all the points in the given city
in_city = list(filter(point_in_city, scaled))

Creating a metric

Now that I had a sampling for the city, I needed a metric to calculate for each point.

I used a relatively simple approach: I examined a few categories of urban amenities and gave 1 point to each unique category of amenity within a 15 minute walking distance. The walking distance was determined by querying Valhalla for each sampled point.

For instance, a point with 3 restaurants, 2 bakeries, and 1 shoe store within walking distance would get a score of 3: 1 point for each category. I reasoned that most amenities are essentially a binary category. Either the amenity you want is within walking distance or it's not. For future work, I would probably change this to a logarithmic or logistic function to reflect diminishing returns on amenities of the same category.

The first bakery within walking distance provides much more value than the second one, which provides more value than the third, etc.

It was also simpler to examine these categories at a binary level because the OSM data model has many places where a single entity would be represented as multiple nodes (i.e. a building polygon), and this approach avoided that complexity.

15 minutes was chosen somewhat arbitrarily, but is about the maximum I would consider to be a "quick walk" to a destination. In the future, this could be explored using variable distances per amenity.

I created a function get_amenities(lat, lon) to retrieve the amenities within walking distance of a point.

MAX_WALK_TIME = 15

def get_amenities(lat, lon):
    payload = {
        "locations":[
            {"lat": lat,"lon": lon}
        ],
        "costing":"pedestrian",
        "denoise":"0.11",
        "generalize":"0",
        "contours":[{"time":MAX_WALK_TIME}],
        "polygons":True
    }


    request = f"http://localhost:8002/isochrone?json={json.dumps(payload)}"
    isochrone = requests.get(request).json()

    geom = json.dumps(isochrone['features'][0]['geometry'])

    query = f"""
    SELECT * FROM planet_osm_point
    WHERE ST_Contains(ST_Transform(ST_GeomFromGeoJSON('{geom}'),3857), way) AND (amenity IS NOT NULL OR shop IS NOT NULL OR leisure IS NOT NULL);
    """

    cur.execute(query)
    rows = cur.fetchall()

    cur.execute("SELECT * FROM planet_osm_point LIMIT 0")
    colnames = [desc[0] for desc in cur.description]
    
    return to_dict(colnames, rows)

This function:

makes a request to the Valhalla Isochrone service to get the polygon representing a 15 minute walk surrounding the point
uses that polygon to find all unique amenities within those bounds

The amenities considered here are represented in the following OSM tags, including their definitions from the OSM wiki.

amenity:cafe - a generally informal place with sit-down facilities selling beverages and light meals and/or snacks
amenity:bar - a purpose-built commercial establishment that sells alcoholic drinks to be consumed on the premises
amenity:pub - an establishment that sells alcoholic drinks that can be consumed on the premises
amenity:restaurant - formal eating places with sit-down facilities selling full meals served by waiters
shop:bakery - a shop selling bread
shop:convenience - a small local shop carrying a variety of everyday products, mostly including single-serving food items
shop:supermarket - a large shop for groceries and other goods, including meat and fresh produce
shop:department_store - a large store with multiple clothing and other general merchandise departments
shop:clothes - a shop which primarily sells clothing, such as underwear, shirts, jeans etc
shop:shoes - is a type of retailer that specialises in selling shoes
shop:books - A shop selling books (including antique books, chain bookstores, etc.)
shop:hairdresser - where you can get your hair cut
leisure:fitness_center - a place with exercise machines and/or fitness/dance classes. They are places to go to for exercise
leisure:park - an area of open space for recreational use, usually designed and in semi-natural state with grassy areas, trees and bushes

Results

Top 10 most walkable cities

Rank	City	Amenity Score
1	washington-dc dc	5.54
2	minneapolis mn	4.31
3	chicago il	4.2
4	new-york ny	4.07
5	seattle wa	4.01
6	cleveland oh	3.84
7	denver co	3.37
8	jersey-city nj	3.31
9	st-louis mo	3.14
10	pittsburgh pa	3.07

Least 10 walkable cities:

Rank	City	Amenity Score
91	louisville-jefferson ky	0.39
92	tucson az	0.38
93	north-las-vegas nv	0.36
94	jacksonville fl	0.35
95	lubbock tx	0.32
96	oklahoma-city ok	0.31
97	chesapeake va	0.27
98	bakersfield ca	0.22
99	corpus-christi tx	0.17
100	anchorage ak	0.05

I also compared these rankings to Walk Score, and found that they seemed to be mostly in the ballpark.

walk score comparison

This plot shows the difference between the Walk Score ranking and my ranking. E.g. I ranked San Fransisco as #26, but Walk Score ranked it at #1, showing a bar at -25 for the first entry. The x axis is ordered by the walk score ranking.⁵

Limitations

Some significant limitations:

I didn't filter samples that aren't on land
I didn't filter for residential areas (although arguably this could be intentional to count tourism and travel)
openstreetmap is made by crowdsourced data from volunteers, and amenity data may not be evenly distributed (probably benefitting higher population cities)
Biases from the boundaries of the city vs. metropolitan sprawl. This would be biased toward larger metropolitan areas

Performance

I profiled running this on a single city with flameprof⁶, and found that the majority of time is spent between querying Valhalla and querying PostGIS, which makes sense to me. Both Valhalla and PostGIS are most likely heavily optimized already, so the only way I could shave down runtime would be to sample fewer points.

flame graph

Future Work

Custom walkability scores

Explore customized walkability scores. "Walkability" based on all amenities is painting with a broad brush. Most people probably don't care about every single amenity that can be categorized. Peter Johnson has an interesting article⁷ about reverse engineering the Walk Score, which found that it's highly correlated with the number of nearby restaurants.

Personally, I love restaurants, but I cook most of my meals at home, so this score may be less useful to me. I like the idea of assigning weights to each amenity to get a more custom optimization. You could even optimize for something like "most walkable Chinese restaurants" or "most parts".

We could also codify some other factors with walkability related to the isochrone calculations. For example, people with disabilities may have an especially hard time with poor sidewalk infrastructure. I don't think Valhalla supports custom sidewalk cost penalties related to width, sidewalk condition, or curb cuts, but I would be interested in exploring that more specifically.

One of the coolest things about openstreetmap is the level of detail available in the data. I don't know of anywhere else where curb cut data is available.

Amenity centered isochrones

Another idea I had was instead of sampling a bunch of points, which requires a lot of isochrone calculations, instead calculate isochrones starting from each amenity. Then we could combine all the isochrone polygons and calculate what portion of the city is within X minute walk of that amenity. This could also be useful for finding amenity "deserts", like food deserts, but more general.

This would also allow me to explore varying the walking distance for each amenity category. E.g. I would be more willing to walk further for infrequent and important errands, like electronics repair or a hospital visit. But for more common tasks like getting groceries, I would prefer a closer walk.

Rank	City	Amenity Score
1	washington-dc dc	5.54
2	minneapolis mn	4.31
3	chicago il	4.2
4	new-york ny	4.07
5	seattle wa	4.01
6	cleveland oh	3.84
7	denver co	3.37
8	jersey-city nj	3.31
9	st-louis mo	3.14
10	pittsburgh pa	3.07
11	philadelphia pa	3.0
12	portland or	2.94
13	st-paul mn	2.87
14	san-jose ca	2.8
15	buffalo ny	2.67
16	baltimore md	2.63
17	detroit mi	2.57
18	boston ma	2.55
19	oakland ca	2.54
20	miami fl	2.44
21	sacramento ca	2.4
22	omaha ne	2.33
23	chandler az	2.26
24	long-beach ca	2.11
25	santa-ana ca	2.04
26	madison wi	2.01
27	san-francisco ca	2.0
28	milwaukee wi	1.91
29	los-angeles ca	1.9
30	richmond va	1.89
31	cincinnati oh	1.89
32	san-diego ca	1.83
33	chula-vista ca	1.8
34	atlanta ga	1.8
35	gilbert az	1.76
36	honolulu hi	1.76
37	anaheim ca	1.76
38	mesa az	1.74
39	raleigh nc	1.64
40	newark nj	1.64
41	lincoln ne	1.62
42	albuquerque nm	1.51
43	las-vegas nv	1.49
44	austin tx	1.43
45	columbus oh	1.4
46	plano tx	1.34
47	colorado-springs co	1.34
48	aurora co	1.33
49	st.-petersburg fl	1.3
50	greensboro nc	1.29
51	riverside ca	1.28
52	houston tx	1.25
53	boise-city id	1.24
54	tampa fl	1.24
55	glendale az	1.19
56	fremont ca	1.1
57	toledo oh	1.07
58	irvine ca	1.04
59	stockton ca	1.03
60	orlando fl	1.01
61	charlotte nc	1.0
62	hialeah fl	0.99
63	indianapolis-city in	0.99
64	durham nc	0.96
65	norfolk va	0.96
66	phoenix az	0.9
67	baton-rouge la	0.88
68	dallas tx	0.84
69	reno nv	0.83
70	fort-wayne in	0.82
71	arlington tx	0.8
72	wichita ks	0.77
73	garland tx	0.75
74	irving tx	0.75
75	winston-salem nc	0.71
76	tulsa ok	0.69
77	fresno ca	0.67
78	kansas-city mo	0.67
79	new-orleans la	0.66
80	fort-worth tx	0.65
81	san-antonio tx	0.58
82	henderson nv	0.58
83	san-bernardino ca	0.56
84	nashville-davidson-metropolitan-government tn	0.55
85	memphis tn	0.54
86	scottsdale az	0.53
87	lexington-fayette ky	0.51
88	el-paso tx	0.48
89	laredo tx	0.44
90	virginia-beach va	0.4
91	louisville-jefferson ky	0.39
92	tucson az	0.38
93	north-las-vegas nv	0.36
94	jacksonville fl	0.35
95	lubbock tx	0.32
96	oklahoma-city ok	0.31
97	chesapeake va	0.27
98	bakersfield ca	0.22
99	corpus-christi tx	0.17
100	anchorage ak	0.05