Tuesday, October 25, 2011

Mapping Hotspots with R: The GAM

I've been getting a lot of questions about the method used to map the hotspots in the seasonal drunk-driving risk maps.  It uses the GAM (Geographical Analysis Machine), a way of detecting spatial clusters from two data inputs: the data of interest, and a control, or "underlying population at risk" (or at least your best substitute for that).

These four distinct hotspot maps were made in R (using a shorter radial distance than previously posted).  They indicate areas where instances of drunk driving fatalities are much higher than normal in winter, spring, summer, or autumn.

Four individual GAM hotspot maps made in R with a baseline mesh of 10,000 points each with a radius of 14 miles and 49 miles.

The Geographical Analysis Machine was whipped up by Stan Openshaw and his team in the late 1980s as a way of calculating relative geographic clusters or hotspots.  It requires a point dataset of interest, which are known events, and a background point dataset representing candidates for those events (some examples at the bottom of the post).

The mesh backdrop
The study area is canvased with a mesh of backdrop points.  A fine mesh will result in a higher resolution output, with cluster zones of greater precision.  It also takes longer to process.  These are the seeds from which your hotspot kernels may or may not grow (depending on what you consider significant).

Here's my study area in R with a mesh backdrop of 10,000 points.  The finer the mesh, the greater the resulting resolution will be, also the greater the amount of coverage overlap depending on what you chose as a meaningful radial distance.

Radial distance
From each point of the mesh backdrop a radial distance is swiped out.  The ratio of events to candidates is counted up, and if the ratio is significantly (how significant is up to you) beyond what a Poisson distribution would expect, then that radius area is retained, nuked if not.  These significant radii are merged together for a discrete vector output of hotspots or they can be used to feed a kernel heatmapping which will result in a bitmap illustration for varying magnitude at distance (like in the above maps).

Events are mapped along with 'candidates' in this illustration.

Overlay a mesh to serve as the starting points of your radii.
In real life you'd want a finer mesh than this, given the data density.

Swipe out a radial distance from around the mesh points.

Radii containing a significantly high event-to-candidate ratio are retained.
Wash, rinse, and repeat, with varying radius distances and you've got a bubbly indicator of clusters.  Additionally, you can use the clusters as inputs to a kernel density map for a smooth heatmap version.

The GAM is just one way to map a ratio of events in order to find proportionally interesting areas and cut out the underlying phenomena clouding the info, there are lots of others.  And a previous hotspot mapping post went into greater detail on why it's important to isolate event intensity from it's underlying phenomena.  But it is such a cool and useful tool that I can't help providing examples again...

  • Cancer Hotspots.  Cases of prostate cancer vs. men of a certain age in order to see where case rates are actually elevated, not just where lots of older men live.
  • Anything Deserts.  Public playgrounds vs. block-level child counts to identify play "deserts."  Actually, anything deserts are a pretty hot topic right now in the social sciences, like food deserts.  In this case you'd be looking for exceptionally low ratios, rather than high.
  • Crime Risk.  Crime tends to happen where there are people, so a flat crime map will look a lot like a population map.  If you account for the underlying population at risk you can get a sense of localized areas of very specific risk.
  • Just about anything having to do with mapping epidemics.

We are really interested in what folks are up to in R and are doing our best to provide inroads to that work so it can be accessed by more folks in your organization.  Let us know if you have any ideas!


  1. All very pretty, but its hard to ask meaningful statistical questions from a GAM. There's all sorts of issues with it. For pretty pictures, I'd just do a ratio of kernel-smoothings. For formal inference, some kind of underlying Gaussian process.

  2. Code, or this post didn't happen.