Monday, June 6, 2011

Strategies for Massive Data Sets

It happens once in a while, that a client has a data set with a kajillion point locations in it and they'd like to see what that looks like on a map.  With data sets this large, the diminishing marginal value of all those discrete points on a map results in such a visually overwhelming picture that meaningful individual interaction becomes hopelessly unwieldy and the cloud of points serve mainly to give a visual sense of relative geographic distribution.  It's tempting to poo-poo it as an unreasonable desire, but really sometimes you need the whole world's worth of data, to communicate a message of general whereness.  Visualizing massive data sets is a challenge but the result can be quite useful.  So how do you go about cramming that message through a browser-based app?
One million points...

Two Bottlenecks
Slowness can occur at either (or both) ends of a data transaction: back-end/network overhead (the database query and the blob of data sent back) and client-side rendering (the drawing that takes place in the browser).
Asking any database for a million plus records will take a bit of time –probably more than a web-user is accustomed to waiting, or more than a browser itself is willing to wait.  I often imagine the gulp of data that comes down the network looking like one of those snakes that eat a moose or something.  By the way, a spatial index will save your users a ton of time and is a universally good idea for big data sets.

At the other end of the pipe is the browser client that has to actually draw all of that stuff on screen.  Rendering loads of data can cause even the most efficient clients to lock up or respond sluggishly, but you can be clever abut how you represent that data so the experience isn't horrible.

Here are some strategies that can be used individually, or in combination with each other for showing or maybe just managing enormous data sets in a Visual Fusion app...

Default Filtering
I have to at least throw this option out there.  Before the full set of data is requested in an application, some meaningful set of the filtering options can be set by default that restrict the potentially overwhelming set of points to a manageable subset.  Though, as mentioned above, there can be a perfectly good reason for wanting to see a visual indication of all the points, not just some of them.  In that case, consider the following options.

Some default filtering criteria can reduce the likelihood of an incidental data overload.

Aggregation to Regions
Lump up data points into area units for a wholesale aggregate view.  This method addresses the back-end and the front-end issues associated with large data sets.  Associating parent relationships in the database and pre-generating the aggregates results in a vastly smaller number of items to transact.  The reduced number of geographic elements rendered on the map also takes a lot of weight off the UI.  What's more, this method introduces a world of new and useful map visualization options.

Aggregating individual data points to areas has many benefits, including improved performance.

The rendering performance gains of heatmapping are huge.  Many dozens of thousands of points render quickly and elegantly in the map interface.  The heatmap provides a useful indication of dispersion and provides a visual breadcrumb that can guide a user in to closer scales where the data overload problem is mitigated.  Web usability folks have been using scent as a navigation metaphor for the visual scanning of browser content and the heatmap works excellently in concert with that model.
Plus, heatmaps look awesome.  Here are some out of the box coloring options for VFX.

A heatmap gives a great visual indication of dispersion, providing a valuable service in itself but also as guidance for where to zoom in.

Zoom-Level Activation
You can disable feeds in VFX, and probably most other mapping platforms, by zoom level (aka, altitude or scale).  Oftentimes this makes sense for smartly showing and hiding nested area units like countries, states, counties, etc.  But it is also a useful, though comparatively blunt, tool to prevent meltdown at a zoom level that would result in a galaxy of points.  This method is best used in tandem with heatmapping or render-as-raster options.  For example, at broad scales a heatmap can show where items are concentrated and when one zooms in to a tighter scale the points themselves get called in.
VFX can be configured to set up an alias feed row which appears to control both incarnations of the feed (raster and vector) so that the transition appears seamless.

Scale-dependent triggers can restrict access to manageable altitudes.

Server-Side Clustering
Visual Fusion supports something called server-side clustering for individual point feeds.  Using a variable proximity tolerance, all point items are assessed by their relative proximity and potentially lumped into "clusters," which serve as a group proxy for the component items.  The server side of the transaction is not improved, but the rendering performance can get a tremendous bump -having only to render a fraction of the original points.   Any icon can be configured to serve as the cluster icon.  When the user mouses over a cluster, a tool tip shows the original number of points that it represents; clicking a cluster can tractor-beam you down into the footprint of the cluster, at which scale you may or may not discover some of the component points are discrete once more.  Note that clustering occurs at the feed level, so there are no cross-feed clusters.  VF Composer has a simple setting where, if you choose clustering proclivity.

Composer lets you pick a cluster icon and a clusteriness level.

Render as Raster
When it comes to displaying thousands and thousands of data points, raster is just faster (more reading on the tradeoffs here). You don’t save any overhead in the query phase but the network overhead and rendering performance are waaaay faster. Delivery as a flattened bitmap (a simple configuration option in VFX) virtually eliminates the strain put on the web call and the UI, compared to discrete points. And, unlike heatmapping, you get to see the individual point locations. The downside to this is the loss of out-of-box direct interaction with the points, and they won't appear in your VFX charts or timeline. There are methods to mitigate this but really, interacting with individual items with data of this magnitiude is almost meaningless and you're better off waiting for a more manageable scale to call in the discrete vector points.

A flattened bitmap is a lighter load than many thousands of lat-long coordinate pairs and their attributes.

Pre-Processed Overlay
Truly massive data sets can optionally be pre-processed in a GIS or by Visual Fusion as their own canned map overlay, updated via a batch process when traffic is low or as needed.  The database effort of amassing all those points would then be done periodically, rather than every time a user hits the data.  The resulting map overlay (or tile set) is a relatively light-weight layer which would feel fast and responsive as users navigate.  A pre-processed overlay won't look any different than an on-demand overlay, though a drawback (in addition to the render as raster caveats in the previous section) to this option is that the data are refreshed only when the batch process re-creates the overlay.  Additionally, the user won’t be able to apply data filters or thematic visual changes at runtime, as the overlay has been baked ahead of time.  Plus, you'll have to have some map geek(s) set the batch process up for you, so weigh the desire against the cost.

Data that are voluminous and slow to change, like local demographics, are good candidates for pre-processing.

So good luck visualizing all that data.  These are by no means the only strategies for presenting massive data sets, but they have been effective options for us!  Also, if you have a method you've used that wasn't covered here, I'd love for you to let us know about it!

No comments:

Post a Comment