A picture of me standing at a lectern, working on a laptop computer, on the stage of the FWD50 digital government conference

Hi! I’m Alistair. I write surprisingly useful books, run unexpectedly interesting events, & build things humans need for the future.

Gathering moss: data gravity and context

Dave McCrory first proposed the idea of data gravity a few years ago. Since that time, he’s expanded and refined the concept, adding elements of Shannon’s law and the idea that data transmission is a form of “friction.” It can be complicated stuff. So I’m going to offer a couple of simple examples of what I believe to be one of the most important notions in computing today: why data wants to be together.

mccrorywell

In computing, the Von Newmann Bottleneck is a basic limitation on how fast computers can be. Essentially, it says that the speed with which information can get from memory (where data is stored) to computing (where data is acted upon) is the limiting factor in computing speed.

This is the reason that when you buy a chip, there’s a local cache, and often a Layer 2 cache, right on the chip. Even the time it takes for electrons to travel the tiny distance between the Central Processing Unit (CPU) and the memory on the computer imposes a delay on each operation. And that delay adds up quickly when a computer is performing billions of operations a second.

The bottleneck doesn’t just exist on a computer’s motherboard. Microsoft researcher Jim Gray spent much of his career looking at the economics of data. He concluded that, compared to the cost of moving bytes around, everything else is free. Getting information from, say, your hard drive to a cloud server takes time—as anyone who’s ever uploaded a video will agree. And within a data center, moving bytes from a shared hard drive or storage service to a computing service has a cost.

Having all the data in one place would mean the least amount of moving it around, and as a result, the least cost. This is the fundamental principle of data gravity as Dave first explained it. Just as two planets might compete, gravitationally, for a third planet between them, so two data centers or cloud providers “compete” to pull data towards themselves. If all else were equal, we’d wind up with one big data center.

But all else is seldom equal.

There are plenty of reasons you may not want all your data in one place—privacy, legislation, and the cost of a service provider’s storage fees. Dave searched for a model that could explain these complexities, and the friction that stops data from centralizing. In 2012, during a fascinating phone call I wish I’d recorded, Dave announced that he’d found such a model in the way nations negotiate trade tariffs and balance-of-trade agreements between countries and cities. This model actually borrows from gravitational theory.

Since that time, Dave continued to refine his thinking on the subject, and eventually realized that it was in fact a form of information theory. He spoke about this at Cloud Connect earlier this year.

Let me offer, as Dave often does in presentations, a piece of raw data: 32.

On its own, this isn’t very useful. It takes context to make it relevant. When I tell you that the number refers to degrees Fahrenheit, you can now put it into context. That piece of metadata has made the data more useful. It’s now informative.

You’ve brought your own context to this, too. As a reader, you know that water is common on your planet; and that 32 degrees Fahrenheit is the temperature at which water freezes. You’ve brought all your memories about snow, and ice, and winter along with you.

If I now tell you that it is 32 degrees Fahrenheit in Montreal right now, that’s even more context—and it’s far more useful if you’re on a plane headed there right now. You’ve got useful knowledge. And if I also tell you it’s July, then the knowledge is surprising and unusual.

The more context you have about information, the more useful and relevant it is. In networking and information theory, Shannon’s law (technically, the Shannon-Hartley Theorem) governs how much information can be squeezed down a wire in the form of bits and bytes. The way that we squeeze more throughput out of a network is by adding context to each end. If I tell you that someone on the other end of a phone line will tell you the current temperature, in degrees Fahrenheit, in Montreal, then all that person needs to say is “32.” I’ve already given you the context to make it useful. Of course, I’ve also reduced the utility of that connection—I can’t use it to tell you, say, the current price of a stock.

golden_record_cover_sm(As a sidenote, the Voyager spacecraft carried a gold disc that an alien might play. The disc also included a ton of information about what the disc was, how to play it, and so on—because an alien race would have zero context. It was the ultimate in uncompressed, inefficient data that had maximum utility for a recipient.)

So sending data down a wire, at a simple and rarefied level, is often about the tradeoff between efficiency (which we get through compressing data and caching context at either end) and utility (which we get by letting that connection handle many things at once, without context at the end.) This is how wide-area-network acceleration works; it’s also how a ZIP file stores information.

Consider the following file.

Genericfile

You don’t know anything about this file. There’s no file extension, and no useful information in the name. What if I change its name as follows?

Namedfile

Now you might have some insight. The letters DSC stand for Digital Still Camera, and it’s likely that somewhere on your hard drive, you have a folder that looks like this:

searchthismac

If I open the file using the most generic tool I have on my computer—a text editor—and look at a random spot in the file, here’s what I see:

Nonsensetext

This isn’t very informative. Note that I’ve already skipped a number of steps here—I could view this as a series of ones and zeros; and it might not even be a file that works on my Mac OS computer. But the file can show us more. At the top, there’s some additional data I can read:

Fileinfotxt

From this, I can conjecture that the file is, in fact, an image—since it appears to have been taken by a Canon PowerShot. Knowing this, I can tell my computer it is a JPEG file (which is the most common format for images) and try to get a preview of it.

Finderpic

As it turns out, it’s a picture. I don’t know what, or where, or when, but it’s a picture. And now I can find more context. In this case, it’s not just a picture—it’s one I have published on Flickr. And Flickr has a considerable amount of metadata about the picture in question.

All sizes  Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-39-24

That includes who took it (me.)

Flickr: Your Photostream 2013-06-03 09-39-47

It also includes what rights I claim from it (a Creative Commons Attribution/Share-Alike license.)

Creative Commons — Attribution-ShareAlike 2.0 Generic — CC BY-SA 2.0 2013-06-03 09-38-23

What’s more, the picture appears in a photostream, alongside other pictures from the same trip. It has metadata about date and time (and were it taken with a smartphone, it would have location information too.)

Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-40-34

Note that we’re moving beyond what’s in the file itself—I can see information about how many people have seen it as well, which is strictly Flickr’s context, and not mine.

Favorites  Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-45-44

Some of those people have commented, and even added it to their own lists.

Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-40-12

Flickr also extracts the information the camera includes in the file, and mines it for their own purposes. Here’s what it knows about my picture:

Dates
  • Taken on December 2, 2009 at 12.45PM EDT (edit)
  • Posted to Flickr December 4, 2009 at 4.42PM EDT (edit)
Exif data
  • Camera Canon PowerShot SX10 IS
  • Exposure 0.001 sec (1/800)
  • Aperture f/5.0
  • Focal Length 53.8 mm
  • ISO Speed 80
  • Exposure Bias 0 EV
  • Flash Off, Did not fire
  • Orientation Horizontal (normal)
  • X-Resolution 72 dpi
  • Y-Resolution 72 dpi
  • Software QuickTime 7.6.3
  • Date and Time (Modified) 2009:12:04 10:36:54
  • Host Computer Mac OS X 10.6.1
  • YCbCr Positioning Centered
  • Date and Time (Original) 2009:12:02 12:45:07
  • Date and Time (Digitized) 2009:12:02 12:45:07
  • Max Aperture Value 5.0
  • Metering Mode Multi-segment
  • Color Space sRGB
  • Sensing Method One-chip color area
  • Compression JPEG (old-style)

This, in turn, helps them publish research on things like what kind of camera is most popular right now.

flickr-rankings

The journey my photo has taken, from the raw ones and zeros of the data itself, through the layers of context that it’s gained as it’s labeled and it moves into a shared environment, are all things that cause “friction” when I remove it. Were I to leave Flickr and switch to another photo sharing site, that picture would lose much of its appeal. It wouldn’t be considered popular or “interesting,” a rating Flickr assigns to photographs that people like. I wouldn’t have a comment thread around it, or know who’d seen it. The picture would lose context.

Were I to repatriate the image to my hard drive, and then strip away filenames and metadata, I’d be making it increasingly less useful. And this is the resistance Dave’s talking about when we tease apart data that is centralized.

Consider, for example, a company whose sales pipeline resides in Salesforce.com. There, the data is wrapped in context—the software that’s used to edit and analyze it; a history of which employees have accessed which customers; which prospects are neglected; and so on.

If that company decides to leave Salesforce, they’re welcome to extract their data. But they’ll get it in the form of raw, comma-separated value (.csv) information. Much of the context is gone. The data, without the context, is far less useful. That’s an important lesson: software is context. And removing context is, from the point of view of information, like fighting gravity.

There are things we can do to mitigate this removal of context. The smarter computers are at inferring context, the more they remove friction. Consider, for example, what happens when I ask Google’s image search what it thinks of my photograph:

Google Search 2013-06-03 09-42-46

Not only will Google accurately identify the image, and provide other images and an abundance of additional information, it will even show me where my particular image has been used on the web:

Matching images

With this data, I could (if I were so inclined) contact users and try to enforce my claim to the rights of the image.

So a theory of Data Gravity needs to consider several things:

  • Context comes from linking two pieces of data (such as the image contents, and the fact that it’s an image) together.
  • The more context we have, the more we turn raw bits into usable knowledge
  • Often, context comes from somewhere else. On my computer, the fact that an image is an image is stored in the file system, not the file itself. With cloud computing, that file is likely to be far away
  • As data is manipulated by software, it generates more data, which is a form of context. When the data is in a public place, it gathers more context as it interacts with other data. It “gathers moss.”
  • Data that is centralized can be compared, annotated, and tracked as others use it.
  • As software gets smarter it can often infer context usefully. 

All of these observations affect the tendency of data to be centralized (for cost, efficiency, proximity to other sources of context, and utility) or the ease with which it can be moved around and repurposed.

This is important for sovereignty, because it means that countries might need to legislate against such agglomeration. It also conveys a strong first-mover advantage similar to the network effects described in Metcalfe’s Law (the more nodes there are of something, the more useful it becomes.) Amazon’s S3 storage service, for example, has a huge lead over other sources; indeed, the reason Amazon’s East Coast data center is so favored—despite chronic overcrowding—is simply that’s where everyone else is.

What Dave’s latest work does is incorporate these ideas into a workable explanation of Data Gravity. That name is sticky—unusually good branding, and a term that’s spread around the cloud community like wildfire. But it’s also a misleading term. Behind it all is the notion that data which is near other data is more useful, and the tendency of data to cling together comes from the usefulness of the resulting knowledge.