BlogCadre users see no ads!  Popular topics: humor, video, links, cool, wtf.  Go create an account!




Data Mining 101: Finding Subversives with Amazon Wishlists

There is a classic census problem of how to provide reasonably detailed statistical and geographic information about a population while still protecting the identities of individuals. Consider this problem for a moment, and then turn it upside down:

How do you provide information publicly about individuals without exposing your database in its entirety to those who might mine it for unintended purposes?

Tom Owad just released an incredibly detailed howto for analysing the reading habits of Amazon.com wishlist users. Did you think this sort of profiling was only available to the likes of the NSA and FBI?

Think again:

It used to be you had to get a warrant to monitor a person or a group of people. Today, it is increasingly easy to monitor ideas. And then track them back to people. Most of us don't have access to the databases, software, or computing power of the NSA, FBI, and other government agencies. But an individual with access to the internet can still develop a fairly sophisticated profile of hundreds of thousands of U.S. citizens using free and publicly available resources.

Tom details how he was able to obtain reading preferences for thousands of citizens. Then, using a few simple scripts, a little geocoding, and google maps, he was able to filter and visualize the data based on "suspicious" books. You know, like 1984, Catcher In the Rye, or the Bible.

Using a pair of 5-year-old computers, two home DSL connections, 42 hours of computer time, and 5 man hours, I now had documents describing the reading preferences of 260,000 U.S. citizens.

He notes that the Amazon robots.txt and acceptable use policies don't explicity forbid this sort of database scraping.

Not that a robots.txt policy is any real barrier against nefarious activity. And even without the datamining aspect, the Amazon data makes it possible to selectively profile individuals who use the system.

Amazon wishlists lets anyone bookmark books for later purchase. By default these lists are public and available to anybody who searches by name.

The problem doesn't end with Amazon, however. Though an examplary case study, arguably containing information on more people than most public sites, there are undoubtedly other systems and public databases that are vulnerable to this sort of attack.

Think wedding and baby registries.

Think public forums and review databases.

Better yet, think carefully about what information you are publicly disclosing online.

Data Mining 101: Finding Subversives with Amazon Wishlists [via]

Trackback URL for this post:

http://www.blogcadre.com/trackback/833