Sunday, January 6, 2008

Owning my Metadata

Dear Lazyweb,

I'd like someone to invent a 'Metadata reclaimer': a program to screenscrape all my amazon ratings, flickr tags, facebook posts, etc.

I try, as far as possible, to only use apps that let me keep ownership of my metadata. As our friend pud has remarked, all successful internet enterprises share the same business model: either

  • People pay to Enter Data into your Database (eBay, Google AdWords, Flickr, Second Life, World of Warcraft, IMDB pro, Craigslist), or less defensible,
  • People Enter Data into Your Database For Free while Other People Pay to Get it Out (rapidshare, iTunes Music Store, Pud's Internal Memos; with youtube, myspace, epinions etc viewers pay with the tenuous currency of their ad brain).

There's nothing wrong with that; all these companies levelled their playing field in some fundamental and important way. (Well, nothing wrong unless you're the loathsome gracenote.com (formerly cddb), who turned an open community-generated resource into a closed database, without even the courtesy of a copy to fork from.)

But it's fair to ask that I be able to export my copy of the data I've added to their business asset, and to do so easily.

Sites that play well with others:
  • my del.icio.us tags and bookmarks
  • my bloglines/google reader feeds
  • my librarything.com everything
  • my last.fm history
  • my iTunes playcounts, tags and ratings: mostly, I think?
  • Firefox bookmarks and history

Sites with an 'I gave up my metadata and all I got was this stupid webpage' policy:

  • facebook posts, friends, photos, everthing
  • flickr tags &c
  • amazon recommendations
  • Google calendar mostly no (at least, the last time I tried to sync my address books it was a Giant Pain in the Ass: nothing was durably id'd and recurring events were semantically incorrect. (Yes, I'd love to have 96 separate entries for my Grandmother's birthday!)
  • eBay bids, purchases, ratings
  • Blogger: Blogs, yes if you remote host your site. However, you can't even /list/ the blogger comments you've made, let alone export them.
  • I believe Myspace's engineers can't even spell XML
(I could be wrong about any of these except the last one).

I'm picturing something with a plugin architecture -- the main app handles the screenscraping, authentication, form submission, web crawling and file export details; the plugin supplies URL wildcards and regexp's the data back into semantic structure. With XML export, a motivated plugin author or well-itched user could supply a decent XSLT stylesheet to represent that metadata in a useful local fashion (and with helpful links back to the main site). It would be useful to have plugins (trivial) and stylesheets (no more or less so) even for sites like Last.fm and Library Thing that Do The Right Thing by granting transparent access to your metadata.

Much of this may exist in some form or another; for example the Aperture/iPhoto plugin will apparently sync your flickr and iPhoto tags, and embed the result into the app database. But going from XML => app is more flexible -- and possibly easier -- than the other way 'round.

I one off'ed this a while back for my Amazon ratings, but I just saw where I'd gone from ~350 to ~650 'things rated' since then. I'm hoping the LazyWeb has solved my problem, since I'm not sure where I put those scripts. (Ironic, considering my previous post.)

Labels: , , , , , , , , ,

Thursday, December 13, 2007

Leveraging the Bittorrent Underground for semantic data and media

I just ran across a pretty interesting site called coverbrowser.com, which uses a variety of image APIs to pull in comic book, game, book, music, movie and other cover art. (Read the technical details here).

It reminded me of an idea I had while back but which I will never get around to implementing --- maybe you will, or for all I know someone's already been doing for years. (Sidenote: I've had some people express interest in this, and have worked out some parts of it, but just don't have the time to complete it right now. If you'd like to help develop it get in touch).

Many of the movie and music torrents on the, ahem, "Unauthorized Evaluation Copy" bittorrent sites contain hi-res scans of their cover art, and all of the major bittorrent sites maintain topic-specific RSS feeds.

As long as the torrent indexes the files individually (as not as an opaque .zip or .rar) -- and most do index individually -- you can target specific files within the torrent. I don't know whether you could chop all the large-file-size copyright-problematic files that you don't want out of the torrent, or whether you'd have to hack Azureus or other bittorrent client (instructing it to get only *.{png,gif,jpg,jpeg,bmp,tiff,tff} or what have you). Either way, you would then only be pushing out the bandwidth required to grab the photos and not the accompanying multi-megabyte file, and you would only be getting the information to which you assumedly have fair use rights for.

So you'd set up a daemon process that would

  • watch the Movies and the Music RSS feeds off whichever or all of the sites,
  • identify albums whose cover art you lack,
  • pull in the bittorrent,
  • but download only the cover art
  • and perhaps also process any of the accompanying semantic data
You might have to get yourself a seedbox to make this work, but they're not unaffordable.

I think this would lead to a large stream of incoming cover art for music and other media files, complete with a reasonable amount of semantic information.

There's probably a lot of other crowdsourced semantic data flowing through the underground, if someone actually created such a torrenting robot. (And yes, I feel yucky using "crowdsourced" and "semantic data" in the same sentence).

Labels: , , , , , , , , , , , , ,