Homework 4: Recommender System for del.icio.us

Assigned: Tues May 12, 2009

Due: 5pm, Thur May 21, 2009

Email to: stats252.homework@gmail.com.

Note: This homework should be done individually, but feel free to discuss questions about Python programming.

In this assignment you will create a simple recommender system for del.icio.us. Here is the case: If you have just bookmarked www.weigend.com, the program will recommend URLs for you based on your own tagging history and that of others who, at some stage in the past, have also tagged www.weigend.com.

To design a recommendation system, let's start simple.
Find the set of all users who have bookmarked www.weigend.com or weigend.com (in a later version of del.icio.us these will be mapped together, in the current version you need to do this manually).
Looking at all of the URLs tagged by this set of users, how many URLs are tagged by only a single user, how many are tagged by 2 users, 3 users etc.?
When there are many such multiply bookmarked URLs, you might consider to rank the results based on a "weighted" version that gives different users different weights before aggregation. Two examples
for such weights are:

  • heavier weights for users who have more similar tagging history to yours.
  • lower weights for users who have not been using del.icio.us for some time;
Another time dimension to be considered in weighting is: How do you want to treat time
  • relative to the user having tagged weigend.com (i.e., to you want to weight URLs that came afterwards higher than those that came before?), and
  • relative to the current time (should a URL tagged yesterday be weighted more than a URL tagged a year ago?)

By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoring more popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects?
To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged weigend.com? And how would you derive recommendations from their most "interesting" bookmarks?
Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm, biased towards individuality?
So far, you have only considered the fact that others have tagged the URL, and ignored the specific tags as well as the comments. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave?
Before you start working on this, first register an account on del.icio.us, and tag some URLs of interest to you (this will be your history). Please put your Python code and 25 ranked recommendations for the target URL on your webpage. Carefully examine and discuss the change in results as you change your parameters. Post the description of the choices you made, the effects you investigated, and a set of 25 recommendations on your webpage, and submit the link to this webpage to the usual email address by the deadline.

Class from last year: http://stanford2008.wikispaces.com/6.+May+12%2C+2008
Speaker: **Toby Segaran** - Currently works at **Metaweb Technologies**
(Those on Stanford campus can read his book online here)

(I found the following two pages very helpful. - Brian Dumbacher)
Python v2.6.2 Documentation: http://docs.python.org/tutorial/
Section on Data Structures: http://docs.python.org/tutorial/datastructures.html

I could not access the Programming Collective Intelligence book in its entirety (as an off-campus, SCPD student) until I signed up for the free 10-day trial. E-mail me if you need a copy of Ch2. - Sylvie B. (sylviebryant@gmail.com)

This exercise uses Python and the pydelicious (del.icio.us Python API) package. To check out this package, you need to install SVN and the feedparser package. Here are some details.
(1) Install Python 2.5 (Note: You can use other Python Version, i.e Python 2.6. Don't use Python 3.0 that is not adapt to the packages you will use below ). Python and its tutorials are available on www.python.org. On Windows, after you have run the Python installer, you may have to manually add the installation directory to your system path environment variable (Right click on My Computer -> Properties -> Advanced -> Environment Variables, and edit the path variable).
(2) Install SVN. SVN is a source code version controlling system. Please download it (Windows binary: http://subversion.tigris.org/files/documents/15/36797/svn-1.4.3-setup.exe; Mac OS: http://metissian.com/projects/macosx/subversion/ ; Linux: you should know) and install it.
(3) After installing SVN, you need to check out a package named “feedparser”. Please download feedparser-4.1.zip from http://code.google.com/p/feedparser/downloads/list, unpack it, go to its directory, and type the following code to install it.
python setup.py build
python setup.py install

(Just in case people are still at this point and are still stuck: You're typing these into the command line (cmd.exe(?) in Windows, Terminal in Mac). Type "cd [name of directory/folder]" to go to a directory. You have to be in the folder for feedburner or pydelicious for the given code to work. Then typing "python" starts a Python interpreter (in Macs, at least), and then you're on your own for the coding. Hope this helps some people! ~~Georgia A)

(4) Finally, you check out “pydelicious”. Please use SVN to check out “pydelicious”: go to the bin subdirectory of SVN, type the following code:
svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious-0.5.0

(Note: There is a blank between " trunk/" and "pydelicious-0.5.0" in the above command. You also can download pydelicious-0.5.0.zip from on http://code.google.com/p/pydelicious/downloads/list, and unpack it). After checking out the package, you need go to its directory "pydelicious-0.5.0 ", and type the following code to install it.

python setup.py build
python setup.py install

Only after the above four steps, you should be able to import the pydelicious package in Python.
If you encounter errors saying that you need some more packages preinstalled, please install them by the similar way as (3).

Windows note: if you get a "ValueError" message, replace the line in setup.py:

package_dir = { 'pydelicious': 'src/' },
package_dir = { 'pydelicious': 'src' },

For example, one student (Bo) was getting errors with the elementTree library used in pydelicious. Here's what to do:

    # go to your homework directory
    > curl "http://effbot.org/media/downloads/elementtree-1.2.6-20050316.zip" > elementtree-1.2.6-20050316.zip
    > unzip elementtree-1.2.6-20050316.zip
    > cd elementtree-1.2.6-20050316/
    > python setup.py build;
    > python setup.py install;
Now go to your pydelicious directory and build/install again using python setup.py build and python setup.py install

I did it without having to worry about SVN, by downloading the source for the two packages directly from code.google.com and then installing with ">>python setup.py install" from the directory with the downloaded files. -Ryan


note: Instead of installing python and SVN on your own personal Linux/Windows/Mac machine, you can use Stanford's cluster. They already have Python and SVN installed. (However, there might be differences between different clusters, so in case you find it does not work, please use your own computer.)
First: ssh into one of Stanford's many computer server and type this:
svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious-0.5.0

python setup.py build
and now you are ready to run python
>>>import pydelicious
Because we cannot copy the pydelicious code into python directory due to access restriction, you must be inside the pydelicious directory created by svn (use pwd command in Linux to check path). --jack

An alternative:
You can install the packages to a local directory and then set PYTHONPATH to follow that directory structure:

svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious
cd pydelicious
python setup.py build
python setup.py install --prefix=${HOME}
setenv PYTHONPATH ${PYTHONPATH}:${HOME}/lib/python2.5/site-packages
"python2.5" might have to be "python2.6" depending on how it installed. Check your ~/lib/ directory to see which one it is.


Here are some code snippets to illustrate how to use pydelicious to get information from del.icio.us,
1) get the list of posts for a given URL (each post corresponds to a user's bookmark entry of this URL). The post is a data object containing URL, user, timestamp etc. function pydelicious.get_urlposts will return a list of dictionaries each correponding to a post.

import pydelicious
up = pydelicious.get_urlposts('http://www.weigend.com/')
To see what a post looks like, print up[1] (note up[0] is the first post)
{'count': '',
'extended': u'stat252 teacher',
'hash': '',
'description': u'[from smile1599] Andreas S. Weigend, Ph.D.',
'tags': u'amazon weigend',
'href': u'http://del.icio.us/url/e78668880a3ac1f62edbd3449e095df3#smile1599',
'user': u'smile1599',
'dt': u'2009-05-09T20:48:48Z'}

You can get all the information you need from this list of posts. Please note you have to write the full URL with a leading "http:" and a trailing slash, for example, www.weigend.com, http://www.weigend.com will not work. Only http://www.weigend.com/ works.
2) get the list of recent posts for a given user

import pydelicious
up = pydelicious.get_userposts("smile1599")

You can only get recent posts up to a limited number.
3) get the list of recent posts for a given tag

import pydelicious
up = pydelicious.get_tagposts("weigend")

For more information, please refer to pydelicious.html in your pydelicious doc directory.

Playing Nice with Delicious

So comrades, you may have seen the ugly side of delicious (i.e. 503). If you don't know what I am talking about, then don't worry about it. Below, you will find some code that may help you to deal with the delicious throttling mechanism. There are 3 decorators that I use to query delicious (or any webservice). One of them is a simple in memory cache (memoized). It's a convenient way of caching functions that query webservices to avoid hitting them multiple times for the same data. Another, (rate_limited) lets you set a minimum number of seconds between successive calls to the same resource. This is better than always sleeping for the maximum amount of delay needed. Also, this makes it easy to change the rate limiting in one place and it lets you share a rate limited resource between different functions. (If you really want to have fun, you can easily modify it to use a pool of API Keys / IP addresses and automatically balance the requests.) The third and one that I cobbled together for this particular problem is a database based cache (cached) which saves function results across executions or machines (if the cache file is on a shared system). This makes it easier to debug your program without getting 503'ed
because you only pull the data once no matter how many times you run your program.

Hope this helps. - JДMES

# The first attempt:
#   Try to play nice with Delicious by using a memoizing decorator and a
#   rate limiting decorator to follow their rules.
# The hack:
#   While the Delicious API says that you should wait at least 1 second, it
#   will still throttle/ban you if you do a posts/all type call that often.
#   To handle the large amount of data that need to be gathered and possibly
#   from different hosts, we implement a SQLite based cache.
from datetime import datetime, timedelta
import time
import pydelicious as deli
import sqlite3 as sqlite
import pickle
class CacheMissError(Exception):
class DBCache(object):
    def __init__(self, dbfile, debug = False):
        self.__state = {}
        self.__connection = sqlite.connect(dbfile)
        self.__connection.text_factory = str
        self.__cursor = self.__connection.cursor()
        self.__debug = debug
            self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);');
    def save(self, func, args, result):
        self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result)))
        if self.__debug:
            print 'Saving %s(%s) -> %s!' % (func, args, result)
    def load(self, func, args):
        self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args)))
        result = self.__cursor.fetchone()
        if result:
            value = pickle.loads(result[0])
            if self.__debug:
                print 'Loaded %s(%s) -> %s!' % (func, args, value)
            return value
            raise CacheMissError
class memoized(object):
    """Decorator that caches a function's return value each time it is called.
    If called later with the same arguments, the cached value is returned, and
    not re-evaluated.
    def __init__(self, func):
        self.func = func
        self.cache = {}
    def __call__(self, *args):
            return self.cache[args]
        except KeyError:
            self.cache[args] = value = self.func(*args)
            return value
        except TypeError:
            # uncachable -- for instance, passing a list as an argument.
            # Better to not cache than to blow up entirely.
            return self.func(*args)
class cached(object):
    """Use DBCache to cache
    def __init__(self, cache):
        self.cache = cache
    def __call__(self, f):
        def wrapper(*args):
                return self.cache.load(self.func.__name__, args)
            except CacheMissError:
                value = self.func(*args)
                self.cache.save(self.func.__name__, args, value)
                return value
            except TypeError:
                # uncachable -- for instance, passing a list as an argument.
                # Better to not cache than to blow up entirely.
                return self.func(*args)
        self.func = f
        wrapper.__name__ = f.__name__
        wrapper.__doc__ = f.__doc__
        return wrapper
class rate_limited(object):
    """Decorator that limits the rate for a given resource.  Wait at least
    min_delay in seconds before consecutive calls to resource"""
    def __init__(self, resource, min_delay, debug = False):
        self.resource = resource
        self.min_delay = timedelta(seconds=min_delay)
        self.last_run = None
        self.debug = debug
    def __call__(self, f):
        def wrapper(*args, **kw):
            if self.last_run:
                now = datetime.now()
                delta = self.min_delay - (now - self.last_run)
                if self.debug:
                    print 'Now %s : Last %s' % (now, self.last_run)
                if delta > timedelta():
                    if self.debug:
                        print 'Sleeping for %s' % (delta.seconds + float(delta.microseconds) / 1000000)
                    time.sleep(delta.seconds + float(delta.microseconds) / 1000000)
            self.last_run = datetime.now()
            return self.func(*args, **kw)
        self.func = f
        wrapper.__name__ = f.__name__
        wrapper.__doc__ = f.__doc__
        return wrapper
rate = 5
debug = False
cache = DBCache('tasty.db', debug)
@rate_limited('delicious', rate, debug)
def get_userposts(user):
    return deli.get_userposts(user)
@rate_limited('delicious', rate, debug)
def get_urlposts(url):
    return deli.get_urlposts(url)
def add_list_to_set(aset, alist):
    [aset.add(x) for x in alist]
def main():
    target_user = u'aweigend'
    user_set = set()
    url_set = set()

alternative to pydelicious and feedparser

as you are aware, pydelicious.get_urlposts() only returns 15 users. below code will allow you to return up to 100 posts (hard max allowed by delicious's API). you don't need pydelicious and feedparser, but you'll need simplejson

to get delicious feeds, run this script:

[tirto@dreamfresh hw4]$ less example.py

import simplejson
import hashlib
import urllib2
import time
DLCS_FEED = "http://feeds.delicious.com/v2/json/"
# http://feeds.delicious.com/v2/json/url/84e4c8e3cb58ebd74b049b170f2400a5?count=100
def get_urlposts(url,count=15):
  md5url = hashlib.md5(url).hexdigest()
  url = DLCS_FEED + 'url/' + md5url + '?count=%d' %count
  return simplejson.load(urllib2.urlopen(url))
def get_userposts(user,count=15):
  url = DLCS_FEED + user +'?count=%d' %count
  return simplejson.load(urllib2.urlopen(url))
def get_tagposts(tag,count=15):
  url = DLCS_FEED + 'tag/' + tag +'?count=%d' %count
  return simplejson.load(urllib2.urlopen(url))
#count = 100
count = 1
urlposts = get_urlposts('http://weigend.com/',count)
for urlpost in urlposts:
  user = urlpost['a']
  for i in range(3):
      userposts = get_userposts(user,count)
      print userposts
      print "Failed user " + user + ", retrying"

[tirto@dreamfresh hw4]$ python example.py
[{u'a': u'faeriestorm', u'd': u'Andreas S. Weigend, Ph.D.', u'n': u'', u'u': u'http://www.weigend.com/', u't': [u'web2.0', u'technology', u'statistics', u'stanford', u'datamining', u'ebusiness', u'weigend', u'amazon'], u'dt': u'2009-05-19T02:45:50Z'}]



(If you decide to take this approach, http://delicious.com/help/feeds has some helpful information. -Carlin)