HW4_Delicious

Homework 4: Recommender System for del.icio.us

Note: This homework should be done individually, but feel free to discuss questions about Python programming.
In this assignment you will create a simple recommender system for del.icio.us. Here is the case: If you have just bookmarked [|www.weigend.com], the program will recommend URLs for you based on your own tagging history and that of others who, at some stage in the past, have also tagged [|www.weigend.com].

To design a recommendation system, let's start simple. Find the set of all users who have bookmarked [|www.weigend.com] or [|weigend.com] (in a later version of del.icio.us these will be mapped together, in the current version you need to do this manually). Looking at all of the URLs tagged by this set of users, how many URLs are tagged by only a single user, how many are tagged by 2 users, 3 users etc.? When there are many such multiply bookmarked URLs, you might consider to rank the results based on a "weighted" version that gives different users different weights before aggregation. Two examples for such weights are: Another time dimension to be considered in weighting is: How do you want to treat time
 * heavier weights for users who have more similar tagging history to yours.
 * lower weights for users who have not been using del.icio.us for some time;
 * relative to the user having tagged weigend.com (i.e., to you want to weight URLs that came afterwards higher than those that came before?), and
 * relative to the current time (should a URL tagged yesterday be weighted more than a URL tagged a year ago?)

By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoring more popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects? To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged [|weigend.com]? And how would you derive recommendations from their most "interesting" bookmarks? Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm, biased towards individuality? So far, you have only considered the fact that others have tagged the // URL, // and ignored the // specific tags // as well as the // comments //. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave? Before you start working on this, first register an account on del.icio.us, and tag some URLs of interest to you (this will be your history). **Please put your Python code and 25 ranked recommendations for the target URL on your webpage.** **Carefully examine and //discuss// the change in results as you change your parameters. Post the description of the choices you made, the effects you investigated, and a set of 25 recommendations on your webpage,** and submit the link to this webpage to the usual email address by the deadline.

Class from last year: http://stanford2008.wikispaces.com/6.+May+12%2C+2008 Speaker: **[|**Toby Segaran**]** - Currently works at **[|**Metaweb Technologies**]** (Those on Stanford campus can read his book online [|here])
 * Resources**

(I found the following two pages very helpful. - Brian Dumbacher) Python v2.6.2 Documentation: [] Section on Data Structures: []

--
 * I could not access the Programming Collective Intelligence book in its entirety (as an off-campus, SCPD student) until I signed up for the free 10-day trial.** **E-mail me if you need a copy of Ch2. - Sylvie B. (sylviebryant@gmail.com)**

This exercise uses Python and the pydelicious (del.icio.us Python API) package. To check out this package, you need to install SVN and the feedparser package. Here are some details. (1) Install Python 2.5 (Note: You can use other Python Version, i.e Python 2.6. Don't use Python 3.0 that is not adapt to the packages you will use below ). Python and its tutorials are available on [|www.python.org]. On Windows, after you have run the Python installer, you may have to manually add the installation directory to your system path environment variable (Right click on My Computer -> Properties -> Advanced -> Environment Variables, and edit the path variable). (2) Install SVN. SVN is a source code version controlling system. Please download it (Windows binary: []; Mac OS: [] ; Linux: you should know) and install it. (3) After installing SVN, you need to check out a package named “feedparser”. Please download feedparser-4.1.zip from [], unpack it, go to its directory, and type the following code to install it. //python setup.py build// //python setup.py install//

(Just in case people are still at this point and are still stuck: You're typing these into the command line (cmd.exe(?) in Windows, Terminal in Mac). Type "cd [name of directory/folder]" to go to a directory. You have to be in the folder for feedburner or pydelicious for the given code to work. Then typing "python" starts a Python interpreter (in Macs, at least), and then you're on your own for the coding. Hope this helps some people! Georgia A)

(4) Finally, you check out “pydelicious”. Please use SVN to check out “pydelicious”: go to the bin subdirectory of SVN, type the following code: svn checkout [] pydelicious-0.5.0

(Note: There is a blank between " trunk/" and "pydelicious-0.5.0" in the above command. You also can download pydelicious-0.5.0.zip from on [], and unpack it). After checking out the package, you need go to its directory "pydelicious-0.5.0 ", and type the following code to install it. //python setup.py build python setup.py install// Only after the above four steps, you should be able to import the pydelicious package in Python. If you encounter errors saying that you need some more packages preinstalled, please install them by the similar way as (3).

code package_dir = { 'pydelicious': 'src/' }, code with code package_dir = { 'pydelicious': 'src' }, code -Chris
 * Windows note:** if you get a "ValueError" message, replace the line in setup.py:

For example, one student (Bo) was getting errors with the elementTree library used in pydelicious. Here's what to do:

code # go to your homework directory > curl "http://effbot.org/media/downloads/elementtree-1.2.6-20050316.zip" > elementtree-1.2.6-20050316.zip > unzip elementtree-1.2.6-20050316.zip > cd elementtree-1.2.6-20050316/ > python setup.py build; > python setup.py install; code Now go to your pydelicious directory and build/install again using // python setup.py build // and  // python setup.py install // //I did it without having to worry about SVN, by downloading the source for the two packages directly from code.google.com and then installing with ">>python setup.py install" from the directory with the downloaded files. -Ryan//

--

First: ssh into one of Stanford's many computer server and type this: svn checkout [] pydelicious-0.5.0 // cd // pydelicious-0.5.0 // python setup.py build // and now you are ready to run python // >python // // >>>import pydelicious // // Because we cannot copy the pydelicious code into python directory due to access restriction, you must be inside the pydelicious directory created by svn (use pwd command in Linux to check path). --jack //
 * note: ** Instead of installing python and SVN on your own personal Linux/Windows/Mac machine, you can use Stanford's cluster. They already have Python and SVN installed. (However, there might be differences between different clusters, so in case you find it does not work, please use your own computer.)

You can install the packages to a local directory and then set PYTHONPATH to follow that directory structure:
 * An alternative:**

code format="bash" svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious cd pydelicious python setup.py build python setup.py install --prefix=${HOME} setenv PYTHONPATH ${PYTHONPATH}:${HOME}/lib/python2.5/site-packages code "python2.5" might have to be "python2.6" depending on how it installed. Check your ~/lib/ directory to see which one it is. -Chris

--

Here are some code snippets to illustrate how to use pydelicious to get information from del.icio.us, 1) ** get the list of posts for a given URL ** (each post corresponds to a user's bookmark entry of this URL). The post is a data object containing URL, user, timestamp etc. function pydelicious.get_urlposts will return a list of dictionaries each correponding to a post. import pydelicious up = pydelicious.get_urlposts(' [|http://www.weigend.com/')]   To see what a post looks like, print up[1] (note up[0] is the first post) {'count': , 'extended': u'stat252 teacher', 'hash': , 'description': u'[from smile1599] Andreas S. Weigend, Ph.D.', 'tags': u'amazon weigend', 'href': u'[]', 'user': u'smile1599', 'dt': u'2009-05-09T20:48:48Z'} // You can get all the information you need from this list of posts. Please note you have to write the full URL with a leading "http: // " and a trailing slash, for example, [|www.weigend.com], [|http://www.weigend.com] will not work. Only [] works. // 2) // ** get the list of recent posts for a given user ** // import pydelicious up = pydelicious.get_userposts("smile1599") // You can only get recent posts up to a limited number. 3) ** get the list of recent posts for a given tag ** import pydelicious up = pydelicious.get_tagposts("weigend")
 * For more information, please refer to pydelicious.html in your pydelicious doc directory. **

=Playing Nice with Delicious= So comrades, you may have seen the ugly side of delicious (i.e. 503). If you don't know what I am talking about, then don't worry about it. Below, you will find some code that may help you to deal with the delicious throttling mechanism. There are 3 decorators that I use to query delicious (or any webservice). One of them is a simple in memory cache (memoized). It's a convenient way of caching functions that query webservices to avoid hitting them multiple times for the same data. Another, (rate_limited) lets you set a minimum number of seconds between successive calls to the same resource. This is better than always sleeping for the maximum amount of delay needed. Also, this makes it easy to change the rate limiting in one place and it lets you share a rate limited resource between different functions. (If you really want to have fun, you can easily modify it to use a pool of API Keys / IP addresses and automatically balance the requests.) The third and one that I cobbled together for this particular problem is a database based cache (cached) which saves function results across executions or machines (if the cache file is on a shared system). This makes it easier to debug your program without getting 503'ed because you only pull the data once no matter how many times you run your program.

Hope this helps. - JДMES

code format="python"
 * 1) The first attempt:
 * 2)   Try to play nice with Delicious by using a memoizing decorator and a
 * 3)   rate limiting decorator to follow their rules.
 * 4) The hack:
 * 5)   While the Delicious API says that you should wait at least 1 second, it
 * 6)   will still throttle/ban you if you do a posts/all type call that often.
 * 7)   To handle the large amount of data that need to be gathered and possibly
 * 8)   from different hosts, we implement a SQLite based cache.
 * 1)   from different hosts, we implement a SQLite based cache.

from datetime import datetime, timedelta import time import pydelicious as deli import sqlite3 as sqlite import pickle

class CacheMissError(Exception): pass

class DBCache(object): def __init__(self, dbfile, debug = False): self.__state = {} self.__connection = sqlite.connect(dbfile) self.__connection.text_factory = str self.__cursor = self.__connection.cursor self.__debug = debug try: self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);'); except: pass

def save(self, func, args, result): self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result))) self.__connection.commit if self.__debug: print 'Saving %s(%s) -> %s!' % (func, args, result)

def load(self, func, args): self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args))) result = self.__cursor.fetchone if result: value = pickle.loads(result[0]) if self.__debug: print 'Loaded %s(%s) -> %s!' % (func, args, value) return value else: raise CacheMissError

class memoized(object): """Decorator that caches a function's return value each time it is called.   If called later with the same arguments, the cached value is returned, and    not re-evaluated.    """ def __init__(self, func): self.func = func self.cache = {} def __call__(self, *args): try: return self.cache[args] except KeyError: self.cache[args] = value = self.func(*args) return value except TypeError: # uncachable -- for instance, passing a list as an argument. # Better to not cache than to blow up entirely. return self.func(*args)

class cached(object): """Use DBCache to cache   """ def __init__(self, cache): self.cache = cache

def __call__(self, f): def wrapper(*args): try: return self.cache.load(self.func.__name__, args) except CacheMissError: value = self.func(*args) self.cache.save(self.func.__name__, args, value) return value except TypeError: # uncachable -- for instance, passing a list as an argument. # Better to not cache than to blow up entirely. return self.func(*args)

self.func = f       wrapper.__name__ = f.__name__ wrapper.__dict__.update(f.__dict__) wrapper.__doc__ = f.__doc__ return wrapper

class rate_limited(object): """Decorator that limits the rate for a given resource. Wait at least    min_delay in seconds before consecutive calls to resource""" def __init__(self, resource, min_delay, debug = False): self.resource = resource self.min_delay = timedelta(seconds=min_delay) self.last_run = None self.debug = debug

def __call__(self, f): def wrapper(*args, **kw): if self.last_run: now = datetime.now delta = self.min_delay - (now - self.last_run) if self.debug: print 'Now %s : Last %s' % (now, self.last_run) if delta > timedelta: if self.debug: print 'Sleeping for %s' % (delta.seconds + float(delta.microseconds) / 1000000) time.sleep(delta.seconds + float(delta.microseconds) / 1000000) self.last_run = datetime.now return self.func(*args, **kw)

self.func = f       wrapper.__name__ = f.__name__ wrapper.__dict__.update(f.__dict__) wrapper.__doc__ = f.__doc__ return wrapper

rate = 5 debug = False cache = DBCache('tasty.db', debug)

@cached(cache) @rate_limited('delicious', rate, debug) def get_userposts(user): return deli.get_userposts(user)

@cached(cache) @rate_limited('delicious', rate, debug) def get_urlposts(url): return deli.get_urlposts(url)

def add_list_to_set(aset, alist): [aset.add(x) for x in alist]

def main: target_user = u'aweigend' user_set = set url_set = set

get_userposts(target_user)

code

=alternative to pydelicious and feedparser=

as you are aware, pydelicious.get_urlposts only returns 15 users. below code will allow you to return up to 100 posts (hard max allowed by delicious's API). you don't need pydelicious and feedparser, but you'll need [|simplejson]

to get delicious feeds, run this script:

[tirto@dreamfresh hw4]$ less example.py

code format="python" import simplejson import hashlib import urllib2 import time

DLCS_FEED = "http://feeds.delicious.com/v2/json/"

def get_urlposts(url,count=15): md5url = hashlib.md5(url).hexdigest url = DLCS_FEED + 'url/' + md5url + '?count=%d' %count return simplejson.load(urllib2.urlopen(url))
 * 1) http://feeds.delicious.com/v2/json/url/84e4c8e3cb58ebd74b049b170f2400a5?count=100

def get_userposts(user,count=15): url = DLCS_FEED + user +'?count=%d' %count return simplejson.load(urllib2.urlopen(url))

def get_tagposts(tag,count=15): url = DLCS_FEED + 'tag/' + tag +'?count=%d' %count return simplejson.load(urllib2.urlopen(url))

count = 1 urlposts = get_urlposts('http://weigend.com/',count) for urlpost in urlposts: user = urlpost['a'] for i in range(3): try: userposts = get_userposts(user,count) print userposts time.sleep(2) break; except: print "Failed user " + user + ", retrying" time.sleep(4)
 * 1) count = 100

code

[tirto@dreamfresh hw4]$ python example.py [{u'a': u'faeriestorm', u'd': u'Andreas S. Weigend, Ph.D.', u'n': u'', u'u': u'http://www.weigend.com/', u't': [u'web2.0', u'technology', u'statistics', u'stanford', u'datamining', u'ebusiness', u'weigend', u'amazon'], u'dt': u'2009-05-19T02:45:50Z'}]

a|user d|description n|comments t|tag u|url dt|post_date
 * key|field**

hth, tirto

(If you decide to take this approach, http://delicious.com/help/feeds has some helpful information. -Carlin)