Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Spring 2009
Stanford University

Homework 6: Twitter

Assigned: Mon May 18, 2009 Due: Thu Jun 4, 2009, 5pm (this includes the extension)
Email to:
Note: This homework can be done by one person or in small groups of 2-3, and feel free to discuss questions about Python programming broadly.

Stats 252 extra "office hour" aka "Twitter Hack-A-Thon": Tuesday May 26th, 2009,, bldg 524, 7-9pm
Food and soft drinks provided, BYOB.

THANKS: Several great people helped significantly with defining this homework:
  • Yu-shan Fung, who graduated from Stanford just before this course was created, worked at and A9, and is now the CTO of Discoverio, the company that brings you MrTweet now and more in the future)
  • Nick Kallen, who helped us get things done smoothly at Twitter, and shared his deep insights with us.
  • Doug Williams, who helped people get white listed.
  • Hamilton Ulmer who took the course last year and now also works at Discoverio.
  • And of course Mingyeow Ng, the CEO of Discoverio, one of the most amazing people in the world and great discussion partner, always willing to help.

And based on Enrique's idea of offering a coding session to help students with less hands-on experience, Ryan Mason (who also took the course last year and now works at 23andMe) organized a very effective session with help from some students, Chris Anderson, Emile Chamoun, Carlin Eng, Mike Polcari (SCPD/23andMe), and Jeff Mellon (23andMe).
Thank you all!

This homework is likely to take longer than the Delicious homework. We are going to hold special office hours with food because many of you have voiced concern. However, since our teaching team is understaffed we are offering extra credit to experienced python developers who are willing to help their classmates. We are planning to host the hack-a-thon Tuesday, May 26th from 7-9pm at the Bldg 524. If you have friends willing to help outside our class, we can offer $20/hour + food.

If you are a developer interested in helping please email

There are two main goals of this assignment. The first is to introduce you to another API, where you find a wealth of information in Tweets and social graphs. The second goal is to get you to discover interesting people in the hopes of finding long term engagement.

The Assignment

For this assignment please create 10 relevant recommendations of friends for each of 10 friends (100 total recommendations). Before you send the recommendations to each friend, rank them on your own (1 worst - 10 best) based on how you think your friend will rank them (ie based on relevance). Keep this ranking to yourself. Ask each friend to rank them (1 worst - 10 best) and give a short response about why they are good and/or bad. Please submit the recommendations, ratings, and comments using this Google Form Submit to your script/program along with a short write-up that describes clearly your conceptual approach, and briefly discusses the constraints you face and the rationale for taking the approach.

What really does a user want? To discover information for certain domains? business networking? Perhaps even dating?

Data sources and recommendation approaches:
  • Analyzing a user's social graph (who follows who, who replies to who, who retweet's who, etc)
    • Collaborative filtering (You follow A, B and C. Other people who follow A, B and C also follow D, thus we recommend D to you)
    • Similarity (You are followed by A, B and C, all of whom also follow D)
    • Transitive attention (You follow A, B, and C, all of whom follow D)
  • Semantics (tweets)
    • Hashtags (who else uses similar set of hashtags as I do? are all hashtags created equal?)
    • Keywords (identify interesting keywords? bag of words?)
    • Location (how can geolocation be used, when is it relevant, e.g., for finding people with similar interest near me?)
These are just some hypotheses, would love to see you use your creativity/think outside the box.

Pick 10 of your friends who have been using Twitter. Randomly assign them to two groups, each of which will see one version of your algorithm.

Ask each friend to rank your recommendations from 1-10 (1 lowest, 10 highest) and give qualitative feedback in free text about why each recommendation is good as well and/or bad.

Once you have collected this information fill out the Google Form once for each of the friends for whom you make recommendations.

  • Output for each friend:
    • username of friend
  • Output for each recommendation:
    • Please write all of this on one line, separated by commas. The username you're recommending, the rank you think your friend will give (not shared with friend), the actual rank given by your friend, free text about what made the recommendation good, free text about how the recommendation could have been better.
    • It should look like this: suggested_twitter_username, your_rank, friends_rank, "good free text", "bad free text"

EVALUATION [quality is more important than quantity]
Create a short write-up comparing your two hypotheses/algorithms, the constraints you faced, and what you learned.

Getting Started

Twitter has a limit of 100 requests per hour. We have arranged to get you on the whitelist with an expedited turn around of around 24 hours plus up to 48 hours for the change to take effect. For now, please put one Twitter username from your group (only require for authenticated connections) and your IP address.
Request Whitelisting here. Be sure to mention "Stanford Stats 252 Datamining Homework" in the request form.
  • You can find your IP by typing ifconfig (Mac/Linux) or ipconfig (Windows). Amongst the numbers you will find something like 128.*.*.* or 172.*.*.* if you're on campus at Stanford, or another number if you're elsewhere. You can also try What's my ip?

Your response from Twitter should look like this:

Hi <twitter_username>,

Thanks for requesting to be on Twitter's API whitelist. We've approved your request!
You should find any rate limits no longer apply to authenticated requests made by @twitter_username.
We've also whitelisted the following IP addresses:

  • 67.205.50.*
  • 128.12.155.*
This change should take effect within the next 48 hours.

  • You are welcome to use the programming language of your choice, however you may find the files we've put together helpful
  • Twitter API:
  • Language specific libraries:
  • How do you manage the necessary data (in-memory data structures? text files? database tables?)
  • Note that your system does not have to be interactive. Offline computation is good enough.

Working with Python

We have put together a few functions in Python that should make getting data a little easier. The fastest way to get started is to download the file and the simple-json*.tar. The two sample code files should help you get started.
Note that there are two ways to access the Twitter API detailed here.

Download at least these 4 files from
ReadMe.txt - instructions for downloading and installing python-twitter package as well as simplejson - functions for downloading following/followers and a function to search updates
simplejson-2.0.9.tar - needed for working with json objects returned by the Twitter API

You may also want to use the python-twitter library which has more features and objects than - calls to the functions in python-twitter package for examples
python-twitter-0.5.tar - functions and objects for working with the API, the file is cached here as a convenience for you

f you download the .tar files from the link above, you won't need to use curl. If you have any trouble installing the packages. Try moving the file and simple-json
directory to your working directory. The library can be accessed without installing (this should work on shared machines where you don't have write permissions on public directories).

From here you're on you own, good luck and be creative. If you find any helpful links, feel free to post them here. (There are some good hints below.)

Here's some code to help you cache requests like the last assignment:
from datetime import datetime, timedelta
import time
import sqlite3 as sqlite
import pickle
from functools import wraps
class CachedAPIWrapper(object):
        delegate, but DB cache, everything to the api
        def get_api():
            tapi = twitter.Api(username=USER, password=PASSWORD)
            cache = DBCache('tweeting.db', False)
            return CachedAPIWrapper(tapi, cache)
    def __init__(self, api, cache):
        self._api = api
        self._cache = cache
    def __getattr__(self, name):
        fn = getattr(self._api, name)
        if not callable(fn):
            return fn
        def wrapper(*args, **kwargs):
                return self._cache.load(name, args)
            except CacheMissError:
                value = fn(*args)
      , args, value)
                return value
            except TypeError:
                print 'Warning: Uncachable!'
                # uncachable -- for instance, passing a list as an argument.
                # Better to not cache than to blow up entirely.
                return self.func(*args)
        return wrapper
class CacheMissError(Exception):
class DBCache(object):
    def __init__(self, dbfile, debug = False):
        self.__state = {}
        self.__connection = sqlite.connect(dbfile)
        self.__connection.text_factory = str
        self.__cursor = self.__connection.cursor()
        self.__debug = debug
            self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);');
    def save(self, func, args, result):
        self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result)))
        if self.__debug:
            print 'Saving %s(%s) -> %s!' % (func, args, result)
    def load(self, func, args):
        self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args)))
        result = self.__cursor.fetchone()
        if result:
            value = pickle.loads(result[0])
            if self.__debug:
                print 'Loaded %s(%s) -> %s!' % (func, args, value)
            return value
            raise CacheMissError
    def clear(self):
        self.__cursor.execute('DELETE FROM `cache`;')

download this patch if you are using simplejson 2.0.x, and you get this error message when running:

$ cd python-twitter-0.5
$ python test
FAIL: Test the twitter.Status AsJsonString method
Traceback (most recent call last):
  File "", line 121, in testAsJsonString
AssertionError: '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn
\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety-
eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http:\\/
\\/\\/system\\/user\\/profile_image\\/718443\\/normal\\/kesuke.png", "screen_name": "kesuke",
"url": "http:\\/\\/\\/kesuke"}}' != '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id":
4391023, "text": "A l\\u00e9gp\\u00e1rn\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user":
{"description": "Canvas. JC Penny. Three ninety-eight.", "id": 718443, "location": "Okinawa, Japan", "name":
"Kesuke Miyagi", "profile_image_url": "",
"screen_name": "kesuke", "url": ""}}'
FAIL: Test the twitter.User AsJsonString method
Traceback (most recent call last):
  File "", line 224, in testAsJsonString
AssertionError: '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA",
"name": "DeWitt", "profile_image_url": "http:\\/\\/\\/system\\/user\\/profile_image\\/673483
\\/normal\\/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id":
 4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url":
 "http:\\/\\/\\/"}' != '{"description": "Indeterminate things", "id": 673483, "location": "San
Francisco, CA", "name": "DeWitt", "profile_image_url": "
/normal/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id":
4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url":
Ran 36 tests in 0.178s
FAILED (failures=2)
## apply the patch
$ patch < python-twitter-0.5-fixjsontests.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
|diff -up python-twitter-0.5/ python-twitter-0.5/
|--- python-twitter-0.5/     2008-10-20 15:02:40.000000000 -0400
|+++ python-twitter-0.5/ 2008-10-20 15:04:53.000000000 -0400
Patching file using Plan A...
Hunk #1 succeeded at 17.
Hunk #2 succeeded at 146.
## run the test again
$ python test
running test
running egg_info
writing requirements to python_twitter.egg-info/requires.txt
writing python_twitter.egg-info/PKG-INFO
writing top-level names to python_twitter.egg-info/top_level.txt
writing dependency_links to python_twitter.egg-info/dependency_links.txt
reading manifest file 'python_twitter.egg-info/SOURCES.txt'
writing manifest file 'python_twitter.egg-info/SOURCES.txt'
running build_ext
/.amd_mnt/hut/vol/vol0/home/tirto/stanford/hw6/python-twitter-0.5/ DeprecationWarning: the md5
 module is deprecated; use hashlib instead
  import md5
Test the twitter._FileCache.Get method ... ok
Test the twitter._FileCache.GetCachedTime method ... ok
Test the twitter._FileCache constructor ... ok
Test the twitter._FileCache.Remove method ... ok
Test the twitter._FileCache.Set method ... ok
Test the twitter.Status AsDict method ... ok
Test the twitter.Status AsJsonString method ... ok
Test the twitter.Status __eq__ method ... ok
Test all of the twitter.Status getters and setters ... ok
Test the twitter.Status constructor ... ok
Test the twitter.Status NewFromJsonDict method ... ok
Test all of the twitter.Status properties ... ok
Test various permutations of Status relative_created_at ... ok
Test the twitter.User AsDict method ... ok
Test the twitter.User AsJsonString method ... ok
Test the twitter.User __eq__ method ... ok
Test all of the twitter.User getters and setters ... ok
Test the twitter.User constructor ... ok
Test the twitter.User NewFromJsonDict method ... ok
Test all of the twitter.User properties ... ok
Test the twitter.Api CreateFriendship method ... ok
Test the twitter.Api DestroyDirectMessage method ... ok
Test the twitter.Api DestroyFriendship method ... ok
Test the twitter.Api DestroyStatus method ... ok
Test the twitter.Api GetDirectMessages method ... ok
Test the twitter.Api GetFeatured method ... ok
Test the twitter.Api GetFollowers method ... ok
Test the twitter.Api GetFriends method ... ok
Test the twitter.Api GetFriendsTimeline method ... ok
Test the twitter.Api GetPublicTimeline method ... ok
Test the twitter.Api GetReplies method ... ok
Test the twitter.Api GetStatus method ... ok
Test the twitter.Api GetUser method ... ok
Test the twitter.Api GetUserTimeline method ... ok
Test the twitter.Api PostDirectMessage method ... ok
Test the twitter.Api PostUpdate method ... ok
Ran 36 tests in 0.176s


[Update 5/26] You may find this information on the built in set operations in Python helpful as you traverse the social graph.
>>> ryans_friends = ['mike','jeff']
>>> mikes_friend = ['jeff','kelly']
>>> ryan_set = set(ryans_friends)
>>> mike_set = set(mikes_friends)
>>> mike_set - ryan_set
>>> redundant_set = set(['mike','mike','mike'])
>>> redundant_set
>>> ryan_set & mike_set
>>> ryan_set | mike_set #union
set(['kelly', 'mike', 'jeff'])
>>> ryan_set & mike_set #intersection
#Also Try
<span style="border-collapse: collapse; font-family: arial; font-size: 13px; line-height: normal; white-space: normal;">fol = set( a.get_followers('aweigend') )
fri  = set( a.get_friends('aweigend') )
# this gets the intersection of the two sets. In this case it tells you who follows both directions.
#look at the online documentation for other set operations
 fol & fri</span>