HW6_Twitter

[|Andreas Weigend] Spring 2009 STATS 252 Stanford University
 * Data Mining and E-Business: The Social Data Revolution**

=**Homework 6: Twitter**= = = Assigned: Mon May 18, 2009 Due: Thu Jun 4, 2009, 5pm (this includes the extension) Email to: stats252.homework@gmail.com. Note: This homework can be done by one person or in small groups of 2-3, and feel free to discuss questions about Python programming broadly.

Food and soft drinks provided, BYOB.
 * Stats 252 extra "office hour" aka "Twitter Hack-A-Thon": Tuesday May 26th, 2009, d.school, bldg 524, 7-9pm**

//THANKS: Several great people helped significantly with defining this homework://
 * //Yu-shan Fung, who graduated from Stanford just before this course was created, worked at Amazon.com and A9, and is now the CTO of// //Discoverio,// //the company that brings you MrTweet now and more in the future)//
 * //Nick Kallen, who helped us get things done smoothly at Twitter, and shared his deep insights with us.//
 * //Doug Williams, who helped people get white listed.//
 * //Hamilton Ulmer who took the course last year and now also works at Discoverio.//
 * //And of course Mingyeow Ng, the CEO of Discoverio, one of the most amazing people in the world and great discussion partner, always willing to help.//

//And based on Enrique's idea of offering a coding session to help students with less hands-on experience, Ryan Mason (who also took the course last year and now works at 23andMe) organized a very effective session with help from some students, Chris Anderson, Emile Chamoun, Carlin Eng, Mike Polcari (SCPD/23andMe), and Jeff Mellon (23andMe).// //Thank you all!//

This homework is likely to take longer than the Delicious homework. We are going to hold special office hours with food because many of you have voiced concern. However, since our teaching team is understaffed we are offering extra credit to experienced python developers who are willing to help their classmates. We are planning to host the hack-a-thon Tuesday, May 26th from 7-9pm at the d.school Bldg 524. If you have friends willing to help outside our class, we can offer $20/hour + food.

** If you are a developer interested in helping please email enrique.allen@stanford.edu. **
There are two main goals of this assignment. The first is to introduce you to another API, where you find a wealth of information in Tweets and social graphs. The second goal is to get you to discover interesting people in the hopes of finding long term engagement.

=The Assignment= For this assignment please create 10 relevant recommendations of friends for each of 10 friends (100 total recommendations). Before you send the recommendations to each friend, rank them on your own (1 worst - 10 best) based on how you think your friend will rank them (ie based on relevance). Keep this ranking to yourself. Ask each friend to rank them (1 worst - 10 best) and give a short response about why they are good and/or bad. Please submit the recommendations, ratings, and comments using this [|Google Form] Submit to stats252.homework@gmail.com your script/program along with a short write-up that describes clearly your conceptual approach, and briefly discusses the constraints you face and the rationale for taking the approach.

What really does a user want? To discover information for certain domains? business networking? Perhaps even dating?
 * PROBLEM**

Data sources and recommendation approaches: These are just some hypotheses, would love to see you use your creativity/think outside the box.
 * HYPOTHESES**
 * Analyzing a user's social graph (who follows who, who replies to who, who retweet's who, etc)
 * Collaborative filtering (You follow A, B and C. Other people who follow A, B and C also follow D, thus we recommend D to you)
 * Similarity (You are followed by A, B and C, all of whom also follow D)
 * Transitive attention (You follow A, B, and C, all of whom follow D)
 * Semantics (tweets)
 * Hashtags (who else uses similar set of hashtags as I do? are all hashtags created equal?)
 * Keywords (identify interesting keywords? bag of words?)
 * Location (how can geolocation be used, when is it relevant, e.g., for finding people with similar interest near me?)

Pick 10 of your friends who have been using Twitter. Randomly assign them to two groups, each of which will see one version of your algorithm.
 * ACTION**

Ask each friend to rank your recommendations from 1-10 (1 lowest, 10 highest) and give qualitative feedback in free text about why each recommendation is good as well and/or bad.
 * METRICS**

Once you have collected this information fill out the [|Google Form] once for each of the friends for whom you make recommendations.


 * Output for each friend:
 * username of friend
 * Output for each recommendation:
 * Please write all of this on one line, separated by **commas**. The username you're recommending, the rank you think your friend will give (not shared with friend), the actual rank given by your friend, free text about what made the recommendation good, free text about how the recommendation could have been better.
 * It should look like this: suggested_twitter_username, your_rank, friends_rank, "good free text", "bad free text"

Create a short write-up comparing your two hypotheses/algorithms, the constraints you faced, and what you learned. **
 * EVALUATION [quality is more important than quantity]

=Getting Started=

Twitter has a limit of 100 requests per hour. We have arranged to get you on the whitelist with an expedited turn around of around 24 hours plus up to 48 hours for the change to take effect. For now, please put one Twitter username from your group (only require for authenticated connections) and your IP address. [|Request Whitelisting] here. Be sure to mention "__**Stanford Stats 252 Datamining Homework**__" in the request form.
 * You can find your IP by typing **ifconfig** (Mac/Linux) or **ipconfig** (Windows). Amongst the numbers you will find something like 128.*.*.* or 172.*.*.* if you're on campus at Stanford, or another number if you're elsewhere. You can also try [|What's my ip?]

Your response from Twitter should look like this:
 * || 

**Hi** ,
You should find any rate limits no longer apply to authenticated requests made by @ twitter_username. We've also whitelisted the following IP addresses:** ||
 * Thanks for requesting to be on Twitter's API whitelist. We've approved your request!
 * **67.205.50.***
 * **128.12.155.***
 * This change should take effect within the next 48 hours.**


 * You are welcome to use the programming language of your choice, however you may find the [|files] we've put together helpful
 * Twitter API: []
 * Language specific libraries: []
 * How do you manage the necessary data (in-memory data structures? text files? database tables?)
 * Note that your system does not have to be interactive. Offline computation is good enough.

=Working with Python= We have put together a few functions in Python that should make getting data a little easier. The fastest way to get started is to download the TwitterSearch.py file and the simple-json*.tar. The two sample code files should help you get started. //Note that there are two ways to access the Twitter API detailed here.//

Download at least these **4** files from []  ReadMe.txt **- instructions for downloading and installing** python-twitter **package as well as** simplejson  TwitterSearch-sample-code.py  TwitterSearch.py - functions for downloading following/followers and a function to search updates
 * simplejson-2.0.9.tar ** - needed for working with json objects returned by the Twitter API

You may also want to use the python-twitter library which has more features and objects than TwitterSearch.py  python-twitter-sample-code.py **-** calls to the functions in python-twitter package for examples  python-twitter-0.5.tar **- functions and objects for working with the API, the file is cached here as a convenience for you

I**f you download the .tar files from the link above, you won't need to use curl. If you have any trouble installing the packages. Try moving the **python-twitter.py** file and **simple-json** directory to your working directory. The library can be accessed without installing (this should work on shared machines where you don't have write permissions on public directories).

From here you're on you own, good luck and be creative. If you find any helpful links, feel free to post them here. (There are some good hints below.)
Here's some code to help you cache requests like the last assignment: code from datetime import datetime, timedelta import time import sqlite3 as sqlite import pickle from functools import wraps

class CachedAPIWrapper(object): """       delegate, but DB cache, everything to the api        Usage:        def get_api:            tapi = twitter.Api(username=USER, password=PASSWORD)            cache = DBCache('tweeting.db', False)            return CachedAPIWrapper(tapi, cache)    """ def __init__(self, api, cache): self._api = api self._cache = cache

def __getattr__(self, name): fn = getattr(self._api, name) if not callable(fn): return fn       def wrapper(*args, **kwargs): try: return self._cache.load(name, args) except CacheMissError: value = fn(*args) self._cache.save(name, args, value) return value except TypeError: print 'Warning: Uncachable!' # uncachable -- for instance, passing a list as an argument. # Better to not cache than to blow up entirely. return self.func(*args) return wrapper

class CacheMissError(Exception): pass

class DBCache(object): def __init__(self, dbfile, debug = False): self.__state = {} self.__connection = sqlite.connect(dbfile) self.__connection.text_factory = str self.__cursor = self.__connection.cursor self.__debug = debug try: self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);'); except: pass

def save(self, func, args, result): self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result))) self.__connection.commit if self.__debug: print 'Saving %s(%s) -> %s!' % (func, args, result)

def load(self, func, args): self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args))) result = self.__cursor.fetchone if result: value = pickle.loads(result[0]) if self.__debug: print 'Loaded %s(%s) -> %s!' % (func, args, value) return value else: raise CacheMissError

def clear(self): self.__cursor.execute('DELETE FROM `cache`;') self.__connection.commit

code ---polcari

download this [|patch] if you are using simplejson 2.0.x, and you get this error message when running:

code format="bash" $ cd python-twitter-0.5 $ python setup.py test

=
========================================================= FAIL: Test the twitter.Status AsJsonString method -- Traceback (most recent call last): File "twitter_test.py", line 121, in testAsJsonString self._GetSampleStatus.AsJsonString) AssertionError: '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn \\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety- eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http:\\/ \\/twitter.com\\/system\\/user\\/profile_image\\/718443\\/normal\\/kesuke.png", "screen_name": "kesuke", "url": "http:\\/\\/twitter.com\\/kesuke"}}' != '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety-eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http://twitter.com/system/user/profile_image/718443/normal/kesuke.png", "screen_name": "kesuke", "url": "http://twitter.com/kesuke"}}'

=
========================================================= FAIL: Test the twitter.User AsJsonString method -- Traceback (most recent call last): File "twitter_test.py", line 224, in testAsJsonString self._GetSampleUser.AsJsonString) AssertionError: '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA", "name": "DeWitt", "profile_image_url": "http:\\/\\/twitter.com\\/system\\/user\\/profile_image\\/673483 \\/normal\\/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id": 4212713, "text": "\\"Select all\\" and archive your Gmail inbox. The page loads so much faster!"}, "url": "http:\\/\\/unto.net\\/"}' != '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA", "name": "DeWitt", "profile_image_url": "http://twitter.com/system/user/profile_image/673483 /normal/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id": 4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url": "http://unto.net/"}'

-- Ran 36 tests in 0.178s

FAILED (failures=2)


 * 1) apply the patch

$ patch < python-twitter-0.5-fixjsontests.patch Hmm... Looks like a unified diff to me... The text leading up to this was: -- -- Patching file twitter_test.py using Plan A... Hunk #1 succeeded at 17. Hunk #2 succeeded at 146. done
 * diff -up python-twitter-0.5/twitter_test.py.BAD python-twitter-0.5/twitter_test.py
 * --- python-twitter-0.5/twitter_test.py.BAD    2008-10-20 15:02:40.000000000 -0400
 * +++ python-twitter-0.5/twitter_test.py 2008-10-20 15:04:53.000000000 -0400

$ python setup.py test running test running egg_info writing requirements to python_twitter.egg-info/requires.txt writing python_twitter.egg-info/PKG-INFO writing top-level names to python_twitter.egg-info/top_level.txt writing dependency_links to python_twitter.egg-info/dependency_links.txt reading manifest file 'python_twitter.egg-info/SOURCES.txt' writing manifest file 'python_twitter.egg-info/SOURCES.txt' running build_ext /.amd_mnt/hut/vol/vol0/home/tirto/stanford/hw6/python-twitter-0.5/twitter.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5 Test the twitter._FileCache.Get method ... ok Test the twitter._FileCache.GetCachedTime method ... ok Test the twitter._FileCache constructor ... ok Test the twitter._FileCache.Remove method ... ok Test the twitter._FileCache.Set method ... ok Test the twitter.Status AsDict method ... ok Test the twitter.Status AsJsonString method ... ok Test the twitter.Status __eq__ method ... ok Test all of the twitter.Status getters and setters ... ok Test the twitter.Status constructor ... ok Test the twitter.Status NewFromJsonDict method ... ok Test all of the twitter.Status properties ... ok Test various permutations of Status relative_created_at ... ok Test the twitter.User AsDict method ... ok Test the twitter.User AsJsonString method ... ok Test the twitter.User __eq__ method ... ok Test all of the twitter.User getters and setters ... ok Test the twitter.User constructor ... ok Test the twitter.User NewFromJsonDict method ... ok Test all of the twitter.User properties ... ok Test the twitter.Api CreateFriendship method ... ok Test the twitter.Api DestroyDirectMessage method ... ok Test the twitter.Api DestroyFriendship method ... ok Test the twitter.Api DestroyStatus method ... ok Test the twitter.Api GetDirectMessages method ... ok Test the twitter.Api GetFeatured method ... ok Test the twitter.Api GetFollowers method ... ok Test the twitter.Api GetFriends method ... ok Test the twitter.Api GetFriendsTimeline method ... ok Test the twitter.Api GetPublicTimeline method ... ok Test the twitter.Api GetReplies method ... ok Test the twitter.Api GetStatus method ... ok Test the twitter.Api GetUser method ... ok Test the twitter.Api GetUserTimeline method ... ok Test the twitter.Api PostDirectMessage method ... ok Test the twitter.Api PostUpdate method ... ok
 * 1) run the test again

-- Ran 36 tests in 0.176s

OK

code

hth, tirto

[Update 5/26] You may find this information on the built in set operations in Python helpful as you traverse the social graph. code format="python" >>> ryans_friends = ['mike','jeff'] >>> mikes_friend = ['jeff','kelly'] >>> >>> ryan_set = set(ryans_friends) >>> mike_set = set(mikes_friends) >>> mike_set - ryan_set set(['kelly']) >>> redundant_set = set(['mike','mike','mike']) >>> redundant_set set(['mike']) >>> ryan_set & mike_set set(['jeff']) >>> ryan_set | mike_set #union set(['kelly', 'mike', 'jeff']) >>> ryan_set & mike_set #intersection set(['jeff'])

fol = set( a.get_followers('aweigend') ) fri = set( a.get_friends('aweigend') ) fol & fri
 * 1) Also Try
 * 1) this gets the intersection of the two sets. In this case it tells you who follows both directions.
 * 2) look at the online documentation for other set operations

code