Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Spring 2009
STATS 252
Stanford University


Homework 7: Yobo (entirely voluntary!)


THANKS: I am grateful to Allen Guo, the CEO of Beijing-based YOBO, for preparing this extremely interesting, next-generation recommender system data set for our course, and for freely sharing it with us.
Furthermore, Shaun Maguire, currently working at YOBO in Beijing, offered to help out in any way he can -- he took the class last year and will be going to Caltech for grad school.


Due: Tues Jun 9, 2009, 5pm

Email to: stats252.homework@gmail.com.

Note: This homework can be done by one person or in small groups of 2-3.


This assignment is optional but can be used for extra credit towards your homework grade. Students who attempt this homework will also be able to present their findings to the class.

In this assignment you will create a recommender system for yobo.com. Yobo is a music discovery and sharing service. They have a music personality test, which is one of their innovations. Once a user finishes the 17 questions in the personality test, Yobo will start recommending music based on the test. When listening to the recommendation, the user can choose the following actions: like, dislike and skip. The recommendation system will then combine a user's personality test result with their action history to give the user more relevant recommendations.

Download the data set here:
http://weigend.com/files/teaching/stanford/2009/homeworks/yobo_data.zip

We will also have support from former student, Shaun Maguire <shaunmaguire@gmail.com>. Feel free to email him questions and cc <stats252.homework@gmail.com>.


The real data sets in Yobo's recommender system has three sets of inputs:
a. User's DNA test (survey) results: 17 questions
1. We have the Music DNA test results:
(1) The data sets are formed as: id userid mi_06 a_01 re_05 a_06 do_07 a_03 re_08 a_05 do_05 a_04 re_09 a_07 mi_09 a_08 mi_01 a_02 do_02 music_a music_b music_c music_d created_date. (dismiss music_a music_b music_c music_d )
(2) Eight questions with binary outcome(0,1): a_01 a_02....a_08
(3) The other 9 questions have outcome (-1,0,1)
2. The corresponding survey questions are listed in order in the appendix.

b. User's actions on songs: like, dislike, skip, complete
1. We have the music like history of the each user, and are formed as: (song_id user_id created_date)
2. The record is the latest record when the user clicked "Like" for the song: i.e if the song (song_id =1) is liked by user (user id=84) at 06/01/2008, and then again liked by the same user at 09/01/2008, the record will be overwritten to:
song_id user_id created_date
1 84 20080901
We have randomly split all the data into: Training set, and Testing set.
Training Set files: DNASet_Traing.txt, LikedMusic_Training.txt
Testing Set files: DNASet_Tesing.txt, LikedMusic_Testing.txt

Aim of this homework: When someone finishes his DNA test, you can start giving him 100 recommendations of music in order. More generally, if you know the user’s DNA test and his liked song history, you can give him 100 recommendations of music in order.

To design a recommendation system, let's start step by step:

Step 1:
Before you begin designing your recommendation system, let’s first consider the data set we have: each test question result (binary or triplets) be considered as 1 input, so we have 17 raw dimensions input representing the personality test result for each user. The survey test result of user will fall into the finite grids (i.e 2^8*3^9 ). You need to analyze the distribution of number of users in the finite output grids. For example, we plotted the histogram of the number of users in the grids formed by 8 binary questions (a_01----a_08)(the total outcome of the 8 binary questions is 2^8=256):

You should try to understand the relationships between the 17 questions, e.g. if some questions have large correlations with others. Then you can decide reasonable subsets (X1,X2…Xn) of the 17 inputs as your inputs for your recommendation system.
17 questions are designed to measure a user’s personality in 11 dimensions. 8 of these dimensions takes 1 question each, the other 3 dimensions take 3 questions each:

“do” has 3 questions: do_07,do_05,do_02, corresponding to question 6,9,17. These three questions are supposed to have more correlation than average.
“ri” has 3 questions: ri_05,ri_08,ri_09, corresponding to question 3,7,11. These three questions are supposed to have more correlation than average.
“mi” has 3 questions: mi_06,mi_09,mi_01, corresponding to question 1,13,14. These three questions are supposed to have more correlation than average.

Step2:
Now turn to the output, i.e., the songs users have output like action history. You need to aggregate these outputs as prediction probability for each song when you give your recommendations to the user.
The basic method to aggregate these like action history is:
When a new user B finishes the survey test, you can count all the likes for songs given by users who have identical survey result (X1..Xn) from step 1, and give out your recommendation based on the counts of likes.
To improve the basic method, you might consider to get your recommendations based on a "weighted" version that aggregates other users' liked song info, i.e. giving different users different weights before aggregating their output for a particular song as the predicted probability for this song. One weights example is:
• Heavier weights for users who have more # of liked songs
• Lower weights for users who have not been using YOBO for some time or whose action was given long time ago.

There are lots of other issues you can take into the weight consideration of your model. For example, you can make the recommendation rank low if a particular song is over popularized, i.e is favored by many different (X1..Xn outcome groups.

Task:Now, you need to build your model that can aggregate other users Liked history to give recommendation to users. For each music DNA personality test outcome give 100 recommendations and test your model using the Testing set.

Step 3:
After implementing your initial model, we need to analyze the model. Now consider the input and output of your model: given this info, shall we drop some input dimensions from (X1..Xn)? E.g., if the recommendation is very similar for two inputs differing in variable Xk, should for this case Xk be dropped from the input? Or at least, for example,if the recommendation for input1(x1=1,x2=x3..=xn=1) and input2 (x1=0,x2=x3..=xn=1), shall we locally combine the liked song info of users having test results with input1 and input2 to improve our recommendations? (Why: observing it has no impact on the prediction!). What are your findings? What are the most important inputs for your model?
Do you have some suggestions for the survey questions, e.g., reorder the questions / dynamic survey (corresponding to localized combinations)?

Step 4:
You may make your recommendation based on the information of the users who have the identical survey results and users who have similar Like song history. How will you improve your initial model to make recommendations for users that also have the same Liked song history?
If we can define a multidimensional music space by identifying m characteristics to describe each song and each song is considered as a point in the m dimension space, which means we can discover similar songs in the neighborhood of any given one song or artist, how will you build this kind of information into your model?

Step 5:
Compare your method of mapping survey questions --> song recommendations framework to "traditional" collaborative filtering by:
(a) from conception; (b) implementing (eg. Collective Intelligence, Toby Segaran book); (c) compare performance on testing set, what do you learn?

Appendix:
English Translation of the survey questions
1. I always try my best to be polite to everybody.
2. Which one of the following do you think is more attractive?
3. Sometimes, I get extremely excited.
4. Which of the following would you prefer to believe?
5. My friends always refer to me as a “busy bee.”
6. I am a talkative person
7. I am an independent person and do not care about how other people think about me.
8. I prefer to
9. I always act as a leader within a team.
10. I prefer to participate in:
11. I like dealing with abstract theory and ideas.
12. Which of the following will affect your choice more?
13. I am a humble person..
14. I am a well planned and organized person.
15. I believe that every child born has innate goodness.
16. I am an energetic person.
17. I would like to mingle with strangers at a party.

Answer type
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. My Appearance, B. My Intelligence. (0,1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Facts B. Intuition. (0,1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Agree B. Disagree. (0,1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Create Something Original B. Follow the Tradition. (0, 1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Recreational Activities, B. Competitive Activities. (0, 1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A . Emotional Feeling,, B. Logical Thinking. (0, 1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Agree B. Disagree. (0,1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)
Answer: A. Agree B. Disagree. (0,1)
Answer:(Strongly Disagree, Neither Agree nor Disagree, Strongly Agree)(-1,0,1)

Yobo Questions in Chinese
1.尽量对每个人都彬彬有力(是,很难回答,不是)
2. 相比之下,我的(智慧,外表)更吸引人
3. 我容易体验别人的感受(是,很难回答,不是)
4. 我更相信(直觉,事实)
5. 朋友会说我是个爱忙活的人(是,很难回答,不是)
6. 我很健谈(是,不是)
7. 我不会在乎别人怎么看我(是,很难回答,不是)
8. 生活中我喜欢(坚持传统,标新立异)
9. 在集体中总是扮演领导角色(是,很难回答,不是)
10. 我喜欢(轻松悠闲,竞争激烈)的娱乐活动
11. 我(喜欢,很难回答,不喜欢)和抽象的理论观念打交道
12. 相比之下,(罗技判断,情感)更容易主导我的选择
13. 我总是表现的谦虚(是,很难回答,不是)
14. 我做事生活总是(高度计划性,高度灵活性)
15. 我相信人性本善(是,很难回答,不是)
16. 我的精力十分旺盛(是,不是)
17. Party上,我会主动和陌生人攀谈(是,很难回答,不是)