Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
STATS 252, Stanford University, Spring 2009
Class time: Monday 2:15 - 5:05 pm
Class location: Gates B01

  • This url,, is a wiki. If you see anything that can be improved, just do it. While everybody can view, only members can edit. All students in the class have been added as members based on their email address on the class list. If there is any access problem, please click on the top of this page, and generate a request to be added.
  • The page WISHES by students gives you a lightweight way to jot down what you would like to get out of the course. Please share your expectations with us and tell us how we can help. Two-way communication is what distinguishes Web 3.0 (as architectures of interaction) from Web 2.0 (architectures of participation). Feedback is key to make this course truly worthwhile for you.
  • We will need some volunteers for some class-related tasks. Please check out the page HELP with class and add your name if you are interested. Thanks!

1. "Official" course page (non-wiki)

While all the information you need should be here on the pages of this wiki -- and if something is missing, you can just fix it for everybody else by putting it there -- the more traditional syllabus / high-level course description is at It gives links, e.g., to the audio recordings and the transcripts of each class at as well as general references to books and papers.

2. Course content and goals

  1. Overview. The PHAME framework. Learn what it means to clearly define the problem you want to solve. Come up with a couple of hypotheses, suggesting different actions. Create a rich set of metrics, understanding the trade-offs between variables. And then run simple experiments that compare the different actions.
  2. Ecosystems and platforms.
  3. Data sources. Their value is their impact on decisions. Case: Customer Lifetime Value, is getting redefined using social data.
  4. a) Basics of decision analysis, and why it is an important tool in decision making.
    b) Prediction markets: Choices in design, and their consequences. Case: Lessons from Google's internal prediction market.
  5. Social network analysis. Focus on decisions that are influenced by the outcome of the analysis. Case: The spread of information on Facebook, and its implications for traditional "influencer marketing". [Note added June 1, 2009: The paper Eric Sun (who took Stats 252 last year) presented received the Best Paper Award at the ICWSM09 conference. Congratulations!]
  6. Creating information products from data. Case: How LinkedIn's successful culture is based on data and models, experiments and metrics. Also, brief discussion of underlying infrastructure (Hadoop, Aster Data, Greenplum etc.)
  7. Recommendations, reputation, and relevance. Cases: Amazon, Music.
  8. Machine learning approaches for online advertising. Serendipitous discovery vs. interrupt marketing. Cases: MySpace, Fox Interactive Media
  9. Privacy, Dating, Mobile advertising. Cases: Skout. Orange-FranceTelecom.

The material above will be covered in 9 three-hour classes (8 regular classes and 1 class during the time slot for the final). I have reserved about an hour at the end of the last class. I would like to talk about the big shifts in medical personal data collection, sharing and mining. The impact on individuals, society and business will be significant. However, I am open to any suggestions of what you want to do in the last hour of class. Here are a couple of alternatives:
  • Geolocation is another fast growing source of social data right now. Most apps don't do much more than putting pins on Google Maps. After summarizing technology and devices, it would be good to develop scenarios for what "advertising" could look like on mobile platforms.
  • Quants on Wall Street always hope to get ahead by using data sources others don't have. I could tell some of the stories where using, cleaning, and understanding new data turned out to be lucrative. However, for each success story there are dozens of case of wishful thinking that it should work, although it doesn’t. The discussion would center on where SDR data might be useful, and where it might not?

Several students have asked whether we will do a class on visualization. While I love displaying information, the hard part, which is almost impossible to teach in an hour, is to learn to have a "dialog with the data". Using good interactive tools for the problem at hand is more important than pretty visualizations.

3. Who is in the class?

I appreciate the rich diversity of the students in class, and am happy that people bring to bear both their academic perspective and their personal experiences. To facilitate the generation of unexpected ideas and discovery of new people, all students (registered or not) are on the same social network Please upload your picture, and share some information about yourself. To give you a quick overview of the backgrounds of the students in class, here is the breakdown by department (as of April 14):
Mgmt Sci & Engineering - Mgmt Sci & Engineering (MS)
Business Administration - Business Administration (MBA)
Statistics - Statistics (MS)
Computer Science - Computer Science (MS)
Electrical Engineer - Electrical Engineering (MS)
Comput & Math Engr - Comput & Math Engr (MS)
Financial Mathematics - Financial Mathematics (MS)
Biomedical Informatics - Biomedical Informatics (MS)
Communication - Communication (PhD)
Economics - Economics (PhD)
Psychology - Psychology (MA)

Computer Science (BS)
Electrical Engineering (BS)
Math & Comp Science (BS)
Economics (BA)
Psychology (BA)
Computer Science (BS)/Mathematics (BS)
Mathematics (BS)/Economics (Min)
Psychology (BA)

Graduate Non-Degree-Option

4. Grading policy

My main goal is for you to appreciate how amazing the topics covered in this class are, to understand the material and to know how to apply it. Homework and other exercises are designed to help with this goal. However, since this is a graded course, it is important to be clear about the ingredients that will contribute to the final grade:
  • Homework [60%]: The assignments will be graded according to the deliverables explained on each assignment page. Some homework assignments are to be done individually, some in groups where all members of the group will get the same number of points. All of the homework makes up 60% of the grade. The more time consuming ones are weighted more heavily than the quick ones. The homework deliberately spans a wide spectrum of required times and skill sets. You can see that defining robust engagement metrics for a Facebook page or writing contracts for a prediction market are very different from designing an algorithm for discovering Twitter users to follow. These are very different from getting a simple content recommender system to work on Delicious or mining Yobo’s real-world music DNA data and figuring out its predictive power.
  • Course wiki [30%]: Class material from each session needs to be summarized and improved upon by a group of students to serve as a continuous resource. You must form a group and sign up to write a wiki page. Your group is responsible for creating the initial wiki page for the class that week and getting it up within 3 days. The page will be evaluated at 5pm on the Thursday following class. At that stage, the page must clearly emphasize the key learnings of the class, why they are relevant, and link to relevant materials elsewhere. At any time after that, every student is invited to improve the pages as the course develops with useful and relevant links that connect concepts together. You can look at past wikis (2007, 2008) to get an idea of what is expected.
  • Contributions during class [5%].
  • Contributions such as sharing some of your insights on, commenting on, or helping with the site. [5%]

5. How to submit homework

Please email your homework assignments as text, doc or simple pdf with your name in the filename to This email address is for submitting homework only.
Each homework assignment has specific requirements. For example, we want the dashboard sketch as hardcopy since it is less work for you to sketch this on paper, and we can give richer feedback by writing on it directly rather than electronically.

6. If you need help...

To make this simple for you, there is one single email address that is monitored by the teaching team: Please note that this email is different from the address to submit your homework to.

And here is the teaching team:
  • Andreas Weigend. I live in San Francisco, and am on campus one or two days a week. Try contacting me by email first, and if you think you should have received a response but didn’t, then text or call me, 650 906-5906. I'll be out of town for a few days to go to D7, but this does not impact class at all.
  • Enrique Allen is our “Social Media TA”. He took the course last year. We are grateful to Seth Goldstein, CEO of for partially sponsoring Enrique as the TA.
  • Ryan Mason is our grader. Ryan also took the course last year, and now works at 23andMe.
  • Ron Chung is a guest visitor helping out here and there. He's working on a bootstrapped stealth-mode startup building an innovative mobile social media application. If you want to get involved and help his team, contact him at

Note added May 25: Feng Zhang, the TA the Stats Department assigned to the class, has been "relieved of his duties" by the department chair. I genuinely apologize for the lack of adequate support you have received, and deeply thank you both for your understanding, and for the work many of you have put in, from the very beginning when Matt Jones built the Stanford Berkeley dashboard for HW1 , Sampath Jinadasa and Emile Elie Chamoand implemented the Prediction Markets running through the end of the quarter. And the class would have been impossible without Enrique Allen, Ron Chung, and Ryan Mason.

7. ... and if you are able to help out the class

For almost 100 students (including SCPD students and auditors), and too little support from the department, any help you are willing to give to the class is welcome. We are still looking for someone to set up the prediction market (so we can all play and learn how it actually works) and some pre-checking of the Yobo data before we release it for the last homework. I personally would be extremely grateful for people willing to spend some time write up one deep insight you got in class that is worth sharing, and will work with the people. (Thank you, Ray, for your creativity and clarity in the post on The Sorry State of Relevance.)

The world is changing so quickly. We use "old" technologies like this course wiki (simply because it worked well in the past years, but I am ready for new suggestions), or the page to understand both what data are being created there, and how important quick experiments and a thoughtful set of metrics are. We work with Twitter as a currently fast growing distribution channel, Ning to get to the social graph of the class, Ustream to entice people to create metadata via annotations, Etherpad as a real-time collaboration tool... New channels are created constantly, and the usage patterns that emerge are often different than the intended ones.

In any case, if you know of anything that you think I should look into for class and the dissemination of the ideas, or have a better way of doing what I do, I would be grateful if you would let me know. And if it makes sense, let’s just try it out and learn together. I will waive assignments and/or the wiki requirement for comparable work. Furthermore, the school will pay $20 per hour as a grader for things we can frame that way.

So, please talk to me after class, send me email, call me... I deeply care for the subject matter (otherwise I would not be teaching this course, very different from, say, teaching algorithms). And I care that you, the students, get the most out of this class. So, if you have specific ideas, talk to me about them. It usually is good to work on something concrete you care about, plus we’ll both learn in the discussion.

8. Beyond the class: Social Data Revolution

Besides the course wiki, we use a few Web 2.0 tools and reflect on their emerging strengths and weaknesses. We discuss what incentives work for people to engage and share, what can be returned to them in exchange, and what appropriate metrics are in each case that reflect long-term goals.


  • PLease follow @socialdata and include #socialdata @socialdata @aweigend in all tweets related to the class so there is a chance that your posts are actually seen.

Facebook Page

Youtube Channel

  • Subscribe to I am putting up a short video about once a week that gives an insight from class, often framed as conversation with a guest. Please share your thoughts -- what do we learn what works and what doesn't? What does "works" mean, and what is the purpose of such a channel?


  • We use the Ustream channel to understand how to enable people outside the classroom (including SCPD students) to engage. Anybody can view it in real time and participate, requiring people to user their twitter name cuts down on spam. As with all the tools, we will reflect on where they help and where they are distracting or just a waste of time.


  • We use the Etherpad for real time shared notes during class. The free version supports up to 8 concurrent users. Let me know if you run into problems and I will get it increased.

  • We got this site in the first week of the quarter as a central location for SDR related stuff. Initially, we just seeded it with a few paragraphs copied from (hopefully better than a 404 error). The first example of good original content is the HW1 Berkeley-Stanford dashboard, and it will be the place for discussing insights gleaned from the survey (, or We are off to a good start, but I want to figure out what makes sense: What would you like to see there, what goal should it serve, and how can we measure progress?

9. Directions to class and the department

For guests, the following information might be useful: If you come by car, a convenient parking area is the street parking (you need to pay until 4pm) in front of the Cantor Center for the Visual Arts (328 Lomita Drive, Stanford CA 94305). After parking, walk for a few minutes (continuing in the same direction) towards the central part of campus. The first real street you will reach (not counting the street at the museum) is called Serra Mall.
  • If you want to go to the classroom directly, turn right on Serra Mall. The class is in the basement of Gates Computer Science (353 Serra Mall, Stanford CA 94305). The building is on your right after the second small street on your right, and you enter from that small street by going up a few stairs outside, and then down.
  • If you want to come to my office, then cross Serra Mall. My office is in Sequoia Hall, the next building in front of you, slightly to your right (390 Serra Mall, Stanford, CA 94305). If you do absolute directions, this is the South-West corner of the Serra Street-Lomita Mall intersection.
  • And if want to grab a bite or a cup of coffee, turn right on Serra Mall and cross the street to get to Bytes Cafe, located at the ground floor of David Packard Electrical Engineering.
There is a searchable campus map (keywords: Gates or Statistics or Packard). If you have problems finding it, call my mobile, (650) 906-5906, although t-mobile reception on campus is spotty. But any student on campus should be happy to help you with directions.