Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
STATS 252, Stanford University, Spring 2009

Class 1: Overview

April 6, 2009
Audio recording and transcript:
Part 1: mp3 transcript
Part 2: mp3 transcript

General Information

This wiki page summarizes and extends the first lecture of the Stanford STATS252 class, Data Mining and Electronic Business, given on April 6 2009 by Andreas Weigend. A video recording of this class can be found at the following link after providing a Stanford SUNet ID and password:
https://myvideosu.stanford.edu/OCE/GradCourseInfo.aspx?coll=94869e4a-80cb-481a-8271-e8cc7e905c9d

Introduction

The exponential growth of Information has led to an explosion in the number of data shared in the world. In fact, studies (http://www.memebox.com/futureblogger/show/147-worldwide-information-growth-faster-than-previously-estimated) suggest that data shared in 2011 will be 10 times larger than what it was in 2006.
Consequently, the problematic posed is no longer what interesting pattern we can extract from the data set that we have (various types of information are shared in large quantities). Instead, the approach has become given a problem, what interesting information can we extract.

Evolution of Data

Passive Data

In the past, information was recorded in a transactional manner. The common way to use data was to look through the records and extract hidden patterns in order to make decisions for the future.

Examples

Beer and diapers story

By examining transactions history, a correlation was found between buying diapers and buying beer: the story goes that men, after their wives asked them to buy some diapers for their kids, needed to boost up their mood and hence generally also decided to buy beer. Walmart decided to put beer and diapers in the same aisle in order to take advantage of this pattern.

Credit risk assessment

How : http://www.gersteinlab.org/courses/545/07-spr/reading/hastie_moderndatamining.pdf , page 3 4
(btw it's a good paper about data mining techniques and applications)

Interaction Data

Instrumenting the world

Various problems couldn't be solved in the past without using complicated models.

Examples

  1. Traffic flow predictions were made using very complicated traffic flow models (http://en.wikipedia.org/wiki/Traffic_flow). Now by measuring the traffic flow (through GPS enabled devices), we can make accurate predictions by simply measuring the position of many cars simultaneously.
  2. Available parking spaces. For instance, the city of Berkley is planning to create a system of electronic signs providing real-time information on available spaces (http://findarticles.com/p/articles/mi_qn4176/is_20071018/ai_n21065758/)

In other words today, many questions can be answered by a simple measure using instruments providing real-time data instead of complex mathematical models.

Instrumenting People

One of the current revolutionary trends comes from using interactive data shared by people in order to experiment with users, get to know them better and provide better solutions for them. Data shared among groups of users has been coined Social Data. Where has this trend lead us?
external image a_lfacebook_0903.jpg

User's interests must be at the center of focus, always

One major remark is that in order to have sustainable and successful source of shared data, it should be BENEFICIAL TO THE CONSUMERS on the short and long term. Otherwise users won't have any interest in sharing information. Having long term benefits for the user implies that extreme and rare events should be included in this consideration and also play in favor of users.

Illustrative examples where extreme events disadvantaged users
  1. Norwich Union (a car insurance company) developed a business model based (recently adopted by Progressive ) on data sharing which proved to be a failure. Indeed, in extreme situations customers did not benefit from sharing data with insurer. For example, each time a customer used her car, her insurance rate would adjusted based on the specifics of her next trip (based on time, location, etc.) If the next trip was considered to be risky based on acquired data, the rate was increased. Even worse, due to the very precise GPS information shared with the system, customers weren't covered for accidents they made when the data showed that they were violating the law, e.g. when they were speeding. The insurance plan might have been beneficial for many "safe" trips but the extreme cases discouraged enough customers for the insurance company to go under.
  2. Anorexic teenager medical fees. In this case, the insurance company chose not to pay for the medical expenses of an anorexic teenager and defended their case in court based on information they gathered from the myspace profile of the teenager (http://elizabethnolanbrown.wordpress.com/2008/02/14/anorexic-teens-ordered-to-turn-myspace-writings-over-to-the-court/). In other words, myspace was clearly beneficial to the girl on the short term but it clearly worked against her when she became sick.

Great, New Innovations

Using social data has allowed us to find simple and innovative ways to experiment with users and provide them with new solutions that change and improve the dynamics of the world around them.

Examples
  1. www.23andme.com - Personal Genome Service. Using this service, one can explore his or her DNA. Moreover, by having users answer simple questions, they all become part of "a social/DNA lab" where their answers can be associated with their DNA in useful and fun ways, e.g. what's the probability that a couple's children will get red in the face when they drink alcohol, what's the probability that we have a hidden genetic desease, etc.
  2. www.hitwise.com - Construct a Consumer Confidence Index by looking at what users are querying for on the internet instead of doing surveys. Advantages: larger non-biased sample, more reliable inputs.
  3. www.fitbit.com is a device that automatically tracks fitness and sleep. No need to go to the doctor every 6 months wondering what's wrong, you know how you're likely to be doing every day.
  4. www.citysense.com is using the location of people's iPhones to identify tribe types and see where they have a tendency to hang out at different times of the day. If you like a tribe you can just look at what they're doing, go there, and the odds of you liking the place should be pretty high.
Citysense_iPhone.png
Citisense iPhone App


So much can still be done

Gather data more efficiently
Many devices can gather data more efficiently. Cellular phones for example can currently only ring and let the user choose whether she wants to answer or not. So many more things could be done: monitoring the tone of a speaker to measure emotions, analyze calling patterns and infer relationships. User location is starting to be used in smart-phones today.
More generally, social networks and sensor networks such as iPhones with their GPS positioning, accelerometers, etc. could be integrated together in order to create applications which understand what is relevant to the user. In the following paper, http://www.w3.org/2008/09/msnws/papers/sensors.html, it is explained how semantic tools which understand natural language could be the glue for binding these social and sensor networks, c.f. below:

SocialSensorNet.png
Interoperability between Social Networks and Sensor Networks using Semantic Web technologies

Relevance
Relevance is becoming excruciatingly important. People have so much information pushed to them but of all that information, what is relevant to who they are and what they are doing?
Examples
  1. Facebook and Twitter have created fabulous systems to find friends and keep in touch with them. However, information feed systems like walls on facebook and tweets histories on Twitter are both ordered purely chronologically. Do we really want to know everything all friends and followers are doing or saying? No! True relevance classifying systems have not been implemented yet (even though they already exist for search).
  2. How do go about finding out how to provide relevant information to users? What do users want? If one ask Twitter for the top 5 people to follow and then ask for top 50 people to follow, should the first 5 in the top 50 be the same as the top 5? We don't really know but we can ask people and measure their reactions and the results should be a factor in our rankings.

The baseline is that the world is being instrumented and measured, people and groups of people are being instrumented and measured, and these measures give us a current state of what is going on. Ideally, a relevance system should take this state as an input and output the most relevant information corresponding to each user, i.e. some kind "black-box" that, receives a flow of information about the current state of the world and people, and outputs relevant informational bites to every user at the best time. For example, emails could be much more intelligently prioritized so that receivers get what is the most useful for them first. The new priority is that the people at the receiving end should have an influence on the information being pushed onto them.

Sources of Data

Data can be collected implicitly and explicitly. Google, for example, sniffs the digital exhaust of the web implicitly with its browser toolbar which monitors user behavior on the internet and with many other tools to be able to better rank page quality and relevance in its search results. Reviews on Amazon, for example, ask users explicitly what they think about a book. It is important to have a strategy for acquiring data and understand how or why the user will reveal information through the interaction and how much of the information gathered is worth. A general principle is that data is worth as much as it helps us make decisions.

Example: Google

Google uses the user's interaction with its products to leverage the information given by the user.
  1. Query Sequence (used for spelling corrections)
  2. Choice Set (which link chosen from the search results)
  3. Queries and their refinements (How to make a query more precise for the user over time)
  4. Trends (trends of queries to predict trends in the population)
  5. Toolbar: knowing what people do on the web

Summary of Data Revolutions

external image image?id=a4a97&w=400&h=400&rev=93&ac=1

PHAME Methodology

Problem Hypotheses Action Metrics Experiment

P: Problem

For example, my problem is that I want to get people to join a Facebook page. Or another problem could be that I want people to keep coming back to my page during the next year.

H: Hypotheses

If I want people to join my page, my hypothesis is that if I distribute flyers or send them email, they will come. Each problem can and should help us come up with a hypothesis. If my problem is rather to have people keep coming back on my facebook page during the next year, then a hypothesis could be that I need provide them with good frequently changed content.

A: Action

To attract people to my page, one possible action is choose flyers, another is to choose emails, a last one could be just be doing both. Hypotheses too can and should help us come up with actions. Another example of an action involves the Amazon shopping cart: should it be on the left or on the right of the screen?

M: Metrics

Metrics enable us to measure or quantify how good an action is. For our facebook page, sending emails would be considered the best action if, for example, it scored highest on the metric of the number of people visiting the page in a week. For the amazon shopping cart, a good metric to know which side of the screen is the best for the shopping cart could be the size of the order when users decided to purchase, or conversion rates, i.e. how many people decided to order after seeing the cart.

E: Experiment

Try, try, try. What works? What doesn't? Find new actions and test them on your metrics, improve, iterate. We don't know what we should do, but we can find out.

Remarks

  1. Metrics are extremely important. Good metrics lead to good results. Even slightly bad metrics can be catastrophic. Smart users will always find ways of increasing the metric artificially if it is in their interest. This means that they will end up doing precisely what is worst for your problem.
  2. Metrics measuring short term effects are easy to identify but are not that valuable. Metrics measuring long term effects are much harder to come up with but they are the ones with the real value.

Data

When studying data, always analyze the distribution rather than only the mean, the variance or some other summary measure. Much more often than not, a distribution will be completely asymmetric, with spikes and have different intervals where it behaves completely differently. Taking the mean or the variance just clearly does not make any sense in these cases. A picture is worth a thousand words: always look at the actual distribution and try to understand where irregularities come from practically. For example, on the Amazon distribution of the total value of a purchased cart, you notice a kink at $24-$25. Then it's probably the $25 minimum purchasing fee for free shipping which created it. Other example, you notice a spike at 200 clicks for the number clicks during a user session. If that's the case, it's probably a non-human web-bot crawling your website. The point is that you could not have even tried to start understanding things with just the mean or variance.

People are Social Beings

People are social beings... it's been ignored by catalogs, but if you market to the people who know people who have bought the item, you can kick ass. Friends are typically 9 times as effective as advertising at converting bad or neutral feelings into a good ones. The reason for this spectacular difference in performance is that friends choose context, content and recipient particularly well.

Think of the new service, Aardvark, which searches through your friends to find the right person to answer your question (and recommend a purchase)...

Get fast answers over instant messaging or email.
Aardvark finds the right friends (or friends-of-friends) for your questions.

Send Aardvark questions over IM or email

Chat_icons
Chat_icons

Nothing to download or install.
Send questions to Aardvark in your existing IM or email program.
Aardvark will also send questions to you based on your interests.
Chat_roster
Chat_roster

Aardvark chooses the right person to answer

Diagram
Diagram


Scarcity

If you want to help people, you have to increase their scarce resources. For example, you could save people time, help them manage priorities, etc.

Cost

Dollar costs are not necessarily the worst. Another more important currency is social capital, i.e. the the trust in the fact that someone will probably try to help you and not take advantage of you in the future (see homework 1 optional reading: http://www.stanford.edu/%7Efturner/Turner%20Tech%20&%20Culture%2046%203.pdf).
Another cost comes from being interrupted. When in the middle of doing something which requires attention, people feel a natural urge to answer calls, reply to new emails, tweet, etc. This interruption has great cost on the productivity of the person. Evolutionary Psychology explains this urge to react by noting that prehistoric humans needed to pay attention to interruptions in order to spot rare and good opportunities. But is it still useful to us and can we voluntarily control it?

Behavioral Economics

It has been experimentally showed that people do not show unbounded rationality. More options will not always favor more consumption (Draeger's jam experiment). Neither do people always accept money depending on the context in which they received it ($10, 1c experiment). In order to create the coolest things, we need to account for our biological history and adapt to it.

Initial Contributors

Alexis Pribula, Bilal Badaoui and Brad Griffith