Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
STATS 252, Stanford University, Spring 2009

Class 2: Ecosystems

April 13, 2009
Audio recording and transcript:
Part 1: mp3 transcript
Part 2: mp3 transcript

Recap of Class 1: PHAME


As opposed to older approaches to data mining, the PHAME model starts by looking at the central "problems" arising when building a (possibly electronic) business: because data is not the central element, mining is not, but the problem is.

Most of these problems can be stated as falling into one of the following categories:
  • what are we doing it for?
  • what is our business model?
  • what is our monetization model?

Monetization / business models

How do we monetize?
The most common business models are
  1. Ads
    Fueled the internet economy in the web 1.0 era; web site owners want people to stay on their sites as long as possible. We measure metrics such as: ad views, ad clicks, conversion rates (the higher the better).
  2. Sell goods
    Selling stuff as web retailers like amazon, zappos do. In this case, it is no longer clear whether we want people to stay in our page as long as possible or if we should just give them what they are looking for.
  3. Versioning / subscription
    A good example would be dating sites. Our goal is to get a lot of visitors/members, so we can get a lot of inventory. For monetization, one can e.g. set up two versions of the site, of which only one is free to the user. The free version serves to build up the inventory, the paid version is for differentiated services, like sending unlimited messages or unlimited photo uploads etc.
  4. One time payment
    Example: game downloads. A one time payment makes you owner of the software, i.e. provides you with a lifetime of service.
  5. Virtual Currency
    Users need some virtual currency to progress in a game or to perform desired actions. In order to get virtual currency, they have to pay actual money or go through offers/surveys that pay real money.
  6. Virtual goods
    Novel way to make money: virtual gifts consist of items existing (only) in a virtual world, i.e. bits are exchanged. The items may have social capital value, aesthetic or functional value. They may or may not be paired with an offline equivalent, e.g. a 99 cents virtual rose bouquet corresponding to $30 in reality.
  7. Lead gen
    Creation or generation of prospective consumer interest or inquiry into a business' products or services, aimed at increasing total sales of the business.
  8. Information products
    A way to leverage information or data we own in order to make money. In the most general case, this concept is called "FreeMium" - sometimes it is free, paid under premium version. Chris Anderson's LongTail article has an interesting story about the concept. It is a classic case about offering free shaver in order to make people buy the blade (Gillette model). Another case is Google: users are offered free services while the money is obtained from someone else (e.g. people buying ads). Understanding who the agents are, who is willing to pay, what the incentives are, and lining the incentives are key aspects here.
  9. White labels
    Create a product that other company can put their brand on. The company pay for the product or service provided
  10. Access to data
    Example: API provider companies like, providing scalable on-demand infrastructure as building blocks for other companies. Clients include for instance linkedin, trulia, netflix etc.
  11. Ecosystem
    Example: A platform, in this case called appstore, allowgin users to buy and download applications that were developed by third party developers (most notable: iPhone's app store). The constituents of this ecosystem are the software vendors, software developers, open source collectives, individuals, organizations and companies.


How do we get people to visit/join a site for the first time?
Can be achieved by giving away gift cards or promotions, organizing marketing campaigns to get people to know your product.
User acquisition costs: cost associated with attracting/convincing consumers to buy your product or service, including research, marketing and advertising costs.
Return of Investment (ROI), helps determine the customer value, e.g. spend $3 to acquire a customer who, in turn, can add to $10 profit.


How do we persuade the user to come back and buy more?
The product itself, not marketing, plays significant role in user retention. Customers experience the product, hopefully decide to like it and to come back. This can be summarized by “Product is the message.”


Cognitive science and behavioral economics are interested in understanding how people make decisions. For instance, in assessing the effect of delayed rewards, one could ask the following question: Would you prefer $10 now or $100 a week from now? Empirically, the answer to this question is about split 50/50 (experiment conducted during the lecture).
However, when the same question is asked with different delay, e.g. one can choose to receive $10 in 51 weeks from now or $100 in 52 weeks from now, a larger majority will pick the latter choice. Here is a simple example of a study on the impulsiveness that explains probability discounting.

Another example is Amazon's hypothesis of economics behind a co-branded credit card. When a customer signs up for a Chase credit card, Chase bank pays $100 to amazon and $30 to the customer. It remains unclear how to best deliver this message to customers. One could e.g. either offer a $30 discount on the current purchase or on the next purchase (Amazon went with the first option).

A good analogy to the above problem comes from dating, where one could choose between "Mr Right" and "Mr Right now"? It turns out that people tend to discount the value of events taking place in the future, i.e. prefer "Mr. Right now".


Actions are taken by you, not the consumer. Consumers react to your actions. These actions obviously depend on what you're trying to solve; e.g. differential action (could be this or that way) -- and we always start with the action.

Examples of actions:
  • vary text/layout of a web site (e.g. shopping cart being placed on the left or right side).
  • vary the time of sending emails out to customers. If e.g., on previous occasions, a person checked her emails at 10 am in the morning, an appropriate action to avoid our next message to be buried in other messages might be to send the next message just a little bit before 10 am, say at 9 am (this makes sense only if we assume that customers check read their mail according to the LIFO (last in first out) principle).


Inputs vs outputs

Do you care about real-time or interactive? In most things, except performance metrics, we don't care if they're real time or not, but we do care about interactive. because it is our time -- e.g. how long does it take between getting idea and having answer back? whether the hit is minute old or now old already?

We need many metrics, for examples: user metrics, performance metrics, application metrics, relationship metrics, etc., to make decisions better. We measure site behavior as well, not just user behavior (engagement): e.g. what is the affect of a fatal error due site crash when trying to buy product. This may have a long-term effect, yet hard to measure (easy is short term hard is long term)


Interaction between users is one of the keys to foster an online community. Metrics such as the number of discussion/wall posts, discussion thread depth, comments aim at providing an up-to-date impression of user engagement. Engagement affects the performance of the website both from short term (virality) and long term (loyalty) perspectives.


Experiments can be done through A/B testing (aka Bucket testing), which is a method of testing different proposed algorithms, layouts, ads placements, etc. using live traffic in a production environment to measure which one is performing better than others. Examples of metrics that A/B testing can measure are CTR (Click-Through-Rate), Conversion Rate, CPM (cost per thousand), etc. Usually, A/B testing is performed on a sample of the entire user population. A/B testing is a misleading name because you can actually test more than two things at the same time, e.g. (A/B/C/D).

A/B testing is key to validating your hypotheses. It has to be simple enough for everyone in company to use and understand it!
Possible implementation in one line of code:
if ( setup_experiment(...) == "control" ) {
do it the old way
} else {

do it the new way


How to get 2% of all users to visit version B of a website?
Steps of a possible implementation:
  1. Configure Apache web server to handle page requests using traffic splitter DSO, and use fnv_hash() function accepting userid as parameter and get the reminder of this function call.
  2. Assign value = (fnv_hash($userid) % 1000) = [0...999]
  3. Get the 2% sample, by selecting values ranging 0..19 (these values become "control")
  4. Define metrics: e.g. user time spent.
  5. Collect data
  6. Plot the distribution curves and compare the means probability due to chances of the old and new way.

A "soft launch" refers to the release of a minor revision of a web site or other product or service to a limited audience. Soft-launching is a method for gathering data on a product's usage and acceptance in the marketplace, before making it generally available as a "hard launch" or real product launch. If the sample is not well chosen, performing A/B testing during a "soft launch" can give misleading results because it might not reflect normal conditions.

There are other external effects that are important and should be taken into account for: day of the week, interaction with other apps. Therefore, we would like to perform the A/B testing long enough to gather sample data that is enough to ensure that the change was not due to external factors, e.g. 4 weeks including weekends.

Homework 1

Forest vs trees

It is not about the forest (aggregate phenomena), not about the trees (high resolution phenomena).
We are interested in deep structure rather than surface structure. It's also important to understand the underlying dynamics to interpret the data correctly.

Good metrics

What are good metrics or KPI (Key Performance Indicators)?

Characteristics of good metrics are:
  1. Actionable: when reading a report containing actionable metrics, you would be able to take actions in order to e.g. earn money or stop loosing money.
  2. Accessible: easily obtained without overburdening the stakeholders.
  3. Auditable: can be counted and applied in an equal or non-discriminatory manner.

A good metric should contrast different design options by measuring outputs corresponding to those options. It should also be robust, e.g. if you remove data points, the conclusions drawn from the measurements should essentially remain the same .

Kaizen Analytic has a good blog entry about methodology on how to define these metrics.

It is important to look at the data distribution and monitor how well a metric is performing given a set of time periods. We should also validate whether the metrics actually measure what we want them to measure; this validation is performed against the data collected over time. It might be useful to look at the distribution of one metric across, say, 10 days and come up with a statistical description, then monitoring how well that metric evaluated for day 11 matches the statistics.

Misleading metrics can cause huge costs, since they tend to lead to wrong decisions.


If a metric goes down, we should be able to tell what this means. A metric should say something relevant about something that we are measuring so that it can recommend a wiser course of action. For example: CTR is (0.65%) if we put the banner ads together along with quizzes about a subject (Sweeney Todd example in Jia's talk), but the CTR is only (0.01%) if we just put the banner ads by itself.

Deep structure vs surface structure

Deep structure vs surface structure terms come from Chomsky's Minimalist Program.
Example : Fish / water

Model vs trends/Predictive vs descriptive

Predictive model -- linear operator, compare and predict the outcome.
Descriptive model -- trends, no predictive power.

If we change a parameter (e.g. X inside the viral loops) and predict the traffic will increase 7% percentage and we end up getting it, then we know that we have a good model and that means we understand the underlying dynamic.

"natura non facit saltus" -- nature does not leap.


Statistic is a science that deals with noise and generalization
Generalization requires prior knowledge, for any practical application, you have to know what the relevant inputs are. For example, if you want to forecast the price of a stock, a historical record of the stock's price is rarely sufficient input. You will need additional data about the company, as well as other economic data. Good generalization also requires an adequate sample size of the population.

Axes / dimensions of space vs points (instances) in that space

We often use graphs to give us a picture of the relationships between the metrics. A graph consists of two axes: x (horizontal) and y (vertical) axes. The axes of the graph is also known as the dimensions of the space. These axes correspond to the variables we are relating and we need to understand the domain of the graph. The point is where the two axes intersect tells the relationship between the axes. Sometimes, we don't know the reason why two points in a graph are miles apart, or close by. It requires understanding of the underlying dynamics to unravel the mystery behind it.

Tool vs Piece of art

Tool -- e.g. engineers like to build stuffs people can use; hence, the output of the tool depends on the input.
Art -- e.g. movie, directors make movie, and when the movie is done, they then release it and that's. It's up to people how to interpret it.


We are interested in deep structure, so we build model, make prediction, look the axes of the space, and build tools to get the data.


From newspapers to mobile
Marshall McLuhan: Medium is the Message:
The 'message' of any medium or technology is the change of scale or pace or pattern that it introduces into human affairs.
In a culture like ours, long accustomed to splitting and dividing all things as a means of control, it is sometimes a bit of a shock to be reminded that, in operational and practical fact, the medium is the message. This is merely to say that the personal and social consequences of any medium - that is, of any extension of ourselves - result from the new scale that is introduced into our affairs by each extension of ourselves, or by any new technology."

  • Radio imitating ?? … and then [way before my time. McLuhan? Kevin Kelly talk at Web2.0 Summit?]
  • TV imitating Radio… and then? [before my time]
  • Web imitating TV… and then? Interaction, democratization
  • Mobile imitating Web… and then? Location, data collection

Ads evolution

Stage 0: Traditional

  • Unique users
  • Time on site

Stage 1: Individual

Stage 2: Social

  • Sharing
  • Forward to friend A
  • What does friend A do with forward
  • Wall post
  • I have this problem too
  • I like
  • I use this

Linus Liang's Presentation - App Development on Facebook + iPhone

Slides are available here.

Linus took the class 3 years ago. He created a Facebook App company and sold it. He also started an Iphone App Company. He started creating Facebook applications around summer 2007, at which time the Facebook platform was still fairly new.

Linus saw the potential of building a business on top of this platform since for the first time one could have such easy access to ~two hundred million users, including their data, their social connections, etc.

Linus framed his talk in the PHAME framework as follows:

  1. How to get a lot of users quickly?
  2. How to scale the app in the order of magnitude?
  3. How to get an app compelling enough to get a huge amount of growth really fast?

He met a physicist friend who brought a viraling model from physics and put it in the context of Facebook/e-business. The model assumes that users of a service inviting their friends to use the same service. It produces an indicator, called the viral factor, which allows to measure the virality of an app in the viral loop. A viral factor > 1 means the app is viral.

The following image illustrates the viral loop theory.


Another problem was how to create something that is compelling enough to make your app grow really fast. This brought Linus to several hypotheses.
Most of the people in Facebook are younger users, they want something fun and colloquial, something that attracts their egos. All the hypotheses revolved around this idea.

  1. People want to know about themselves
  2. People are concerned about what others think about them
  3. What application can force people to "annoy" their friends by sending invitation

Linus came up with a lot of ideas revolving the hypotheses: you're a hottie, my sexy friends, trendstter, naughty friends, etc. The one that really did well was You're a Hottie, which is basically about knowing how hot your friends are. One important point: technology is being increasingly easy to use, it often makes sense to throw many ideas/apps into the ring and see which ones do well.

  1. Varying placement of buttons
  2. Varying text/reframing
  3. Focusing on performance
  4. Figuring out what do the users wants to get out of from the app

  1. User metrics: time spent, returning users, new users
  2. Application metrics: number of notifications + invites, number of emails sent, most clicked links, most used features, best text strings
  3. Performance metrics: slowest queries; CPU, disk, RAM; time to send notifications + invites
  4. Relationships metrics: conversion rates of notifications + invites and of emails, avg. # of relationships

  1. Test often using A/B testing and focus on short term
  2. Build stickiness engine
  3. Get marginal different from test results, you just have to go with your hunch sometimes

Linus continues the talk about the Facebook and iPhone app development process, and what he does with babies.

Hotties (Facebook)

You're a Hottie was simple, easy to code, super fast in terms of performance, and became one of the most popular applications on Facebook. It was successful because it had better technology than the others.

Lesson learned: if you want to be in this game, you need to annoy people. Facebook's request and notification system was overwhelmed because of the popularity of the app, and facebook has to shut them down.

Mafia (iPhone)

The game has been around forever, but it's still popular as ever -- people don't change.
The whole purpose of the game revolves around recruiting other users. It provides incentives (in the form of points) for users to recruit other users in order to eventually become the biggest mobster.


After leaving the world of apps, Linus would like to do something else. Embrace global is Linus's latest project, utilizing his engineering and design skills to do something good.

Jia Shen's Presentation - Ecosystem and social applications

Slides are available here.

Jia Shen is the CTO and founder of RockYou, one of the leading apps developer companies for social networking sites.

The Internet has changed the way software is used. Back in the old time, (desktop) applications were straight forward and closely tied to operating systems. More and more, application software is becoming operating system agnostic and people build applications over the net. These applications can
be accessed from anywhere and users are no longer tied to a particular computer or operating system.

People start to create web applications, but unfortunately a good feature is not available on every web site. RockYou was founded on the idea like Paypal, which was built on top of eBay and now can work across any type of platform: game , finance, etc.

Social networks as a market is now huge. In 2005, the biggest social network was MySpace, but at that time it was hard for third party developers to create applications on top of it. In 2008, almost all social network platforms are starting to open up and allow third party to build applications on top of them. Social networks have become ridiculously large and allow for a multitude of business opportunities.

What makes social networks so attractive? The real reason is that it is so easy to acquire users. Users are already centralized and can be easily accessed. RockYou gained 1M users after posting its first application on MySpace Bulletin board. Facebook has a 200 million (and counting) user base and is dominating the social apps field now. RockYou decided to develop application for Facebook in part because Facebook users hardly ever uninstall (about 0.1% uninstall rate) apps. As a consequence, it's possible for apps that aready have a lot of users but are not active, to re-engage the users through its powerful notification system. Example: LivingSocial; when they created a new news channel, it become top monthly active within weeks.

Yelp is a prime example of an application that is solid and has its own login. If Yelp were to be created today, it would probably be on top of social networks.

Some early social networks :

These social networks were popular at the early days, but unfortunately, they did not allow third-party developers to develop applications on top of them.
When Facebook launched around May 2007, the company differentiated itself from the other social networks by allowing third-party developers access to its users' social graph and build applications on top of Facebook's platform.

One specific thing on Facebook that make a difference from MySpace is that Facebook has the ability to notify users.
The cool part about notification is that you can go back and re-engage the users. This is a very valuable thing, because it gives you the ability to notify existing users the app's has acquired about new services/apps.

In general Facebook make sure that :
  1. Developer has access to social graphs
  2. Apps have the ability to message users in long term and specific ways
  3. They make it relatively easier for people to make money out of it

Platform leads to rapid Facebook growth. MySpace and Facebook launched at the same time, and MySpace was dominant before Facebook platform. Facebook overtook MySpace after launching its platform. The rapid growth of Facebook over MySpace can be seen on the the following chart.


Apps can be viewed as Web Sites

In web 2.0, App space, in Facebook specifically, is growing at a faster rate than the number of web sites created during web 1.0 era. That means, from a monetization standpoint, it is a better opportunity. It is better to buid a business on top of a platform that has acquired a lot of user base than building a destination web site by yourself.

Overview of Social Networks

The current landscape of top social networking sites:
  • Facebook: currently the biggest social network and it is dominating globally.
  • MySpace: was the previous biggest social network in US, but its market share is shrinking. Although it has presence in Japan, China,
    MySpace's majority users are in the US. Users can be monetized easily in MySpace, but it is 4 times harder to grow.
  • Hi5: popular in countries/regions with Spanish speaking users. It is rapidly declining in traffic. It has potential for international monetization, but not a place to go for US traffic.
  • Orkut: big in Brazil, used to be big in India before being taken over by Facebook
  • Friendster: has a big market in South East Asia (Philipines, Malaysia, Singapore)
  • Bebo: big in UK, but Facebook starting to take over their market share too
  • Xiaonei: largest social network in China
  • Mixi: big in japan; the first social platform with mobile integration since 65% of the users only browse the site on the phone


Parasitic relationship
Apps traffic tends to grow linearly with host social networking sites. Apps live on something already built and are thus easier to maintain/build (e.g., no need to have own email or authentication systems). Apps developers instantly acquire a large user base, e.g. Facebook platform. In a sense the relationship between apps and their hosts is also a symbiosis, where both benefit: Apps developers drive traffic to social networks because they increase their attractiveness to users.

The following graphs illustrate how RockYou traffic growth is proportional to that that of Facebook and MySpace.

Other example: Paypal / eBay
There is a strong relationship between both, yet eBay is not the only place you can use Paypal. Paypal was initially built on top of eBay, now ends up as payment application on different platforms used all over the web.

Social Apps

Facebook and Myspace combined have more advertising inventory than Yahoo does, but their monetization is lower. If social networks can do targeting like what Google does with AdSense, then Facebook would become gigantic. Google knows how to do targeting, unfortunately targeting traffic like in AdSense does not work in Social Network. CPM and social networks are just always really bad. This is where Social Apps fit into the picture. Social Apps bring together the two worlds: CPM and social network.

How so? If we take a look at Yahoo! Homepage, it has a bad CPM. You can not target the users because you don't know who they are, what they like. This is why Yahoo! created verticals like Yahoo! Auto, Yahoo! Finance to get more specific users to target for better monetization. A car dealer would rather buy ads on Yahoo! Auto to target those who are looking for cars than buying ads on Yahoo! Homepage. Yahoo! Auto has higher CPM than Yahoo! Homepage.

Social Apps has best of both worlds, a lot of users and ability to better targeting for social network. Advertiser can better target particular users based on their interests and what Apps they are using. This makes Social Apps has higher CPM than the social network itself. The hard part is not the technology side, but figuring out what application to build.

"Platform" concept/idea

Software turned out to be commodities, and value moved "up the stack" to services delivered over the web platform. Companies are opening up their sites and making it easy for apps developers to acquire users.

OpenSocial as a platform

Open Social is a standard to create apps for social networking sites, promoted by Google, Yahoo!, and MySpace as competitor to Facebook.
Ning is a platform for users to create their own social websites and social networks.
OpenID eliminates the need for different usernames across different websites, you can use the same id to login to AOL, Blogger, Flickr, Yahoo, Wordpress etc.

Who owns the web? Facebook has scary potential to do this. The social graph war is going on. Companies strive to become "the authentication system of the web" and whoever wins it will rule the world. Imagine if Yahoo! Mail uses Facebook Connect to authenticate users. Facebook would control the flow of information.

Yahoo! Mail as dormant social networking site; it has high stickiness factor, the company has opened up its "platform" for third-party developers so it's possible to develop new apps on it, e.g. soccer mom apps.

Andreas's questions to Linus and Jia

  1. What has changed in the last 3 years?
    Linus : Apps (Facebook App, Amazon API, etc). People more open with their information (twitter)
    Jia : The evolution of social graph
  2. What has not changed in the last 3 years?
    Linus : People haven't changed. what popular 20 years ago, still popular now (mafia game)
    Jia : Monetization
  3. What will change 3 years from now?
    Linus : Mobile would be very big. Data would just become more freely shared
    Jia : MySpace is going to disappear, Facebook will dominate
  4. What would you have done differently?
    Linus : Less spammy
    Jia : Grow the company faster, land grab
  5. What advice can you give?
    Linus : Make sure you are always on the same page with you business partner(s)
    Jia : Don't wait on stuffs, build stuffs really fast, build something that can count in real-time fast

Initial Contributor: Tirto Adji, Stephane Colas, Chris Cinelli