Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
STATS 252, Stanford University, Spring 2009
Class time: Monday 2:15 - 5:05 pm
Class location: Gates B01

Class Date: May 4, 2009
Audio: part 1, part 2
Transcript: TBD

Initial Contributors:
Todd Sullivan
Roshan Sumbaly [rsumbaly at stanford edu]
Shakti Sinha
Aravind Narayanan
Emile Chamoun [echamoun at stanford dot edu]
Maya Choksi [mpchoksi@]

Table of Contents

Class 5

Part 1: Professor Weigend's Lecture

I) Segmentation:

In that past, people thought of segmentation as “clustering” data into smaller subsets.

What is Clustering?

Clustering data: Statistical method that is based on partitioning large parts data into smaller subsets such that points within the same cluster are more similar than points from different clusters.
One commonly used Clustering Algorithm (k-means clustering) is explained in more details in Professor’s Ng CS229 lectures notes accessible here.

In the past, companies have hired people to conduct cluster analysis on their users’ datasets in an attempt to collect relevant information about the demographics of their user base.

Question: Suppose you know that 70% of people using your site are female. Is this actionable?

Two alternatives:
1) If you are 100% sure that you have no way of changing the constituencies of
your site, then publishing more content targeted at females might seem like a good strategy.

2) However, a marketer confronted with this situation will face the following dilemma: If my goal is to maximize my earnings, should I spend more
money trying to attract more females to the site or should I target those underrepresented males?

In the past, when confronted with the second alternative, people found it extremely difficult to answer this question since they could not conduct experiments to test their hypotheses, especially when the variables were more complex than the gender distribution of a website’s user base.

What is Segmentation?

Tip of the day: When someone asks you about segmentation, be careful and try to figure out what they have in mind since they might have a different idea about segmentation.

It is important to distinguish between two different types of segmentation:

1) The first type of segmentation consists of dividing users according to their common characteristics as measured by variables that do not change much. E.g. Zip code, phone number, demographics, psychographicsetc.

This type of segmentation is becoming obsolete since it is weak compared to the other alternatives available to companies. For example, by finding out people’s interests based on their click behavior, we can get much better targeting to people.

2) Conditioning on variables that make sense. Find out what good variables are, form hypotheses and address them using PHAME framework. This type of segmentation that values experimentation over demographical data is very powerful.

Google trends: Online tool developed by Google that helps to visualize important characteristics of the domains where events happen. Therefore, this tool would certainly help to form hypotheses but experiments will need to be run to determine what works and what does not.

II) Who do the Social Data Revolution and Marketing 2.x matter to?

Three groups of people will be interested in the impact of the social data revolution on their work. Understanding what these people want and being able to deliver will ensure success. Getting it wrong the first time can greatly diminish your chances of success.

A) Top executives who want to set up a web/digital strategy for their company:

Concern: Why do executives resist the social data revolution?

1) Lack of interest in technical skills: Top executives do not share the enthusiasm technical people might have for data. They are more interested in the direct impact on their business

How to address it:
  • Focus on the business, deemphasize the data! Tell the top executives the impact of the actions you are suggesting on the business.technorati.JPG
  • Give them small, concrete tasks to carry out (on the order of zero dollars and a couple hours!): people are much more likely to invest their time in technoratito learn what customers are saying about the company. low-cost approaches. E.g. searching a blog search engine such as

2) Concern over capital expenditures: Executives have no idea how easy and cheap it is for them to deal with data. They often believe it is expensive to build simple things. E.g. Executives at Best Buy decided to set up an online wiki. They consulted their technology provider and were told that they could have five online stores in half a year for a cost exceeding one million dollars. Best Buy then contacted an IT company that promised the wiki would be created over a single weekend.
How to address it: Convince executives that online experimentation and storage are very easy and extremely cheap. For instance, a readily available
online tool such a Google analytics could provide some businesses insights about their user base.

3) Have not realized the shift in the nature of businesses: Some executives still think of their business as a transaction-based business. They often rely on traditional offline methods to get feedback from their customers. Traditionally, businesses have employed consultant companies to gather information about their customers’ opinion. However, this method appears to be “very” noisy: the consultants only communicate a fraction of their findings, the executives only understand part of what they were told etc. This implies that a significant part of the information is lost. In addition to that, offline feedback loops take a very large amount of time.

How to address it:
  • customers’ needs and try to address them; people will tell companies what they think if they are asked for their opinion. E.g. Suppose a customer Executives should think of their business as a relationship business and try to account for future transactions as well. Businesses are no longer entirely based on transactions; there should be a relationship between companies and their customers. Companies should be attentive to their is dissatisfied with airline X and they complain about the service they received on a social network. The customer will be very impressed if an employee from another airline contacts them to discuss the problem and offer to help. Social networks are a great way to connect with customers and strengthen the company-customers relationship.

  • It is important for executives to realize that there is a core difference between online and offline methods for gathering customers’ feedback. Web-based methods tend to be much more cost effective and powerful than offline methods. Online experimentation is very cheap (basically free) and the data collected on the web helps establish a direct relationship with the clients, thus eliminating the risk of significant information loss. However, designing suitable metrics and determine the relevance of different metrics and their relation with each other is a critical step towards the success of the experiments.

4) Loss of Control: Executives often complain that using online methods might result in a loss of control over the company information that customers are sharing.

How to address it: Since the day they started their operations, companies have already lost control over “what is being said about them”. It is a fact
that customers of the company are sharing information online. The company can choose to learn about that information and try to address its
customers’ concerns or simply ‘bury heads in sand’ and pretend that the information does not exist.

5) Traditional methods: Our education emphasizes experiments that were optimized in world where data is expensive. Executives were taught to deal with data in traditional ways that demand large capital expenditures. They are afraid to switch from strategies that appeared to be successful over the years to new methods that have not been tested yet.

How to address it: It is not the nature of the experiments but their relevance that matters. The distinction between the two is hard to emphasize in a
standard course. A typical statistics PhD is not what companies need to deal with large databases. Often, CTO’s do not realize that their most of their
employees are technologically aware and could conduct simple web experiments in a very short amount of time at almost no cost for the company.

6) Unpleasant revelations: Executives might be concerned that conducting experiments might reveal to them that the policies they believed to work up to now, in fact, do not. They also realize they might lose their job as a consequence.

How to address it: Running experiments is fast and cheap. Provided the metrics were well designed, executives can quickly figure out the customers’
issues and try to address them in a timely manner. This will have a significant “positive effect” on the company as a whole and they might be rewarded for

B) Marketing executives who want a direction beyond social media.

Before engaging in the social data revolution, marketing executives need to understand how the essence of marketing changed:
  • The 4 P’s of marketing (Product, Place, Price, Promotion) no longer rely on the same old methods that seemed to work well in that past. Traditionally, product marketing relied on long feedback loops (e.g. hiring a consulting company to gather information about customers’ needs and opinions about the products) and the organization of focus groups (selecting a group of people to understand their attitude towards a service, product, etc.) to get data about customers’ satisfaction.
  • Marketers have not yet realized how the technological advancements have impacted their work. They still think about optimizing one message and try to reach a large number of people as opposed to running fast cost-effective experiments to figure out the customers’ needs and try to address them. e.g. Trying to connect with people on social networks to understand their needs (e.g. reaching out to customers’ complaining about competitors’ products and services and offering help) works much better than coming up with nice brochures and sending them out
  • Top marketing executives don’t know what is promising for their business. For instance, some executives want to go into second life but have no idea what impact it would have on their business.
  • Some don't realize the importance of the existence of the fifth P: Platform . Today there are over 2 billion searches performed daily on the internet and 81% of all users browse the internet to find information about products or services that they are considering buying. Marketing executives must understand that this fifth P is playing an increasingly dominant role in their work. Recommender systems are extremely important and a good portion of that comes from the Internet.
  • Marketing executives need to realize that they are not only instrumenting their site but they are instrumenting the world. Advances in technology that show this are:
    • Billboards can now monitor the sorts of people who stop and look at them, and can collect data on the viewer demographics, the average time spent viewing, etc.
    • In-store displays have infrared sensors to track how many people pass by their sign and how many people stop to look
    • See The Economist's "Machines That Can See "

C) Visionaries who want to know where the company is headed

  • Want to know what sort of information is already available and what can we do with it? This group of people is probably most interested and most receptive to the at-large insights that you can draw from the data that will help shape the long-term model for the company.
  • How can they give back to their users and improve their services? These sorts of questions provide an overview of the company's business model in terms of how they can maintain their customer base and improve their services to stay on top and attract additional customers.

Exercise: How would you answer this question? Should businesses become involved in Second Life? secondlife.JPG
Selected Answers:
  • Second life is a fad, it did not replace going to stores!
  • Although the press loved Second Life when it first came out, it does not seem to be what people are looking for; people like to interact with real people (though social networks for example) not avatars. However, money can be made on second life with carefully designed experiments.
  • Recommender Systems have shifted from product-based to social-based. Recommender Systems that emphasize the product as opposed to customers’ opinion no longer achieve successful results. 14 seconds is the average time a catalog gets looked at in the United States! Recommendation systems can make a big impact on the sales of a company and should be carefully designed. For instance, Amazon estimates that its recommender system is responsible for 20% of the company’s revenues.


Part 2: Pre-Facebook Presentation Questions with Itamar Rosenn

About Itamar Rosenn

Itamar Rosenn was born in Jerusalem and graduated from Stanford in 2005 in Symbolic Systems.

Facebook Lexicon

Similar to Google Trends, Facebook Lexicon allows you to follow language trends across Facebook, such as the usage or words and phrases on profile/group/event Walls. You can compare up to 5 phrases at a time with each phrase being one or two words long. The current version which allows the user to input any words that he/she is interested outputs graphs like the following:


The new Facebook Lexicon does not yet allow users to input their own topics, but provides additional functionality. The following screenshots show the application for the topic "dancing". For any topic, the new Facebook Lexicon provides the views Dashboard, Demographics, Associations, Sentiment, Pulse, and Maps.

The Dashboard shows an overview of the topic's popularity among users, and a breakdown by gender/age. The Demograhaphic tab shows number and percentage of posters by gender, age, and country. The Associations tab shows other phrases that frequently occur near the topic (in this case "dancing") and plots each phrase on a graph to show the average age and gender of users that use the particular word alongside "dancing". You can check the checkboxes of individual words to include/remove the words from the graph.

The Sentiment tab plots across time the percentage of mentions about the topic that are positive (versus negative). Thus a sentiment of 86% means that 86% of opinionated mentions are positive while 20% are negative. The Pulse tab shows words that frequently occur in users' profiles under interests, music, movies, TV shows, and books. Finally, the Maps tab shows where users that are talking about the topic are located, and currently allows for the US, Canada, and England.


Facebook Pages: Coca Cola

Coca Cola is a great example of a company embracing these new social technologies and leverage their fan base. In 2007, Professor Weigend was at a conference where Coca Cola's CMO's opinion when he saw someone using the Coca Cola name on the web was "We will sue them!" Coca Cola recently found out that there was a successful Facebook Page for Coca Cola's Facebook Page was run by two fans. Instead of being hostile, they flew the two fans out to the Coca Cola headquarters, gave them a tour, etc.

What kind of Facebook Page metrics are companies and celebrities asking for?

Big companies are not sophisticated in the metrics that they want. They are interested in high level signals, broad distributions. Example: playing Yao Ming and Shaq against each other by reminding each of them when the other has more fans and suggesting that they should invest more money and publish more advertisements for their respective page. Example 2: pages for movies. They are most interested in the Lexicon-type feedback and seeing increased chatter about their movie. Are people actually talking about the movie?

A few are interested in surveys and experiments to see if users are able to recall a brand or fact about a brand some time after seeing an advertisement.

Facebook Invites Question

Suppose you do not use Facebook and you receive 3 invites from friends to join. Assume you know the friends equally well, like them equally as much, and all circumstances are identical except for one thing. In Scenario 1, all 3 friends are from the same network on Facebook. For example, you go to Stanford and all 3 of your friends are part of the Stanford network. In Scenario 2, all 3 friends are from different networks. For example, one friend is from Stanford, one is from Singapore and one is from Oxford where you spent a semester abroad. Would you be more likely to join Facebook in Scenario 1 or Scenario 2?

Student Responses

Around 66% of students were for Scenario 1. Opinions for Scenario 1 included:
  • All 3 friends on the same network provides a critical mass, or enough reason, for me to join Facebook. If they are all on the same network, they probably know each other. All 4 of us knowing each other and interacting is more valuable to me than 3 disjoint relationships where the friends only interact with me and not with each other.
  • 3 people in a network that I am in feels very relevant to me.
  • 3 people from the same network are probably from the geographical network that you are in in the real-world. The online + real-world interaction would be more beneficial to me than online-only interactions.
Opinions for Scenario 2 included:
  • If the friends are all from the network where I am in the real-world, then I can just interact with them there. I would be more interested in using Facebook to interact with those that are not close geographically, thus I would prefer Scenario 2.
  • The fact that you are receiving invites from 3 disjoint networks shows that Facebook is large and not only popular amongst one group. It shows that many of your friends are likely on Facebook across multiple disparate networks that you can interact with.
  • It is weird for a website to be recommended by people in distant geographical areas. 3 random networks is a better indication that the website has reached critical mass and is not just a local phenomenon.

Facebook's Answer

Facebook looked at a question that was more broad. Instead of the same network vs. three different networks, they looked at similar interests/characteristics vs. dissimilar interests/characteristics, where a user's network was one characteristic. Other characteristics include friends, age, etc. Facebook ran a correlation study and not an empirical study. They found that shared characteristics had a significant positive effect.

Receiving three invitations from users that are part of the same network gives you a feeling that there is a cohesive social experience to participate in. This study motivated the creation of the Invitation Suggester.

Part 3a: Presentation on Data Science at Facebook

Itamar Rosenn from the Facebook data team talked about work being done on data collection and analysis by his team.

Facebook Data

Facebook has over 200 million active users, with several hundred thousand new users joining it every day. With this huge, rapidly growing user base, and with new forms of interactions with users, Facebook continuously generates a huge amount of data. With this data, Facebook is in a position to extract information that few other companies can match.

The data on Facebook can be categorized into three types: one is the the social graph formed by the users of facebook, second is the data generated through the behavior of users on the site, and the third is the social content posted by users such as photos, videos etc.

The scale of data on Facebook can be estimated from the following: Over 100 million users visit the site every day, the average user has over 120 friends, and the information per user has hundreds of dimensions. Users interact with hundreds of thousands of applications on Facebook and off the site. In addition, users can perform various interactions on the site. Considering the number of users, the data generated through wall posts, comments, 'like's, picture and video views etc. is tremendous.

The data collected each day adds up to over 2 terabytes, while dozens of terabytes of data are read and written. And these figures do not include the photo and video data. To store, analyze and extract information from this data, Facebook uses specialized systems.

Managing Data at Scale

For efficiently managing the data it collects, Facebook uses the following systems:

HDFS / Hadoop

hadoop+elephant_rgb.pngHadoop is a free framework in Java designed for data intensive distributed application. Its been developed primarily by Yahoo, and is inspired by the Google MapReduce framework. This framework utilizes distributed systems to process data in parallel on a large number of machines. Hadoop Distributed File System (HDFS) is designed to scale to petabytes of storage and works on top of existing filesystems of operating systems. The largest Hadoop / HDFS cluster at Facebook has more than 2 petabytes of raw capacity.


MetaStore is a system for managing MetaData. Instead of storing metadata as flat files, it stores them as database tables. With the high number of images, videos and other content, metadata management has a significant impact on performance of systems.


hive_logo_medium.jpgThe Hive Query Language is a SQL like query language which is built on top of Hadoop and MetaStore. It has been developed by Facebook. One of the main advantages of using a system like Hive is that any person with some backgrouond in SQL is able to easily access the data in stores in the Hadoop clusters.

What does Facebook do with the data?

Facebook used automated data analysis and machine learning techniques to extract information from the data it collects. It does behavioral analysis as well as build data driven systems. This involves figuring out how to process the data, what kind of techniques to use as well as building the tolls and systems that make it possible to do all this. Some of them are listed below along with the motivations for doing that.

Behavioral Analysis

Behavioral analysis deals with determining the nature of user behavior, nature of social norms and how products affect users pereferences.

Product Health Metrics

Basic information like page visits is not sufficient to characterise how successful a page is. For example, a page which has fewer visits, but many more interactions can be generally considered to be more successful than a page with more visits, but where visitors do not interact much with the site. The data science team helps in the collection of such metrics and their analysis.

Launch Evaluations

For a service like Facebook with millions of users, how can the site designers decide what their users will like? The answer to this question is almost always the same - experiment. The Facebook data team helps conduct such evaluations on new features to guage how well they will be received by the users. Such experiments are generally performed on a small fraction of users chosen randomly. Many experiments of this kind were conducted during the site redesign in Facebook by the data team in collaboration with the product management. In addition to the pre-launch experiments, the Data team also tracks the effect of redesigns and new features after launch. This can involve determination of metrics, collecting them, analyzing them etc.

Growth Modeling

This involves determining parameters that affect the growth of Facebook and it communities. Finding which markets are good for Facebook to grow in, determining the saturation patterns and growth patterns, what are the markets where Facebook is having difficulty growing, etc. are some of the tasks carried out by the data team in this area.

User Churn Modeling

User churn modeling deals with finding the features that cause users to come back to the site or to leave the site. Accurate user modeling helps the company to determine what are these features, and then target different users based on such features so that they visit the site more often. It helps in figuring out what actions taken by Facebook can cause which type of users to visit the site at some frequency. Such modeling also helps determine what are the things that turn users away, which is very important from the point of view of user growth and retention.

Production Incentives

There are some events which can cause users to post content more frequently. For example, if a users sees many of his friends comment on some photo, he is more likely to view the photo and comment on it. Similarly, comments and content targeted at the user is more likely to cause the user to take some action on it. Determining such events and the degree to which they cause a user to post content is modeling the production incentives. The usage data and site logs collected by Facebook can provide very good indicators of such incentives, and analyzing the data to find them is another thing that the data team does.

Content Diffusion

Content diffusion is concerned with modeling the way in which content is transmitted over the social graph. It helps in finding factors that influence the virality of content as well as factors that help create strong interacting networks. Content diffusion is covered in more detail in the later part of this writeup.

Data Driven Systems

Data driven systems try to model various concepts based on data collected by Facebook using machine learning and other techniques.

Ad CTR Prediction

Facebook generates revenue by running ads on different pages on its website. For xany ad space, Facebook has to determine which ads to display out of the pool it has. One parameter that is important for optimizing performance of the ad space is the Click Through Rate (CTR). The data collected by facebook can be used to build models that predict the CTR for an ad based on parameters such as targeting, ad content etc. The models consider things like user profile information, things they have talked about with their friends, characteristics of the other ads they have seen or clicked, etc.

PYMK - People You Might Know

People You Might Know is a section on the Facebook page that makes suggestions on users that can be added as friends. This section throws up some interesting questions about which recommendations to show to users for people they might be interested in connecting with. For example, if there is an option between two users for showing as recommendations. One user has been on facebook for some time and has a large network of friends, but most of his friends overlap with the friends of the user we are showing recommendations to. The other user is someone who has just joined Facebook, and has a few of his friends common with the user to whom the recommendation is to be shown. It might be more useful to show the second person as a recommendation as it can have a much higher impact on the overall network, because it would connect the new user to the large network of the primary user. Also, doing this is good for Facebook, because it helps connect and create new networks. The friend suggestion algorithm currently suffers some problems. For example, it keeps showing a person as a recommendation for someone you might know, even if you don’t add him as a user after many appearances. The algorithm is under improvement to solve similar issues.

Search Ranking

Facebook provides different types of searches – for users, pages, groups, etc. Ranking results in these searches so that the most relevant are delivered to the user is part of the work done by the Data science team. For computing the relevance, they try to determine what the user might be looking for based on his actions and other data associated with him.


With the new design of Facebook, there are two data streams on the main page. The first is the newsd feed which is similar to what the earlier version of Facebook had. The other new feature is the Highlights section. It is different from the news feed in that it does not present recommendations in a strict chronological order. Elements of the news feed do not stay on the user screen for too long as new stories arrive, while the Highlights can present information that can stay on the page longer. Since the Highlights section occupies important parts of the user home page, determining what goes there is important. Recommendations for this section are generated by finding the highest quality items from the news feeds based on relevance and other parameters.

Part 3b: Presentation on Maintained Relationships on Facebook

The question that they tackled is that whether or not Facebook helps people to increase their own personal networks. They set themselves the task of studying the types of relationships people maintain on the site, as well as the size of these groups.

There are at least 3 different types of relationships:
1. People that you actually know, or you have met, in real life. This is a superset of the people who are your close friends.
2. People that you directly communicate with on a regular base. This group is your inner circle of close friends and family, and can be as little as 5 people.
3. People that "follow", or whom you take an interest in. These are the people whose updates you see in your News Feed, but with whom you don't have direct regular communication. This type of information consumption can lead to direct communication in the future.

The next step was to gauge the size of these different types of networks. Because it might be prohibitively expensive to measure the size of each individual user's networks, they selected a random subset of the users on the site, and observed their usage patterns over the course of a month. This technique is a common one, where many sites like Amazon test out new features or experiment on a random subset of users, just to measure the impact of a new idea.

The first measurement is the number of friends a person has on Facebook. Another important measurement is the size of the network of people with whom the user has regular reciprocated communication. This tells us that the user is close to this friend, and gives us an idea of the size of the user's core network. The third metric is "one way communication". This measures the number of friends with whom a user communicates with once in a while, but with whom they are not really that close. The last metric is the number of people the user "tracks" on Facebook. That is, the friends whom the user wants to know about, but does not communicate with on a regular basis. A loose definition for these kinds of friends is a friend whose profile the user has visited at least twice.

Interesting findings:

  • A typical user is passively associated with 2 to 2.5 times more people than with whom they directly communicate.
  • Social graphs are a great way to gauge the usage pattern of a user. This type of graph shows us in a graphical way the number and distribution of a user's friends. A point to note was that technologies like News Feed allow people to track many more people than they actively communicate with, since the graph of friends they follow is much denser than the graph of friends they actively talk with.

Another question Facebook wanted to find the answer to is "What motivates new users to share content on the site?" Some interesting stats are:
  • Less than half of new users upload photos in their first two weeks.
  • Less than a third send a private message,
  • Less than 30% write on either their on wall (status updates), or on their friends' wall

We can see than by far the most common mode of content production is uploading photos, and hence they focused their study on this. As suggested by the PHAME framework, they made the following hypotheses:
1. Feedback makes new users post more content. A metric for this was the number of comments a user received for a posted item.
2. Increased distribution makes new users post more. Some metrics used for this was the number of times the content was viewed on someone's News Feed, as well as the total number of friends who viewed the item.
3. If a users friends post lots of content, users themselves will post more. A metric for this was the number of friends' photos this user viewed.
4. If a user is singled out in content that their friend produced, they will post more content. An accurate metric for this was the number of times a user was "tagged".

Even though the second hypothesis was considered weak, it was still included in the model, since it can serve as a type of dummy model which can be used to prove the model, by showing that this change did not affect that much a response.

Facebook followed both a quantitative and a qualitative method of experimentation, where they measured the activity of new users for their first two weeks on the site, as well as the number of photos they uploaded between their 3rd and 15th weeks. To get a more qualitative perspective, they also interview new users about the typical uses of Facebook, and how they found the site to be.

One model featured users who updated photos in the first two weeks, while another model featured all users, even if they did not post any photos in their first two weeks

Some overall inferences from the experiments are:
  • The number of comments you receive on a photo has no effect on future photo uploads
  • But if the user received *no* comments on an uploaded photo, he was less likely to upload photos in the future
  • Number of photo views only led to a small increase in future photos.
  • If a user sees a friend post more photos, he was also more likely to post more photos.
  • If a user (who was already uploading photos) was singled out, it had no effect on the future photo uploads of that user. However, if a user who did not upload photos was singled out, this had a positive effect, since it probably led to the user "discovering" the photo app.

Overall, these results tell us that a user is likely to upload photos if they can see that their friends are also active users of the photo app (whether they commented, or uploaded new photos themselves). Thus, social learning is the main motivation for the user to produce content.

Part 4: Eric Sun on “Modeling Contagion through Facebook News Feed”

This part of the lecture dealt with a research project done by Eric and the Data Science team at Facebook wherein they tried to understand how ideas spread through Facebook. The aim was to model how content diffuses through a social network of friends. Results of experiments from this modeling would help in :
  • comparison with existing diffusion models in literature
  • proving the interconnected and diffusion properties of Facebook thereby making it an ideal advertisement platform.

Definition of a diffusion model (DM)?

  • DMs are used to forecast new product adoption with emphasis on predicting the level of penetration (saturation) and the rate of approach to saturation. The theory considers how a new idea/content, or adoption of new behavior, spreads throughout the market over time. This can be very helpful to understand for a social media marketer.
  • There are strong parallels with epidemiology - the study of how a contagious disease spreads. An event triggers its existence and it infects one person. The first infected person (called “influential actor”) then propagates the disease to other people in his network.
  • DMs also help us in understanding the “influence and connectedness” at the level of individual person.

Two theories of DMs have been proposed in literature having suggestions which would aid social media marketers.
  • The first one was proposed by Gladwell in 2000 [1], which said that the correct way to reach mass popularity would be by finding the right “influential actors” i.e. people having a large influence and connectedness. These small numbers of influential actors would enable them to get quick access to a large user-base.
  • The second theory was proposed by Duncan Watts and Peter Dodds recently in 2007 [2] and is called Contagion theory. It opposes Gladwell’s theory and says influential actors are not really required for mass propagation. They say that propagation will take place as long as there are susceptible people in the network. They call these susceptible people as “sheeple” due to their herd behavior.

Experiments on Facebook

Diffusion of content on Facebook takes place primarily by using News Feed. For the experiments feeds containing posts of only Facebook Pages were used. Facebook Pages allows brands and businesses to make customized profile pages to promote themselves. The user can interact with a Page by becoming its “fan.” This action is then broadcasted to all his friends in the form of a post in their News feeds. The friend then has the option to either fan the page or ignore the message. The figure below shows an example of a news feed as it appears on a follower’s page. Since most of these experiments were conducted in 2008, the feed format is of the old Facebook design.


Ideally without the Newsfeed the graph of cumulative distribution function of Page fan acquisition versus time would have been linear. But due to the broadcasting of messages, the fan acquisition rate is very random. Following are the graphs of 20 random Facebook Pages and their CDFs of Page fanning over time.


The diffusion chains can be modeled as a tree wherein the nodes are the people and an edge is a link between an actor and his followers. Followers are basically people who would have fanned a certain page after getting a post mentioning that the actor had done so. The tree structure grows as followers become secondary actors and their friends fan the page. As the true grows, links merge frequently because nodes in the graph may have more than one parent. This results in a dense graph structure forming clearly evident clusters of users.

Following are some of the observations / results from the experiments:

  • Contagion theory proved: Distribution of initial “influential actors” amounted to around 15% of the fans in the cluster. This means around 15% of the fans randomly searched on Facebook Search or clicked on an advertisement to find the Facebook Page of interest. The rest of them came through News Feeds and fanned the same page within 24 hours. This proves that Gladwell’s theory of having only a few users responsible for the success of the page does not hold.
  • The clusters formed are very well connected with some Pages having over 90% of its fans in one cluster. This shows the “interconnected” properties of Facebook.
  • The length of the diffusion chain produced by an actor shows his reach in the Facebook network. This variable was determined on the basis of characteristics of the actor and the Page. The prediction model used to determine the length was negative binomial regression with various predictor variables viz. log age, gender, log Facebook_age, log activity_count, log friend_count, log feed_exposure and log popularity. Two interesting observations were made here. Firstly, the highest coefficient was for “feed_exposure” i.e. the number of friends who saw your News feed. Secondly, if you try to change the feed_exposure, the affect of other demographic variables like friend_count will decrease on the diffusion length.
  • The influential actors have the same characteristics (Facebook age, Friend Count, etc) as that of followers. The only difference is in their activity count in which case the influential actors seem to be more frequent users as compared to their followers. Thus, from a social media marketer’s point of view, it is very difficult to narrow down on people who might help in growing our diffusion chains.

Part 5a: Homework and Course Feedback

Homework 2:

The purpose of homework 2 was to dispel fears that it's difficult to access data.

Google Trends:

Some things that were commonly suggested improvements:
  • Increased granularity of time intervals. It would be interesting to see what time of the day people were viewing a page (whether at work or at home). Complication: the problem of time zones arises (when it's noon in the US, it's midnight in India).
  • Query Refinement: be able to see where people ended up with their searches. What was the final source they concluded their search with? It would be interesting to see this because it would show how people learn
Some reasons why Google doesn't give us this information:
  • Competition: don't want to give out so much information that they are helping competitors
  • Privacy: how much information is too much information? What level of detail should information be released?

Yahoo! Analytics v. Google Analytics:

  • Yahoo!: more geared towards microsites. Data is more latent and you can get information more immediately. Geared towards helping your sites from an optimization perspective.
  • Google: custom-reporting. Latency is not great and have 24-hour delay. Other companies wouldn’t use it because they wouldn’t want to give away their data to Google.
  • Both are good sites but have different approaches to displaying similar information.

Yahoo! Pipes

Purpose: To show that you don’t have to be big programmers to get at important information.

RSS Readers

  • Class Response: Less than half of people in classroom use them. People use them to look at lots of things everyday—RSS compiles everything in one place
  • Benefits: Some RSS readers learn from your preferences and give suggestions that suit your reader profile. They are able to do this by measuring how much time you spend on different articles. Given our attention data and the world’s data we can do a much better job of having the things that we are actually interested in seeing.
  • Social element of reader community: Can measure the aggregate time spent of a community of readers on a specific article. There is the individual attention score that you receive and the shared attention score (aggregate attention score of your group of friends) – share attention items are bubbled up
  • Conclusions: atomization of information versus clustering. The focus of RSS readers is not artificial intelligence techniques but their techniques are based on real social data and individuals.

SDR Survey:

25 questions - answers will be compiled into a Google Doc for class viewing

Part 5b: SDR Update

Team rankings can be found here
Someone stupid deleted the graph created of total number of fans over time.
Comments: Currently "This OR That", "Would You Rather", and "Gossip Closet Club" are the top ranked teams. According to the graph of their total number of fans over time, it looks like "This OR That" started a trend of strong growth early on, which has put them at the front. "Gossip Closet Club" has seen increased growth in the last week (since 4/28) and "Would You Rather" has just started to see high growth (starting 5/3). Other top performers include "That's What She Said Photos"
Questions: Will any page be able to catch up to "This OR That"s number of fans since they started faster than any other team? Will Stanford FML, with the second highest number of fans be able to improve in other areas to catch up to "This OR That"s #1 ranking?
  • "Stanford FML" seems to be dominating the number of Page Views with some rather large peaks. Despite how cyclical their page view history was, their fan growth remains relatively steady--which suggests that their fans return to the Page at a high rate.
  • "That's What She Said Photos" and "National MS Society" had other significant peaks but their ranking remains at 5 and 7 respectively.
  • "Gossip Closet Club" was on the rise when this graph was created and now they are ranked number 3.
  • "This OR That" seems to have most consistent number of page views, however it has also seen a steady decline since 4/17.

Part 5c: Pointer to

million student march

by Robert Clegg, Founder of the :

The million student march will take place in a virtual world to enable students to have an impact on the direction of the future of education. Nowadays, society is experiencing a distrust of funded education. There is an imminent need to search for new ways to fund education, perhaps by building sponsored relationships with brands students trust.
The million student march will consist of a social network combined with a gaming platform. Its goal is to leverage the value in the network in order to start a new paradigm in the future of education. The project will culminate with a large logon event that will connect millions of students. Students interested in helping to build the platform should contact Mr. Clegg directly. To learn more about the project, send your questions to

[1] Gladwell, Malcolm. 2000 - The Tipping Point
[2] Watts, D. J. and Dodds, P. S. 2007 - Influentials, Networks, and Public Opinion Formation