Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
STATS 252, Stanford University, Spring 2009
Class time: Monday 2:15 - 5:05 pm
Class location: Gates B01


Class 6: LinkedIn

May 11, 2009
audio: podcast

Part 1: mp3 transcript
Part 2: mp3 transcript

Guest Speakers

Reid Hoffman

Reid Hoffman is a co-founder of (PayPal) and currently runs (LinkedIn ), a networking site for professionals.
Reid_Hoffman.jpg

DJ Patil

DJ who used to be a professor of Concrete Mathematics at the University of Maryland (Univerisity of Maryland)is the Chief Scientist and Sr. Director of Product Analytics at LinkedIn.
DJ.jpg

Reid and DJ discussed Data Analytics and how LinkedIn uses Data to build Products. This page summarizes the talk.

What Is LinkedIn?


LinkedIn, in summary, is a business-oriented social networking site founded in December 2002 and launched in May 2003 mainly used for professional networking. As of May 2009, it had more than 39 million registered users, spanning 170 industries

Networks

Networks are Information and People Representation Systems. They provide Reputation Systems that help
  • predict the reputation of a person as a likely expert in a field,
  • estimate both objective and subjective measures of reputation,
  • determine which people are good at something,
  • determine which people can be trusted, and
  • determine which sources of information can be trusted.

Applications of Professional Networks

  • Hiring: this is very important, but there is much more than just this,
  • Helping people make transactional decisions based on judgements of expertise
  • Examples include Hedge Funds using LinkedIn to find experts to do trades in the market; sourcing a deal and all reference checking on the deal.

Basic Challenges involved in building a network

  • Getting people: user #1 is valuable; user #1 and user #2? Not valuable. User #3? Not valuable. Key is to get enough people so the network becomes valuable.
  • Extracting Data and building useful products.

The penalty of lying in public

A good network centralizes around good data. LinkedIn has an inherent advantage because the data it collects tends to be intrinsically accurate due to various factors.
  • Key concept: There are fewer factual inaccuracies on a person's LinkedIn profile with 10+ connections than there are on that person's resume. This allows more meaningful analytics to be drawn from LinkedIn profiles.
  • Lying on LinkedIn ~= Lying in public
  • Lying on your resume ~= Lying to a very limited audience (*hush *hush)
  • LinkedIn profiles tend to be much more honest than private profiles on other job websites such as Monster.com.
  • Takeaway: The potential for getting caught is a powerful motivator for telling the truth.
  • Kim Isaacs from Monster.com talks about lying on your resume
  • More about lying on your resume:


Analytics

What is Analytics?
The simplest definition of Analytics is "the science of analysis". A simple and practical definition, however, would be how an entity, namely, a business arrives at an optimal or realistic decision based on existing data.

Analytics is concerned with identifying and extracting useful data to build useful products.

KHAAA...N.png


KHAxN, x = 1 to 100

It is possible to come up with interesting experiments to gather data. But, it is difficult to make sense from it, and to build products that are useful. An example is the KHAxN experiment, which looks at how many As can there actually be when spelling KHAN, and plots Google search results for the different spellings.

















Analytics over time

Use of Analytics, as a way to represent one's skills in the industry or as a means to find experts, has had a tremendous increase recently. The following graph shows the use of Analytics in LinkedIn profiles to describe a person's expertise.

DemandForAnalytics.png
Note LinkedIn Profiles allow users to go as far back as they want when listing their work experience.


The trend to some extent reflects changes in ability to process massive amounts of data. Early on, there were big Mainframe computers that could crunch huge amounts of data. Then came the Desktop PCs that did not have such capabilities. Now, we have easy-to-use new technologies such as MySql, MPP Systems, Hadoop, Pig, and Hive, etc. that enable processing data.

Identifying Events

Analytics can provide insights into major events and trends. An example from LinkedIn is the surge in registration from employees of Lehman Brothers just before its bankruptcy was announced. According to DJ, internal news about problems in Lehman Brothers led to increased activity in the corp network. Their IT suspected an attack, and shutdown their systems. People did not have the access to their corporate email and other address book tools. Thus they turned to LinkedIn so that they could get in touch with colleagues whose contact addresses they had in LinkedIn. In other words, LinkedIn enabled them to find people,get contact information, and establish a communication channel outside the Corporation IT approved band. See http://money.cnn.com/2009/03/24/technology/hempel_linkedin.fortune/index.htm for a different view of why activity on LinkedIn was on the rise.

Business is recognizing the importance of analytics

There was a cover story in BusinessWeek (BusinessWeek Maganize) named "Math Will Rock Your World" in January 23, 2006. See(Math Will Rock Your World) for the detail. According to DJ, this was the first time when a common media stressed the importance of Mathematicians and analytics. Other examples include: Super Crunchers, our old friend The Wisdom of Crowds, Predicably Irrational and the best selling books:

Business_Modeling.jpgcompeting_on_Analytics.jpg Web_Analytics.jpg


Analytics Need Effective Visualization

Visualization is key to presenting data in an easily digestible format. Effective visualization allows the viewer to glean many useful insights from the visual. A good visual should have multiple dimensions, but not too many as to be cluttered or confusing.
Barbara Tversky, Department of Psychology, Stanford University, outlines the ways that graphics can augment learning:
  • Record information
  • Convey information
  • Promote inferences
  • Enable new ideas
  • Facilitate collaboration

Visual.jpg
Type of Visualizations - Benefits and Drawbacks; Visualization and the Geosciences, Libarkin J., Brick C., Research Methodologies in Science Education pp.449-456


More resources:
What Makes an Effective Visualization

Need for Bright and Creative People

There is great importance placed upon finding the right people to crunch your data. These people should
  • Find creative ways to see a problem
  • Have the technical skill and knowledge to crunch the problem
  • Excellent intuitive sense of design - ways to display data
  • Ability to visualize a way for information to be communicated

Communication in Professional Networks

Besides using professional networks to maintain contacts for job searches, people also use the networks to share information. A significant part of communication in LinkedIn happens between people that work for different companies unlike singular cases such as the Lehman Brothers Bankruptcy. A fair amount of communication is threaded, such as discussions on technical topics.


Other uses

Companies can benefit from building directories of employee information based on profiles of their employees on Professional Networks. E.g., people have more incentive updating their profiles on LinkedIn than on internal company directories since LinkedIn is futuristic and provides long term benefits and prospects. The public profiles also help people build individual brands.

Analytics in LinkedIn

Analytics is organized in to two types at LinkedIn.


Product Analytics

User Facing
  • How to build user facing products from data?
  • Primary focus on two dimensions
    1. Engagement Engagement is measured through the following metrics:
    • a. Visits
    • b. Did the user engage with a product? Yes/No binary metric.
    • c. Did the user do anything? As time on site is not a very good variable, people count on how one uses the site.
    • d. Did the user share content? For example, recommend a news article?
    • e. Did the user contribute content?

Note The data collected is not made very granular as to eliminate possibilities.

  • 2. Revenue

Behind-the-scenes Product Enablement
  • The product is used behind the scenes without direct user involvement to invoke the product.
  • Examples
    • Suggestions for groups
    • Recommendations
    • Collaborative filtering - people who viewed this profile also viewed this other profile

Rapid Prototyping
  • Enables trying out new products in very short time.
  • Hypotheses from Analysis of Data are easily tested by rolling out features iteratively.
  • Data Visualization

Data Insights

  • Focuses on building funnels to drive traffic to products.
  • Tools, Dashboards, Reports.
  • Understanding user and usage.
    • What industries, countries, etc.
  • A/B Testing.
  • Exposing Demographic Trends.

To enable insights from interactions on data, ad hoc reports are generated using a free-form SQL tool that generates reports. Given a query criterion, reports usually provide
    • # of members,
    • # of new members/day,
    • % that is VP or higher in positions held,
    • distribution by region.
Note The queries return in 20-30 minutes when run at peak time.

Organizational Structure

The analytics team is part of LinkedIn's product organization and not the technology organization. This strengthens product development and enables quick evolution of products based on insights from understanding demographic trends. In particular, Analytics is not white-castled to become a useless organization removed from the company's business.

avoid-edge-cases.jpg
Some sites are steaming heaps of edge cases.


Bottlenecks in the use of Analytics for Product Development

  1. Data manipulation and associated costs.
    • Requires processing huge amounts of data using various technologies such as MySql, Aster Data, Hadoop, Greenplum, etc.
  2. Classification Problem This is a classical problem in Computer Science. It is highly visible when mining data from the web mainly due to content in the web being free of form.

Edge Cases

The analytics team spends most of their time dealing with edge cases. No data fits any one model perfectly. Edge cases are the tail of the curve, and that tail tends to be very wide!

Title Standardization

DJ presented the problem of Title Standardization in LinkedIn as an instance of the Classification Problem with numerous examples of edge cases. The problem is to arrive at a standard representation of title for a user based on his/her profile.
  • Fields The standardized title is inferred from the following fields
    • Title,
    • Job Name,
    • Functional Area,
    • Seniority Level.
  • Basic Approach
    • For members with position information the standardized title is determined by a look-up of the user-entered title.
    • For members without position information
      • a generalized classification logic is applied on the profile headline and industry data,
      • special logic is applied to handle ambiguous titles and accurately determine functional area from other characteristics of profile data.
  • Issues
    • Titles vary from company to company. E.g., a Software Engineer in Yahoo! is called a Technical Yahoo!
    • User-provided titles occur in various representations with additions, or as abbreviations and mis-spellings. E.g., software engineer - china, sw. eng, sotware engineer all map to Software Engineer. LinkedIn has encountered more than 6000 variants of Software Engineer!
    • Company names are also in several formats so it is difficult to apply rules based on company. E.g. ibm, IBM, I.B.M. 8000+ variations of IBM.
    • Determining functional area from company names is difficult. E.g., Does Continental refer to Continental Airlines or Continental Bakery!
    • When to classify profiles into a new group? When there are 3 people, 100 people, ...?
  • Solutions There is no single solution to the above problems. Some approaches are listed below.
    • User clicks Given a set of options, the ones that the user chooses are the most relevant to the user. Efficiency of this approach depends on how well user intent is understood.
    • Similarity and Distance Measures Compute a similarity measure between profiles to classify them and resolve ambiguities.
    • Mechanical Turk Get people to solve the problem. This approach uses the wisdom of crowds and works better than machines at solving hard problems that require intelligence.
      However, the design of the system needs to be robust enough to avoid the same issues that it attempts to solve! E.g., people may provide crappy solutions. Some approaches that may be used are
      • Pose Yes/No questions or provide choices to choose from instead of asking for an explanation of a solution.
      • Rank responses by analyzing their statistics.
      • Revise the system, with more/different options if the statistics do not point to conclusive answers.
    Note Wow! This wiki-izing of lectures is a Mechanical Turk solution! How else can we reliably extract from blob videos such rich info as on this wiki? O! No. This isn't a Mechanical Turk solution proper! See the next one.
    • Incentives for User Resolution Ask the users to themselves resolve ambiguities by providing incentives through campaigns or explaining the value propositions. E.g., Profile Completion Tips (Why do this?)
      • Users setup profiles for their own use. They want to build their networks for their own benefits.
      • Incentives can be easily aligned with user interests, and people value such incentives more than campaigns. E.g., recommendations based on profiles (similar to Amazon's recommendations based on wishlists).
      • These approaches also lead to increased engagement, which in itself is an incentive, and perhaps the #1 option (Andreas). However, the number of times a user is touched needs to be limited so this is not seen as an annoyance (Reid).
      • This approach is similar to the Mechanical Turk in getting people to solve problems. However, the motivations are different. People are more motivated to solve their own problems than others' even if paid for that.
      • A good example of this approach is Andreas' experience at MoodLogic, where people help identify music similar to what they have in the system (rather than help classify music without any self-interest).
      • Another good example, and more immediate to this class, is the manner of summarizing the lecture sessions in this wiki. This task is incentivised by grades, which motivates students to do a good job.
    • Type Ahead/Auto-complete Providing suggestions for key fields as the user types can significantly bring down ambiguities.
    • Rules Use rules to cluster instances by mapping words and phrases to categories. E.g., a regular expression check such as /.*stanford.*/i can be used to classify Stanford Medical School, Stanford University, stanford univ., stanford school of bus. and many others as related to Stanford.
    • Canonicalization Apply rules such as convert all text to lower-case, remove punctuations, or replace special characters with alternatives to arrive at a canonical representation of content before applying rules for classification.
    • Use email address to determine company or functional area. E.g., if the registered email address is somebody@ebay.com, then the person works for EBay. This, according to DJ, is the number one approach! Why #1?
      • Since LinkedIn is a professional network and people normally send connection invitations to people who they work with or know through work. So, the email addresses are commonly people's work addresses rather than personal ones, and can be used to determine company or functional are.
      • Email addresses are also the hardest to fake.
      • They can be used to authenticate and validate claims (Andreas).

Note Ben Henick from A List Apart has some interesting ideas about how to avoid edge cases. He purports that edge cases are a problem of leaping before looking. You can avoid edge cases by designing up front.

LinkedIn Products


People who viewed this profile also viewed this other profile


  • A huge traffic driver.
  • Very simple to implement via a collaborative filter. Look at pairwise pages that occur more than a threshold.
  • Iteration time is about a week.
  • Live to site in about a week and a half.

Challenges
  • With 40 million users, data is very sparse.
  • Privacy issues surround whether the user is comfortable sharing his/her name when viewing others' profiles. LinkedIn solves this through generalization, obfuscating names as shown above.

People Who Viewed You

WhosViewed.jpg
Key Concept - A balance must be struck between showing too much (which may an issue with the sharer of the information) and showing too little (which limits the usefulness to the consumer of the information). LinkedIn gives their users the ability to toggle whether they are willing to share their names when viewing others' profiles. Are you comfortable sharing this information with others?

Internet privacy is still a very subjective topic.

Scalia.jpg
Supreme Court Justice Antonin Scalia questions whether all data about one's life deserves protection.
Supreme Court Justice Scalia has this to say about the privacy of our information on the internet:

"I stand by my remark at the Institute of American and Talmudic Law conference that
it is silly to think that every single datum about my life is private. I was referring, of course, to whether every single datum about my life deserves privacy protection in law.

It is not a rare phenomenon that what is legal may also be quite irresponsible. That appears in the First Amendment context all the time. What can be said often should not be said. Prof. Reidenberg's exercise is an example of perfectly legal, abominably poor judgment. Since he was not teaching a course in judgment, I presume he felt no responsibility to display any."









People you may know

  • A more challenging algorithm.
  • Overall time to go live on site is about the same.

Groups you might like

  • 3.5 days to build the algorithm - a basic logistic regression of keywords.
  • 2 days to site.
  • About 1000 new groups added every day.

Product Cycle

pymk.jpg
Activity during people-you-may-know prototyping: Blue lines represent pushes to the live site

  • An iterative cycle.
  • Features developed through rapid prototyping and pushed to site after each iteration.
  • Data collected (such as change in activity metrics) to validate hypotheses.
  • Formal product development.

By iterating the feature through multiple prototypes, LinkedIn avoids large up-front engineering costs.
When testing a feature, increase in activity may be consequent to built-up/pent-up demand for the feature or its freshness. Such effect may also be due to a Hawthorne effect of reactivity when users know a feature is being tested (Andreas).

Analytics Technologies used by LinkedIn


Data Technologies:
Aster Data
Hadoop (with Hive & Cassandra)
Amazon Mechanical Turk
MySQL / Oracle

Visualization:
Processing
Prefuse (interactivity graph)
Recommended reading: Visualizing Data by Ben Fry

Internal Projects:
Voldemort

Conclusions

Challenges for the next 3-5 years: Hadoop & MPP systems
The Future of Work: The Free-Agent Nation. LinkedIn sees itself as the reputation system
The world we live in now is about connecting data in a deep way - in thoughtfully interpreting it, not in copying the features of linkedin or facebook.
Iteration and analysis of self-metrics is crucial for improvement


Initial Contributors:

Gary Chung
Pablo Paniagua
Yan Zhai
Xin Shi
Sampath Jinadasa

Qian, yana
Liu,Yun
Polcari,Mike
Jay R