6_LinkedIn+5.11

[|Andreas Weigend] STATS 252, Stanford University, Spring 2009 Class time: Monday 2:15 - 5:05 pm Class location: Gates B01
 * Data Mining and E-Business: The Social Data Revolution**

=__Class 6: LinkedIn__= May 11, 2009 audio: [|podcast]

Part 1: [|mp3] [|transcript] Part 2: [|mp3] [|transcript]

toc

[|Reid Hoffman]
Reid Hoffman is a co-founder of ([|PayPal]) and currently runs ([|LinkedIn] ), a networking site for professionals.

[|DJ Patil]
DJ who used to be a professor of Concrete Mathematics at the University of Maryland ([|Univerisity of Maryland])is the Chief Scientist and Sr. Director of Product Analytics at LinkedIn.

Reid and DJ discussed Data Analytics and how LinkedIn uses Data to build Products. This page summarizes the talk.

What Is LinkedIn?
media type="youtube" key="IzT3JVUGUzM" height="340" width="560" LinkedIn, in summary, is a business-oriented social networking site founded in December 2002 and launched in May 2003 mainly used for professional networking. As of May 2009, it had more than 39 million registered users, spanning 170 industries

Networks
Networks are //Information and People Representation Systems//. They provide //Reputation Systems// that help
 * predict the reputation of a person as a likely expert in a field,
 * estimate both objective and subjective measures of reputation,
 * determine which people are good at something,
 * determine which people can be trusted, and
 * determine which sources of information can be trusted.

Applications of Professional Networks

 * Hiring: this is very important, but there is much more than just this,
 * Helping people make transactional decisions based on judgements of expertise
 * Examples include Hedge Funds using LinkedIn to find experts to do trades in the market; sourcing a deal and all reference checking on the deal.

Basic Challenges involved in building a network

 * Getting people: user #1 is valuable; user #1 and user #2? Not valuable. User #3? Not valuable. Key is to get //enough people// so the network becomes valuable.
 * Extracting Data and building useful products.

The penalty of lying in public
A good network centralizes around good data. LinkedIn has an inherent advantage because the data it collects tends to be intrinsically accurate due to various factors. media type="custom" key="3800953"
 * //Key concept:// There are fewer factual inaccuracies on a person's LinkedIn profile with 10+ connections than there are on that person's resume. This allows more meaningful analytics to be drawn from LinkedIn profiles.
 * Lying on LinkedIn ~= Lying in public
 * Lying on your resume ~= Lying to a very limited audience (*hush *hush)
 * LinkedIn profiles tend to be much more honest than private profiles on other job websites such as [|Monster.com].
 * //Takeaway:// The potential for getting caught is a powerful motivator for telling the truth.
 * [|Kim Isaacs from Monster.com talks about lying on your resume]
 * More about lying on your resume:

Analytics
What is //Analytics//? The simplest definition of //Analytics// is "the science of analysis". A simple and practical definition, however, would be how an entity, namely, a business arrives at an optimal or realistic decision based on existing data.

Analytics is concerned with identifying and extracting useful data to build useful products.

KHA //x// N, x = 1 to 100
It is possible to come up with interesting experiments to gather data. But, it is difficult to make sense from it, and to build products that are useful. An example is the **KHA //x// N** experiment, which looks at how many //A//s can there actually be when spelling KHAN, and plots Google search results for the different spellings.

Analytics over time
Use of //Analytics//, as a way to represent one's skills in the industry or as a means to find experts, has had a tremendous increase recently. The following graph shows the use of //Analytics// in LinkedIn profiles to describe a person's expertise.



The trend to some extent reflects changes in ability to process massive amounts of data. Early on, there were big Mainframe computers that could crunch huge amounts of data. Then came the Desktop PCs that did not have such capabilities. Now, we have easy-to-use new technologies such as MySql, MPP Systems, Hadoop, Pig, and Hive, etc. that enable processing data.

Identifying Events
Analytics can provide insights into major events and trends. An example from LinkedIn is the surge in registration from employees of Lehman Brothers just before its bankruptcy was announced. According to DJ, internal news about problems in Lehman Brothers led to increased activity in the corp network. Their IT suspected an attack, and shutdown their systems. People did not have the access to their corporate email and other address book tools. Thus they turned to LinkedIn so that they could get in touch with colleagues whose contact addresses they had in LinkedIn. In other words, LinkedIn enabled them to find people,get contact information, and establish a communication channel outside the Corporation IT approved band. See [] for a different view of why activity on LinkedIn was on the rise.

Business is recognizing the importance of analytics
There was a cover story in BusinessWeek ([|BusinessWeek Maganize]) named "Math Will Rock Your World" in January 23, 2006. See([|Math Will Rock Your World]) for the detail. According to DJ, this was the first time when a common media stressed the importance of Mathematicians and analytics. Other examples include: [|Super Crunchers], our old friend [|The Wisdom of Crowds], [|Predicably Irrational] and the best selling books:



Analytics Need Effective Visualization
Visualization is key to presenting data in an easily digestible format. Effective visualization allows the viewer to glean many useful insights from the visual. A good visual should have multiple dimensions, but not too many as to be cluttered or confusing. Barbara Tversky, //Department of Psychology, Stanford University//, [|outlines the ways that graphics can augment learning]:
 * Record information
 * Convey information
 * Promote inferences
 * Enable new ideas
 * Facilitate collaboration



More resources: [|What Makes an Effective Visualization]

Need for Bright and Creative People
There is great importance placed upon finding the right people to crunch your data. These people should
 * Find creative ways to see a problem
 * Have the technical skill and knowledge to crunch the problem
 * Excellent intuitive sense of design - ways to display data
 * Ability to visualize a way for information to be communicated

Communication in Professional Networks
Besides using professional networks to maintain contacts for job searches, people also use the networks to share information. A significant part of communication in LinkedIn happens between people that work for different companies unlike singular cases such as the Lehman Brothers Bankruptcy. A fair amount of communication is //threaded//, such as discussions on technical topics.

media type="youtube" key="IUFwpUFQ_tM" height="344" width="425"

Other uses
Companies can benefit from building directories of employee information based on profiles of their employees on Professional Networks. E.g., people have more incentive updating their profiles on LinkedIn than on internal company directories since LinkedIn is futuristic and provides long term benefits and prospects. The public profiles also help people build individual brands.

Analytics in LinkedIn
Analytics is organized in to two types at LinkedIn.

media type="youtube" key="dRrkgvr9V_s" height="340" width="560"

Product Analytics
> 1. **Engagement** Engagement is measured through the following metrics: >> a. Visits >> b. Did the user //engage// with a product? Yes/No binary metric. >> c. Did the user //do// anything? As time on site is not a very good variable, people count on how one uses the site. >> d. Did the user //share// content? For example, recommend a news article? >> e. Did the user //contribute// content?
 * User Facing**
 * How to build //user facing// products from data?
 * Primary focus on two dimensions

//Note// The data collected is not made very granular as to eliminate possibilities.

> 2. **Revenue**


 * Behind-the-scenes Product Enablement**
 * The product is used behind the scenes without direct user involvement to invoke the product.
 * Examples
 * Suggestions for groups
 * Recommendations
 * Collaborative filtering - //people who viewed this profile also viewed this other profile//


 * Rapid Prototyping**
 * Enables trying out new products in very short time.
 * Hypotheses from Analysis of Data are easily tested by rolling out features iteratively.
 * Data Visualization

Data Insights

 * Focuses on building funnels to drive traffic to products.
 * Tools, Dashboards, Reports.
 * Understanding user and usage.
 * What industries, countries, etc.
 * A/B Testing.
 * Exposing Demographic Trends.

To enable insights from interactions on data, ad hoc reports are generated using a free-form SQL tool that generates reports. Given a query criterion, reports usually provide //Note// The queries return in 20-30 minutes when run at peak time.
 * # of members,
 * # of new members/day,
 * % that is VP or higher in positions held,
 * distribution by region.

Organizational Structure
The analytics team is part of LinkedIn's product organization and not the technology organization. This strengthens product development and enables quick evolution of products based on insights from understanding demographic trends. In particular, Analytics is not //white-castled// to become a useless organization removed from the company's business.

Bottlenecks in the use of Analytics for Product Development

 * 1) **Data manipulation** and associated costs.
 * Requires processing huge amounts of data using various technologies such as MySql, Aster Data, Hadoop, Greenplum, etc.
 * 1) **Classification Problem** This is a classical problem in Computer Science. It is highly visible when mining data from the web mainly due to content in the web being free of form.

**Edge Cases**
The analytics team spends most of their time dealing with edge cases. No data fits any one model perfectly. Edge cases are the tail of the curve, and that tail tends to be very wide!

**Title Standardization**
DJ presented the problem of //Title Standardization// in LinkedIn as an instance of the Classification Problem with numerous examples of edge cases. The problem is to arrive at a standard representation of //title// for a user based on his/her profile. >> However, the design of the system needs to be robust enough to avoid the same issues that it attempts to solve! E.g., people may provide crappy solutions. Some approaches that may be used are > //Note// Wow! This wiki-izing of lectures is a Mechanical Turk solution! How else can we reliably extract from blob videos such rich info as on this wiki? O! No. This isn't a Mechanical Turk solution proper! See the next one.
 * **Fields** The //standardized title// is inferred from the following fields
 * Title,
 * Job Name,
 * Functional Area,
 * Seniority Level.
 * **Basic Approach**
 * For members with //position information// the //standardized title// is determined by a look-up of the user-entered title.
 * For members without //position information//
 * a generalized classification logic is applied on the profile headline and industry data,
 * special logic is applied to handle ambiguous titles and accurately determine functional area from other characteristics of profile data.
 * **Issues**
 * Titles vary from company to company. E.g., a //Software Engineer// in Yahoo! is called a //Technical Yahoo!//
 * User-provided titles occur in various representations with //additions//, or as //abbreviations// and //mis-spellings//. E.g., //software engineer - china//, //sw. eng//, //sotware engineer// all map to //Software Engineer//. LinkedIn has encountered more than 6000 variants of Software Engineer!
 * Company names are also in several formats so it is difficult to apply rules based on company. E.g. ibm, IBM, I.B.M. 8000+ variations of IBM.
 * Determining functional area from company names is difficult. E.g., Does //Continental// refer to //Continental Airlines// or //Continental Bakery//!
 * When to classify profiles into a new group? When there are 3 people, 100 people, ...?
 * **Solutions** There is no single solution to the above problems. Some approaches are listed below.
 * **User clicks** Given a set of options, the ones that the user chooses are the most relevant to the user. Efficiency of this approach depends on how well user intent is understood.
 * **Similarity and Distance Measures** Compute a similarity measure between profiles to classify them and resolve ambiguities.
 * [|Mechanical Turk] Get people to solve the problem. This approach uses the wisdom of crowds and works better than machines at solving hard problems that require intelligence.
 * Pose Yes/No questions or provide choices to choose from instead of asking for an //explanation of a solution//.
 * Rank responses by analyzing their statistics.
 * Revise the system, with more/different options if the statistics do not point to conclusive answers.
 * **Incentives for User Resolution** Ask the users to themselves resolve ambiguities by providing incentives through campaigns or explaining the value propositions. E.g., //Profile Completion Tips ([|Why do this?])//
 * Users setup profiles for their own use. They want to build their networks for their own benefits.
 * Incentives can be easily aligned with user interests, and people value such incentives more than campaigns. E.g., recommendations based on profiles (similar to Amazon's recommendations based on wishlists).
 * These approaches also lead to increased engagement, which in itself is an incentive, and perhaps the #1 option (Andreas). However, the number of times a user is touched needs to be limited so this is not seen as an annoyance (Reid).
 * This approach is similar to the //Mechanical Turk// in getting people to solve problems. However, the motivations are different. People are more motivated to solve their own problems than others' even if paid for that.
 * A good example of this approach is Andreas' experience at MoodLogic, where people help identify music similar to //what they have// in the system (rather than help classify music without any self-interest).
 * Another good example, and more immediate to this class, is the manner of summarizing the lecture sessions in this wiki. This task is incentivised by grades, which motivates students to do a good job.
 * **Type Ahead/Auto-complete** Providing suggestions for key fields as the user types can significantly bring down ambiguities.
 * **Rules** Use rules to cluster instances by mapping words and phrases to categories. E.g., a regular expression check such as ///.*stanford.*/i// can be used to classify //Stanford Medical School//, //Stanford University//, //stanford univ.//, //stanford school of bus.// and many others as related to Stanford.
 * **Canonicalization** Apply rules such as //convert all text to lower-case//, //remove punctuations//, or //replace special characters with alternatives// to arrive at a canonical representation of content before applying rules for classification.
 * **Use email address** to determine company or functional area. E.g., if the registered email address is //somebody@ebay.com//, then the person works for EBay. This, according to DJ, is the //number one// approach! Why #1?
 * Since LinkedIn is a professional network and people normally send connection invitations to people who they work with or know through work. So, the email addresses are commonly people's work addresses rather than personal ones, and can be used to determine company or functional are.
 * Email addresses are also the hardest to fake.
 * They can be used to authenticate and validate claims (Andreas).

//Note// Ben Henick from [|A List Apart] has some interesting ideas about how to avoid edge cases. He purports that edge cases are a problem of leaping before looking. You can [|avoid edge cases by designing up front].

**People who viewed this profile also viewed this other profile**

 * A huge traffic driver.
 * Very simple to implement via a collaborative filter. Look at pairwise pages that occur more than a threshold.
 * Iteration time is about a week.
 * Live to site in about a week and a half.


 * Challenges**
 * With 40 million users, data is very sparse.
 * Privacy issues surround whether the user is comfortable sharing his/her name when viewing others' profiles. LinkedIn solves this through generalization, obfuscating names as shown above.

**People Who Viewed You**
//Key Concept// - A balance must be struck between showing //too much// (which may an issue with the sharer of the information) and showing //too little// (which limits the usefulness to the consumer of the information). LinkedIn gives their users the ability to toggle whether they are willing to share their names when viewing others' profiles. Are you comfortable sharing this information with others?


 * //Internet privacy is still a very subjective topic.//**
 * **[|Robert Cringely talks about Internet Privacy on PC World Magazine Online]**


 * [[image:Scalia.jpg align="right" caption="Supreme Court Justice Antonin Scalia questions whether all data about one's life deserves protection." link="http://en.wikipedia.org/wiki/Antonin_Scalia"]]Supreme Court Justice Scalia has this to say about the privacy of our information on the internet:

//"I stand by my remark at the Institute of American and Talmudic Law conference that//** //it is silly to think that every single datum about my life is private**. I was referring, of course, to whether every single datum about my life deserves privacy protection in law.

It is not a rare phenomenon that what is legal may also be quite irresponsible. That appears in the First Amendment context all the time. What can be said often should not be said. Prof. Reidenberg's exercise is an example of perfectly legal, abominably poor judgment. Since he was not teaching a course in judgment, I presume he felt no responsibility to display any."**//

**People you may know**

 * A more challenging algorithm.
 * Overall time to go live on site is about the same.

**Groups you might like**

 * 3.5 days to build the algorithm - a basic logistic regression of keywords.
 * 2 days to site.
 * About 1000 new groups added every day.

**Product Cycle**

 * An iterative cycle.
 * Features developed through rapid prototyping and pushed to site after each iteration.
 * Data collected (such as change in activity metrics) to validate hypotheses.
 * Formal product development.

By iterating the feature through multiple prototypes, LinkedIn avoids large up-front engineering costs. When testing a feature, increase in activity may be consequent to built-up/pent-up demand for the feature or its freshness. Such effect may also be due to a [|Hawthorne effect] of reactivity when users know a feature is being tested (Andreas).

Analytics Technologies used by LinkedIn
//**Data Technologies:**// [|Aster Data] Hadoop (with Hive & Cassandra) [|Amazon Mechanical Turk] MySQL / Oracle

Processing Prefuse (interactivity graph) //Recommended reading//: Visualizing Data by Ben Fry
 * //Visualization://**

Voldemort
 * //Internal Projects://**

Conclusions
Challenges for the next 3-5 years: Hadoop & MPP systems The Future of Work: The Free-Agent Nation. LinkedIn sees itself as the reputation system The world we live in now is about connecting data in a deep way - in thoughtfully interpreting it, not in copying the features of linkedin or facebook. Iteration and analysis of self-metrics is crucial for improvement

Initial Contributors:
Gary Chung Pablo Paniagua Yan Zhai Xin Shi Sampath Jinadasa

Qian, yana Liu,Yun Polcari,Mike Jay R