Andreas Weigend
Data Mining and E-Business: The Social Data Revolution
Spring 2009
STATS 252
Stanford University

Homework 2: Getting to Data on the Web

GoogleTrends (Part A), Google Analytics (Part B), and Yahoo Pipes (Part C)

(Note: Homework2 is to be done on an individual basis)
Assigned: Mon Apr 13, 2009
Due: Thursday Apr 23, 2009, 5pm (all parts)
Everything you need to turn in is marked in RED.
Submit to: stats252.homework@gmail.com, and also please bring a hard copy to the class (Apr.27th) for non scpd students.


Part A: GOOGLE TRENDS

With Google Insights( http://www.google.com/insights/search/ ) for Search, you can compare search volume patterns across specific regions, categories, time frames and properties.
1) Please use Google Insights to analyze trends relating to your Facebook Page from Homework Assignment 1. Give three interesting results you find with explanations.
2) Google Insights aggregates information about regions, categories and time frames to show search trends. If you were Google, what other factors would you visualize with search trends and why?
3)There are many other websites providing trending information, for example " technorati "( http://technorati.com/ ) can give you search volume across the Blogosphere. Please find an example website that provides insights for "live" feeds of information.

Part B: GOOGLE ANALYTICS

Set up your web page, retrieve and analyze web access logs from your Leland account:

Step 1, you need to download and install the necessary software for secure files transfer:
• Windows:
  • SecureFX, click here to learn and click here to download. You can use all default settings during installation.
• Mac
  • Fetch, click here (serial number included) to learn and click here to download.
    • using Fetch would be similar to SecureFx, follow the guide and remember to change "Hostname" to elaine.stanford.edu.

Step 2, In SecureFX connect to elaine.stanford.edu and log in with your SUNet ID and password.
252_securefx_setting.png
252_securefx_setting.png

If you don’t already have a webpage, you will want to transfer one to the WWW folder. The opening page should be called index.html (a simple example external image plain.png index.html.txt.txt). (See picture below
)
securefx_1.png
securefx_1.png


(If you already have your own website from which you can get logs, you can skip Steps 3 and 4.)

Step 3, request here to have your log dump generated for your Stanford web site (if you don’t do this, no log will be visible to you by default).
Note: according to the request page, the logdump will be generated in the morning of the next day of your request. So make sure you start this step early.
logdump.png
logdump.png

Note: if you experience problems, please write to the TA immediately. IT has recently resolved an issue in their script processing the requests, but just in case.

Step 4, now you should retrieve your web access logs from the server. It may take a day for the logdump to be generated. You can find them at your_home_directory/WWW/logdumps/.
You can retrieve them through SecureFX.
252_www_logdumps.png
252_www_logdumps.png


If you don't know how to extract .bz2 or gzip files, you may want to try SecureZip (freeware). If you see everything squeezed into one big line in the extracted file, that's because the file is in unix format (more to read for whoever is curious), try Microsoft Word.

After creating your page and having your friends hit it a few times, you will need to wait another day for the logs to be refreshed

Step 5, now you can analyze your web log
1) Comment on the format of logs, and print out a snippet.
2) Formulate 3 questions to which you may be interested in finding the answers.

Some example questions are: what is the most popular link in a certain page? or, how many unique ips are there per day?

Step 6, analyze your website using Google Analytics,
1) Follow the instructions external image pdf.png google_analytics_instructions2.pdf to set up your Google Analytics account. Note: don't forget step 6 in the instructions to put the code right before </body> tag of any page that you wish to be analyzed.

2) In Google Analytics, click "View Reports" for your website

google_analytics1_small.jpg
google_analytics1_small.jpg

3) You will be shown an Dashboard consisting of the several diagrams
below. Take screenshots and submit these plots as part of your homework write-up,
and comment on each of these plots, and how you can use some of the information to improve performance
(for example, if you find a product you are selling may attract much more people from Asia than from U.S., you may want to focus on Asia market).

google_analytics2_small.jpg
google_analytics2_small.jpg


Part C: YAHOO PIPES

Automatic Data Service with Yahoo Pipes

In this exercise, we will use Yahoo Pipes to do automatic data collection and build alerts on top of it.

Step 1, understanding the basic concepts
  • What is an RSS feed?
RSS is a popular method used to announce recently updated items. The data of a RSS feed is represented in XML format. There are a lot of online services that allow you to subscribe to your favorite RSS feeds to keep yourself updated with the changes, such as igoogle, google reader, livejournal, newsgator, etc. The typical use of RSS is subscribe a RSS feed to your favorite RSS feed reader, and you can view all the content you care about in a single place. You can learn more about RSS here. RSS feed is a common data source used in Yahoo Pipes.

  • What is Yahoo Pipes?
“Yahoo pipes is a powerful composition tool to aggregate, manipulate, and mashup content from around the web.” - from Yahoo Pipes homepage. We will show you an example in step 2, but we highly encourage you to learn more about it beforehand.
Here are some very good videos tutorials,

Step 2, understanding a real-world example

Assume you are sick of your landlord, and now looking for a new apartment. You want to find a “1 bed-room apartment that asks for less than $1400/month and is also cats-friendly in Palo Alto”. So you go to craigslist, and search for it, something like
http://sfbay.craigslist.org/search/apa/pen?addTwo=purrr&bedrooms=1&maxAsk=1400. But you get two problems: first, craigslist only allows you to limit search to the “peninsula” area, so you have to search “palo alto” in the page; second, you can do the search only when you remember to do so, and you are usually too busy to remember to do it. So ideally, you want the process to be automated, and whenever there is a new listing that matches your requirement, you should be alerted.

Here is the Yahoo Pipe we created to solve the problem,
http://pipes.yahoo.com/pipes/pipe.info?_id=bBXZNDgJ3RGCNBwWGsevXg shown in the picture below), and we can set up automatic alerts whenever there is a change of the pipe output.
Picture_2.png
Picture_2.png

You should go there and view the source of the pipe and play around. If you don't understand how the source code works, you should go back to Step 1 and re-study some of the concepts.

After the pipe is created, you can set up alert on it whenever there is a change of the result, and you will get informed through email, or mobile, or yahoo messenger.
yahoo_pipes_alert2.JPG
yahoo_pipes_alert2.JPG


yahoo_pipes_alert3.JPG
yahoo_pipes_alert3.JPG


Step 3, questions for you,

Now you can should design a similar problem, and implement a yahoo pipe to solve it. Please publish your pipe and send the link in the homework submission, along with your problem definition.