Introduction to Web and Web Crawling

Comic image of a spider on a leaf over a web saying Excuse me a sec... I want to check out how many hits I got on my website
  • by Morten Goodwin Olsen

Lecture Outline

  • The Web
    • How large is the web
    • Why do we need web crawlers
  • Web Crawlers
    • What is a web crawler
    • Types of web crawlers
  • Measures of crawlers - web pages
    • Page Rank
    • Freshness, Age,..
  • Recrawling
    • Theoretical Approach - Knapsack

What is a web crawler?

  • A web crawler (also known as a web spider or web robot)
    • Program browses the World Wide Web in a methodical, automated manner.
    • ants
    • automatic indexers
    • bots
    • worms
    • Create a copy of pages for later processing

Basic functionality of a web crawler

  • Start with a list of URLs called seeds
  • (Selected) URLs to be visited
  • New hyper links are identified
  • New URLs added to the list
  • Go to start

Basic functionality of a web crawler (2)

Web pages being downloaded from the world wide web by web crawlers. The pages are then indexed in a storage facility. Afterwards the data is being processed for possible later presentation.

Types of Web Crawlers

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.

Types of Web Crawlers (2)

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Width First Crawlers (Entire Web Sites From Main Page)
    • Downloads all pages on the same level before it goes to the next.
      • A
      • B, C
      • D, E, F, G, H

Types Of Web Crawlers

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Depth First Crawlers (Entire Web Site From Main Page)
    • Downloads all pages on one arm, before it starts on the next.
      • A, B, D
      • (B), E
      • (B), F
      • (A), C, G
      • (C), H

Types of Web Crawlers

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Path Ascending Crawlers (Finding hidden resources)
    • Downloads pages from one or more URLs based on the paths
    • E.g; http://example.com/A/B/E
      • E,B,A
      • D, (B), (A)
      • F, (B), (A)
      • C, (A)
      • G, (C), (A)
      • H, (C), (A)

Types Of Web Crawlers

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Random Walk (Samples - Not entire sites)
    • Follows each link randomly
      • A
        • Random(B,C) = B
        • Random(D,E,F) = E
      • A
        • Random(B,C) = C
        • Random(G,H) = G

Types Of Web Crawlers (2)

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A links to B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Focused Crawling (Only Download What is interesting)
    • A
      • IsIntersting(B) = No
      • IsInteresting(C) = Yes
      • IsIntersting(G) = Yes
      • IsInteresting(H) No
    • Page Downloaded:
      • A,C,G

Robots.txt

  • http://www.vg.no/robots.txt
  • User-agent: *
  • Disallow: /pub/vgartmail.hbs
  • Disallow: /pub/skrivervennlig.hbs
  • Disallow: /tv
  • Disallow: /CGI
  • User-agent: Google
  • Cache delay: 1

PageRank

Picture of a Google search with EIAO as the highest rank.

PageRank (2)

  • (Danny Sullivan, 2007)
  • Page with higher PageRank is presented before pages with lower PageRank.
  • Numeric value of how important your page is
  • Voting Based.
  • A link from a page to another means this page votes on another page.
  • Example:
    • Page B links to page A
    • Page B votes for page A
    • Page A increases the PageRank

PageRank (3)

  • How is it calculated?
    • PR(A) = (1d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))
  • PR = PageRank
  • d = Damping Factor
  • A = Page to be calculated
  • t1,t2,....tn are pages that links to A
  • C(tn) = Number of total links from tn
  • PR(tn) = PageRank of tn

PageRank - Example

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A is linked from B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Prior knowledge:
    • PR(A)=0.6
    • PR(B)=0.3

PageRank - Example (2)

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A is linked from B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Prior knowledge:
    • PR(B)=0.6
    • PR(C)=0.3
  • c(B) = 4
  • c(C) = 3
  • PR(A) = ?

PageRank

  • Example of calculating PageRank of page A
  • Three page A,B and C
    • B and C links to A
    • Page B
      • Four (4) links. c(B) = 4
      • PageRank 0.6. PR(B) = 0.6
    • Page C
      • Three (3) links c(C) = 3
      • PageRank 0.3. PR(C) = 0.3

PageRank (2)

  • PR(A) = (1-d) + d*(PR(B)/c(B) + PR(C)/c(C))
  • d = 0.85
  • PR(A) = (10.85) + 0.85*(0.6/4 + 0.3/3)
  • PR(A) = 0.15 + 0.85*(0.15 + 0.10)
  • PR(A) = 0.15 + 0.85*0.25 = 0.15 + 0.2125
  • PR(A) = 0.3625

PageRank - Example

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A is linked from B and C. Web page B links to D, E and F. Web page C links to G and H.
  • Prior knowledge:
    • PR(B)=0.6
    • PR(C)=0.3
  • c(B) = 4
  • c(C) = 3
  • PR(A) = 0.3625
  • Which of the pages A,B or C are most important (has the highest PageRank) in this example.

PageRank - Example (2)

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A is linked from B and B is linked from A. Additionally, C links to A and C. Web page B links to D, E and F. Web page C links to G and H.
  • PR(A) = ?
  • PR(B) = ?
  • PR(C)=0.3
  • c(A) = 1
  • c(B) = 4
  • c(C) = 3
  • PageRank
    • What if
      • Page A links to Page Ba
        and
        Page B links to Page A
        • PR(A) depends on PR(B)
        • PR(B) depends on PR(A)

PageRank - Example

Illustration of a web site 8 nodes - A,B,C,D,E,F,G,H. Web page A is linked from B and B is linked from A. Additionally, C links to A and C. Web page B links to D, E and F. Web page C links to G and H.
  • PR(A) = ?
  • PR(B) = ?
  • PR(C)=0.3
  • c(A) = 1
  • c(B) = 4
  • c(C) = 3

PageRank

  • Solution
    • Repeat until PR(A) and PR(B) are stable
      • Calculate PR(A) with the last known value of PR(B)
      • Calculate PR(B) with the last known value of PR(A)

PageRank (2)

  • What does Google do with PageRanks (as far as we know)
    • Presents the results according to the PageRank
    • Crawls the pages according to PageRank

Random Walk - PageRank

  • Follows each link randomly
    • Will visit pages with a lot of in-links (high pagerank) more often than pages with few in-links.

Incremental Web Crawling

  • "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in a not only scalable, but efficient way if some reasonable measure of quality or freshness is to be maintained."
    (Edwards et al., 2001)
    .

How to measure your crawler

Showing the behaviour of a page that is updated with respect to freshness, age, number of downloads and number of missed updates.

Freshness

Graph of freshness

Age

Graph of age

Number of real downloads

Graph of number of real downloads

Number of missed updates

Graph of number of missed updates

Deep / hidden web

  • Some part of the web is hidden from crawlers such as;
    • Pages which are password protected (e.g. online banks)
    • Pages which are only visible through forms (e.g. guestbook)
    • Pages that update so fast that the crawlers never have the time to download them (e.g. online news papers).
    • Pages that are not linked from anywhere or provided as a seed URL.
    • Pages that are hidden with inaccessible content (e.g. only flash links not directly parsable by most crawlers).

Quote from Garcia

  • When a page changes too often, the crawler will waste time by trying to recrawl it too fast and still will not be able to keep its copy of the page fresh.
    (Cho and Garcia-Molina, 2003)
  • Example:
    • Try to google for an online newspaper and you will in most cases find out that the cache is a couple of days old (and thus useless).

Recrawling

  • What is the difference between crawling and recrawling?
    • Recrawling assumes you have already knownledge of all pages.

Round Robin

  • List of pages = {P1,P2,...PN}
  • 1. Get next available page from list of pages
  • 2. Download the page
  • 3. If capacity is full, wait until new time slot is available
  • 4. Goto 1

Weighted Proportional

  • List of pages = {P1,P2,...PN}
  • Each page has value one (1) if updated / zero (0) if not updated
  • Assumes that the average probability of a page being updated is known
  • 1. Order list of pages by probability of being updated
  • 2. Download Pi with the highest probability of being updated
  • 3. Remove Pi from list of pages
  • 4. If capacity met, goto 1
  • 5. Otherwise goto 2
  • Disadvantage ?

Weighted Proportional (2)

  • Disadvantage:
    • Needs to calculate the average probability of each page before it can be applied

Learning Automata Solution

  • Each web page is connected to a LA Learning Automata.
  • Each Learning Automata has a defined probability of being visited.
  • Whenever a page is downloaded and it is updated, the probability increases.
  • Whenever a page is downloaded and it is not updated, the probability decreases
  • 1. Chose randomly based on the probabilities of each page.
  • 2. Update probability (increase probability of updated, decrease if not)
  • 3. Goto 1
  • (John Oommen et. al, 2007)

Learning Automata Solution (2)

Learning Automata Solution where 1. The page has probability 0.5, is updated and increases the probability to 0.6. 2. The page has probability 0.6 and is updated and increases the probability to 0.7. 3. The page has probability 0.7 and is not updated and decreases the probability to 0.6.

Theroretical Approach to Recrawling

  • Knapsack
    • (Knapsack, 2007)
    • (John Oommen et. al, 2007)
    • Picture of knapsack with five (5) items and a capacity of 15 Kg

The Knapsack Problem

  • What does a knapsack has to do with resource allocation and crawling?
  • The knapsack problem can be described as following
    • A thief goes into a museum carrying a knapsack with limited capacity
    • He/She wants to steal as much of value as possible without his knapsack being exceeded.
    • In other words he wants to know how to pack his knapsack to let to reach the most valuable solution.

The Knapsack Problem (2)

  • The knapsack can be seen as the resources available from e.g. a crawler.
  • The museum items can be seen as contributions retrieved from accessing the content. E.g. how valuable the downloaded page is?
    • How probable is it that the page I download is updated?

Fractional Knapsack Problem

  • The thief can put any percentage of the items into the knapsack (not just complete parts).
  • The optimal solution to this is as following (known as the greedy algorithm):
    • 1. Take as much as possible of the most valueble item/size.
    • 2. When this is empty, take as much as possible of the second most valueble item/size.
    • 3. Continue until the knapsack is full.

Fractional Knapsack Solution

Picture of knapsack with five (5) items and a capacity of 15 Kg
  • Green: 10/12 = 0.83333
  • Blue: 4/2 = 2
  • Gray: 2/1 = 2
  • Red: 1/1 = 1
  • Yellow: 2/4 = 0.5
  • Optimal value: 16

Example of Assignment

  • Find the most optimal solution for the following Items
    • a (v=10,w=2.5)
    • b (v=20,w=6)
    • c (v=13,w=5)
    • d (v=14,w=5)
    • In a fractional knapsack with constraint C=10

Example of Solution

  • Find the most optimal solution for the following Items
    • a (v=10,w=2.5, v/w=4)
    • b (v=20,w=6, v/w=3.33)
    • c (v=13,w=5, v/w=2.6)
    • d (v=14,w=5, v/w=2.8)
    • In a fractional knapsack with constraint C=10
  • Solution:
    • 1*a+1*b+0.3*c+0*d
    • Weight = 1*2.5+1*6+0.3*5=10
    • Value = 1*10+1*20+0.3*14=34.2

Binary Knapsack Problem

  • The thief can put only 100% or 0% of the items into the knapsack (not just complete parts).
  • No direct optimal solution exists.

Binary Knapsack Solution

Picture of knapsack with five (5) items and a capacity of 15 Kg
  • Green: 10/12 = 0.83333
  • Blue: 4/2 = 2
  • Gray: 2/1 = 2
  • Red: 1/1 = 1
  • Yellow: 4/2 = 0.5
  • Optimal value: 16

Example of Assignment

  • Find the most optimal solution for the following Items
    • a (v=10,w=2.5)
    • b (v=20,w=6)
    • c (v=13,w=5)
    • d (v=14,w=5)
    • In a binary knapsack with constraint C=10

Example of Solution

  • Find the most optimal solution for the following Items
    • a (v=10,w=2.5)
    • b (v=20,w=6)
    • c (v=13,w=5)
    • d (v=14,w=5)
    • In a binary knapsack with constraint C=10
  • Solution
    • 1*a+1*b+0*c+0*d=1*a+1*b
    • Weight = 1*2.5+1*6 = 8.5
    • Value = 1*10+1*20 = 30

Comparing Crawling and Knapsack

  • Binary Approach
    • All pages are downloaded completely or not downloaded at all.
  • Fractional approach
    • All pages are downloaded with a given probability (fraction = probability).

Fractional Knapsack With Unknown Distribution

  • Unknown weight and value.
  • The thief can put any percentage (fraction) of the items into the knapsack (not just complete parts).
  • The thief does not know how much each item weights or how valuable it is before he has put it in his knapsack.
  • The thief can never be sure if he has the most optimal solution, because he does not know the value and weight of the other items.

Fractional Knapsack With Unknown Distribution (2)

  • With regards to web crawling
  • You do not know if a page is updated or not until you download it

Fractional Knapsack With Unknown Distribution (3)

  • Greedy algorithm does not work.
  • This is very close to web crawling:
  • The knapsack size = The limited resources (bandwidth)
  • Each item is a web page.
  • The weight of a web page is the cost of download (download time).
  • The value of the web page is how much it has contributed.

Why is detecting changes on the web so important?

  • Internet grows
    • (Janna Anderson, Lee Rainie, 2006)
  • Even Google is unable to keep up

The size and growth of the Web

Size and growth of the web
  • Outer circle shows the percentage of population that have daily access to Internet.
  • Inner circle shows the growth of Internet users from June 2000 to June 2007.
    • (Internet Usage Statistics)

The size of Internet tomorrow?

  • ?
  • Acronyms and abbreviations

    URL
    Unified Resource Locator
    EIAO
    European Internet Accessibility Observatory
    LA
    Learning Automata

    Bibliography

    Knapsack, 2007
    The Knapsack Problem, (2007), Online:http://en.wikipedia.org/wiki/Knapsack_problem
    Cho and Garcia-Molina, 2003
    J. Cho,H. Garcia-Molina: The Evolution of the Web and Implications for an Incremental Crawler. Technical Report, 1999
    Edwards et al., 2001
    Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In Proceedings of the Tenth Conference on World Wide Web: 106-113.
    John Oommen et. al, 2007
    John Oommen, Ole-Christoffer Granmo, Svein Arild Myrer and Morten Goodwin Olsen, (2007). Learning Automata-based Solutions to the Nonlinear Fractional Knapsack Problem with Applications to Optimal Resource Allocation IEEE Transactions on Systems, Man and Cybernetics
    Danny Sullivan, 2007
    Danny Sullivan (2007). "What Is Google PageRank? A Guide For Searchers & Webmasters". Online:http://searchengineland.com/070426-011828.php
    Janna Anderson, Lee Rainie, 2006
    Janna Anderson, Lee Rainie (2006). "The Future of the Internet II". Online:http://www.pewinternet.org/PPF/r/188/report_display.asp
    Internet Usage Statistics
    Internet Usage Statistics, (2007). Online:http://www.internetworldstats.com/stats.htm