Which of the pages A,B or C are most important (has
the highest PageRank) in this example.
PageRank - Example (2)
PR(A) = ?
PR(B) = ?
PR(C)=0.3
c(A) = 1
c(B) = 4
c(C) = 3
PageRank
What if
Page A links to Page Ba
and
Page B links to Page A
PR(A) depends on PR(B)
PR(B) depends on PR(A)
PageRank - Example
PR(A) = ?
PR(B) = ?
PR(C)=0.3
c(A) = 1
c(B) = 4
c(C) = 3
PageRank
Solution
Repeat until PR(A) and PR(B) are stable
Calculate PR(A) with the last known value
of PR(B)
Calculate PR(B) with the last known value
of PR(A)
PageRank (2)
What does Google do with PageRanks (as far as we
know)
Presents the results according to the PageRank
Crawls the pages according to PageRank
Random Walk - PageRank
Follows each link randomly
Will visit pages with a lot of in-links (high
pagerank) more often than pages with few in-links.
Incremental Web Crawling
"Given that the bandwidth for conducting crawls is
neither infinite nor free it is becoming essential to
crawl the Web in a not only scalable, but efficient way
if some reasonable measure of quality or freshness is
to be maintained."
(Edwards et al., 2001)
.
How to measure your crawler
Freshness
Age
Number of real downloads
Number of missed updates
Deep / hidden web
Some part of the web is hidden from crawlers such
as;
Pages which are password protected (e.g. online
banks)
Pages which are only visible through forms
(e.g. guestbook)
Pages that update so fast that the crawlers
never have the time to download them (e.g. online
news papers).
Pages that are not linked from anywhere or
provided as a seed URL.
Pages that are hidden with inaccessible content
(e.g. only flash links not directly parsable by
most crawlers).
Quote from Garcia
When a page changes too often, the crawler will
waste time by trying to recrawl it too fast and still
will not be able to keep its copy of the page fresh.
(Cho and Garcia-Molina, 2003)
Example:
Try to google for an online newspaper and you
will in most cases find out that the cache is a
couple of days old (and thus useless).
Recrawling
What is the difference between crawling and
recrawling?
Recrawling assumes you have already knownledge
of all pages.
Round Robin
List of pages = {P1,P2,...PN}
1. Get next available page from list of pages
2. Download the page
3. If capacity is full, wait until new time slot is
available
4. Goto 1
Weighted Proportional
List of pages = {P1,P2,...PN}
Each page has value one (1) if updated / zero (0)
if not updated
Assumes that the average probability of a page
being updated is known
1. Order list of pages by probability of being
updated
2. Download Pi with the highest probability of
being updated
3. Remove Pi from list of pages
4. If capacity met, goto 1
5. Otherwise goto 2
Disadvantage ?
Weighted Proportional (2)
Disadvantage:
Needs to calculate the average probability of
each page before it can be applied
Learning Automata Solution
Each web page is connected to a LA Learning Automata.
Each Learning Automata has a defined probability of
being visited.
Whenever a page is downloaded and it is updated,
the probability increases.
Whenever a page is downloaded and it is not
updated, the probability decreases
1. Chose randomly based on the probabilities of
each page.
2. Update probability (increase probability of
updated, decrease if not)
3. Goto 1
(John Oommen et. al, 2007)
Learning Automata Solution (2)
Theroretical Approach to Recrawling
Knapsack
(Knapsack, 2007)
(John Oommen et. al, 2007)
The Knapsack Problem
What does a knapsack has to do with resource
allocation and crawling?
The knapsack problem can be described as following
A thief goes into a museum carrying a knapsack
with limited capacity
He/She wants to steal as much of value as
possible without his knapsack being exceeded.
In other words he wants to know how to pack his
knapsack to let to reach the most valuable
solution.
The Knapsack Problem (2)
The knapsack can be seen as the resources available
from e.g. a crawler.
The museum items can be seen as contributions
retrieved from accessing the content. E.g. how valuable
the downloaded page is?
How probable is it that the page I download is
updated?
Fractional Knapsack Problem
The thief can put any percentage of the items into
the knapsack (not just complete parts).
The optimal solution to this is as following
(known as the greedy algorithm):
1. Take as much as possible of the most
valueble item/size.
2. When this is empty, take as much as possible
of the second most valueble item/size.
3. Continue until the knapsack is full.
Fractional Knapsack Solution
Green: 10/12 = 0.83333
Blue: 4/2 = 2
Gray: 2/1 = 2
Red: 1/1 = 1
Yellow: 2/4 = 0.5
Optimal value: 16
Example of Assignment
Find the most optimal solution for the following
Items
a (v=10,w=2.5)
b (v=20,w=6)
c (v=13,w=5)
d (v=14,w=5)
In a fractional knapsack with constraint C=10
Example of Solution
Find the most optimal solution for the following
Items
a (v=10,w=2.5, v/w=4)
b (v=20,w=6, v/w=3.33)
c (v=13,w=5, v/w=2.6)
d (v=14,w=5, v/w=2.8)
In a fractional knapsack with constraint C=10
Solution:
1*a+1*b+0.3*c+0*d
Weight = 1*2.5+1*6+0.3*5=10
Value = 1*10+1*20+0.3*14=34.2
Binary Knapsack Problem
The thief can put only 100% or 0% of the items into
the knapsack (not just complete parts).
No direct optimal solution exists.
Binary Knapsack Solution
Green: 10/12 = 0.83333
Blue: 4/2 = 2
Gray: 2/1 = 2
Red: 1/1 = 1
Yellow: 4/2 = 0.5
Optimal value: 16
Example of Assignment
Find the most optimal solution for the following
Items
a (v=10,w=2.5)
b (v=20,w=6)
c (v=13,w=5)
d (v=14,w=5)
In a binary knapsack with constraint C=10
Example of Solution
Find the most optimal solution for the following
Items
a (v=10,w=2.5)
b (v=20,w=6)
c (v=13,w=5)
d (v=14,w=5)
In a binary knapsack with constraint C=10
Solution
1*a+1*b+0*c+0*d=1*a+1*b
Weight = 1*2.5+1*6 = 8.5
Value = 1*10+1*20 = 30
Comparing Crawling and Knapsack
Binary Approach
All pages are downloaded completely or not
downloaded at all.
Fractional approach
All pages are downloaded with a given
probability (fraction = probability).
Fractional Knapsack With Unknown Distribution
Unknown weight and value.
The thief can put any percentage (fraction) of the
items into the knapsack (not just complete parts).
The thief does not know how much each item weights
or how valuable it is before he has put it in his
knapsack.
The thief can never be sure if he has the most
optimal solution, because he does not know the value
and weight of the other items.
Fractional Knapsack With Unknown Distribution (2)
With regards to web crawling
You do not know if a page is updated or not until
you download it
Fractional Knapsack With Unknown Distribution (3)
Greedy algorithm does not work.
This is very close to web crawling:
The knapsack size = The limited resources
(bandwidth)
Each item is a web page.
The weight of a web page is the cost of download
(download time).
The value of the web page is how much it has
contributed.
Why is detecting changes on the web so important?
Internet grows
(Janna Anderson, Lee Rainie, 2006)
Even Google is unable to keep up
The size and growth of the Web
Outer circle shows the percentage of population
that have daily access to Internet.
Inner circle shows the growth of Internet users
from June 2000 to June 2007.
(Internet Usage Statistics)
The size of Internet tomorrow?
?
Acronyms and abbreviations
URL
Unified Resource Locator
EIAO
European Internet Accessibility Observatory
LA
Learning Automata
Bibliography
Knapsack, 2007
The Knapsack Problem, (2007),
Online:http://en.wikipedia.org/wiki/Knapsack_problem
Cho and Garcia-Molina, 2003
J. Cho,H. Garcia-Molina: The Evolution of
the Web and Implications for an
Incremental Crawler. Technical Report,
1999
Edwards et al., 2001
Edwards, J., McCurley, K. S., and Tomlin,
J. A. (2001). "An adaptive model for
optimizing performance of an incremental
web crawler". In Proceedings of the Tenth
Conference on World Wide Web: 106-113.
John Oommen et. al, 2007
John Oommen, Ole-Christoffer Granmo,
Svein Arild Myrer and Morten Goodwin
Olsen, (2007). Learning Automata-based
Solutions to the Nonlinear Fractional
Knapsack Problem with Applications to
Optimal Resource Allocation IEEE
Transactions on Systems, Man and
Cybernetics
Danny Sullivan, 2007
Danny Sullivan (2007). "What Is Google
PageRank? A Guide For Searchers &
Webmasters".
Online:http://searchengineland.com/070426-011828.php
Janna Anderson, Lee Rainie, 2006
Janna Anderson, Lee Rainie (2006). "The
Future of the Internet II".
Online:http://www.pewinternet.org/PPF/r/188/report_display.asp
Internet Usage Statistics
Internet Usage Statistics, (2007).
Online:http://www.internetworldstats.com/stats.htm