Welcome to Web-mining and Data Analysis
- Introduction
- by Morten Goodwin Olsen
Lecture Outline
-
0800-0900:
-
- Web-mining and data analysis introduction
- Examples
- Course Outline
- Presentation of projects
- Individual introduction round
-
0900-1200:
-
What will you learn?
- Web and Crawling
-
- Lecturer
-
- How is it possible to search the almost
infinite world wide web in less than a second with
a search engine such as Google.
- Example
-
- When a web page is visited (crawled) each
hyperlink is extracted.
- The next web page to visit is one of the
hyperlinks just extracted.
- This simple random walk automatically
crawls all web pages based on the pagerank.
- Date
-
What will you learn? (2)
- Pattern classification
-
-
Lecturer:
-
- How can I automatically classify my e-mail as
spam / not spam.
- Example
-
-
Two classes: Spam and not spam.
- The user manually categorises the e-mail as
spam or not spam.
- Each e-mail recieved is automatically
scanned and each word is looked up how many
times they exist in the two categories.
- If most of the words are in spam the e-mail
is categorised as spam. Otherwise it is
categorised as not spam.
- Date
-
What will you learn? (3)
- Reinforcement Learning
-
- Lecturer
-
- How can a I find the optimal action for an AI
in a real-time strategy game (e.g. Command and
Conquer).
- Example
-
-
Two actions: Attack or retreat.
- At first, there is a probability of attack
at 0.5 (p(a)=0.5) and probability of retreat at
0.5 (p(r)=0.5).
-
- 1. Choose randomly among a and r based
on p(a) and p(r).
- 2. If attack fails at a given location,
decrease the probability of attack and
increase probability of retreat.
(Otherwise, do the opposite).
- 3. Goto 1
- The AI will then move towards the most
optimal action without a teacher telling
the what is the most optimal action.
- Date
-
What will you learn? (4)
- Content versus representation.
-
- Lecturer
-
- What is content?
- What is representation?
- How is it presented with CSS / CMS and what
Eclipse tools can be used for this?
- Date
-
What will you learn? (5)
- Software Acents
-
- Lecturer
-
- How to create software agents
- Example
-
-
Assignment of an agent: Fix
unstructured HTML to an understandable and
usable material.
-
Input: Discussions as HTML.
-
Output: Discussions as XML including
meta data.
- Date
-
Lecture schedule
- 2007-08-23 (Thursday) - Web and Crawling
- 2007-08-30 (Thursday) - Pattern classification
- 2007-09-06 (Thursday) - Reinforced learning
- 2007-09-11 (Tuesday) - Content versus
representation
- 2007-09-18 (Tuesday) - Software Agents
Deliverables
- 2007-09-20 - Choose project - Problem description
- 2007-10-04 - Exam ( 25% of your grade)
- 2007-10-18 - Motivation and planned experiments
- 2007-11-08 - Experiment results
- 2007-12-30 - Final project report
- 2007-12-07 - Presentation
How is your grade calculated?
- 25% - Result from your exam
- 75% - Project work including report and
presentation
Project nr 1 Readability index
- Develop a readability index as a plugin for the
Natural Language toolkit
(http://nltk.sourceforge.net/index.php)
- Readability tests:
http://en.wikipedia.org/wiki/Category:Readability_tests
- Investigate features and difference of the
readability indices
-
Step 1: Develop a readability plugin for
English text Readability index for other languages:
further requirements (determine the language of a
text, existing tests for different languages)
-
Step 2: Extend the plugin to cover more
languages
Project nr. 2 - Subset sum problem
- Solve the subset sum problem using learning
algorithms.
OBProject nr. 3 - Stock marked indicator
- Use stock marked information to develop a tool to
give indications / stock marked tips for investments in
high risk companies. Stock brokers rely more and more
on tools to make decisions on what stocks are
beneficial to invest in. This is specially true when it
comes to investing in companies that are of a high
risk. In most situations it is not complicated to know
which companies that will most likely have an increase
in value. However, in order for an investment to be
beneficial, this needs to be known prior to most other
people. The assignment includes using stock market
information as a basis as an investment tool. For
example:
http://www.dn.no/finans/aksjekurser/?marked=OSE&list=§or=&symbol=FRO
includes information such as: "Anonymt
storkjøp i Sea Production" This might
give an indication that the stock will rise in value.
However, the information might be a rumour set out to
artificially increase the value of the company. In
contrast "Klageflom på bil og mobil" Might
give an indication that the stock will decrease in
value.
Project nr. 4 - Pathfinder
- Develop a pathfinder for ORTS. The idea for this
algorithm is to find the shortest path to an enemy unit
and attack this unit. In this assignment obstacles,
such as stationary units (mountains etc.) must be
considered.
Project nr. 5 - Resource allcation
- Resource allocation scheduling for ORTS. - When to
devote time to get more resources. This could be a so
called need driven AI. 1.Observe and evaluate the
environment 2.Find the most urgent need, based on the
environment and the self 3.Perform the need
(http://home.swipnet.se/dungeondweller/development/dev00055.htm)
Project nr. 6 - Scouting
- Scouting algorithm for ORTS. Develop an algorithm
that scouts that as fast as possible scouts the entire
available map.
Project nr. 7 - Built order
- Built order optimisation for ORTS. How should the
units be built to get the most optimal way of
implementation.
Project nr. 8 - Attack strategy
- Attack strategy selector for ORTS. Several of
attack strategies exists in already in ORTS. These
include; stand (hold position), explore, group, defend
and attack. There exists already automatic strategies
for detection of finding the best strategy for each
unit using monte carlo evaluation (example of monte
carlo evaluation can be found here
http://graphics.stanford.edu/papers/scatteringeqns/se.pdf).
In this assignment you are to develop a learning
automata based scheme for choosing a strategy. Such a
strategy should include - if attack fails (e.g. you
loose more units than your enemy), the probability of
choosing another strategy, such as defend, should
increase. - if defend fails (e.g. you loose more units
than your enemy), the probability of choosing another
strategy, such as attack, should increase.
Project nr. 9 - accessibility of web pages for dyslexic
users
- Develop an indicator for the accessibility of a web
page for a dyslexic user.
- Apart from linguistic aspects (like the
readability) this should also take into account the
presentation, font and graphical representation.
- To be able to test the presentation of the page the
students have to evaluate the (X)HTML as well as the
CSS.
Project nr. 10 - Learning Automata based crawler resource
allocation
- The goal of the project is to successfully
implement a solution to the nonlinear fractional
knapsack problem, as proposed in the article
“Learning Automata-Based Solutions
to the Nonlinear Fractional Knapsack Problem With
Applications to Optimal Resource Allocation."
- For more infoinformation read the detailed
description on the web site.
- This project is given by Integrasco.
Project nr. 11 - Visualizing geographic distribution of
online discussions
- This project will consist of implementing a
framework for generating graphical presentations of the
geographic distribution of online discussions. The
framework will be used for displaying statistics in a
web portal.
- For more information read the detailed description
on the web site.
- This project is given by Integrasco.
Project nr. 12 - Social Networks
- The goal of this project is to investigate ways of
visualizing relationships between users in on-line
communities. The result should be a easy-to-understand
visual representation of these relations, preferably in
the form of a navigational and interactive map embedded
in a web page.
- For more information read the detailed description
on the web site.
- This project is given by Integrasco.
Project nr. 13 - Classification of On-Line Discussion
Board Structures
- The goal is to design and implement a classificator
that is able to classify what template a specific
discussion board use. The students will be given a
training set of discussion boards by Integrasco. They
will also be supplied with a set of known templates.
The implementation should then be testet on a set of
unclassified discussion boards, also given by the
employer.
- For more information read the detailed description
on the web site.
- This project is given by Integrasco.
Project nr. 14 - The Artificial Intelligence of TripleA.
- TripleA is a turn based strategy game engine and
axis and allies clone. In short, TripleA allows online
and offline play against human as well as AI opponents.
- Currently, the AI opponent is rather weak --- it is
for instance unable to adapt its strategy to respond to
novel situations. The goal of this project is to
explore the current TripleA AI and to identify its main
weaknesses and strengths. It is also desirable to take
a look at the underlying implementation of the TripleA
AI, so that the identified weaknesses/strengths can be
explained. Finally, suggestions for improving the AI
are welcome (but not necessary for the purpose of this
project).
- Note that follow-up projects in later terms will
consider how machine learning and pattern recognition
can be applied to improve the TripleA AI.
-
More information about "TripleA" and "Axis and
Allies" can be found at:
-
- http://triplea.sourceforge.net/mywiki
-
http://triplea.sourceforge.net/mywiki/Developers
-
https://www.wizards.com/default.asp?x=ah/prod/axis
Project nr. 15 - Readability evaluation for people with
reading difficulties.
- People with reading difficulties often find it hard
to concentrate on a text because of the graphical
representation. Develop a test that evaluates the
graphical presentation of a web page with regard to its
design for people with dyslexia.
- Background:
http://www.thepickards.co.uk/Articles/Designing_for_Dyslexia.cfm
- Explore state of the art and background on reading
difficulties / dyslexia. Develop a set of indicators
that can be measured in web pages. How to assess html +
css input?
- Implement the test module.
Project nr. 16 - Accelerate stochastic search on the
line.
- How to accelerate the stochastic search on the line
when the distribution of the parameter Lamda is known.
Project nr. 17 - Your suggestion.
- To be filled in by students.
Who is Morten Goodwin Olsen
-
Name: Morten Goodwin Olsen
-
E-mail:morten.g.olsen@hia.no
-
Phone number:+47 95 24 86 79
-
Current research:The European Internet
Accessibility Observatory
-
General research interests:
-
- Web Crawling
- Resource Allocation
- Reinforced Learning
- Mostly anything that cannot easily be modelled
as deterministic and requires learning algorithms
to survey
Shortly explain (max 3 minutes):
- Your name
- What prior knowledge you have related to the
web-mining course.
- What you want to learn from the web-mining course.
- Any other comments or questions