Welcome to Web-mining and Data Analysis

  • Introduction
  • by Morten Goodwin Olsen

Lecture Outline

  • 0800-0900:
    • Web-mining and data analysis introduction
    • Examples
    • Course Outline
    • Presentation of projects
    • Individual introduction round
  • 0900-1200:
    • Web and Crawling

What will you learn?

  • Web and Crawling
    • Lecturer
      • Morten Goodwin Olsen
    • How is it possible to search the almost infinite world wide web in less than a second with a search engine such as Google.
    • Example
      • When a web page is visited (crawled) each hyperlink is extracted.
      • The next web page to visit is one of the hyperlinks just extracted.
      • This simple random walk automatically crawls all web pages based on the pagerank.
    • Date
      • 2007-08-23

What will you learn? (2)

  • Pattern classification
    • Lecturer:
      • Ole-Christoffer Granmo
    • How can I automatically classify my e-mail as spam / not spam.
    • Example
      • Two classes: Spam and not spam.
      • The user manually categorises the e-mail as spam or not spam.
      • Each e-mail recieved is automatically scanned and each word is looked up how many times they exist in the two categories.
      • If most of the words are in spam the e-mail is categorised as spam. Otherwise it is categorised as not spam.
    • Date
      • 2007-08-31

What will you learn? (3)

  • Reinforcement Learning
    • Lecturer
      • Ole-Christoffer Granmo
    • How can a I find the optimal action for an AI in a real-time strategy game (e.g. Command and Conquer).
    • Example
      • Two actions: Attack or retreat.
      • At first, there is a probability of attack at 0.5 (p(a)=0.5) and probability of retreat at 0.5 (p(r)=0.5).
        • 1. Choose randomly among a and r based on p(a) and p(r).
        • 2. If attack fails at a given location, decrease the probability of attack and increase probability of retreat. (Otherwise, do the opposite).
        • 3. Goto 1
        • The AI will then move towards the most optimal action without a teacher telling the what is the most optimal action.
      • Date
        • 2007-09-06

What will you learn? (4)

  • Content versus representation.
    • Lecturer
      • Andreas Prinz
    • What is content?
    • What is representation?
    • How is it presented with CSS / CMS and what Eclipse tools can be used for this?
    • Date
      • 2007-09-11

What will you learn? (5)

  • Software Acents
    • Lecturer
    • How to create software agents
    • Example
      • Assignment of an agent: Fix unstructured HTML to an understandable and usable material.
      • Input: Discussions as HTML.
      • Output: Discussions as XML including meta data.
    • Date
      • 2007-09-18

Lecture schedule

  • 2007-08-23 (Thursday) - Web and Crawling
  • 2007-08-30 (Thursday) - Pattern classification
  • 2007-09-06 (Thursday) - Reinforced learning
  • 2007-09-11 (Tuesday) - Content versus representation
  • 2007-09-18 (Tuesday) - Software Agents

Deliverables

  • 2007-09-20 - Choose project - Problem description
  • 2007-10-04 - Exam ( 25% of your grade)
  • 2007-10-18 - Motivation and planned experiments
  • 2007-11-08 - Experiment results
  • 2007-12-30 - Final project report
  • 2007-12-07 - Presentation

How is your grade calculated?

  • 25% - Result from your exam
  • 75% - Project work including report and presentation

Project nr 1 Readability index

  • Develop a readability index as a plugin for the Natural Language toolkit (http://nltk.sourceforge.net/index.php)
  • Readability tests: http://en.wikipedia.org/wiki/Category:Readability_tests
  • Investigate features and difference of the readability indices
  • Step 1: Develop a readability plugin for English text Readability index for other languages: further requirements (determine the language of a text, existing tests for different languages)
  • Step 2: Extend the plugin to cover more languages

Project nr. 2 - Subset sum problem

  • Solve the subset sum problem using learning algorithms.

OBProject nr. 3 - Stock marked indicator

  • Use stock marked information to develop a tool to give indications / stock marked tips for investments in high risk companies. Stock brokers rely more and more on tools to make decisions on what stocks are beneficial to invest in. This is specially true when it comes to investing in companies that are of a high risk. In most situations it is not complicated to know which companies that will most likely have an increase in value. However, in order for an investment to be beneficial, this needs to be known prior to most other people. The assignment includes using stock market information as a basis as an investment tool. For example: http://www.dn.no/finans/aksjekurser/?marked=OSE&list=&sector=&symbol=FRO includes information such as: "Anonymt storkjøp i Sea Production" This might give an indication that the stock will rise in value. However, the information might be a rumour set out to artificially increase the value of the company. In contrast "Klageflom pÃ¥ bil og mobil" Might give an indication that the stock will decrease in value.

Project nr. 4 - Pathfinder

  • Develop a pathfinder for ORTS. The idea for this algorithm is to find the shortest path to an enemy unit and attack this unit. In this assignment obstacles, such as stationary units (mountains etc.) must be considered.

Project nr. 5 - Resource allcation

  • Resource allocation scheduling for ORTS. - When to devote time to get more resources. This could be a so called need driven AI. 1.Observe and evaluate the environment 2.Find the most urgent need, based on the environment and the self 3.Perform the need (http://home.swipnet.se/dungeondweller/development/dev00055.htm)

Project nr. 6 - Scouting

  • Scouting algorithm for ORTS. Develop an algorithm that scouts that as fast as possible scouts the entire available map.

Project nr. 7 - Built order

  • Built order optimisation for ORTS. How should the units be built to get the most optimal way of implementation.

Project nr. 8 - Attack strategy

  • Attack strategy selector for ORTS. Several of attack strategies exists in already in ORTS. These include; stand (hold position), explore, group, defend and attack. There exists already automatic strategies for detection of finding the best strategy for each unit using monte carlo evaluation (example of monte carlo evaluation can be found here http://graphics.stanford.edu/papers/scatteringeqns/se.pdf). In this assignment you are to develop a learning automata based scheme for choosing a strategy. Such a strategy should include - if attack fails (e.g. you loose more units than your enemy), the probability of choosing another strategy, such as defend, should increase. - if defend fails (e.g. you loose more units than your enemy), the probability of choosing another strategy, such as attack, should increase.

Project nr. 9 - accessibility of web pages for dyslexic users

  • Develop an indicator for the accessibility of a web page for a dyslexic user.
  • Apart from linguistic aspects (like the readability) this should also take into account the presentation, font and graphical representation.
  • To be able to test the presentation of the page the students have to evaluate the (X)HTML as well as the CSS.

Project nr. 10 - Learning Automata based crawler resource allocation

  • The goal of the project is to successfully implement a solution to the nonlinear fractional knapsack problem, as proposed in the article “Learning Automata-Based Solutions to the Nonlinear Fractional Knapsack Problem With Applications to Optimal Resource Allocation."
  • For more infoinformation read the detailed description on the web site.
  • This project is given by Integrasco.

Project nr. 11 - Visualizing geographic distribution of online discussions

  • This project will consist of implementing a framework for generating graphical presentations of the geographic distribution of online discussions. The framework will be used for displaying statistics in a web portal.
  • For more information read the detailed description on the web site.
  • This project is given by Integrasco.

Project nr. 12 - Social Networks

  • The goal of this project is to investigate ways of visualizing relationships between users in on-line communities. The result should be a easy-to-understand visual representation of these relations, preferably in the form of a navigational and interactive map embedded in a web page.
  • For more information read the detailed description on the web site.
  • This project is given by Integrasco.

Project nr. 13 - Classification of On-Line Discussion Board Structures

  • The goal is to design and implement a classificator that is able to classify what template a specific discussion board use. The students will be given a training set of discussion boards by Integrasco. They will also be supplied with a set of known templates. The implementation should then be testet on a set of unclassified discussion boards, also given by the employer.
  • For more information read the detailed description on the web site.
  • This project is given by Integrasco.

Project nr. 14 - The Artificial Intelligence of TripleA.

  • TripleA is a turn based strategy game engine and axis and allies clone. In short, TripleA allows online and offline play against human as well as AI opponents.
  • Currently, the AI opponent is rather weak --- it is for instance unable to adapt its strategy to respond to novel situations. The goal of this project is to explore the current TripleA AI and to identify its main weaknesses and strengths. It is also desirable to take a look at the underlying implementation of the TripleA AI, so that the identified weaknesses/strengths can be explained. Finally, suggestions for improving the AI are welcome (but not necessary for the purpose of this project).
  • Note that follow-up projects in later terms will consider how machine learning and pattern recognition can be applied to improve the TripleA AI.
  • More information about "TripleA" and "Axis and Allies" can be found at:
    • http://triplea.sourceforge.net/mywiki
    • http://triplea.sourceforge.net/mywiki/Developers
    • https://www.wizards.com/default.asp?x=ah/prod/axis

Project nr. 15 - Readability evaluation for people with reading difficulties.

  • People with reading difficulties often find it hard to concentrate on a text because of the graphical representation. Develop a test that evaluates the graphical presentation of a web page with regard to its design for people with dyslexia.
  • Background: http://www.thepickards.co.uk/Articles/Designing_for_Dyslexia.cfm
  • Explore state of the art and background on reading difficulties / dyslexia. Develop a set of indicators that can be measured in web pages. How to assess html + css input?
  • Implement the test module.

Project nr. 16 - Accelerate stochastic search on the line.

  • How to accelerate the stochastic search on the line when the distribution of the parameter Lamda is known.

Project nr. 17 - Your suggestion.

  • To be filled in by students.

Example or ORTS

  • Show example of ORTS

Who is Morten Goodwin Olsen

  • Name: Morten Goodwin Olsen
  • E-mail:morten.g.olsen@hia.no
  • Phone number:+47 95 24 86 79
  • Current research:The European Internet Accessibility Observatory
  • General research interests:
    • Web Crawling
    • Resource Allocation
    • Reinforced Learning
    • Mostly anything that cannot easily be modelled as deterministic and requires learning algorithms to survey

Shortly explain (max 3 minutes):

  • Your name
  • What prior knowledge you have related to the web-mining course.
  • What you want to learn from the web-mining course.
  • Any other comments or questions