This is a static copy of what used to be a dynamic wiki page hosted on our former DBAI web server. Provided here just as is (Sept 2015)
PSwie»Main course page

Main course page


Administrative issues

Subtitle: Proseminar Scientific Research (181.081, PS 2.0h)
Location: Seminar room 184/2 (Favoritenstraße 9-11/3rd floor)
Supervisors: Georg Gottlob, Marcus Herzog
Lecturer: Wolfgang Gatterbauer
First session: Friday, October 14, 2005, 10:15
Registration: Per email before first session
Contact: gatterx(@) (Wolfgang Gatterbauer)
Web page from previous semester: Web Information Extraction - SS 2005

  • 09.10.2005: The new Web page is online at
  • 18.10.2005: The new schedule and a detailed description of first homeworks are online.
  • 02.11.2005: Reading on the Pyramid Principle sent out per email.
  • 04.11.2005: Individual papers distributed per email.
  • 07.11.2005: Optional storylining template sent out per email.
  • 14.11.2005: Presentations from class sent out per email.
  • 26.11.2005: Optional presentation template sent out per email.
  • 02.12.2005: Detailed guidance for last two homeworks is posted.
  • 12.12.2005: Internal PSwiki access possible.
  • 16.02.2006: Proseminar Web Information Extraction will be held again in SS 2006. This semester's lecturers will be Tamir Hassan, Max Göbel, Bernhard Krüpl and Wolfgang Holzinger. New Proseminar Web page: Web Information Extraction - SS 2006


The goal of this course is to prepare students for scientific work in the domain of Web Information Extraction. The course will focus on providing students with opportunities to demonstrate critical thinking, i.e. the ability to question existing results and to put things into perspective.


Web Information Extraction (WIE) is the process of extracting relevant information from Web pages and transforming the extracted content into a form suitable for computerized data-processing applications. For example, one application of WIE would be to locate individual products and their attributes within online product catalogues and to provide these data in a structured format such as XML or a relational database.

In order to solve this complex task, the research community is continuously working on combining and advancing a number of techniques from research areas such as artificial intelligence, information retrieval, natural language processing, wrapper learning, ontology modeling, supervised and unsupervised machine learning, statistics and data mining.

The course introduces students to current research through individual reading, analysis and comparison of selected publications. Students will present their insights in class and will engage in a collective discussion of the specific topics addressed each Friday.

Procedure of the course (teaching methodology)

General reading: Four readings serve as preparation for class discussions. Three short homeworks are related to the readings: two handdrawn one-slide graphical summaries and a one-page written defence of one's own standpoint. Students will receive feedback twice: first, explicitly from the lectureres, and secondly, implicitly by being able to compare what they came up with the solutions of their colleagues.

One-slide graphical summaries: For several reasons, graphics are often better suited to convey meaning than text. With these graphical summaries, students are asked to condense the content of a paper into one slide, which should contain the main proposition of the respective paper in the form of a legible hand drawing in landscape format. (See last term's intermediate hand-ins, pdf). This term students draw three slides: two for the general readings and one intermediate hand-in for the individual paper.

Example 1 Δ, Example 2 Δ, Example 3 Δ

Individual paper: Each student chooses one paper from the individual paper list and prepares the intermediate hand-in, a report and a presentation of his or her paper. The report should be no longer than four pages. The presentations should take a maximum of 15 minutes, leaving 15 minutes for discussion. Prior to the hand-in of the reports and the presentations, students will introduce each other in the preparatory meeting to their respective papers with a very short presentation (max. 30 sec) of their intermediate hand-ins. The slide should contain the main proposition of the respective paper in the form of a legible hand drawing in landscape format. This preparatory meeting enables the students to relate their papers to the other publications before they begin to analyse them in considerate depth.

Teams of two students: Every two students form one team. Discussions following each individual presentation start with a fictitous debate between the student presenting the paper and his or her partner, in which one student takes the role of the proponent, and the other takes the role of the opponent of the solution presented in the paper. Later, the two partners prepare the entries into the wiki and also the final hand-in together.

Wiki: This Web page is part of a larger knowledge base in the form of a wiki. More details are inside. Login

The wiki is password protected. The password is case-sensitive! Type it carefully. * Click here to access the PSwie wiki.

Final hand-in: The final hand-ins are prepared in a team of two after all students have entered their results into the wiki. The teams will prepare a graphical representation, in which they try to relate all the papers covered in class to each other. (See last term's final hand-ins, pdf). In contrast to last semester, not only the choice of the representation scheme, but also the tool used to create the graphics, is entirely up to the teams. Examples of such tools are:

Active participation: An important aspect of this course is the active participation of the students in class. Students are expected to pose questions to the presenters and to participate in discussions recognizing the similarities and differences between the different techniques presented.

Feedback: After each presentation the audience is encouraged to give feedback to the presenters by using a simple feedback form (See last term's feedback form, pdf). The staff will not see this feedback. It is therefore not used for grading but serves merely as a diverse collection of impressions and suggestions for possible improvement from the students themselves.

To dos and grading

Overview of to dos and relative weight (ref. detailed specifications of to dos):
  • Analysis of general readings (~20%)
    • One page handdrawing on first reading
    • One page handdrawing on second reading
    • Written one-pager on the subject of text editors vs. WYSIWYG
  • Analysis of individual paper (~40%)
    • One page handdrawn intermediate hand-in
    • Short report: max. 4 pages
    • Presentation + Discussion: max. 30 min per paper
  • Relating issues (~40%)
    • Active participation in classes
    • Wiki: 4 entries per team
    • Final hand-in: one graphics (skipped this term)

Focus on insights, not wording: In grading, we do not focus on wording, style or length, but rather on the quality and insightfulness of the interpretations and conclusions and the challenging remarks of the students. Hand-ins and presentations are expected to show the students' capacity to reflect on the papers and put the concepts into context with the overall topic. Simple summaries are not accepted.

English: Reports and presentations are to be written and held in English. The level of English, however, is not in the focus of grading.


  • Willingness to both work independently and in a team of two
  • Willingness to actively participate in group discussions
  • Genuine interest in a booming research area
  • Reasonable grasp of English

Related courses

WS 2005/06

SS 2005

WS 2004/05

Schedule (preliminary)

Friday classes

Date Homework Class Content and Slides

41 Fr 14.10. 10:15 -

• Introduction to PS Web Information Extraction (slides)

42 We 19.10. 23:59

• Read [Eik99] and draw a one page slide summary of the systems discussed


Fr 21.10. 10:00 -

• Discussion of [Eik99]
• Topic Graphics vs. Text (slides)

43 - - - - -

44 We 2.11. 23:59

• Read [Cot99] and draw a one page summary of your opinion on the subject


Fr 4.11. 10:00

• Read the short paper on the Pyramid Principle sent out by email

• Discussion of the Minto Pyramid Principle (slides)
• Discussion of [Cot99] (slides)
• Choice of individual papers

45 - - - - -

46 We 16.11. 23:59

• Write a one pager in which you argue for your opinion on the subject discussed in [Cot99] and in class
• Draw your intermediate hand-in


Fr 18.11. 10:00 -

• Discussion Intermediate hand-ins
• Tipps on presentations (slides)

47 - - - - -

48 We 30.11. 23:59

• Hand in your individual reports


Fr 2.11. 8:30 -

• Student presentations
• Use of PS WIE Wiki

49 - - - - -

50 We 14.12. 23:59

• Deadline Hand-in revised presentations
• Deadline Hand-in revised reports
• Deadline Wiki entries
• Final hand-ins (skipped; instead prepare for our wrap-up next class)


Fr 16.12. 10:00 -

• Wrap-up: Putting things into context
• Current research at DBAI

Class material



  • One page handdrawing on [Eik99]
  • One slide opinion on [Cot99]
  • One-page personal statement on WYSIWYG vs. coding text
  • One page handdrawn intermediate hand-in
  • Short reports
  • Presentations

General reading list

  • [Eik99]: Line Eikvil, Information Extraction from World Wide Web - A Survey, Rapport Nr. 945, July, 1999. ISBN 82-539-0429-0. (online, November 7, 2005).
  • [Cot99]: Allin Cottrell, Word Processors: Stupid and Inefficient, Online position paper, 1999.
    (online, November 7, 2005).
  • [Pyramid principle]: Sent out per email
  • [Gat05]: Canceled; just discussed in class

Suggested individual paper list

  • [CHJ02]: William W. Cohen, Matthew Hurst, Lee S. Jensen, A flexible learning system for wrapping tables and lists in HTML documents, In Proc. of the 11th international conference on World Wide Web, May 2002, Honolulu, Hawaii, USA.
  • [CYW+03]: Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting content structure for web pages based on visual representation, In Proc. 5th APWeb, Springer, 2003, 406-417.
  • [Etz04]: Oren Etzioni et al., Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison, In Proc. of the 19th National Conference on Artificial Intelligence (AAAI), July 2004, San Jose, California, USA.
  • [Han00]: John C. Handley, Table analysis for multiline cell identification, Proc. SPIE Vol. 4307, p. 34-43, Document Recognition and Retrieval VIII, Paul B. Kantor; Daniel P. Lopresti; Jiangying Zhou; Eds., 2000.
  • [PMW+03]: David Pinto, Andrew McCallum, Xing Wei, W. Bruce Croft, Table extraction using conditional random fields, In Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2003, Toronto, Canada.
  • [SRH+03]: Zach Solan, Eytan Ruppin, David Horn and Shimon Edelman, Unsupervised efficient learning and representation of language structure, In Proc. 25th Conference of the Cognitive Science Society, July 2003, Boston, MA, USA.
  • [WaH02]: Yalin Wang, Jianying Hu, A machine learning based approach for table detection on the web, In Proc. of the 11th international conference on World Wide Web, May 2002, Honolulu, Hawaii, USA.
  • [YTS01]: Minoru Yoshida, Kentaro Torisawa, Junichi Tsujii, A method to integrate tables of the world wide web, In Proc. 1st International Workshop on Web Document Analysis, Sept. 2001, Seattle, WA, USA.
  • [ZhL05]: Yanhong Zhai, Bing Liu, Web data extraction based on partial tree alignement, In Proc. of the 14th international conference on World Wide Web, May 2005, Chiba, Japan.