Web Information Extraction – SS 2005


Subtitle: Wissenschaftliches Arbeiten Proseminar (181.081), SS 2.0h

Location: Seminar rooms 184/2 and 184/3 (Favoritenstraße 9-11/3rd floor) 

Supervisors: Georg Gottlob, Marcus Herzog

Responsible for content and methodology: Wolfgang Gatterbauer, Bernhard Krüpl

First session: Monday, March 7, 2005

Web Information Extraction is the process of extracting relevant information from Web pages and transforming the extracted content into a form suitable for computerized data-processing applications. An example application is to locate individual products and their attributes within online product catalogues and provide these data in a structured format such as XML or a relational database.


In order to solve this complex task, the research community is currently working on combining and advancing techniques from several disciplines such as artificial intelligence, information retrieval and information extraction.


The aim of this course is to introduce students to current research through individual reading, analysis and comparison of selected publications.  Students will present their insights in a seminar class and will engage in a collective discussion of the topics that have been addressed.


March 7, 2005: ShortIntroToWIE (pdf)




Teams of two students:  The literature list groups the publications into pairs. Every two students form one team. Each team member chooses one paper belonging to the same pair in the literature list.


Three reports and three presentations per team:  Each student prepares one report and one presentation of his or her paper. In addition, each team prepares one report and one presentation in which they compare the two different papers together. The three reports together should be no longer than eight pages.  The three consecutive presentations should take a maximum of 45 minutes, leaving 15 minutes for discussion.


Focus on insights, not wording:  In grading, we do not focus on wording, style or length, but rather on the quality and insightfulness of the interpretations and conclusions and the challenging remarks of the students. Reports and presentations are expected to show the students' capacity to reflect on the papers and put the concepts into context with the overall topic. Simple summaries are not accepted.


Intermediate hand-in: Prior to the hand-in of the reports and the presenations, students will introduce each other to their respective papers with a very short presentation with one A4 slide (max. 30 sec). The slide should contain the main proposition of the respective paper in the form of a legible hand drawing in landscape format. This preparatory meeting enables the students to relate their papers to the other publications before they begin to analyse them in considerate depth.
(April 12, 2005: IntermediateHand-Ins, pdf)


Final hand-in: After the last presentation, all teams will prepare one slide in which they try to relate all the papers covered in class to each other. The choice of the representation scheme is entirely up to the student.
(April 27, 2005: FinalHand-Ins, pdf)


Active participation: An important aspect of this course is active participation of the students in class. Students are expected to pose questions to the presenters and to participate in discussions recognizing the similarities and differences between the different techniques presented.


English: Reports and presentations are to be written and held in English. The level of English, however, is not in the focus of grading.


Feedback: After each presentation the audience is encouraged to give feedback to the presenters by using a simple feedback form (FeedbackForm, pdf). The staff will not see this feedback. It is therefore not used for grading but serves merely as a diverse collection of impressions and suggestions for possible improvement from the students themselves.


March 7, 2005: ShortIntroToAdmin (pdf)




  Report: 30%

  Presentation and susequent discussion: 40%

  Active participation in proseminar (incl. intermediate and end hand-ins): 30%




  Genuine interest in a booming research area

  Willingness to both work independently and in a team of two

  Willingness to actively participate in group discussions

  Reasonable grasp of English



Related courses

  Web Datenextraktion und –integration (181.130 VU), Robert Baumgartner, WS 2.0h

  Semistrukturierte Daten 2 (181.139 VU), Georg Lausen, Reinhard Pichler and
Fang Wei, SS 2.0h




Date What Place
Mo 7.3. 13:30 - 14:00 First session: Content and methodology of class SE 184/2
Mo 14.3. 15:00 - 15:30 Literature list with papers is available online -
Fr 18.3 23.59 Email us your choice of team partner and paper from literature list -
Fr 8.4. 23:59 Intermediate Hand-in (1 scanned handdrawing per student) -
Di 12.4. 15:00 - 16:30 Discussion of Intermediate hand-ins SE 184/2
Fr 15.4. 23:59 Reports due by email (3 per team, pdf format) -
Mo 18.4. 12:00 Reports of students are available online -
Do 21.4. 14:00 - 15:00 #19, #20 Leopold Redlingshofer, Edvin Seferovic SE 184/3
15:00 - 16:00 #3, #4 Stefan & Marco Schönig
16:00 - 17:00 #5, #6 René Kiesler, Christoph Veigl
17:00 - 18:00 #9, #10 Sunil Pilani, Friedrich Dimmel
Fr 22.4.

09:30 - 10:10


Martin Zeilinger

SE 184/3

10:15 - 11:05


Tamir Hassan

11:15 - 12:05


Rainer Dobiasch

12:30 - 14:00 Joint Pizza Lunch (optional) SE 184/2

14:00 - 15:00

#7, #8

Paul Bohunsky, Marian Schedenig

SE 184/3

15:00 - 16:00

#21, #22

Jeremy Solarz, Tobias Dönz

16:00 - 17:00

#13, #14

Gregor Pridun, Max Arends

17:00 - 18:00

#15, #16

Stefan Bischof, Stefan Rümmele

Mo 25.4. 23:59 Presentations (3 per team, pdf format) and
Final hand-in (1 page per team, pdf format) due by email




Literature list

Due to possible copyright issues, the directory with the papers is password protected. All publications are in English. Each two successively numerated papers (even and odd) belong to one topic and will be analysed together by each pair of students
(Literature list).




Lecture notes and hand-ins of students (in pdf)

March 7, 2005: ShortIntroToWIE

March 7, 2005: ShortIntroToAdmin

April 12, 2005: IntermediateHand-Ins

April 27, 2005: FinalHand-Ins




#3 Stefan Schoenig: Ion Muslea, Steve Minton, Craig Knoblock. A Hierarchical Approach to Wrapper Induction. Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. (Paper, Report, Presentation)


#4 Marco Schoenig: Ling Liu, Calton Pu, Wei Han.XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. International Conference on Data Engineering (ICDE), pages 611--621, 2000. (Paper, Report, Presentation)


Critical comparison #3/#4: (Report, Presentation)

#5 René Kiesler: D. W. Embley, Y. Jiang, Y.-K. Ng. Record-Boundary Discovery in Web Documents. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia, Pennsylvania, 467 - 478. (Paper, Report, Presentation)


#6 Christoph Veigl: Kristina Lerman, Lise Getoor, Steven Minton, Craig Knoblock. Using the Structure of Web Sites for Automatic Segmentation of Tables. Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France. (Paper, Report, Presentation)


Critical comparison #5/#6: (Report, Presentation)

#7 Marian Schedenig: Kristina Lerman, Craig Knoblock, Steven Minton. Automatic Data Extraction from Lists and Tables in Web Sources. In Proceedings of the workshop on Advances in Text Extraction and Mining, IJCAI-2001. (Paper, Report, Presentation)


#8 Paul Bohunsky: Bing Liu, R. Grossman, Yanhong Zhai. Mining Web Pages for Data Records. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 2003. (Paper, Report, Presentation)


Critical comparison #7/#8: (Report, Presentation)

#9 Sunil Pilani: William W. Cohen, Matthew Hurst, Lee S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In The Eleventh International World Wide Web Conference WWW-2002, 2002. (Paper, Report, Presentation)


#10 Friedrich Dimmel: Tao Fu, Mengchi Liu. A Gateway From HTML to XML. IDEAS 2004: 205-21. (Paper, Report, Presentation)


Critical comparison #9/#10: (Report, Presentation)

#13 Gregor Pridun: Yalin Wang, Jianying Hu. A Machine Learning Based Approach for Table Detection on The Web. Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, 2002. (Paper, Report, Presentation)


#14 Max Arends: Jiying Wang, Fred H. Lochovsky. Data-rich Section Extraction from HTML pages. Proceedings of the 3rd International Conference on Web Information Systems Engineering, Pages: 313 - 322,2002. (Paper, Report, Presentation)


Critical comparison #13/#14: (Report, Presentation)

#15 Stefan Bischof: Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal, Wendy Hall, Paul H. Lewis, Nigel R. Shadbolt. Automatic Ontology-Based Extraction from Web Documents. IEEE Intelligent Systems, Volume 18, Issue 1, January 2003. (Paper, Report, Presentation)


#16 Stefan Rümmele: David W. Embley, Cui Tao, Stephen W. Liddle. Automating the Extraction of Data from HTML Tables with Unknown Structure. Submitted , May 2003, (source: (Paper, Report, Presentation)


Critical comparison #15/#16: (Report, Presentation)

#19 Leopold Redlingshofer: Jiying Wang, Frederick H. Lochovsky. Data Extraction and Label Assignment for Web Databases. Proceedings of the twelfth international conference on World Wide Web, Budapest, Hungary, 2003. (Paper, Report, Presentation)


#20 Edvin Seferovic: Hui Song, Suraj Giri, Fanyuan Ma. Data Extraction and Annotation for Dynamic Web Pages. 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'04) March 28 - 31, 2004 Taipei, Taiwan. (Paper, Report, Presentation)


Critical comparison #19/#20: (Report, Presentation)

#21 Jeremy Solarz: Chia-Hui Chang, Shao-Chen Lui, Yen-Chin Wu, Applying Pattern Mining to Web Information Extraction. Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001. (Paper, Report, Presentation)


#22 Tobias Dönz: Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California. (Paper, Report, Presentation)


Critical comparison #21/#22: (Report, Presentation)

R1 Martin Zeilinger: Newswrapper for Brokers (Report, Presentation)


R2 Tamir Hassan: Intelligent text extraction from PDF documents with Lixto (Report, Presentation)


R3 Rainer Dobiasch: Schema Matching (Report, Presentation)




