Web Information Extraction – SS 2005
This is last term's Web page.
Subtitle: Wissenschaftliches Arbeiten Proseminar (181.081), SS 2.0h
Web page from previous semester: WS 2004/05
First session: Monday, March 7, 2005
Registration: no longer possible
Contact: gatterX@Xdbai.tuwien.ac.at (Wolfgang Gatterbauer)
Web Information Extraction is the process of extracting relevant information from Web pages and transforming the extracted content into a form suitable for computerized data-processing applications. An example application is to locate individual products and their attributes within online product catalogues and provide these data in a structured format such as XML or a relational database.
In order to solve this complex task, the research community is currently working on combining and advancing techniques from several disciplines such as artificial intelligence, information retrieval and information extraction.
The aim of this course is to introduce students to current research through individual reading, analysis and comparison of selected publications. Students will present their insights in a seminar class and will engage in a collective discussion of the topics that have been addressed.
March 7, 2005: (pdf)
Teams of two students: The literature list groups the publications into pairs. Every two students form one team. Each team member chooses one paper belonging to the same pair in the literature list.
Three reports and three presentations per team: Each student prepares one report and one presentation of his or her paper. In addition, each team prepares one report and one presentation in which they compare the two different papers together. The three reports together should be no longer than eight pages. The three consecutive presentations should take a maximum of 45 minutes, leaving 15 minutes for discussion.
Focus on insights, not wording: In grading, we do not focus on wording, style or length, but rather on the quality and insightfulness of the interpretations and conclusions and the challenging remarks of the students. Reports and presentations are expected to show the students' capacity to reflect on the papers and put the concepts into context with the overall topic. Simple summaries are not accepted.
Intermediate hand-in: Prior to the hand-in of the
reports and the presenations, students will introduce each other to their
respective papers with a very short presentation with one A4 slide (max. 30
sec). The slide should contain the main proposition of the respective paper in
the form of a legible hand drawing in landscape format. This preparatory
meeting enables the students to relate their papers to the other publications
before they begin to analyse them in considerate depth.
Final hand-in: After the last presentation, all teams
will prepare one slide in which they try to relate all the papers covered in
class to each other. The choice of the representation scheme is entirely up
to the student.
Active participation: An important aspect of this course is active participation of the students in class. Students are expected to pose questions to the presenters and to participate in discussions recognizing the similarities and differences between the different techniques presented.
English: Reports and presentations are to be written and held in English. The level of English, however, is not in the focus of grading.
Feedback: After each presentation the audience is encouraged to give feedback to the presenters by using a simple feedback form (, pdf). The staff will not see this feedback. It is therefore not used for grading but serves merely as a diverse collection of impressions and suggestions for possible improvement from the students themselves.
March 7, 2005: (pdf)
• Report: 30%
• Presentation and susequent discussion: 40%
• Active participation in proseminar (incl. intermediate and end hand-ins): 30%
• Genuine interest in a booming research area
• Willingness to both work independently and in a team of two
• Willingness to actively participate in group discussions
• Reasonable grasp of English
Please verify your data and update us with changes or missing information (ListOfParticipants).
Due to possible copyright issues, the directory with the
papers is password protected. All publications are in English. Each two
successively numerated papers (even and odd) belong to one topic and will be analysed
together by each pair of students
Lecture notes and hand-ins of students (in pdf)
March 7, 2005:
March 7, 2005:
April 12, 2005:
#3 Stefan Schoenig: Ion Muslea, Steve Minton, Craig Knoblock. A Hierarchical Approach to Wrapper Induction. Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, WA. (Paper, Report, Presentation)
#4 Marco Schoenig: Ling Liu, Calton Pu, Wei Han.XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. International Conference on Data Engineering (ICDE), pages 611--621, 2000. (Paper, Report, Presentation)
#5 René Kiesler: D. W. Embley, Y. Jiang, Y.-K. Ng. Record-Boundary Discovery in Web Documents. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia, Pennsylvania, 467 - 478. (Paper, Report, Presentation)
#6 Christoph Veigl: Kristina Lerman, Lise Getoor, Steven Minton, Craig Knoblock. Using the Structure of Web Sites for Automatic Segmentation of Tables. Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France. (Paper, Report, Presentation)
#7 Marian Schedenig: Kristina Lerman, Craig Knoblock, Steven Minton. Automatic Data Extraction from Lists and Tables in Web Sources. In Proceedings of the workshop on Advances in Text Extraction and Mining, IJCAI-2001. (Paper, Report, Presentation)
#9 Sunil Pilani: William W. Cohen, Matthew Hurst, Lee S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In The Eleventh International World Wide Web Conference WWW-2002, 2002. (Paper, Report, Presentation)
#13 Gregor Pridun: Yalin Wang, Jianying Hu. A Machine Learning Based Approach for Table Detection on The Web. Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, 2002. (Paper, Report, Presentation)
#14 Max Arends: Jiying Wang, Fred H. Lochovsky. Data-rich Section Extraction from HTML pages. Proceedings of the 3rd International Conference on Web Information Systems Engineering, Pages: 313 - 322,2002. (Paper, Report, Presentation)
#15 Stefan Bischof: Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal, Wendy Hall, Paul H. Lewis, Nigel R. Shadbolt. Automatic Ontology-Based Extraction from Web Documents. IEEE Intelligent Systems, Volume 18, Issue 1, January 2003. (Paper, Report, Presentation)
#16 Stefan Rümmele: David W. Embley, Cui Tao, Stephen W. Liddle. Automating the Extraction of Data from HTML Tables with Unknown Structure. Submitted , May 2003, (source: http://www.deg.byu.edu/papers/). (Paper, Report, Presentation)
#19 Leopold Redlingshofer: Jiying Wang, Frederick H. Lochovsky. Data Extraction and Label Assignment for Web Databases. Proceedings of the twelfth international conference on World Wide Web, Budapest, Hungary, 2003. (Paper, Report, Presentation)
#20 Edvin Seferovic: Hui Song, Suraj Giri, Fanyuan Ma. Data Extraction and Annotation for Dynamic Web Pages. 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'04) March 28 - 31, 2004 Taipei, Taiwan. (Paper, Report, Presentation)
#21 Jeremy Solarz: Chia-Hui Chang, Shao-Chen Lui, Yen-Chin Wu, Applying Pattern Mining to Web Information Extraction. Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001. (Paper, Report, Presentation)
#22 Tobias Dönz: Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California. (Paper, Report, Presentation)
R1 Martin Zeilinger: Newswrapper for Brokers (Report, Presentation)
• "10 beliebte Fehler bei Vorträgen" (pdf), Ludwig-Maximilians-Universität München
• ACM-Portal: Scientific publication of the Associaton of Computing Machinery
• CiteSeer: Relations between publications
• DBLP Library (Uni Trier): Relations between authors
• Google Scholar: Google for scientific publications
• Scopus: "The world's largest abstract database of scientific literature"