PSwie»Main course page
- 09.10.2005: The new Web page is online at http://education.dbai.tuwien.ac.at/wie/WS05/.
- 18.10.2005: The new schedule and a detailed description of first homeworks are online.
- 02.11.2005: Reading on the Pyramid Principle sent out per email.
- 04.11.2005: Individual papers distributed per email.
- 07.11.2005: Optional storylining template sent out per email.
- 14.11.2005: Presentations from class sent out per email.
- 26.11.2005: Optional presentation template sent out per email.
- 02.12.2005: Detailed guidance for last two homeworks is posted.
- 12.12.2005: Internal PSwiki access possible.
- 16.02.2006: Proseminar Web Information Extraction will be held again in SS 2006. This semester's lecturers will be Tamir Hassan, Max Göbel, Bernhard Krüpl and Wolfgang Holzinger. New Proseminar Web page: Web Information Extraction - SS 2006
The goal of this course is to
prepare students for scientific work in the domain of Web Information
Extraction. The course will focus on providing students with
opportunities to demonstrate critical thinking, i.e. the ability to
question existing results and to put things into perspective.
Web Information Extraction
(WIE) is the process of extracting relevant information from Web pages
and transforming the extracted content into a form suitable for
computerized data-processing applications. For example, one application
of WIE would be to locate individual products and their attributes
within online product catalogues and to provide these data in a
structured format such as XML or a relational database.
In order to solve this
complex task, the research community is continuously working on
combining and advancing a number of techniques from research areas such
as artificial intelligence, information retrieval, natural language
processing, wrapper learning, ontology modeling, supervised and
unsupervised machine learning, statistics and data mining.
The course introduces
students to current research through individual reading, analysis and
comparison of selected publications. Students will present their
insights in class and will engage in a collective discussion of the
specific topics addressed each Friday.
Procedure of the course (teaching methodology)
readings serve as preparation for class discussions. Three short
homeworks are related to the readings: two handdrawn one-slide graphical
summaries and a one-page written defence of one's own standpoint.
Students will receive feedback twice: first, explicitly from the
lectureres, and secondly, implicitly by being able to compare what they
came up with the solutions of their colleagues.
One-slide graphical summaries
For several reasons, graphics are often better suited to convey meaning
than text. With these graphical summaries, students are asked to
condense the content of a paper into one slide, which should contain the
main proposition of the respective paper in the form of a legible hand
drawing in landscape format. (See last term's intermediate hand-ins, pdf
). This term students draw three slides: two for the general readings and one intermediate hand-in for the individual paper.
Example 1 Δ, Example 2 Δ, Example 3 Δ
: Each student chooses one
paper from the individual paper list and prepares the intermediate
hand-in, a report and a presentation of his or her paper. The report
should be no longer than four pages. The presentations should take a
maximum of 15 minutes, leaving 15 minutes for discussion. Prior to the
hand-in of the reports and the presentations, students will introduce
each other in the preparatory meeting to their respective papers with a
very short presentation (max. 30 sec) of their intermediate hand-ins.
The slide should contain the main proposition of the respective paper in
the form of a legible hand drawing in landscape format. This
preparatory meeting enables the students to relate their papers to the
other publications before they begin to analyse them in considerate
Teams of two students
Every two students form one team. Discussions following each individual
presentation start with a fictitous debate between the student
presenting the paper and his or her partner, in which one student takes
the role of the proponent, and the other takes the role of the opponent
of the solution presented in the paper. Later, the two partners prepare
the entries into the wiki and also the final hand-in together.
: This Web page is part of a larger knowledge base in the form of a wiki. More details are inside. Login
The wiki is password protected. The password is case-sensitive! Type it carefully.
* Click here to access the PSwie wiki.
: The final hand-ins are
prepared in a team of two after all students have entered their results
into the wiki. The teams will prepare a graphical representation, in
which they try to relate all the papers covered in class to each other.
(See last term's final hand-ins, pdf
In contrast to last semester, not only the choice of the representation
scheme, but also the tool used to create the graphics, is entirely up
to the teams. Examples of such tools are:
An important aspect of this course is the active participation of the
students in class. Students are expected to pose questions to the
presenters and to participate in discussions recognizing the
similarities and differences between the different techniques presented.
: After each presentation the audience is encouraged to give feedback to the presenters by using a simple feedback form (See last term's feedback form, pdf
The staff will not see this feedback. It is therefore not used for
grading but serves merely as a diverse collection of impressions and
suggestions for possible improvement from the students themselves.
To dos and grading
Overview of to dos and relative weight (ref. detailed specifications of to dos
- Analysis of general readings (~20%)
- One page handdrawing on first reading
- One page handdrawing on second reading
- Written one-pager on the subject of text editors vs. WYSIWYG
- Analysis of individual paper (~40%)
- One page handdrawn intermediate hand-in
- Short report: max. 4 pages
- Presentation + Discussion: max. 30 min per paper
- Relating issues (~40%)
- Active participation in classes
- Wiki: 4 entries per team
- Final hand-in: one graphics (skipped this term)
Focus on insights, not wording
In grading, we do not focus on wording, style or length, but rather on
the quality and insightfulness of the interpretations and conclusions
and the challenging remarks of the students. Hand-ins and presentations
are expected to show the students' capacity to reflect on the papers and
put the concepts into context with the overall topic. Simple summaries
are not accepted.
Reports and presentations are to be written and held in English. The
level of English, however, is not in the focus of grading.
- Willingness to both work independently and in a team of two
- Willingness to actively participate in group discussions
- Genuine interest in a booming research area
- Reasonable grasp of English
|| Class Content and Slides
• Introduction to PS Web Information Extraction (slides)
• Read [Eik99] and draw a one page slide summary of the systems discussed
• Discussion of [Eik99]
• Topic Graphics vs. Text (slides)
• Read [Cot99] and draw a one page summary of your opinion on the subject
• Read the short paper on the Pyramid Principle sent out by email
• Discussion of the Minto Pyramid Principle (slides)
• Discussion of [Cot99] (slides)
• Choice of individual papers
• Write a one pager in which you argue for your opinion on the subject discussed in [Cot99] and in class
• Draw your intermediate hand-in
• Discussion Intermediate hand-ins
• Tipps on presentations (slides)
• Hand in your individual reports
• Student presentations
• Use of PS WIE Wiki
• Deadline Hand-in revised presentations
• Deadline Hand-in revised reports
• Deadline Wiki entries
• Final hand-ins (skipped; instead prepare for our wrap-up next class)
• Wrap-up: Putting things into context
• Current research at DBAI
- One page handdrawing on [Eik99]
- One slide opinion on [Cot99]
- One-page personal statement on WYSIWYG vs. coding text
- One page handdrawn intermediate hand-in
- Short reports
General reading list
- [Eik99]: Line Eikvil, Information Extraction from World Wide Web - A Survey, Rapport Nr. 945, July, 1999. ISBN 82-539-0429-0. (online, November 7, 2005).
- [Cot99]: Allin Cottrell, Word Processors: Stupid and Inefficient, Online position paper, 1999.
(online, November 7, 2005).
- [Pyramid principle]: Sent out per email
- [Gat05]: Canceled; just discussed in class
Suggested individual paper list
- [CHJ02]: William W. Cohen, Matthew Hurst, Lee S. Jensen, A flexible learning system for wrapping tables and lists in HTML documents, In Proc. of the 11th international conference on World Wide Web, May 2002, Honolulu, Hawaii, USA.
- [CYW+03]: Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting content structure for web pages based on visual representation, In Proc. 5th APWeb, Springer, 2003, 406-417.
- [Etz04]: Oren Etzioni et al., Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison, In Proc. of the 19th National Conference on Artificial Intelligence (AAAI), July 2004, San Jose, California, USA.
- [Han00]: John C. Handley, Table analysis for multiline cell identification,
Proc. SPIE Vol. 4307, p. 34-43, Document Recognition and Retrieval
VIII, Paul B. Kantor; Daniel P. Lopresti; Jiangying Zhou; Eds., 2000.
- [PMW+03]: David Pinto, Andrew McCallum, Xing Wei, W. Bruce Croft, Table extraction using conditional random fields,
In Proc. of the 26th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, July 2003, Toronto,
- [SRH+03]: Zach Solan, Eytan Ruppin, David Horn and Shimon Edelman, Unsupervised efficient learning and representation of language structure, In Proc. 25th Conference of the Cognitive Science Society, July 2003, Boston, MA, USA.
- [WaH02]: Yalin Wang, Jianying Hu, A machine learning based approach for table detection on the web, In Proc. of the 11th international conference on World Wide Web, May 2002, Honolulu, Hawaii, USA.
- [YTS01]: Minoru Yoshida, Kentaro Torisawa, Junichi Tsujii, A method to integrate tables of the world wide web, In Proc. 1st International Workshop on Web Document Analysis, Sept. 2001, Seattle, WA, USA.
- [ZhL05]: Yanhong Zhai, Bing Liu, Web data extraction based on partial tree alignement, In Proc. of the 14th international conference on World Wide Web, May 2005, Chiba, Japan.