Web Information Extraction – SS 2005
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is last term's Web page.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Subtitle:
Wissenschaftliches Arbeiten Proseminar (181.081), SS 2.0h Location: Seminar
rooms 184/2 and 184/3 (Favoritenstraße 9-11/3rd
floor) Supervisors: Georg
Gottlob, Marcus
Herzog Responsible for content and methodology: Wolfgang
Gatterbauer, Bernhard Krüpl Web page from previous semester: WS 2004/05 First session:
Monday, March 7, 2005 Registration: no
longer possible Contact: gatterX@Xdbai.tuwien.ac.at (Wolfgang Gatterbauer) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Content Web Information
Extraction is the process of extracting relevant information from Web pages
and transforming the extracted content into a form suitable for computerized
data-processing applications. An example application is to locate individual
products and their attributes within online product catalogues and provide these
data in a structured format such as XML or a relational database. In order to solve
this complex task, the research community is currently working on combining
and advancing techniques from several disciplines such as artificial
intelligence, information retrieval and information extraction. The aim of this
course is to introduce students to current research through individual
reading, analysis and comparison of selected publications. Students will present their insights in a
seminar class and will engage in a collective discussion of the topics that
have been addressed. March 7, 2005: ShortIntroToWIE (pdf)
Methodology Teams of two students: The literature list groups the publications into
pairs. Every two students form one team. Each team member chooses one paper
belonging to the same pair in the literature list. Three reports and
three presentations per team:
Each student prepares one report and
one presentation of his or her paper. In addition, each team prepares one
report and one presentation in which they compare the two different papers
together. The three reports together should be no longer than eight pages. The three consecutive presentations should
take a maximum of 45 minutes, leaving 15 minutes for discussion. Focus on
insights, not wording: In grading, we do not focus on wording,
style or length, but rather on the quality and insightfulness of the
interpretations and conclusions and the challenging remarks of the students.
Reports and presentations are expected to show the students' capacity to
reflect on the papers and put the concepts into context with the overall
topic. Simple summaries are not accepted. Intermediate hand-in: Prior to the hand-in of the
reports and the presenations, students will introduce each other to their
respective papers with a very short presentation with one A4 slide (max. 30
sec). The slide should contain the main proposition of the respective paper in
the form of a legible hand drawing in landscape format. This preparatory
meeting enables the students to relate their papers to the other publications
before they begin to analyse them in considerate depth.
Final hand-in: After the last presentation, all teams
will prepare one slide in which they try to relate all the papers covered in
class to each other. The choice of the representation scheme is entirely up
to the student. Active
participation: An
important aspect of this course is active participation of the students in
class. Students are expected to pose questions to the presenters and to participate
in discussions recognizing the similarities and differences between the
different techniques presented. English: Reports and presentations are to be
written and held in English. The level of English, however, is not in the
focus of grading. Feedback: After each presentation the audience is
encouraged to give feedback to the presenters by using a simple feedback form
(FeedbackForm, pdf). The staff will not see this feedback. It is therefore not
used for grading but serves merely as a diverse collection of impressions and
suggestions for possible improvement from the students themselves. March 7, 2005: ShortIntroToAdmin (pdf) Grading • Report: 30% • Presentation and susequent
discussion: 40% • Active participation in
proseminar (incl. intermediate and end hand-ins): 30% Prerequisites • Genuine interest in a
booming research area • Willingness to both work independently and
in a team of two • Willingness to actively participate in
group discussions • Reasonable grasp of English Related courses • Web
Datenextraktion und –integration (181.130 VU), Robert Baumgartner,
WS 2.0h • Semistrukturierte
Daten 2 (181.139 VU), Georg Lausen, Reinhard Pichler and Students Please verify your data and update us with changes or missing
information (ListOfParticipants). Schedule
Literature list Due to possible copyright issues, the directory with the
papers is password protected. All publications are in English. Each two
successively numerated papers (even and odd) belong to one topic and will be analysed
together by each pair of students
Lecture notes and hand-ins of students (in pdf) March 7, 2005: ShortIntroToWIE March 7, 2005: ShortIntroToAdmin April 12, 2005: IntermediateHand-Ins April 27, 2005: FinalHand-Ins #3
Stefan Schoenig: Ion Muslea,
Steve Minton, Craig Knoblock. A
Hierarchical Approach to Wrapper Induction. Proceedings of the
Third International Conference on Autonomous Agents (Agents'99), Seattle, WA.
(Paper,
Report, Presentation)
#4
Marco Schoenig: Ling Liu,
Calton Pu, Wei Han.XWRAP: An
XML-enabled Wrapper Construction System for Web Information Sources.
International Conference on Data Engineering (ICDE), pages 611--621, 2000. (Paper,
Report, Presentation) Critical
comparison #3/#4: (Report, Presentation)
#5
René Kiesler: D. W. Embley,
Y. Jiang, Y.-K. Ng. Record-Boundary
Discovery in Web Documents. Proceedings of the 1999 ACM SIGMOD
international conference on Management of data, Philadelphia, Pennsylvania,
467 - 478. (Paper,
Report, Presentation) #6 Christoph Veigl: Kristina Lerman, Lise Getoor, Steven Minton, Craig Knoblock. Using the Structure of Web Sites for Automatic
Segmentation of Tables. Proceedings of the 2004 ACM SIGMOD
international conference on Management of data, Paris, France. (Paper,
Report, Presentation) Critical
comparison #5/#6: (Report, Presentation)
#7 Marian Schedenig: Kristina Lerman, Craig Knoblock, Steven Minton. Automatic Data Extraction from Lists and Tables
in Web Sources. In Proceedings of the workshop on Advances in
Text Extraction and Mining, IJCAI-2001. (Paper,
Report, Presentation) #8 Critical
comparison #7/#8: (Report, Presentation)
#9 Sunil Pilani: William W. Cohen, Matthew Hurst, Lee S. Jensen. A Flexible Learning System for Wrapping Tables
and Lists in HTML Documents. In The Eleventh International World
Wide Web Conference WWW-2002, 2002. (Paper, Report, Presentation) #10 Friedrich Dimmel: Tao Fu, Mengchi Liu. A Gateway
From HTML to XML. IDEAS 2004: 205-21.
(Paper,
Report, Presentation) Critical
comparison #9/#10: (Report, Presentation)
#13 Gregor Pridun: Yalin Wang, Jianying Hu. A
Machine Learning Based Approach for Table Detection on The Web.
Proceedings of the eleventh international conference on World Wide Web,
Honolulu, Hawaii, USA, 2002. (Paper, Report, Presentation) #14 Max Arends:
Jiying Wang, Fred H. Lochovsky. Data-rich
Section Extraction from HTML pages. Proceedings of the 3rd
International Conference on Web Information Systems Engineering, Pages: 313 -
322,2002. (Paper, Report, Presentation) Critical
comparison #13/#14: (Report, Presentation)
#15 Stefan Bischof: Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal, Wendy
Hall, Paul H. Lewis, Nigel R. Shadbolt. Automatic
Ontology-Based Extraction from Web Documents. IEEE Intelligent
Systems, Volume 18, Issue 1, January 2003. (Paper,
Report, Presentation) #16 Stefan Rümmele: David W. Embley, Cui Tao, Stephen W. Liddle. Automating the Extraction of Data from HTML
Tables with Unknown Structure. Submitted , May 2003, (source:
http://www.deg.byu.edu/papers/). (Paper,
Report, Presentation) Critical
comparison #15/#16: (Report, Presentation)
#19 Leopold Redlingshofer: Jiying Wang, Frederick H. Lochovsky. Data Extraction and Label Assignment for Web Databases.
Proceedings of the twelfth international conference on World Wide Web,
Budapest, Hungary, 2003. (Paper,
Report, Presentation) #20 Edvin Seferovic: Hui Song, Suraj Giri, Fanyuan Ma. Data
Extraction and Annotation for Dynamic Web Pages. 2004 IEEE
International Conference on e-Technology, e-Commerce and e-Service (EEE'04)
March 28 - 31, 2004 Taipei, Taiwan. (Paper,
Report, Presentation) Critical
comparison #19/#20: (Report, Presentation)
#21 Jeremy Solarz: Chia-Hui Chang, Shao-Chen Lui, Yen-Chin Wu, Applying Pattern Mining to Web Information
Extraction. Proceedings of the 5th Pacific-Asia Conference on
Knowledge Discovery and Data Mining, 2001. (Paper,
Report, Presentation) #22 Tobias Dönz: Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. Proceedings
of the 2003 ACM SIGMOD international conference on Management of data, San
Diego, California. (Paper,
Report, Presentation) Critical
comparison #21/#22: (Report, Presentation)
R1 Martin Zeilinger: Newswrapper for Brokers (Report, Presentation) R2 Tamir Hassan: Intelligent text extraction from PDF documents
with Lixto (Report, Presentation) R3 Rainer Dobiasch: Schema Matching (Report, Presentation)
Resources • "10
beliebte Fehler bei Vorträgen" (pdf), Ludwig-Maximilians-Universität
München • ACM-Portal: Scientific
publication of the Associaton
of Computing Machinery • CiteSeer: Relations between publications • DBLP Library (Uni Trier): Relations between
authors • Google
Scholar: Google for scientific publications • Scopus:
"The world's largest abstract database of scientific literature" |