Introduction

Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to convey visual instead of semantic information. With our approach called VENTex (for Visualized Element Nodes Table EXtraction), we propose a robust technique for extracting tables from arbitrary web pages. This technique relies upon the positional information of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implementation from the actual intended visual appearance.

Take for example this "<div>-Table" and this "<table>-Table", which are rendered exactly alike (at least in the CSS complient browser Firefox). This example proves that rendering is non-injective and that similar visual data structures may originate from different source code. Hence, similarity at the visual level cannot be generalized to code similarity, and semantic interpretation has less chances of success when ignoring the visual layer (see graphics below).

We encourage you to test our system and leave us short feedback about your impressions. Please note that we are still in the process of tweaking our algorithms. So things might change over time.


The Syntactic, the Semantic and the Visual Web