Content extraction: Identifying the main content in HTML documents

Gottron, Thomas

doi:http://doi.org/10.25358/openscience-3244

Content extraction: Identifying the main content in HTML documents

Files

1859.pdf (12.17 MB)

Date issued

2009

Authors

Gottron, Thomas

Reuse License

Description of rights: InC-1.0

Item

Dissertation

Open Access

Abstract

Except the article forming the main content most HTML documents on the WWW contain additional contents such as navigation menus, design elements or commercial banners. In the context of several applications it is necessary to draw the distinction between main and additional content automatically. Content extraction and template detection are the two approaches to solve this task. This thesis gives an extensive overview of existing algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The newly introduced content code blurring algorithm overcomes several drawbacks of previous approaches and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create sets of training documents for template detection algorithms. As the whole process can be automated it allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms.

DOI

http://doi.org/10.25358/openscience-3244

URI

https://openscience.ub.uni-mainz.de/handle/20.500.12030/3246

Collections

JGU-Hochschulschriften

Full item page

Content extraction: Identifying the main content in HTML documents

Files

Date issued

Authors

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Reuse License

Abstract

DOI

Description

Keywords

Citation

URI

Relationships

Collections

Endorsement

Review

Supplemented By

Referenced By