Data Extraction using Content-Based Handles

Pouramini, A.; Khaje Hassani, S.; Nasiri, Sh.

doi:https://dx.doi.org/10.22044/jadm.2017.990

(ندگان)پدیدآور

Pouramini, A.Khaje Hassani, S.Nasiri, Sh.

دریافت مدرک

FullText

اندازه فایل:

1.158 مگابایت

نوع فايل (MIME):

PDF

نوع مدرک

Text
Research/Original/Regular Article

زبان مدرک

English

نمایش کامل رکورد

چکیده

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.

کلید واژگان

Web Data Record Extraction
Web Wrapper Generation
Web Information Extraction
Document and Text Processing

شماره نشریه

تاریخ نشر

2018-07-01
1397-04-10

ناشر

Shahrood University of Technology

سازمان پدید آورنده

Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.

شاپا

2322-5211
2322-4444

URI

https://dx.doi.org/10.22044/jadm.2017.990
http://jad.shahroodut.ac.ir/article_990.html
https://iranjournals.nlai.ir/handle/123456789/294898