Data Extraction using Content-Based Handles

Pouramini, A.; Khaje Hassani, S.; Nasiri, Sh.

doi:https://dx.doi.org/10.22044/jadm.2017.990

dc.contributor.author	Pouramini, A.	en_US
dc.contributor.author	Khaje Hassani, S.	en_US
dc.contributor.author	Nasiri, Sh.	en_US
dc.date.accessioned	1399-07-09T06:04:19Z	fa_IR
dc.date.accessioned	2020-09-30T06:04:19Z
dc.date.available	1399-07-09T06:04:19Z	fa_IR
dc.date.available	2020-09-30T06:04:19Z
dc.date.issued	2018-07-01	en_US
dc.date.issued	1397-04-10	fa_IR
dc.date.submitted	2016-01-17	en_US
dc.date.submitted	1394-10-27	fa_IR
dc.identifier.citation	Pouramini, A., Khaje Hassani, S., Nasiri, Sh.. (2018). Data Extraction using Content-Based Handles. Journal of AI and Data Mining, 6(2), 399-407. doi: 10.22044/jadm.2017.990	en_US
dc.identifier.issn	2322-5211
dc.identifier.issn	2322-4444
dc.identifier.uri	https://dx.doi.org/10.22044/jadm.2017.990
dc.identifier.uri	http://jad.shahroodut.ac.ir/article_990.html
dc.identifier.uri	https://iranjournals.nlai.ir/handle/123456789/294898
dc.description.abstract	In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.	en_US
dc.format.extent	1186
dc.format.mimetype	application/pdf
dc.language	English
dc.language.iso	en_US
dc.publisher	Shahrood University of Technology	en_US
dc.relation.ispartof	Journal of AI and Data Mining	en_US
dc.relation.isversionof	https://dx.doi.org/10.22044/jadm.2017.990
dc.subject	Web Data Record Extraction	en_US
dc.subject	Web Wrapper Generation	en_US
dc.subject	Web Information Extraction	en_US
dc.subject	Document and Text Processing	en_US
dc.title	Data Extraction using Content-Based Handles	en_US
dc.type	Text	en_US
dc.type	Research/Original/Regular Article	en_US
dc.contributor.department	Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.	en_US
dc.contributor.department	Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.	en_US
dc.contributor.department	Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.	en_US
dc.citation.volume	6
dc.citation.issue	2
dc.citation.spage	399
dc.citation.epage	407

فایل‌های این مورد

نام:: 2A827A432D057FE0742269A52DAC85 ...
اندازه:: 1.158مگابایت
قالب:: PDF
توصیف:: FullText

دریافت مدرک

این مورد در مجموعه‌های زیر وجود دارد:

Volume 6, Issue 2

نمایش مختصر رکورد

Data Extraction using Content-Based Handles

فایل‌های این مورد

این مورد در مجموعه‌های زیر وجود دارد:

Volume 6, Issue 2