DocumentCode
133976
Title
Automated specification extraction for consolidated product catalogue
Author
Hareendran, Stuthi ; Parashar, Anuvrat ; Khan, Farhat Ullah
Author_Institution
Dept. of Comput. Sci. & Eng., Amity Univ., Noida, India
fYear
2014
fDate
1-2 March 2014
Firstpage
1
Lastpage
7
Abstract
This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial structure in source format. The most significant problem in designing a solution is the source of the data itself. Since the data is fetched from not just one but many different portals, the sheer variety of it is an obstacle as the format and structure vary for every single portal. The paper considers two subproblems of data available in structured as well as unstructured format. The methodology developed for structured data makes use of the information pattern contained in the underlying tree structure of the page´s HTML content from which data is sourced in order to perform extraction. And pattern matching using regular expressions is the concept used for cases where data is unstructured. Implementation has been carried out using Python as the programming language with the usage of tools like Scrapy and LXML.
Keywords
electronic commerce; hypermedia markup languages; pattern matching; portals; text analysis; tree data structures; HTML pages; LXML; Python; Scrapy; automated specification extraction; consolidated product catalogue; data source; data structure; e-commerce portals; information pattern; page HTML content; pattern matching; product details; product specification extraction; programming language; regular expressions; source format; tree structure; Data mining; Dictionaries; HTML; Pattern matching; Pediatrics; Portals; LXML; e-commerce; information extraction; specification extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference on
Conference_Location
Bhopal
Print_ISBN
978-1-4799-2525-4
Type
conf
DOI
10.1109/SCEECS.2014.6804527
Filename
6804527
Link To Document