• DocumentCode
    133976
  • Title

    Automated specification extraction for consolidated product catalogue

  • Author

    Hareendran, Stuthi ; Parashar, Anuvrat ; Khan, Farhat Ullah

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Amity Univ., Noida, India
  • fYear
    2014
  • fDate
    1-2 March 2014
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial structure in source format. The most significant problem in designing a solution is the source of the data itself. Since the data is fetched from not just one but many different portals, the sheer variety of it is an obstacle as the format and structure vary for every single portal. The paper considers two subproblems of data available in structured as well as unstructured format. The methodology developed for structured data makes use of the information pattern contained in the underlying tree structure of the page´s HTML content from which data is sourced in order to perform extraction. And pattern matching using regular expressions is the concept used for cases where data is unstructured. Implementation has been carried out using Python as the programming language with the usage of tools like Scrapy and LXML.
  • Keywords
    electronic commerce; hypermedia markup languages; pattern matching; portals; text analysis; tree data structures; HTML pages; LXML; Python; Scrapy; automated specification extraction; consolidated product catalogue; data source; data structure; e-commerce portals; information pattern; page HTML content; pattern matching; product details; product specification extraction; programming language; regular expressions; source format; tree structure; Data mining; Dictionaries; HTML; Pattern matching; Pediatrics; Portals; LXML; e-commerce; information extraction; specification extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference on
  • Conference_Location
    Bhopal
  • Print_ISBN
    978-1-4799-2525-4
  • Type

    conf

  • DOI
    10.1109/SCEECS.2014.6804527
  • Filename
    6804527