• DocumentCode
    2690210
  • Title

    A machine-learning approach to discovering company home pages

  • Author

    Gryc, Wojciech ; Melville, Prem ; Lawrence, Richard D.

  • Author_Institution
    Oxford Internet Inst., Univ. of Oxford, Oxford, UK
  • fYear
    2010
  • fDate
    13-16 April 2010
  • Firstpage
    361
  • Lastpage
    366
  • Abstract
    For many marketing and business applications, it is necessary to know the home page of a company specified only by its company name. If we require the home page for a small number of big companies, this task is readily accomplished via use of Internet search engines or access to domain registration lists. However, if the entities of interest are small companies, these approaches can lead to mismatches, particularly if a specified company lacks a home page. We address this problem using a supervised machine-learning approach in which we train a binary classification model. We classify potential website matches for each company name based on a set of explanatory features extracted from the content on each candidate website. Our approach is related to web-based business intelligence in two ways: (1) we build the training set for our learning algorithms through crowdsourcing tools and illustrate their potential for business research, and (2) the success of our model allows one to easily use corporate home pages as data inputs into other research projects. Through the successful use of crowdsourcing, our approach is able to identify a correct home page or recognize that a valid home page does not exist with an accuracy that is 57% better than simply taking the highest ranked search engine result as the correct match.
  • Keywords
    Web sites; business data processing; feature extraction; learning (artificial intelligence); pattern classification; search engines; Internet search engines; Web site; Web-based business intelligence; binary classification model; business application; company home pages discovering; crowdsourcing tools; explanatory features extraction; learning algorithm; marketing application; supervised machine-learning approach; Biological system modeling; Companies; Feature extraction; Logistics; Search engines; Training; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Ecosystems and Technologies (DEST), 2010 4th IEEE International Conference on
  • Conference_Location
    Dubai
  • ISSN
    2150-4938
  • Print_ISBN
    978-1-4244-5551-5
  • Type

    conf

  • DOI
    10.1109/DEST.2010.5610621
  • Filename
    5610621