Title :
Practical token retrieval and indexing from binary data: An application in computer aided design
Author :
Gruber, M. ; Geschray, R. ; Hillbrand, C.
Author_Institution :
V-Res. GmbH, Dornbirn, Austria
Abstract :
In many commercial applications proprietary file formats make it difficult to access the generated data. In the worst case interoperability is impeded even further by shortcomings in interface technology. The objective of this work is to find out whether it is possible to retrieve textual data from certain binary files in a quality which is sufficient to build a useful index. We propose a method to parse and filter binary data in multiple stages. Besides stop-words, we use whitelists and phonetic as well as phonotactic criteria to create token data while minimizing noise. The results are promising: with a few simple steps we are able to filter most of the invalid tokens while preserving abbreviations and terms like company names even though they are not in a dictionary.
Keywords :
CAD; indexing; information retrieval; CAD; binary data; binary files; computer aided design; indexing; interface technology; phonetic; phonotactic criteria; proprietary file formats; textual data retrieval; token retrieval; whitelists; worst case interoperability; Design automation; Encoding; Filtering algorithms; ISO standards; Indexes; Law; Software;
Conference_Titel :
Logistics and Industrial Informatics (LINDI), 2011 3rd IEEE International Symposium on
Conference_Location :
Budapest
Print_ISBN :
978-1-4577-1842-7
DOI :
10.1109/LINDI.2011.6031132