Abstract :
Ironically, although much work has been done on elucidating
algorithms for enabling scientists to efficiently
retrieve relevant information from the glut of data
derived from the efforts of the Human Genome Project
and other similar projects, little has been performed on
optimizing the levels of data economy across databases.
One technique to qualify the degree of data economization
is that constructed by Boisot. Boisot’s Information
Space (I-Space) takes into account the degree to which
data are written (codification), the degree to which the
data can be understood (abstraction), and the degree to
which the data are effectively communicated to an audience
(diffusion). A data system is said to be more data
economical if it is relatively high in these dimensions.
Application of the approach to entries in two popular,
publicly available biological data repositories, the Protein
DataBank (PDB) and GenBank, leads to the recommendation
that PDB increases its level of abstraction
through establishing a larger set of detailed keywords,
diffusion through constructing hyperlinks to other databases,
and codification through constructing additional
subsections. With these recommendations in place, PDB
would achieve the greater data economies currently
enjoyed by GenBank. A discussion of the limitations of
the approach is presented.