Abstract :
As illustrated by the World Wide Web, the volume of
information in languages other than English has grown
significantly in recent years. This highlights the importance
of multilingual corpora. Much effort has been
devoted to the compilation of multilingual corpora for
the purpose of cross-lingual information retrieval and
machine translation. Existing parallel corpora mostly
involve European languages, such as English–French
and English–Spanish. There is still a lack of parallel
corpora between European languages and Asian
languages. In the authors’ previous work, an alignment
method to identify one-to-one Chinese and English title
pairs was developed to construct an English–Chinese
parallel corpus that works automatically from the World
Wide Web, and a 100% precision and 87% recall were
obtained. Careful analysis of these results has helped
the authors to understand how the alignment method
can be improved. A conceptual analysis was conducted,
which includes the analysis of conceptual equivalent
and conceptual information alternation in the aligned
and nonaligned English–Chinese title pairs that are
obtained by the alignment method. The result of the
analysis not only reflects the characteristics of parallel
corpora, but also gives insight into the strengths and
weaknesses of the alignment method. In particular, conceptual
alternation, such as omission and addition, is
found to have a significant impact on the performance of
the alignment method