Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures
-
Toru Hisamitsu
Abstract
One year’s worth of Japanese newspaper articles contains about 300,000 ‘parenthetical expressions (PEs)’, pairs of character strings A and B related to each other by parentheses as in A(B). These expressions contain a large number of important terms, such as organization names, company names, and their abbreviations, and are easily extracted by pattern matching.
We have developed a simple and accurate method for collecting unregistered terms from PEs which identified two types of PEs by using pattern matching, bigram statistics, and entropy, and collected about 17,000 terms with over 97% precision.
Bigram statistics, combined with a small number of rules, identified ‘pairs of exchangeable terms’ (PET) in PEs, such as , which mostly contained important terms and their abbreviations. Entropy worked to highlight inner PE terms (such as , which means company personnel affair), that were clues useful for acquiring proper nouns such as company names, organization names, and person names.
Identification of PETs provided the opportunity to evaluate the usefulness of various bigram co-occurrence statistics. Seven statistical measures (frequency, Mutual Information, the χ2-test, the χ2-test with Yates’ correction, the log-likelihood ratio, the Dice coefficient, and the modified Dice coefficient) were compared.
Abstract
One year’s worth of Japanese newspaper articles contains about 300,000 ‘parenthetical expressions (PEs)’, pairs of character strings A and B related to each other by parentheses as in A(B). These expressions contain a large number of important terms, such as organization names, company names, and their abbreviations, and are easily extracted by pattern matching.
We have developed a simple and accurate method for collecting unregistered terms from PEs which identified two types of PEs by using pattern matching, bigram statistics, and entropy, and collected about 17,000 terms with over 97% precision.
Bigram statistics, combined with a small number of rules, identified ‘pairs of exchangeable terms’ (PET) in PEs, such as , which mostly contained important terms and their abbreviations. Entropy worked to highlight inner PE terms (such as , which means company personnel affair), that were clues useful for acquiring proper nouns such as company names, organization names, and person names.
Identification of PETs provided the opportunity to evaluate the usefulness of various bigram co-occurrence statistics. Seven statistical measures (frequency, Mutual Information, the χ2-test, the χ2-test with Yates’ correction, the log-likelihood ratio, the Dice coefficient, and the modified Dice coefficient) were compared.
Chapters in this book
- Prelim pages i
- Table of contents vi
- Introduction viii
- A graph-based approach to the automatic generation of multilingual keyword clusters 1
- The automatic construction of faceted terminological feedback for interactive document retrieval 29
- Automatic term detection 53
- Incremental extraction of domain-specific terms from online text resources 89
- Knowledge-based terminology management in medicine 111
- Searching for and identifying conceptual relationships via a corpus-based approach to a Terminological Knowledge Base (CTKB) 127
- Qualitative terminology extraction 149
- General considerations on bilingual terminology extraction 167
- Detection of synonymy links between terms 185
- Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures 209
- Software tools to support the construction of bilingual terminology lexicons 225
- Determining semantic equivalence of terms in information retrieval 245
- Term extraction using a similarity-based approach 261
- Extracting knowledge-rich contexts for terminography 279
- Experimental evaluation of ranking and selection methods in term extraction 303
- Corpus-based extension of a terminological semantic lexicon 327
- Term extraction for automatic abstracting 353
- About the contributors 371
- Subject Index 377
Chapters in this book
- Prelim pages i
- Table of contents vi
- Introduction viii
- A graph-based approach to the automatic generation of multilingual keyword clusters 1
- The automatic construction of faceted terminological feedback for interactive document retrieval 29
- Automatic term detection 53
- Incremental extraction of domain-specific terms from online text resources 89
- Knowledge-based terminology management in medicine 111
- Searching for and identifying conceptual relationships via a corpus-based approach to a Terminological Knowledge Base (CTKB) 127
- Qualitative terminology extraction 149
- General considerations on bilingual terminology extraction 167
- Detection of synonymy links between terms 185
- Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures 209
- Software tools to support the construction of bilingual terminology lexicons 225
- Determining semantic equivalence of terms in information retrieval 245
- Term extraction using a similarity-based approach 261
- Extracting knowledge-rich contexts for terminography 279
- Experimental evaluation of ranking and selection methods in term extraction 303
- Corpus-based extension of a terminological semantic lexicon 327
- Term extraction for automatic abstracting 353
- About the contributors 371
- Subject Index 377