Abstract
Stemming or suffix stripping is the problem of removing suffixes from words to get the root word. Word endings can be removed by developing stripping rules dependent on the morphological knowledge of a specific language; obviously such approach cannot flourish in multilingual environment. Statistical approaches survive in multilingual environment but they require significant amount of computing. We define stemming as an optimization problem for the very first time in the literature. An Integer Program is being developed for the stemming problem. We exhibit our approach by applying it to clusters of English and Spanish words; moreover, the proposed method is also being compared with an established technique in the field for English language. An AMPL program of the proposed method has also been given in Appendix (A.2).
Appendices
A.1
Outputs of IP stemmer and Porter stemmer (Snowball) over 100 randomly chosen English words.
S. No. | Word | IP stem | Porter (Snowball) stem | LDIP | LDPorter |
1. | Abject | Abject | Abject | 0 | 0 |
2. | Admiring | Admir | Admir | 3 | 3 |
3. | Admonishing | Admonish | Admonish | 3 | 3 |
4. | Agreement | Agreement | Agreement | 0 | 0 |
5. | Alto | Alto | Alto | 0 | 0 |
6. | Anxiously | Anxious | Anxious | 2 | 2 |
7. | Believing | Believ | Believ | 3 | 3 |
8. | Blissfully | Bliss | Bliss | 5 | 5 |
9. | Borrowed | Borrow | Borrow | 2 | 2 |
10. | Borrowing | Borrow | Borrow | 3 | 3 |
11. | Briskly | Brisk | Brisk | 2 | 2 |
12. | Casual | Casual | Casual | 0 | 0 |
13. | Consulted | Consult | Consult | 2 | 2 |
14. | Consulting | Consult | Consult | 3 | 3 |
15. | Cutter | Cutter | Cutter | 0 | 0 |
16. | Deceived | Deceiv | Deceiv | 2 | 2 |
17. | Deceiving | Deceiv | Deceiv | 3 | 3 |
18. | Dilution | Dilut | Dilut | 3 | 3 |
19. | Employed | Employ | Employ | 2 | 2 |
20. | Employing | Employ | Employ | 3 | 3 |
21. | Enormously | Enorm | Enorm | 5 | 5 |
22. | Explained | Explain | Explain | 2 | 2 |
23. | Explaining | Explain | Explain | 3 | 3 |
24. | Finished | Finish | Finish | 2 | 2 |
25. | Finishing | Finish | Finish | 3 | 3 |
26. | Fleet | Fleet | Fleet | 0 | 0 |
27. | Frightfully | Fright | Fright | 5 | 5 |
28. | Gathered | Gather | Gather | 2 | 2 |
29. | Gathering | Gather | Gather | 3 | 3 |
30. | Haze | Haze | Haze | 0 | 0 |
31. | Hopelessly | Hopeless | Hopeless | 2 | 2 |
32. | Improved | Improv | Improv | 2 | 2 |
33. | Improving | Improv | Improv | 3 | 3 |
34. | Inwardly | Inward | Inward | 2 | 2 |
35. | Laughed | Laugh | Laugh | 2 | 2 |
36. | Laughing | Laugh | Laugh | 3 | 3 |
37. | Listened | Listen | Listen | 2 | 2 |
38. | Listening | Listen | Listen | 3 | 3 |
39. | Lyric | Lyric | Lyric | 0 | 0 |
40. | Manace | Menac | Menac | 1 | 1 |
41. | Miserable | Miser | Miser | 4 | 4 |
42. | Monthly | Month | Month | 2 | 2 |
43. | Museums | Museum | Museum | 1 | 1 |
44. | Ordaining | Ordain | Ordain | 3 | 3 |
45. | Overseas | Oversea | Oversea | 1 | 1 |
46. | Parsons | Parson | Parson | 1 | 1 |
47. | Passion | Passion | Passion | 0 | 0 |
48. | Plucked | Pluck | Pluck | 2 | 2 |
49. | Plucking | Pluck | Pluck | 3 | 3 |
50. | Preached | Preach | Preach | 2 | 2 |
51. | Preaching | Preach | Preach | 3 | 3 |
52. | Predicted | Predict | Predict | 2 | 2 |
53. | Quiet | Quiet | Quiet | 0 | 0 |
54. | Solemnly | Solemn | Solemn | 2 | 2 |
55. | Swiftly | Swift | Swift | 2 | 2 |
56. | Training | Train | Train | 3 | 3 |
57. | Unloaded | Unload | Unload | 2 | 2 |
58. | Viciously | Vicious | Vicious | 2 | 2 |
59. | Abnormally | Abnormal | Abnorm | 2 | 4 |
60. | Abused | Abuse | Abus | 1 | 2 |
61. | Bloodier | Blood | Bloodier | 3 | 0 |
62. | Carrying | Carry | Carri | 3 | 2 |
63. | Delightfully | Delightful | Delight | 2 | 5 |
64. | Despotic | Despotic | Despot | 0 | 2 |
65. | Diligently | Diligent | Dilig | 2 | 5 |
66. | Eleventh | Eleven | Eleventh | 2 | 0 |
67. | Forbidden | Forbid | Forbidden | 3 | 0 |
68. | Jubilantly | Jubilant | Jubil | 2 | 5 |
69. | Macabre | Macabre | Macabr | 0 | 1 |
70. | Midwife | Midwife | Midwif | 0 | 1 |
71. | Ninety | Nine | Nineti | 2 | 1 |
72. | Obnoxiously | Obnoxious | Obnoxi | 2 | 5 |
73. | Quizzical | Quizzical | Quizzic | 0 | 2 |
74. | Reluctantly | Reluctant | Reluct | 2 | 5 |
75. | Substitute | Substitute | Substitut | 0 | 1 |
76. | Thankfully | Thankful | Thank | 2 | 5 |
77. | Undeveloped | Undeveloped | Undevelop | 0 | 2 |
78. | Vacancy | Vacancy | Vacanc | 0 | 1 |
79. | Wonderfully | Wonderful | Wonder | 2 | 5 |
80. | Abusing | Abusing | Abus | 0 | 3 |
81. | Admired | Admi | Admir | 3 | 2 |
82. | Believed | Belie | Believ | 3 | 2 |
83. | Braid | Bra | Braid | 2 | 0 |
84. | Carried | Carrie | Carri | 1 | 2 |
85. | Decorated | Decorat | Decor | 2 | 4 |
86. | Decorating | Decorati | Decor | 2 | 5 |
87. | Devastating | Devastat | Devast | 3 | 5 |
88. | Forbidding | Forbidd | Forbid | 3 | 4 |
89. | Freehand | Freeh | Freehand | 3 | 0 |
90. | Laceration | Lacerat | Lacer | 3 | 5 |
91. | Librarian | Librar | Librarian | 3 | 0 |
92. | Mainstream | Mainstre | Mainstream | 2 | 0 |
93. | Mended | Mende | Mend | 1 | 2 |
94. | Mending | Mending | Mend | 0 | 3 |
95. | Nipped | Nipp | Nip | 2 | 3 |
96. | Nipping | Nipping | Nip | 0 | 4 |
97. | Prophet | Prop | Prophet | 3 | 0 |
98. | Recognized | Recogni | Recogn | 3 | 4 |
99. | Refill | Refi | Refill | 2 | 0 |
100. | Stallion | Stall | Stallion | 3 | 0 |
A.2 A.2 AMPL program for the proposed IP
#———-Integer Program for Suffix stripping——
#———————-Model————————-
param n;
set I={2..n-1};
var Gama{I} binary;
param C{I};
param Cn;
param Gn;
maximize z: sum {e in 2..n-1}
C[e]*Gama[e]+Cn*Gn;
subject to
difference{x in 2..n-2}:
C[x+1]*Gama[x+1]-C[x]*Gama[x]>=0;
#————Data: For the word PARSONS—————
data;
param n:=7;
param C:=
2. 288
3. 466
4. 009
5. 265
6 1
;
param Cn:=.894;
param Gn:=0;
option solver cplex;
solve;
display z, Gama;
References
Araujo, Lourdes et al. (2010): Structure of morphologically expanded queries: A genetic algorithm approach. Data & Knowledge Engineering, 69(3): 279–289.10.1016/j.datak.2009.10.010Search in Google Scholar
Corpus Del Español (2014): Available at: http://www.corpusdelespanol.org/, visited 30 September 2014.Search in Google Scholar
Corpus Do Português (2014): Available at: http://www.corpusdoportugues.org/, visited 30 September 2014.Search in Google Scholar
Corpus of Contemporary American English (COCA) (2014): Available at: http://corpus.byu.edu/coca/, visited 30 September 2014.Search in Google Scholar
Dawson, John (1974): Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46.Search in Google Scholar
English Joshua, S. (2005): English Stemming Algorithm. Working Paper. Westlake Village, CA: Pragmatic Solutions, Inc., 1–3.Search in Google Scholar
Frakes, W. et al. (1998): DARE: Domain analysis and reuse environment. Annals of Software Engineering, 5(1): 125–141.10.1023/A:1018972323770Search in Google Scholar
Hafer, M. and Weiss, S. (1974): Word segmentation by letter successor varieties. Information Storage and Retrieval, 10: 371–385.10.1016/0020-0271(74)90044-8Search in Google Scholar
Harman, Donna (1991): How effective is suffixing? Journal of the American Society for Information Science, 42: 7–15.10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-PSearch in Google Scholar
Hull David, A. and Grefenstette, Gregory (1996): A detailed analysis of English stemming algorithms. Rank Xerox Research Center Technical Report.Search in Google Scholar
Kraaij, Wessel and Pohlmann, Renee (1996): Viewing stemming as recall enhancement. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 40–48.10.1145/243199.243209Search in Google Scholar
Krovetz, Robert (1993): Viewing morphology as an inference process. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191–202.10.1145/160688.160718Search in Google Scholar
Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady, 10(8): 707–710.Search in Google Scholar
Lovins, J. B. (1968): Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22–31.Search in Google Scholar
Majumder, Prasenjit et al. (2007): YASS: Yet another suffix stripper. ACM Transactions on Information Systems, 25(4): 18.10.1145/1281485.1281489Search in Google Scholar
Mayfield, James and McNamee, Paul (2003): Single N-gram stemming. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 415–416.10.1145/860435.860528Search in Google Scholar
Melucci, Massimo and; Orio, Nicola (2007): Design, implementation, and evaluation of a methodology for automatic stemmer generation. Journal of the American Society for Information Science and Technology, 58(5): 673–686.10.1002/asi.20509Search in Google Scholar
Mood, A. M. et al. (1974): Introduction to the Theory of Statistics. New York: McGraw-Hill.Search in Google Scholar
Paice Chris, D. (1990): Another stemmer. ACM SIGIR Forum, 24(3): 56–61.10.1145/101306.101310Search in Google Scholar
Paice Chris, D. (1994): An evaluation method for stemming algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–50.Search in Google Scholar
Pande, B. P. and Dhami, H. S. (2011): Application of natural language processing tools in stemming. International Journal of Computer Applications, 27(6): 14–19.10.5120/3302-4530Search in Google Scholar
Pande, B. P., Tamta, P. and Dhami, H. S. (2014a): A simple algorithm for the problem of suffix stripping. International Journal of Applied Linguistics, 25(3): 315–328.10.1111/ijal.12071Search in Google Scholar
Pande, B. P., Tamta, P. and Dhami, H. S. (2014b): A Devanagari script based stemmer. International Journal of Computational Linguistics Research, 5(4): 119–130.Search in Google Scholar
Peng, Funchun et al. (2007): Context sensitive stemming for web search. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 639–646.10.1145/1277741.1277851Search in Google Scholar
Porter, M. F. (1980): An algorithm for suffix stripping. Program, 14: 130–137.10.1108/eb046814Search in Google Scholar
Porter, M. F. (2001): Snowball: A language for stemming algorithms. Available at: http://snowball.tartarus.org/ [1.10.2014].Search in Google Scholar
Taha, Hamdy A. (2008): Operations Research: An Introduction, 8th edition. Boston et al.: Pearson.Search in Google Scholar
Tomlinson, Stephen (2003): Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. CLEF, 3237: 286–300.10.1007/978-3-540-30222-3_27Search in Google Scholar
Xu, Jinxi and Croft, W. Bruce (1998): Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(11): 61–81.10.1145/267954.267957Search in Google Scholar
©2015 by De Gruyter Mouton
Articles in the same Issue
- Frontmatter
- Prädikative Kopula + Infinitiv-Formen und ihre Funktionen im Deutschen. Die Kopula unter Bühlerscher Desambigierung
- Synergetic Linguistics: Do We Need Better Explanatory Mechanism?
- Ideologies of Supreme Court Justices: Quantitative Thematic Analysis of Multiple Opinions of “Bush v. Gore 2000”
- Suffix Stripping Problem as an Optimization Problem
- Book Reviews
- Pape, Walter; Preuschoff, Susanne; Yuqing, Wie, Jin, Zhao: China und Europa. Sprache und Kultur, Werte und Recht
- Best, Karl-Heinz & Kelih, Emmerich: Entlehnungen und Fremdwörter: Quantitative Aspekte
Articles in the same Issue
- Frontmatter
- Prädikative Kopula + Infinitiv-Formen und ihre Funktionen im Deutschen. Die Kopula unter Bühlerscher Desambigierung
- Synergetic Linguistics: Do We Need Better Explanatory Mechanism?
- Ideologies of Supreme Court Justices: Quantitative Thematic Analysis of Multiple Opinions of “Bush v. Gore 2000”
- Suffix Stripping Problem as an Optimization Problem
- Book Reviews
- Pape, Walter; Preuschoff, Susanne; Yuqing, Wie, Jin, Zhao: China und Europa. Sprache und Kultur, Werte und Recht
- Best, Karl-Heinz & Kelih, Emmerich: Entlehnungen und Fremdwörter: Quantitative Aspekte