Suffix Stripping Problem as an Optimization Problem

Pawan Tamta; B. P. Pande; H. S. Dhami

doi:10.1515/glot-2015-0013

Artikel

Suffix Stripping Problem as an Optimization Problem

Pawan Tamta , B. P. Pande und H. S. Dhami

Veröffentlicht/Copyright: 1. Dezember 2015

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Glottotheory Band 6 Heft 2

Abstract

Stemming or suffix stripping is the problem of removing suffixes from words to get the root word. Word endings can be removed by developing stripping rules dependent on the morphological knowledge of a specific language; obviously such approach cannot flourish in multilingual environment. Statistical approaches survive in multilingual environment but they require significant amount of computing. We define stemming as an optimization problem for the very first time in the literature. An Integer Program is being developed for the stemming problem. We exhibit our approach by applying it to clusters of English and Spanish words; moreover, the proposed method is also being compared with an established technique in the field for English language. An AMPL program of the proposed method has also been given in Appendix (A.2).

Keywords: information retrieval (IR); affix removal; stemming; conflation; and integer program (IP)

Appendices

A.1

Table 4:

Outputs of IP stemmer and Porter stemmer (Snowball) over 100 randomly chosen English words.

S. No.	Word	IP stem	Porter (Snowball) stem	LD_IP	LD_Porter
1.	Abject	Abject	Abject	0	0
2.	Admiring	Admir	Admir	3	3
3.	Admonishing	Admonish	Admonish	3	3
4.	Agreement	Agreement	Agreement	0	0
5.	Alto	Alto	Alto	0	0
6.	Anxiously	Anxious	Anxious	2	2
7.	Believing	Believ	Believ	3	3
8.	Blissfully	Bliss	Bliss	5	5
9.	Borrowed	Borrow	Borrow	2	2
10.	Borrowing	Borrow	Borrow	3	3
11.	Briskly	Brisk	Brisk	2	2
12.	Casual	Casual	Casual	0	0
13.	Consulted	Consult	Consult	2	2
14.	Consulting	Consult	Consult	3	3
15.	Cutter	Cutter	Cutter	0	0
16.	Deceived	Deceiv	Deceiv	2	2
17.	Deceiving	Deceiv	Deceiv	3	3
18.	Dilution	Dilut	Dilut	3	3
19.	Employed	Employ	Employ	2	2
20.	Employing	Employ	Employ	3	3
21.	Enormously	Enorm	Enorm	5	5
22.	Explained	Explain	Explain	2	2
23.	Explaining	Explain	Explain	3	3
24.	Finished	Finish	Finish	2	2
25.	Finishing	Finish	Finish	3	3
26.	Fleet	Fleet	Fleet	0	0
27.	Frightfully	Fright	Fright	5	5
28.	Gathered	Gather	Gather	2	2
29.	Gathering	Gather	Gather	3	3
30.	Haze	Haze	Haze	0	0
31.	Hopelessly	Hopeless	Hopeless	2	2
32.	Improved	Improv	Improv	2	2
33.	Improving	Improv	Improv	3	3
34.	Inwardly	Inward	Inward	2	2
35.	Laughed	Laugh	Laugh	2	2
36.	Laughing	Laugh	Laugh	3	3
37.	Listened	Listen	Listen	2	2
38.	Listening	Listen	Listen	3	3
39.	Lyric	Lyric	Lyric	0	0
40.	Manace	Menac	Menac	1	1
41.	Miserable	Miser	Miser	4	4
42.	Monthly	Month	Month	2	2
43.	Museums	Museum	Museum	1	1
44.	Ordaining	Ordain	Ordain	3	3
45.	Overseas	Oversea	Oversea	1	1
46.	Parsons	Parson	Parson	1	1
47.	Passion	Passion	Passion	0	0
48.	Plucked	Pluck	Pluck	2	2
49.	Plucking	Pluck	Pluck	3	3
50.	Preached	Preach	Preach	2	2
51.	Preaching	Preach	Preach	3	3
52.	Predicted	Predict	Predict	2	2
53.	Quiet	Quiet	Quiet	0	0
54.	Solemnly	Solemn	Solemn	2	2
55.	Swiftly	Swift	Swift	2	2
56.	Training	Train	Train	3	3
57.	Unloaded	Unload	Unload	2	2
58.	Viciously	Vicious	Vicious	2	2
59.	Abnormally	Abnormal	Abnorm	2	4
60.	Abused	Abuse	Abus	1	2
61.	Bloodier	Blood	Bloodier	3	0
62.	Carrying	Carry	Carri	3	2
63.	Delightfully	Delightful	Delight	2	5
64.	Despotic	Despotic	Despot	0	2
65.	Diligently	Diligent	Dilig	2	5
66.	Eleventh	Eleven	Eleventh	2	0
67.	Forbidden	Forbid	Forbidden	3	0
68.	Jubilantly	Jubilant	Jubil	2	5
69.	Macabre	Macabre	Macabr	0	1
70.	Midwife	Midwife	Midwif	0	1
71.	Ninety	Nine	Nineti	2	1
72.	Obnoxiously	Obnoxious	Obnoxi	2	5
73.	Quizzical	Quizzical	Quizzic	0	2
74.	Reluctantly	Reluctant	Reluct	2	5
75.	Substitute	Substitute	Substitut	0	1
76.	Thankfully	Thankful	Thank	2	5
77.	Undeveloped	Undeveloped	Undevelop	0	2
78.	Vacancy	Vacancy	Vacanc	0	1
79.	Wonderfully	Wonderful	Wonder	2	5
80.	Abusing	Abusing	Abus	0	3
81.	Admired	Admi	Admir	3	2
82.	Believed	Belie	Believ	3	2
83.	Braid	Bra	Braid	2	0
84.	Carried	Carrie	Carri	1	2
85.	Decorated	Decorat	Decor	2	4
86.	Decorating	Decorati	Decor	2	5
87.	Devastating	Devastat	Devast	3	5
88.	Forbidding	Forbidd	Forbid	3	4
89.	Freehand	Freeh	Freehand	3	0
90.	Laceration	Lacerat	Lacer	3	5
91.	Librarian	Librar	Librarian	3	0
92.	Mainstream	Mainstre	Mainstream	2	0
93.	Mended	Mende	Mend	1	2
94.	Mending	Mending	Mend	0	3
95.	Nipped	Nipp	Nip	2	3
96.	Nipping	Nipping	Nip	0	4
97.	Prophet	Prop	Prophet	3	0
98.	Recognized	Recogni	Recogn	3	4
99.	Refill	Refi	Refill	2	0
100.	Stallion	Stall	Stallion	3	0

A.2 A.2 AMPL program for the proposed IP

#———-Integer Program for Suffix stripping——

#———————-Model————————-

param n;

set I={2..n-1};

var Gama{I} binary;

param C{I};

param Cn;

param Gn;

maximize z: sum {e in 2..n-1}

C[e]*Gama[e]+Cn*Gn;

subject to

difference{x in 2..n-2}:

C[x+1]*Gama[x+1]-C[x]*Gama[x]>=0;

#————Data: For the word PARSONS—————

data;

param n:=7;

param C:=

2. 288

3. 466

4. 009

5. 265

6 1

;

param Cn:=.894;

param Gn:=0;

option solver cplex;

solve;

display z, Gama;

References

Araujo, Lourdes et al. (2010): Structure of morphologically expanded queries: A genetic algorithm approach. Data & Knowledge Engineering, 69(3): 279–289.10.1016/j.datak.2009.10.010Suche in Google Scholar

Corpus Del Español (2014): Available at: http://www.corpusdelespanol.org/, visited 30 September 2014.Suche in Google Scholar

Corpus Do Português (2014): Available at: http://www.corpusdoportugues.org/, visited 30 September 2014.Suche in Google Scholar

Corpus of Contemporary American English (COCA) (2014): Available at: http://corpus.byu.edu/coca/, visited 30 September 2014.Suche in Google Scholar

Dawson, John (1974): Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46.Suche in Google Scholar

English Joshua, S. (2005): English Stemming Algorithm. Working Paper. Westlake Village, CA: Pragmatic Solutions, Inc., 1–3.Suche in Google Scholar

Frakes, W. et al. (1998): DARE: Domain analysis and reuse environment. Annals of Software Engineering, 5(1): 125–141.10.1023/A:1018972323770Suche in Google Scholar

Hafer, M. and Weiss, S. (1974): Word segmentation by letter successor varieties. Information Storage and Retrieval, 10: 371–385.10.1016/0020-0271(74)90044-8Suche in Google Scholar

Harman, Donna (1991): How effective is suffixing? Journal of the American Society for Information Science, 42: 7–15.10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-PSuche in Google Scholar

Hull David, A. and Grefenstette, Gregory (1996): A detailed analysis of English stemming algorithms. Rank Xerox Research Center Technical Report.Suche in Google Scholar

Kraaij, Wessel and Pohlmann, Renee (1996): Viewing stemming as recall enhancement. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 40–48.10.1145/243199.243209Suche in Google Scholar

Krovetz, Robert (1993): Viewing morphology as an inference process. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191–202.10.1145/160688.160718Suche in Google Scholar

Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady, 10(8): 707–710.Suche in Google Scholar

Lovins, J. B. (1968): Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22–31.Suche in Google Scholar

Majumder, Prasenjit et al. (2007): YASS: Yet another suffix stripper. ACM Transactions on Information Systems, 25(4): 18.10.1145/1281485.1281489Suche in Google Scholar

Mayfield, James and McNamee, Paul (2003): Single N-gram stemming. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 415–416.10.1145/860435.860528Suche in Google Scholar

Melucci, Massimo and; Orio, Nicola (2007): Design, implementation, and evaluation of a methodology for automatic stemmer generation. Journal of the American Society for Information Science and Technology, 58(5): 673–686.10.1002/asi.20509Suche in Google Scholar

Mood, A. M. et al. (1974): Introduction to the Theory of Statistics. New York: McGraw-Hill.Suche in Google Scholar

Paice Chris, D. (1990): Another stemmer. ACM SIGIR Forum, 24(3): 56–61.10.1145/101306.101310Suche in Google Scholar

Paice Chris, D. (1994): An evaluation method for stemming algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–50.Suche in Google Scholar

Pande, B. P. and Dhami, H. S. (2011): Application of natural language processing tools in stemming. International Journal of Computer Applications, 27(6): 14–19.10.5120/3302-4530Suche in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014a): A simple algorithm for the problem of suffix stripping. International Journal of Applied Linguistics, 25(3): 315–328.10.1111/ijal.12071Suche in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014b): A Devanagari script based stemmer. International Journal of Computational Linguistics Research, 5(4): 119–130.Suche in Google Scholar

Peng, Funchun et al. (2007): Context sensitive stemming for web search. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 639–646.10.1145/1277741.1277851Suche in Google Scholar

Porter, M. F. (1980): An algorithm for suffix stripping. Program, 14: 130–137.10.1108/eb046814Suche in Google Scholar

Porter, M. F. (2001): Snowball: A language for stemming algorithms. Available at: http://snowball.tartarus.org/ [1.10.2014].Suche in Google Scholar

Taha, Hamdy A. (2008): Operations Research: An Introduction, 8th edition. Boston et al.: Pearson.Suche in Google Scholar

Tomlinson, Stephen (2003): Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. CLEF, 3237: 286–300.10.1007/978-3-540-30222-3_27Suche in Google Scholar

Xu, Jinxi and Croft, W. Bruce (1998): Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(11): 61–81.10.1145/267954.267957Suche in Google Scholar

Published Online: 2015-12-1

Published in Print: 2015-12-1

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/glot-2015-0013

Schlagwörter für diesen Artikel

information retrieval (IR); affix removal; stemming; conflation; and integer program (IP)