Home Suffix Stripping Problem as an Optimization Problem
Article
Licensed
Unlicensed Requires Authentication

Suffix Stripping Problem as an Optimization Problem

  • Pawan Tamta , B. P. Pande EMAIL logo and H. S. Dhami
Published/Copyright: December 1, 2015
Become an author with De Gruyter Brill

Abstract

Stemming or suffix stripping is the problem of removing suffixes from words to get the root word. Word endings can be removed by developing stripping rules dependent on the morphological knowledge of a specific language; obviously such approach cannot flourish in multilingual environment. Statistical approaches survive in multilingual environment but they require significant amount of computing. We define stemming as an optimization problem for the very first time in the literature. An Integer Program is being developed for the stemming problem. We exhibit our approach by applying it to clusters of English and Spanish words; moreover, the proposed method is also being compared with an established technique in the field for English language. An AMPL program of the proposed method has also been given in Appendix (A.2).

Appendices

A.1

Table 4:

Outputs of IP stemmer and Porter stemmer (Snowball) over 100 randomly chosen English words.

S. No.WordIP stemPorter (Snowball) stemLDIPLDPorter
1.AbjectAbjectAbject00
2.AdmiringAdmirAdmir33
3.AdmonishingAdmonishAdmonish33
4.AgreementAgreementAgreement00
5.AltoAltoAlto00
6.AnxiouslyAnxiousAnxious22
7.BelievingBelievBeliev33
8.BlissfullyBlissBliss55
9.BorrowedBorrowBorrow22
10.BorrowingBorrowBorrow33
11.BrisklyBriskBrisk22
12.CasualCasualCasual00
13.ConsultedConsultConsult22
14.ConsultingConsultConsult33
15.CutterCutterCutter00
16.DeceivedDeceivDeceiv22
17.DeceivingDeceivDeceiv33
18.DilutionDilutDilut33
19.EmployedEmployEmploy22
20.EmployingEmployEmploy33
21.EnormouslyEnormEnorm55
22.ExplainedExplainExplain22
23.ExplainingExplainExplain33
24.FinishedFinishFinish22
25.FinishingFinishFinish33
26.FleetFleetFleet00
27.FrightfullyFrightFright55
28.GatheredGatherGather22
29.GatheringGatherGather33
30.HazeHazeHaze00
31.HopelesslyHopelessHopeless22
32.ImprovedImprovImprov22
33.ImprovingImprovImprov33
34.InwardlyInwardInward22
35.LaughedLaughLaugh22
36.LaughingLaughLaugh33
37.ListenedListenListen22
38.ListeningListenListen33
39.LyricLyricLyric00
40.ManaceMenacMenac11
41.MiserableMiserMiser44
42.MonthlyMonthMonth22
43.MuseumsMuseumMuseum11
44.OrdainingOrdainOrdain33
45.OverseasOverseaOversea11
46.ParsonsParsonParson11
47.PassionPassionPassion00
48.PluckedPluckPluck22
49.PluckingPluckPluck33
50.PreachedPreachPreach22
51.PreachingPreachPreach33
52.PredictedPredictPredict22
53.QuietQuietQuiet00
54.SolemnlySolemnSolemn22
55.SwiftlySwiftSwift22
56.TrainingTrainTrain33
57.UnloadedUnloadUnload22
58.ViciouslyViciousVicious22
59.AbnormallyAbnormalAbnorm24
60.AbusedAbuseAbus12
61.BloodierBloodBloodier30
62.CarryingCarryCarri32
63.DelightfullyDelightfulDelight25
64.DespoticDespoticDespot02
65.DiligentlyDiligentDilig25
66.EleventhElevenEleventh20
67.ForbiddenForbidForbidden30
68.JubilantlyJubilantJubil25
69.MacabreMacabreMacabr01
70.MidwifeMidwifeMidwif01
71.NinetyNineNineti21
72.ObnoxiouslyObnoxiousObnoxi25
73.QuizzicalQuizzicalQuizzic02
74.ReluctantlyReluctantReluct25
75.SubstituteSubstituteSubstitut01
76.ThankfullyThankfulThank25
77.UndevelopedUndevelopedUndevelop02
78.VacancyVacancyVacanc01
79.WonderfullyWonderfulWonder25
80.AbusingAbusingAbus03
81.AdmiredAdmiAdmir32
82.BelievedBelieBeliev32
83.BraidBraBraid20
84.CarriedCarrieCarri12
85.DecoratedDecoratDecor24
86.DecoratingDecoratiDecor25
87.DevastatingDevastatDevast35
88.ForbiddingForbiddForbid34
89.FreehandFreehFreehand30
90.LacerationLaceratLacer35
91.LibrarianLibrarLibrarian30
92.MainstreamMainstreMainstream20
93.MendedMendeMend12
94.MendingMendingMend03
95.NippedNippNip23
96.NippingNippingNip04
97.ProphetPropProphet30
98.RecognizedRecogniRecogn34
99.RefillRefiRefill20
100.StallionStallStallion30

A.2 A.2 AMPL program for the proposed IP

#———-Integer Program for Suffix stripping——

#———————-Model————————-

param n;

set I={2..n-1};

var Gama{I} binary;

param C{I};

param Cn;

param Gn;

maximize z: sum {e in 2..n-1}

C[e]*Gama[e]+Cn*Gn;

subject to

difference{x in 2..n-2}:

C[x+1]*Gama[x+1]-C[x]*Gama[x]>=0;

#————Data: For the word PARSONS—————

data;

param n:=7;

param C:=

2. 288

3. 466

4. 009

5. 265

6 1

;

param Cn:=.894;

param Gn:=0;

option solver cplex;

solve;

display z, Gama;

References

Araujo, Lourdes et al. (2010): Structure of morphologically expanded queries: A genetic algorithm approach. Data & Knowledge Engineering, 69(3): 279–289.10.1016/j.datak.2009.10.010Search in Google Scholar

Corpus Del Español (2014): Available at: http://www.corpusdelespanol.org/, visited 30 September 2014.Search in Google Scholar

Corpus Do Português (2014): Available at: http://www.corpusdoportugues.org/, visited 30 September 2014.Search in Google Scholar

Corpus of Contemporary American English (COCA) (2014): Available at: http://corpus.byu.edu/coca/, visited 30 September 2014.Search in Google Scholar

Dawson, John (1974): Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46.Search in Google Scholar

English Joshua, S. (2005): English Stemming Algorithm. Working Paper. Westlake Village, CA: Pragmatic Solutions, Inc., 1–3.Search in Google Scholar

Frakes, W. et al. (1998): DARE: Domain analysis and reuse environment. Annals of Software Engineering, 5(1): 125–141.10.1023/A:1018972323770Search in Google Scholar

Hafer, M. and Weiss, S. (1974): Word segmentation by letter successor varieties. Information Storage and Retrieval, 10: 371–385.10.1016/0020-0271(74)90044-8Search in Google Scholar

Harman, Donna (1991): How effective is suffixing? Journal of the American Society for Information Science, 42: 7–15.10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-PSearch in Google Scholar

Hull David, A. and Grefenstette, Gregory (1996): A detailed analysis of English stemming algorithms. Rank Xerox Research Center Technical Report.Search in Google Scholar

Kraaij, Wessel and Pohlmann, Renee (1996): Viewing stemming as recall enhancement. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 40–48.10.1145/243199.243209Search in Google Scholar

Krovetz, Robert (1993): Viewing morphology as an inference process. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191–202.10.1145/160688.160718Search in Google Scholar

Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady, 10(8): 707–710.Search in Google Scholar

Lovins, J. B. (1968): Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22–31.Search in Google Scholar

Majumder, Prasenjit et al. (2007): YASS: Yet another suffix stripper. ACM Transactions on Information Systems, 25(4): 18.10.1145/1281485.1281489Search in Google Scholar

Mayfield, James and McNamee, Paul (2003): Single N-gram stemming. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 415–416.10.1145/860435.860528Search in Google Scholar

Melucci, Massimo and; Orio, Nicola (2007): Design, implementation, and evaluation of a methodology for automatic stemmer generation. Journal of the American Society for Information Science and Technology, 58(5): 673–686.10.1002/asi.20509Search in Google Scholar

Mood, A. M. et al. (1974): Introduction to the Theory of Statistics. New York: McGraw-Hill.Search in Google Scholar

Paice Chris, D. (1990): Another stemmer. ACM SIGIR Forum, 24(3): 56–61.10.1145/101306.101310Search in Google Scholar

Paice Chris, D. (1994): An evaluation method for stemming algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–50.Search in Google Scholar

Pande, B. P. and Dhami, H. S. (2011): Application of natural language processing tools in stemming. International Journal of Computer Applications, 27(6): 14–19.10.5120/3302-4530Search in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014a): A simple algorithm for the problem of suffix stripping. International Journal of Applied Linguistics, 25(3): 315–328.10.1111/ijal.12071Search in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014b): A Devanagari script based stemmer. International Journal of Computational Linguistics Research, 5(4): 119–130.Search in Google Scholar

Peng, Funchun et al. (2007): Context sensitive stemming for web search. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 639–646.10.1145/1277741.1277851Search in Google Scholar

Porter, M. F. (1980): An algorithm for suffix stripping. Program, 14: 130–137.10.1108/eb046814Search in Google Scholar

Porter, M. F. (2001): Snowball: A language for stemming algorithms. Available at: http://snowball.tartarus.org/ [1.10.2014].Search in Google Scholar

Taha, Hamdy A. (2008): Operations Research: An Introduction, 8th edition. Boston et al.: Pearson.Search in Google Scholar

Tomlinson, Stephen (2003): Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. CLEF, 3237: 286–300.10.1007/978-3-540-30222-3_27Search in Google Scholar

Xu, Jinxi and Croft, W. Bruce (1998): Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(11): 61–81.10.1145/267954.267957Search in Google Scholar

Published Online: 2015-12-1
Published in Print: 2015-12-1

©2015 by De Gruyter Mouton

Downloaded on 16.10.2025 from https://www.degruyterbrill.com/document/doi/10.1515/glot-2015-0013/html
Scroll to top button