Startseite Suffix Stripping Problem as an Optimization Problem
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Suffix Stripping Problem as an Optimization Problem

  • Pawan Tamta , B. P. Pande EMAIL logo und H. S. Dhami
Veröffentlicht/Copyright: 1. Dezember 2015
Veröffentlichen auch Sie bei De Gruyter Brill

Abstract

Stemming or suffix stripping is the problem of removing suffixes from words to get the root word. Word endings can be removed by developing stripping rules dependent on the morphological knowledge of a specific language; obviously such approach cannot flourish in multilingual environment. Statistical approaches survive in multilingual environment but they require significant amount of computing. We define stemming as an optimization problem for the very first time in the literature. An Integer Program is being developed for the stemming problem. We exhibit our approach by applying it to clusters of English and Spanish words; moreover, the proposed method is also being compared with an established technique in the field for English language. An AMPL program of the proposed method has also been given in Appendix (A.2).

Appendices

A.1

Table 4:

Outputs of IP stemmer and Porter stemmer (Snowball) over 100 randomly chosen English words.

S. No.WordIP stemPorter (Snowball) stemLDIPLDPorter
1.AbjectAbjectAbject00
2.AdmiringAdmirAdmir33
3.AdmonishingAdmonishAdmonish33
4.AgreementAgreementAgreement00
5.AltoAltoAlto00
6.AnxiouslyAnxiousAnxious22
7.BelievingBelievBeliev33
8.BlissfullyBlissBliss55
9.BorrowedBorrowBorrow22
10.BorrowingBorrowBorrow33
11.BrisklyBriskBrisk22
12.CasualCasualCasual00
13.ConsultedConsultConsult22
14.ConsultingConsultConsult33
15.CutterCutterCutter00
16.DeceivedDeceivDeceiv22
17.DeceivingDeceivDeceiv33
18.DilutionDilutDilut33
19.EmployedEmployEmploy22
20.EmployingEmployEmploy33
21.EnormouslyEnormEnorm55
22.ExplainedExplainExplain22
23.ExplainingExplainExplain33
24.FinishedFinishFinish22
25.FinishingFinishFinish33
26.FleetFleetFleet00
27.FrightfullyFrightFright55
28.GatheredGatherGather22
29.GatheringGatherGather33
30.HazeHazeHaze00
31.HopelesslyHopelessHopeless22
32.ImprovedImprovImprov22
33.ImprovingImprovImprov33
34.InwardlyInwardInward22
35.LaughedLaughLaugh22
36.LaughingLaughLaugh33
37.ListenedListenListen22
38.ListeningListenListen33
39.LyricLyricLyric00
40.ManaceMenacMenac11
41.MiserableMiserMiser44
42.MonthlyMonthMonth22
43.MuseumsMuseumMuseum11
44.OrdainingOrdainOrdain33
45.OverseasOverseaOversea11
46.ParsonsParsonParson11
47.PassionPassionPassion00
48.PluckedPluckPluck22
49.PluckingPluckPluck33
50.PreachedPreachPreach22
51.PreachingPreachPreach33
52.PredictedPredictPredict22
53.QuietQuietQuiet00
54.SolemnlySolemnSolemn22
55.SwiftlySwiftSwift22
56.TrainingTrainTrain33
57.UnloadedUnloadUnload22
58.ViciouslyViciousVicious22
59.AbnormallyAbnormalAbnorm24
60.AbusedAbuseAbus12
61.BloodierBloodBloodier30
62.CarryingCarryCarri32
63.DelightfullyDelightfulDelight25
64.DespoticDespoticDespot02
65.DiligentlyDiligentDilig25
66.EleventhElevenEleventh20
67.ForbiddenForbidForbidden30
68.JubilantlyJubilantJubil25
69.MacabreMacabreMacabr01
70.MidwifeMidwifeMidwif01
71.NinetyNineNineti21
72.ObnoxiouslyObnoxiousObnoxi25
73.QuizzicalQuizzicalQuizzic02
74.ReluctantlyReluctantReluct25
75.SubstituteSubstituteSubstitut01
76.ThankfullyThankfulThank25
77.UndevelopedUndevelopedUndevelop02
78.VacancyVacancyVacanc01
79.WonderfullyWonderfulWonder25
80.AbusingAbusingAbus03
81.AdmiredAdmiAdmir32
82.BelievedBelieBeliev32
83.BraidBraBraid20
84.CarriedCarrieCarri12
85.DecoratedDecoratDecor24
86.DecoratingDecoratiDecor25
87.DevastatingDevastatDevast35
88.ForbiddingForbiddForbid34
89.FreehandFreehFreehand30
90.LacerationLaceratLacer35
91.LibrarianLibrarLibrarian30
92.MainstreamMainstreMainstream20
93.MendedMendeMend12
94.MendingMendingMend03
95.NippedNippNip23
96.NippingNippingNip04
97.ProphetPropProphet30
98.RecognizedRecogniRecogn34
99.RefillRefiRefill20
100.StallionStallStallion30

A.2 A.2 AMPL program for the proposed IP

#———-Integer Program for Suffix stripping——

#———————-Model————————-

param n;

set I={2..n-1};

var Gama{I} binary;

param C{I};

param Cn;

param Gn;

maximize z: sum {e in 2..n-1}

C[e]*Gama[e]+Cn*Gn;

subject to

difference{x in 2..n-2}:

C[x+1]*Gama[x+1]-C[x]*Gama[x]>=0;

#————Data: For the word PARSONS—————

data;

param n:=7;

param C:=

2. 288

3. 466

4. 009

5. 265

6 1

;

param Cn:=.894;

param Gn:=0;

option solver cplex;

solve;

display z, Gama;

References

Araujo, Lourdes et al. (2010): Structure of morphologically expanded queries: A genetic algorithm approach. Data & Knowledge Engineering, 69(3): 279–289.10.1016/j.datak.2009.10.010Suche in Google Scholar

Corpus Del Español (2014): Available at: http://www.corpusdelespanol.org/, visited 30 September 2014.Suche in Google Scholar

Corpus Do Português (2014): Available at: http://www.corpusdoportugues.org/, visited 30 September 2014.Suche in Google Scholar

Corpus of Contemporary American English (COCA) (2014): Available at: http://corpus.byu.edu/coca/, visited 30 September 2014.Suche in Google Scholar

Dawson, John (1974): Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46.Suche in Google Scholar

English Joshua, S. (2005): English Stemming Algorithm. Working Paper. Westlake Village, CA: Pragmatic Solutions, Inc., 1–3.Suche in Google Scholar

Frakes, W. et al. (1998): DARE: Domain analysis and reuse environment. Annals of Software Engineering, 5(1): 125–141.10.1023/A:1018972323770Suche in Google Scholar

Hafer, M. and Weiss, S. (1974): Word segmentation by letter successor varieties. Information Storage and Retrieval, 10: 371–385.10.1016/0020-0271(74)90044-8Suche in Google Scholar

Harman, Donna (1991): How effective is suffixing? Journal of the American Society for Information Science, 42: 7–15.10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-PSuche in Google Scholar

Hull David, A. and Grefenstette, Gregory (1996): A detailed analysis of English stemming algorithms. Rank Xerox Research Center Technical Report.Suche in Google Scholar

Kraaij, Wessel and Pohlmann, Renee (1996): Viewing stemming as recall enhancement. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 40–48.10.1145/243199.243209Suche in Google Scholar

Krovetz, Robert (1993): Viewing morphology as an inference process. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191–202.10.1145/160688.160718Suche in Google Scholar

Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady, 10(8): 707–710.Suche in Google Scholar

Lovins, J. B. (1968): Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11: 22–31.Suche in Google Scholar

Majumder, Prasenjit et al. (2007): YASS: Yet another suffix stripper. ACM Transactions on Information Systems, 25(4): 18.10.1145/1281485.1281489Suche in Google Scholar

Mayfield, James and McNamee, Paul (2003): Single N-gram stemming. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 415–416.10.1145/860435.860528Suche in Google Scholar

Melucci, Massimo and; Orio, Nicola (2007): Design, implementation, and evaluation of a methodology for automatic stemmer generation. Journal of the American Society for Information Science and Technology, 58(5): 673–686.10.1002/asi.20509Suche in Google Scholar

Mood, A. M. et al. (1974): Introduction to the Theory of Statistics. New York: McGraw-Hill.Suche in Google Scholar

Paice Chris, D. (1990): Another stemmer. ACM SIGIR Forum, 24(3): 56–61.10.1145/101306.101310Suche in Google Scholar

Paice Chris, D. (1994): An evaluation method for stemming algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–50.Suche in Google Scholar

Pande, B. P. and Dhami, H. S. (2011): Application of natural language processing tools in stemming. International Journal of Computer Applications, 27(6): 14–19.10.5120/3302-4530Suche in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014a): A simple algorithm for the problem of suffix stripping. International Journal of Applied Linguistics, 25(3): 315–328.10.1111/ijal.12071Suche in Google Scholar

Pande, B. P., Tamta, P. and Dhami, H. S. (2014b): A Devanagari script based stemmer. International Journal of Computational Linguistics Research, 5(4): 119–130.Suche in Google Scholar

Peng, Funchun et al. (2007): Context sensitive stemming for web search. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 639–646.10.1145/1277741.1277851Suche in Google Scholar

Porter, M. F. (1980): An algorithm for suffix stripping. Program, 14: 130–137.10.1108/eb046814Suche in Google Scholar

Porter, M. F. (2001): Snowball: A language for stemming algorithms. Available at: http://snowball.tartarus.org/ [1.10.2014].Suche in Google Scholar

Taha, Hamdy A. (2008): Operations Research: An Introduction, 8th edition. Boston et al.: Pearson.Suche in Google Scholar

Tomlinson, Stephen (2003): Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003. CLEF, 3237: 286–300.10.1007/978-3-540-30222-3_27Suche in Google Scholar

Xu, Jinxi and Croft, W. Bruce (1998): Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(11): 61–81.10.1145/267954.267957Suche in Google Scholar

Published Online: 2015-12-1
Published in Print: 2015-12-1

©2015 by De Gruyter Mouton

Heruntergeladen am 16.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/glot-2015-0013/html
Button zum nach oben scrollen