Identification of similarities and clusters of bread baking recipes based on data of ingredients
-
Stefan Anlauf
, Sebastian Dorl
, Theresa Hirz
Abstract
We define the similarity of bakery recipes using different distance calculations and identify groups of similar recipes using different clustering algorithms. Our analyses are based on the relative amounts of ingredients included in the recipes. We compare different clustering algorithms (k-means, k-medoid, and hierarchical clustering) to find the optimal number of clusters. Besides the standard distance calculation (euclidean distance), we test three other distance metrics (hamming distance, manhattan distance, and cosine similarity). Additionally, we reduce the impact of raw materials used in large quantities by applying two different data transformations, namely the logarithm of the original data and the binarization of the original data. Clustering recipes based on their ingredients can improve the search for similar recipes and therefore help with the time-consuming process of developing new recipes. Using the hierarchical clustering on the logarithm of the original data, we can separate 704 recipes into three different clusters, achieving a Silhouette Score of 0.531. We visualize our results via dendrograms representing the recipes’ hierarchical separation into individual groups and sub-groups.
Funding source: Österreichische Forschungsförderungsgesellschaft
Award Identifier / Grant number: INTEGRATE – 892418
Funding source: The research reported in this paper has been funded by BMK, BMDW, and the State of Upper Austria in the frame of the COMET Programme managed by FFG
Acknowledgment
The project is a cooperation between University of Applied Sciences Upper Austria, Software Competence Center Hagenberg and backaldrin International The Kornspitz Company GmbH. The data was provided by the company backaldrin International The Kornspitz Company GmbH.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: The research reported in this paper has been funded by BMK, BMDW, and the State of Upper Austria in the frame of the COMET Programme managed by FFG. This research was also funded by Österreichische Forschungsförderungsgesellschaft (INTEGRATE – 892418).
-
Data availability: Not applicable.
References
1. Su, H, Lin, TW, Li, CT, Shan, MK, Chang, J. Automatic recipe cuisine classification by ingredients. In: UbiComp ’14 Adjunct. New York, NY, USA: Association for Computing Machinery; 2014:565–70 pp.10.1145/2638728.2641335Search in Google Scholar
2. Kicherer, H, Dittrich, M, Grebe, L, Scheible, C, Klinger, R. What you use, not what you do: automatic classification and similarity detection of recipes. Data Knowl Eng 2018;117:252–63. https://doi.org/10.1016/j.datak.2018.04.004.Search in Google Scholar
3. Nadamoto, A, Hanai, S, Nanba, H. Clustering for similar recipes in user-generated recipe sites based on main ingredients and main seasoning. In: 2016 19th international conference on network-based information systems (NBiS). Ostrava, Czech Republic: IEEE; 2016:336–41 pp.10.1109/NBiS.2016.49Search in Google Scholar
4. Faisal, M, Zamzami, E, et al.. Comparative analysis of inter-centroid K-Means performance using euclidean distance, canberra distance and manhattan distance. J Phys Conf 2020;1566:012112.10.1088/1742-6596/1566/1/012112Search in Google Scholar
5. Huang, A, et al.. Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol 4; 2008:9–56 pp.Search in Google Scholar
6. Norouzi, M, Fleet, DJ, Salakhutdinov, RR. Hamming distance metric learning. Adv Neural Inf Process Syst 2012;25:1061–9.Search in Google Scholar
7. Barlow, HB. Unsupervised learning. Neural Comput 1989;1:295–311. https://doi.org/10.1162/neco.1989.1.3.295.Search in Google Scholar
8. Sinaga, KP, Yang, MS. Unsupervised K-means clustering algorithm. IEEE Access 2020;8:80716–27. https://doi.org/10.1109/access.2020.2988796.Search in Google Scholar
9. Park, HS, Jun, CH. A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 2009;36:3336–41. https://doi.org/10.1016/j.eswa.2008.01.039.Search in Google Scholar
10. Johnson, SC. Hierarchical clustering schemes. Psychometrika 1967;32:241–54. https://doi.org/10.1007/bf02289588.Search in Google Scholar
11. Pedregosa, F, Varoquaux, G, Gramfort, A, Michel, V, Thirion, B, Grisel, O, et al.. Scikit-learn: machine learning in Python. CoRR. 2012;abs/1201.0490. Available from: http://arxiv.org/abs/1201.0490.Search in Google Scholar
12. Abdi, H, Williams, LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat 2010;2:433–59. https://doi.org/10.1002/wics.101.Search in Google Scholar
13. Rogovschi, N, Kitazono, J, Grozavu, N, Omori, T, Ozawa, S. t-Distributed stochastic neighbor embedding spectral clustering. In: 2017 international joint conference on neural networks (IJCNN). Anchorage, AK, USA: IEEE; 2017:1628–32 pp.10.1109/IJCNN.2017.7966046Search in Google Scholar
14. Shahapure, KR, Nicholas, C. Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA). Sydney, NSW, Australia: IEEE; 2020:747–8 pp.10.1109/DSAA49011.2020.00096Search in Google Scholar
15. Davies, D, Bouldin, D. A cluster separation measure. Pattern analysis and machine intelligence. IEEE Trans 1979;PAMI-1:224–7. https://doi.org/10.1109/tpami.1979.4766909.Search in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston