Understanding the effects of temporal energy-data aggregation on clustering quality

Holger Trittenbach; Jakob Bach; Klemens Böhm

doi:10.1515/itit-2019-0014

Article

Understanding the effects of temporal energy-data aggregation on clustering quality

Holger Trittenbach
Holger Trittenbach is working towards the PhD degree at Karlsruhe Institute of Technology (KIT) at the Department of Informatics. His current research interest is data mining and machine learning in the field of outlier detection and active learning.
, Jakob Bach
Jakob Bach is working towards the PhD degree at Karlsruhe Institute of Technology (KIT) at the Department of Informatics. His current research interest is machine learning in the fields of feature selection and meta-learning.
and Klemens Böhm
Klemens Böhm is full professor at Karlsruhe Institute of Technology (KIT), since 2004. Current research topics at his chair are knowledge discovery and data mining in big data, data privacy and workflow management.

Published/Copyright: October 24, 2019

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal it - Information Technology Volume 61 Issue 2-3

Abstract

Energy data often is available at high temporal resolution, which challenges the scalability of data-analysis methods. A common way to cope with this is to aggregate data to, say, 15-minute-interval summaries. But it often is not known how much information is lost with this, i. e., how good analysis results on aggregated data actually are. In this article, we study the effects of aggregating energy data on clustering. We propose an experimental design to compare a wide range of clustering methods found in literature. We then introduce different ways to compare clustering results obtained with different aggregation schemes. Our evaluation shows that aggregation affects the clustering quality significantly. Finally, we propose guidelines to select an aggregation scheme.

Keywords: Data mining; clustering; energy data; aggregation; benchmark

ACM CCS:

Funding source: Deutsche Forschungsgemeinschaft

Award Identifier / Grant number: GRK 2153

Funding statement: This work was supported by the German Research Foundation (DFG) as part of the Research Training Group GRK 2153: Energy Status Data – Informatics Methods for its Collection, Analysis and Exploitation.

About the authors

Holger Trittenbach

Holger Trittenbach is working towards the PhD degree at Karlsruhe Institute of Technology (KIT) at the Department of Informatics. His current research interest is data mining and machine learning in the field of outlier detection and active learning.

Jakob Bach

Jakob Bach is working towards the PhD degree at Karlsruhe Institute of Technology (KIT) at the Department of Informatics. His current research interest is machine learning in the fields of feature selection and meta-learning.

Prof. Dr. Klemens Böhm

Klemens Böhm is full professor at Karlsruhe Institute of Technology (KIT), since 2004. Current research topics at his chair are knowledge discovery and data mining in big data, data privacy and workflow management.

Appendix A Adaptation of indices

Table 4

Overview of clustering algorithms.

Algorithm	Ref.	Category	Parameters
PAM	[35]	representative-based	2≤k≤10 with maximum Silhouette
AP	[36]	representative-based	s(x,y)=−d(x,y)^*, s(x,x)=medianx,y(s(x,y)), max iterations = 1000, λ=0.9
DBSCAN	[38]	density-based	minPts=1, ϵ=meanx(d1NN(x))^*
Hier.avg	[40]	hierarchical	average linkage, 2≤k≤10 with maximum Silhouette
Hier.comp	[40]	hierarchical	complete linkage, 2≤k≤10 with maximum Silhouette
Hier.sin	[40]	hierarchical	single linkage, 2≤k≤10 with maximum Silhouette
Hier.ward	[69]	hierarchical	Ward’s criterion, 2≤k≤10 with maximum Silhouette

^*s(·,·) = similarity, d(·,·) = dissimilarity.

Notation

Let D={x1,x2,…,xm} be a set of m time series. A cluster Ci⊆D is a subset of all time series. A clustering C partitions the data set D into k clusters C1,…,Ck. The dissimilarity between two time series x and x′ is d(x,x′).

Connectivity

In the original version, high Connectivity indicates poor clustering quality [60]. We invert the index such that higher values indicate good clustering quality. As an intermediate step, we normalize Connectivity to [0,1] by dividing through the maximum Connectivity possible. Connectivity obtains its maximum if for all objects, the L nearest neighbors are assigned to a different cluster. The inverted and normalized Connectivity is:

i.Con(C)=1−Con(C)|D|·∑l=1L1l

Davies-Bouldin Index

The original Davies-Bouldin Index [57] relies on dissimilarities between and to cluster centroids. To make the index applicable to non-representative-based algorithms, we use average-based instead of centroid-based dissimilarities. The average intra-cluster dissimilarity of a cluster Ci is:

(1)δintraavg(Ci)=1|Ci|·(|Ci|−1)·∑x,x′∈Ci,x≠x′d(x,x′)

The average inter-cluster dissimilarity between two clusters Ci and Cj is:

(2)δinteravg(Ci,Cj)=1|Ci|·|Cj|·∑x∈Ci,x′∈Cjd(x,x′)

We also invert the summands of the original Davies-Bouldin definition such that higher index values indicate good clustering quality. Our generalized and inverted version of the Davies-Bouldin Index is:

i.D-B(C)=1k·∑Ci∈CminCj∈C,i≠jδinteravg(Ci,Cj)δintraavg(Ci)+δintraavg(Cj)

Dunn Index

The original Dunn Index [58] is defined as the ratio of the minimum dissimilarity between any two objects in different clusters, and the maximum dissimilarity between any two objects belonging to the same cluster. We use one of the generalized forms proposed in [59] to make the index more stable and less prone to outliers. With Equation 1 and Equation 2, the generalized version of the Dunn Index is:

Dunn(C)=minCi∈C,Cj∈C,i≠jδinteravg(Ci,Cj)maxCi∈Cδintraavg(Ci)

External indices

We apply the normalizations proposed in [68]. We also invert the normalized van Dongen measure by subtraction from 1 such that higher values indicate good clustering quality.

Table 5

Overview of dissimilarity measures.

Diss.	Ref.	Category	Parameters	V*
CDM+^a	[47]	complexity	SAX alphabet size = 8, compression = gzip	✓
DTW	[42]	elastic	—	✓^b
DTW.CID	[42], [52]	elastic + complexity	—	✓^b
DTW.CORT	[42], [51]	elastic + lock-step	tuning parameter k=2	X
DTW.Band	[42]	elastic	Sakoe-Chiba window size = 10 %	X^c
ERP	[45]	elastic	gap value g=0	✓^b
L1		lock-step	—	X
L2		lock-step	—	X
L2.CID	[52]	lock-step + complexity	—	X
L2.CORT	[51]	lock-step	tuning parameter k=2	X
Lmax		lock-step	—	X
PDD	[50]	complexity	embedding dimension m by entropy heuristic	X^d
SBD	[46]	elastic	—	✓

^* Applicable to sequences of variable length (yes/no).
^a We modify the formula of CDM slightly to obtain a dissimilarity in [0,1] instead of (0.5,1).
^b We additionally normalize the resulting dissimilarities to account for differences in length.
^c Undefined if lengths of sequences differ too much and therefore not used.
^d Undefined for sequences of length 1 and therefore not used.

References

1. Omar Al-Jarrah et al. Multi-layered clustering for power consumption profiling in smart grids. IEEE Access, 2017.10.1109/ACCESS.2017.2712258Search in Google Scholar

2. Sambaran Bandyopadhyay et al. Individual and aggregate electrical load forecasting: One for all and all for one. In e-Energy, 2015.10.1145/2768510.2768539Search in Google Scholar

3. Mohamed Chaouch. Clustering-based improvement of nonparametric functional time series forecasting: Application to intra-day household-level load curves. IEEE Smart Grid, 2014.10.1109/TSG.2013.2277171Search in Google Scholar

4. Wen Shen et al. An ensemble model for day-ahead electricity demand time series forecasting. In e-Energy, 2013.10.1145/2487166.2487173Search in Google Scholar

5. Jungsuk Kwac, June Flora, and Ram Rajagopal. Household energy consumption segmentation using hourly data. IEEE Smart Grid, 2014.10.1109/TSG.2013.2278477Search in Google Scholar

6. Ranjan Pal et al. Challenge: On online time series clustering for demand response: Optic – a theory to break the ‘curse of dimensionality’. In e-Energy, 2015.Search in Google Scholar

7. Michel Verleysen and Damien François. The curse of dimensionality in data mining and time series prediction. In IWANN, 2005.10.1007/11494669_93Search in Google Scholar

8. Holger Trittenbach, Jakob Bach, and Klemens Böhm. On the tradeoff between energy data aggregation and clustering quality. In e-Energy, 2018.10.1145/3208903.3212038Search in Google Scholar

9. T arren Liao. Clustering of time series data – a survey. Pattern Recognition, 2005.10.1016/j.patcog.2005.01.025Search in Google Scholar

10. Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. Time-series clustering – a decade review. Inform Syst, 2015.10.1016/j.is.2015.04.007Search in Google Scholar

11. Xiaoyue Wang et al. Experimental comparison of representation methods and distance measures for time series data. Data Min Knowl Disc, 2013.10.1007/s10618-012-0250-5Search in Google Scholar

12. Gianfranco Chicco. Overview and performance assessment of the clustering methods for electrical load pattern grouping. Energy, 2012.10.1016/j.energy.2011.12.031Search in Google Scholar

13. Ling Jin et al. Comparison of clustering techniques for residential energy behavior using smart meter data. Technical report, LBNL, 2017.Search in Google Scholar

14. Simon Bischof et al. HIPE – An Energy-Status-Data set from industrial production. In e-Energy, 2018.10.1145/3208903.3210278Search in Google Scholar

15. Ian Dent et al. Finding the creatures of habit; clustering households based on their flexibility in using electricity, 2012.10.2139/ssrn.2828585Search in Google Scholar

16. Vera Figueiredo et al. An electric energy consumer characterization framework based on data mining techniques. IEEE Power Systems, 2005.10.1109/TPWRS.2005.846234Search in Google Scholar

17. Alejandro Gómez-Boix, Leticia Arco, and Ann Nowé. Consumer segmentation through multi-instance clustering time-series energy data from smart meters. In Soft Computing for Sustainability Science. Springer, 2018.10.1007/978-3-319-62359-7_6Search in Google Scholar

18. Stephen Haben, Colin Singleton, and Peter Grindrod. Analysis and clustering of residential customers energy behavioral demand using smart meter data. IEEE Smart Grid, 2016.10.1109/TSG.2015.2409786Search in Google Scholar

19. Peter Laurinec and Mária Lucká. Comparison of representations of time series for clustering smart meter data. In WCECS, 2016.Search in Google Scholar

20. Fintan McLoughlin, Aidan Duffy, and Michael Conlon. A clustering approach to domestic electricity load profile characterisation using smart metering data. Applied Energy, 2015.10.1016/j.apenergy.2014.12.039Search in Google Scholar

21. Franklin L Quilumba et al. Using smart meter data to improve the accuracy of intraday load forecasting considering customer behavior similarities. IEEE Smart Grid, 2015.10.1109/TSG.2014.2364233Search in Google Scholar

22. Teemu Räsänen and Mikko Kolehmainen. Feature-based clustering for electricity use time series data. In ICANNGA, 2009.10.1007/978-3-642-04921-7_41Search in Google Scholar

23. Abbas Shahzadeh, Abbas Khosravi, and Saeid Nahavandi. Improving load forecast accuracy by clustering consumers using smart meter data. In IJCNN, 2015.10.1109/IJCNN.2015.7280393Search in Google Scholar

24. Yogesh Simmhan and Muhammad Usman Noor. Scalable prediction of energy consumption using incremental time series clustering. In Big Data, 2013.10.1109/BigData.2013.6691774Search in Google Scholar

25. Tri Kurniawan Wijaya et al. Consumer segmentation and knowledge extraction from smart meter and survey data. In ICDM, 2014.Search in Google Scholar

26. Alexander Lavin and Diego Klabjan. Clustering time-series energy data from smart meters. Energy Efficiency, 2015.10.1007/s12053-014-9316-0Search in Google Scholar

27. Luis Hernández et al. Classification and clustering of electricity demand patterns in industrial parks. Energies, 2012.10.3390/en5125215Search in Google Scholar

28. Félix Iglesias and Wolfgang Kastner. Analysis of similarity measures in times series clustering for the discovery of building energy patterns. Energies, 2013.10.3390/en6020579Search in Google Scholar

29. Rishee K Jain et al. Forecasting energy consumption of multi-family residential buildings using support vector regression: Investigating the impact of temporal and spatial monitoring granularity on performance accuracy. Applied Energy, 2014.10.1016/j.apenergy.2014.02.057Search in Google Scholar

30. A Vaghefi, Farbod Farzan, and Mohsen A Jafari. Modeling industrial loads in non-residential buildings. Applied Energy, 2015.10.1016/j.apenergy.2015.08.077Search in Google Scholar

31. Junjing Yang et al. k-shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energ Buildings, 2017.10.1016/j.enbuild.2017.03.071Search in Google Scholar

32. George J Tsekouras, Nikos D Hatziargyriou, and Evangelos N Dialynas. Two-stage pattern recognition of load curves for classification of electricity customers. IEEE Power Systems, 2007.10.1109/TPWRS.2007.901287Search in Google Scholar

33. Bogdan Neagu et al. Patterns discovery of load curves characteristics using clustering based data mining. In Cpe-Powereng, 2017.10.1109/CPE.2017.7915149Search in Google Scholar

34. Charu C Aggarwal. Data Mining: The Textbook. Springer, 2015.10.1007/978-3-319-14142-8Search in Google Scholar

35. Leonard Kaufman and Peter J Rousseeuw. Clustering by Means of Medoids. Elsevier, 1987.Search in Google Scholar

36. Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 2007.10.1126/science.1136800Search in Google Scholar PubMed

37. Ulrich Bodenhofer, Andreas Kothmeier, and Sepp Hochreiter. Apcluster: An r package for affinity propagation clustering. Bioinformatics, 2011.10.1093/bioinformatics/btr406Search in Google Scholar PubMed

38. Martin Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.Search in Google Scholar

39. Dimitrios Kotsakos et al. Time-series data clustering. In Data Clustering: Algorithms and Applications. CRC Press, 2014.Search in Google Scholar

40. Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques. The Morgan Kaufmann series in data management systems. Elsevier, 2012.Search in Google Scholar

41. Eamonn Keogh and Shruti Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min Knowl Disc, 2003.10.1145/775047.775062Search in Google Scholar

42. Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In ICDM, 1994.Search in Google Scholar

43. Joan Serra and Josep Ll Arcos. An empirical evaluation of similarity measures for time series classification. Knowl-Based Syst, 2014.10.1016/j.knosys.2014.04.035Search in Google Scholar

44. Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE T Acoust Speech, 1978.10.1016/B978-0-08-051584-7.50016-4Search in Google Scholar

45. Lei Chen and Raymond Ng. On the marriage of lp-norms and edit distance. In VLDB, 2004.10.1016/B978-012088469-8.50070-XSearch in Google Scholar

46. John Paparrizos and Luis Gravano. k-shape: Efficient and accurate clustering of time series. In PODS, 2015.10.1145/2723372.2737793Search in Google Scholar

47. Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. Towards parameter-free data mining. In KDD, 2004.10.1145/1014052.1014077Search in Google Scholar

48. Ming Li et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001.10.1093/bioinformatics/17.2.149Search in Google Scholar PubMed

49. Jessica Lin et al. A symbolic representation of time series, with implications for streaming algorithms. In Workshop on DMKD, 2003.Search in Google Scholar

50. Andreas M Brandmaier. Permutation Distribution Clustering and Structural Equation Model Trees. PhD thesis, Universität des Saarlandes, 2011.Search in Google Scholar

51. Ahlame Douzal Chouakria and Panduranga Naidu Nagabhushan. Adaptive dissimilarity index for measuring time series proximity. ADAC, 2007.10.1007/s11634-006-0004-6Search in Google Scholar

52. Gustavo Batista et al. Cid: an efficient complexity-invariant distance for time series. Data Min Knowl Disc, 2014.10.1007/s10618-013-0312-3Search in Google Scholar

53. Eamonn J Keogh and Michael J Pazzani. Scaling up dynamic time warping for datamining applications. In KDD, 2000.10.1145/347090.347153Search in Google Scholar

54. Eamonn Keogh et al. Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst, 2001.10.1007/PL00011669Search in Google Scholar

55. Olatz Arbelaitz et al. An extensive comparative study of cluster validity indices. Pattern Recognit, 2013.10.1016/j.patcog.2012.07.021Search in Google Scholar

56. Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math, 1987.10.1016/0377-0427(87)90125-7Search in Google Scholar

57. David L Davies and Donald W Bouldin. A cluster separation measure. IEEE Pattern Analysis and Machine Intelligence, 1979.10.1109/TPAMI.1979.4766909Search in Google Scholar

58. Joseph C Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernetics, 1973.10.1080/01969727308546046Search in Google Scholar

59. James C Bezdek and Nikhil R Pal. Some new indexes of cluster validity. IEEE T Sys Man Cy B, 1998.10.1109/3477.678624Search in Google Scholar PubMed

60. Julia Handl and Joshua D Knowles. Exploiting the trade-off – the benefits of multiple objectives in data clustering. In EMO, 2005.10.1007/978-3-540-31880-4_38Search in Google Scholar

61. Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and Sen Wu. Understanding and enhancement of internal clustering validation measures. IEEE Cybernetics, 2013.10.1109/TSMCB.2012.2220543Search in Google Scholar PubMed

62. Silke Wagner and Dorothea Wagner. Comparing clusterings – an overview. Technical report, Faculty of Informatics, Universität Karlsruhe (TH), 2007.Search in Google Scholar

63. Edward B Fowlkes and Colin L Mallows. A method for comparing two hierarchical clusterings. J Am Stat Assoc, 1983.10.1080/01621459.1983.10478008Search in Google Scholar

64. Karl Pearson. Mathematical contributions to the theory of evolution. vii. on the correlation of characters not quantitatively measurable. Philos T R Soc Lond, 1900.Search in Google Scholar

65. William M Rand. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 1971.10.1080/01621459.1971.10482356Search in Google Scholar

66. Stijn Van Dongen. Performance criteria for graph clustering and Markov cluster experiments. Technical report, CWI, 2000.Search in Google Scholar

67. Ana LN Fred and Anil K Jain. Robust data clustering. In CVPR, 2003.Search in Google Scholar

68. Junjie Wu, Hui Xiong, and Jian Chen. Adapting the right measures for k-means clustering. In KDD, 2009.Search in Google Scholar

69. Joe H Ward Jr. Hierarchical grouping to optimize an objective function. J Am Stat Assoc, 1963.10.1080/01621459.1963.10500845Search in Google Scholar

Received: 2019-04-29

Revised: 2019-07-21

Accepted: 2019-08-16

Published Online: 2019-10-24

Published in Print: 2019-04-24

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/itit-2019-0014

Keywords for this article

Data mining; clustering; energy data; aggregation; benchmark