Identifying Important Packages of  Object-Oriented Software Using Weighted  k-Core Decomposition

Weifeng Pan; Bo Hu; Bo Jiang; Bo Xie

doi:10.1515/jisys-2014-0015

Article Open Access

Identifying Important Packages of Object-Oriented Software Using Weighted k-Core Decomposition

Weifeng Pan , Bo Hu , Bo Jiang and Bo Xie

Published/Copyright: May 7, 2014

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 23 Issue 4

Abstract

Identifying important entities in software systems has many implications for effective resource allocation. Complex network research opens new opportunities for identifying important entities from software networks. However, the existing methods only focus on identifying important classes. Little work has been done on the identification of important packages. Moreover, the metrics they used to quantify the class importance are only designed for unweighted software networks and cannot fit in with the weighted software networks. To overcome these limitations, in this article, we introduce the weighted k-core decomposition method (W_k-core) to identify the important packages. First, we use a weighted software network to describe packages and their internal dependencies. Second, we use W_k-core to partition a software network into a layered structure. Then, the packages that are denoted by the nodes within the main core are the identified important packages. To evaluate our method, we use a variant of the susceptible–infectious–recovered model to examine the spreading influence of the nodes in six real weighted software networks. The results show that our method can well identify influential nodes, better than other four methods (i.e., original k-core decomposition, degree centrality, closeness centrality, and betweenness centrality methods). Furthermore, we demonstrate our method on two software networks and show that the important packages identified by our method are more meaningful from a software engineering perspective when compared with the other methods.

Keywords: Package; centrality metric; k-core; software network; spreading; SIR model

1 Introduction

Over the past few years, complex networks have gained overwhelming popularity across many fields of science. It provides a unified perspective for studying various complex systems, including social, biological, physiological, economic, and technological networks [9]. Software represents a typical kind of diverse and sophisticated man-made systems, which can also be represented as networks where software entities such as methods/fields, classes/interfaces, or packages are nodes and the relationships between them are edges [23]. The understanding of software structures, functions, and their relations has attracted much attention recently. Moreover, many shard topological properties of software networks have been revealed, such as scale-free phenomena [8, 20, 26] and small-world feature [19, 23].

Identification of the important (or influential/key) nodes is of theoretical and practical significance in complex network research. It is an important step toward controlling rumor and disease spreading, creating new marketing tools, optimizing the use of available resources, and ensuring a more efficient spread of information [15]. In the field of software engineering, identification of the important software entities such as important methods and important classes is also very useful. As we all know, software maintenance usually occupies 60–80% of the total cost of a software system [27]. However, the resource that can be paid is <5% of the total maintenance costs. The gap being so large makes how to effectively allocate the limited resources be a problem facing many people. Important software entities implement the key concepts of a system. They may be the hard problem of a software design and the focus of software testing [13]. Naturally, the limited resources should first be allocated to them. Furthermore, evolution is an intrinsic property of software. To apply the evolution, people need to acquire enough knowledge about the software. These important entities can be used to focus their understanding efforts when starting to work on a new software project. However, it is a nontrivial task to identify these important software entities.

Centrality analysis is widely used in network science to identify the important nodes in a specific network [2, 5, 6, 15], and it has just recently been adopted to identify the important software entities. Wang and Pan [28] used some centrality metrics, such as degree centrality, closeness centrality, and betweenness centrality, to rank the importance of classes. Li et al. [17] proposed a new metric, IC (importance of classes), by analyzing the bug dynamics to rank the importance of classes. One major limitation of the existing methods is that the centrality metrics they used are originally designed for unweighted networks. However, software networks are usually weighted [17, 22, 24]. The weights describe the coupling (or dependency) strength between the entities. Another major limitation of the existing methods is that they only focus on identifying important classes. Little work has been done on the identification of important software entities at the other levels of granularity, such as field/method level and package level. However, identifying important entities at other levels of granularity is also crucial as it provides us a whole perspective from the macro (package) to the micro (methods/fields) level, to understand a specific piece of software [23]. Furthermore, although packages have a much higher granularity than classes and methods/fields, identifying the important packages is also crucial as it can be the first step to identify the important classes or methods/fields, especially when the software systems have become ever larger and more complex. Thus, questions such as “how to identify the most important nodes in weighted software networks” and “how to identify important software entities at other levels of granularity,” are natural to ask.

In this article, we only focus on identifying the important packages. To identify the important packages, we introduce a weighted k-core decomposition method to calculate the weighted k-core structure of weighted software networks at the package level. Our method will partition a software network at the package level into a layered structure. Then, the packages, which are denoted by the nodes within the main core (the definition will be given in Definition 4), are the identified important packages. To the best of our knowledge, research work on such a problem has never been reported. To evaluate our approach, we use a variant of the susceptible–infectious–recovered (SIR) model to examine the spreading influence of the nodes ranked by our method. The simulations on six real software networks show that our method can well identify influential nodes. Compared with the original k-core decomposition, degree centrality, closeness centrality, and betweenness centrality methods, our method performs best.

The main contributions of this article are

Extending the problem of important software entities identification from the class level to the package level, i.e., to identify the important packages;
Proposing a novel method to identify the important packages, which uses a weighted software network and a weighted k-core decomposition to identify the important packages; and
Evaluating the proposed method using six open-source software systems.

The rest of the article is structured as follows. Section 2 describes our approach in detail, with focus on the definition of related software networks and the weighted k-core decomposition method. In Section 3, we present the empirical evaluation to investigate the effectiveness of the proposed method and discuss the correlations between some metrics. In Section 4, we give final conclusions and identify areas of further research.

2 The Proposed Approach

The overall method taken in the current work is shown in Figure 1. In the following subsections, we will detail the main parts of Figure 1.

Figure 1

Workflow of the Proposed Method.

2.1 Software Entity Collection

This article mainly focuses on Java software systems. The choice of software systems in Java programming language is limited by our analysis tool and our interest. To identify the important packages of software systems, we consider the byte code of a software system. To represent a piece of software as a network, entities in the code should be extracted. As our focus is on identifying important packages, we first need to determine the entities that should be extracted. In the current work, packages, classes, and their dependencies are chosen as entities that will be represented as nodes and edges in a software network. We use the term “class” to designate classes and interfaces. We will be treating them the same from here on.

2.2 Software Network Definition

After the entities have been collected, we will introduce the software network to represent the topological characteristics of a Java software system. However, how is the software network defined?

As we all know, Java software systems are usually composed of entities at different levels of granularity, varying from methods and fields to packages. Upper-level entities are built by the lower-level ones, e.g., a package is composed of a collection of classes. As the interest of the current work mainly focuses on the identification of important packages, in the following, we will first introduce a package dependency network (PDN) to represent the topological structure of Java software systems at the package level.

Definition 1 In PDN, the nodes denote the packages of a specific Java software system. The undirected edge between every pair of nodes denotes the coupling interaction between the corresponding packages. The coupling interactions between packages are extracted from the interactions of the classes they enclosed, i.e., a dependency between two classes in two separate packages implies a dependency between the two packages. Therefore, PDN can be described as

(1)PDN = (Vp, Ep), (1)

where V_p is the set of nodes in PDN (the subscript p denotes the software network is constructed at the package level) and E_p is the set of edges.

Each edge is also weighted with a value to signify the coupling strength between the packages on its two sides. The weight on each edge allows us to consider the coupling strength between the packages on its two sides. However, how to determine such weights is still a problem. In the current work, we use the dependencies between classes that the two packages enclose to quantify the coupling strength. To calculate the weight, we introduce another type of software network, namely class dependency network (CDN), to represent the classes and their dependencies. CDN can be described as follows.

Definition 2 In CDN, the nodes denote the classes of a specific Java software system and each class is represented by only one node. Undirected edges between two nodes indicate the use dependency between the corresponding classes, i.e., if class c₁ uses the services provided by class c₂, there is an edge between the nodes denoting the two classes. The edge (or the use dependency) between class A and B can be defined under the following four circumstances: (i) A inherits from B via the keyword extends; (ii) A realizes interface B via the keyword implements; (iii) B has a field with type of class A; and (iv) one method of A calls methods on an object of B. Therefore, CDN can be described as

(2)CDN = (Vc, Ec), (2)

where V_c is the set of all nodes in CDN (the subscript c denotes the software network is constructed at the class level) and E_c is the set of edges.

Therefore, if there is an edge between class A and class B in CDN and the two classes are defined in two different packages, there will be an edge between the two packages. On the basis of the CDN, we can calculate the weight on the edges between package i and j in PDN according to formula (3):

(3)w(i, j) = ∑m ∈ getClass(i)|Rm1 ∩ getClass(j)|, (3)

where w(i, j) is the weight on the edge between package i and j, R_mk denotes the set of all reachable nodes originated from the node of class m within a distance k along with the edges, getClass(i) returns the set of all classes package i contains, and |*| returns the number of elements in set *.

Figure 2 shows a simple code segment and its corresponding CDN and PDN. For illustration purpose, we take w(p1, p2) to show how to calculate the weight on the edge. As p1.classX in package p1 directly depends on p2.classY and p2.classZ in package p2, there is an edge between p1 and p2 in PDN. At the same time, R_p1.classX₁ = {p2.classY, p2.classZ}. Thus, w(p1, p2) = |{p2.classY, p2.classZ} ∩ {p2.classY, p2.classZ}| = 2.

Figure 2

A Simple Code Segment and Its Corresponding CDN and PDN.

2.3 The Weighted k-Core Decomposition Method

Here, we introduce the weighted k-core decomposition method (W_k-core) [11], which is a generalized version of the original k-core decomposition method and is applied for weighted networks. As the original k-core decomposition method [1, 4] is applied for unweighted networks, from now on, we will call it the unweighted k-core decomposition method (U_k-core). W_k-core differs from U_k-core mainly in its way of node degree calculation. W_k-core is based on a weighted version of the traditional node degree, which considers both the degree of a node and the weights of its edges. The weighted degree of node i, wDeg(i), is defined as

(4)wDeg(i) = [Deg(i)α(∑j = 1Deg(i)wij)β]1α+β, (4)

where wDeg(i) is the weighted degree of node i, Deg(i) is the degree of node i, and ∑j = 1Deg(i)wij is the sum over all the edge weight of node i. In the current work, we also discuss only the case where α = β = 1 as [11] does, which treats the weight and the node degree equally. Therefore, for what follows, wDeg(i) = Deg(i)∑j = 1Deg(i)wij. Therefore, in the unweighted networks where w_ij = 1, wDeg(i) is equivalent to the Deg(i), while in the weighted network, wDeg(i) is usually a decimal. We discretize these decimals by rounding to their closest integers.

On the basis of the weighted degree of a node, we can introduce some definitions of weighted k-cores as that of the original k-cores proposed in [1, 4]. Let us consider a graph G = (V, E) with |V| = n nodes and |E| = e edges; the definitions of weighted k-cores are shown as follows.

Definition 3 A subgraph H = (C, E|C) induced by the set C⊆V is a weighted k-core or a weighted core of order k if and only if the weighted degree of every node v∈ C induced in H is greater than or equal to k (i.e., ∀v∈ C, wDeg(v) ≥ k), and H is the maximum subgraph with this property.

W_k-core applies a similar pruning routine as the U_k-core. A weighted k-core of G can therefore be obtained by recursively removing all the nodes of weighted degree less than k, until all nodes in the remaining graph have weighted degree at least k.

Definition 4 A node i has weighted coreness k if it belongs to the weighted k-core but not to the weighted (k+ 1)-core. Furthermore, the maximum weighted coreness, k_max, is such that the k_max-core is not empty, but the (k_max+ 1)-core is. k_max is usually named as the graph coreness. Moreover, the k_max-core is also called the main core.

The weighted coreness of a node also denotes how deep in the core a node is [20]. Obviously, all the nodes of a connected graph belong to the weighted 1-core.

Definition 5 A weighted k-shell, WS_k, is composed by all the nodes whose weighted coreness is k and the edges between them. The weighted k-core is thus the union of all WS_c with c ≥ k.

For illustration purpose, we give an example to show how to divide a network into the weighted k-core structure (see Figure 3). It is a simple network with only five nodes and four edges (see the leftmost part of Figure 3). All edge weights equal to 1, except for the weight of the edge between nodes B and C, which equals 9. First, we calculate the weighted degree of all the nodes and remove from the network all nodes with weighted degree <1 (specifically node E). We obtain the weighted 1-core. Subsequently, we recalculate the weighted degree of the left nodes in weighted 1-core, and remove all nodes with a weighted degree <2 (specifically node D). We obtain the weighted 2-core. Again, this procedure is repeated iteratively until there are only nodes with weighted degree no less than 3 left on the network, and so on. This routine is applied until there are no nodes left in the network.

Figure 3

Illustration of the Layered Structure of a Network Obtained Using the Weighted k-Core Decomposition Method.

The text at the bottom of the figure denotes the weighted degree of the nodes in the corresponding networks.

3 Empirical Study

To investigate the effectiveness of the proposed method in the identification of important packages, we designed and conducted controlled experiments. Our experiments were carried out on a personal computer at 2.3 GHz with 2 GB of RAM.

3.1 Research Questions

Our experiment aims at addressing the following research questions:

What is the difference between the results of W_k-core and U_k-core when applied to PDNs? W_k-core is a weighted version of U_k-core. We wish to know the difference between the results of the two methods in the identification of important packages when applied to real PDNs.
What is the relation between the results of weighted coreness and other centrality metrics when applied to PDNs? Many centrality metrics have been proposed to rank the importance of a node in a software network. Therefore, we wish to know the relation between the results of the weighted coreness and other centrality metrics in the identification of important packages.
How about the effectiveness of the weighted coreness in identifying important packages when compared with other centrality metrics? As there are many centrality metrics, we wish to know whether the weighted coreness is better in effectiveness in the identification of important packages.
Can W_k-core find meaningful important packages in the software systems? W_k-core ranks the packages according to the weighted coreness from large to small. We wish to know whether the ranked packages make sense to the developer.

In the following sections, we provide details on the objects of study (Section 3.2), our experiment process and results (Section 3.3), and our analysis of the results (Section 3.4).

When performing case studies with new techniques aimed at understanding a software system, there basically exist two paths to follow when trying to validate the results. One path is to perform an extrinsic evaluation, where, e.g., a controlled experiment would serve as an evaluator. Another path is the intrinsic evaluation, where persons such as the original developers and maintainers serve as an oracle. Thus, in the current work, Sections 3.4.1, 3.4.2, and 3.4.3 follow the first path, while Section 3.4.4 follows the second path.

3.2 Objects of Study

Six open-source nontrivial Java systems are chosen as objects of our study. Azureus 3.0.1.4^[1] is a well-known P2P file-sharing client. Tomcat 6.0.18^[2]² is a web server and servlet container. JMeter 2.0.1^[3] is a desktop application designed to load test functional behavior and measure performance. JFreeChart 1.0.12^[4] is a free Java chart library. XGen Source Code Generator (XGen) 0.5.0^[5] is a tool that creates text output from structured text input. Jakarta ECS 1.4.2^[6] is a Java API for generating elements for various markup languages. The reasons for selecting these specific projects are as follows:

Their source code is open and publicly available, allowing the replication of the experiment.
They are implemented in Java programming language that can be analyzed by our analysis tool.
They have ever been selected as research subjects [13, 23, 25]. Using the same subjects lays a basis for comparing our approach with others.
They originate from different application domains allowing, to some extent, the generalization of the conclusions.

The size characteristics of the examined software systems are shown in Table 1, where KLOC is the thousand lines of code, #P is the number of packages, #C is the number of classes, and #F is the number of features. The term “feature” is used to designate fields and methods. #P excludes the outer packages, #C includes the number of inner classes and interfaces, and KLOC is the practical lines of code, excluding the comment lines and blank lines.

Table 1

Size Characteristics of the Examined Software Systems.

Subject	KLOC	#P	#C	#F
Azureus 3.0.1.4	307.021	428	5102	47,434
Tomcat 6.0.18	161.933	166	2331	39,158
JMeter 2.0.1	78.304	290	3477	45,936
JFreeChart 1.0.12	137.034	107	1959	29,453
XGen 0.5.0	5.457	21	74	848
Jakarta ECS 1.4.2	28.691	13	390	6016

3.3 Experiment Process and Results

We follow the steps shown in Figure 1 to identify the important packages in a specific software system. The PDNs for all the subject systems are all automatically built by our own developed software analysis tool SNAT [23]. SNAT can parse the source code or compiled Java code of Java projects, extract the relevant information, build the CDNs, and finally build the PDNs.

For illustration purpose, we show in Figure 4A–D only the CDNs and their corresponding PDNs extracted from Azureus and Tomcat. Enlarging the corresponding networks can give you more information about the network such as the class (or package) each node denotes and the dependency between every pair of classes (or packages). The positions of the nodes in CDNs and PDNs are all calculated using the original circular algorithm in Pajek.^[7]

Figure 4

Illustration of the CDNs and PDNs for Azureus and Tomcat.

In (A) and (C), the nodes denote the classes, while in (B) and (D), the nodes denote the packages. The notes beside the nodes are the name of the corresponding software entity that the node denotes. The values on the edges in (B) and (D) are the coupling strength. Here, we omit the isolated nodes.

We summarize some detailed statistical properties of the PDNs built from the subject software systems in Table 2, where N_N is the number of nodes, N_E is the number of edges, <k> is the average degree of network nodes, d is the diameter, C is the clustering coefficient, L is the average path length, and H is the network heterogeneity. For our analysis from here on, if not stated otherwise, when we talk about the network, we refer to the largest connected component (LCC), and whenever we discuss network properties these are calculated from the LCC. The definition of these parameters can be found in [9].

Table 2

Statistical Properties of the PDNs Built from the Six Subject Software Systems.

Subject	N_N	N_E	<k>	d	C	L	H
Azureus 3.0.1.4	419	2865	13.675	5	0.472	2.494	1.445
Tomcat 6.0.18	156	635	8.141	6	0.549	2.963	1.015
JMeter 2.0.1	282	1529	10.844	7	0.556	3.263	0.980
JFreeChart 1.0.12	96	437	9.104	9	0.600	3.015	0.907
XGen 0.5.0	17	30	3.529	3	0.451	1.971	0.837
Jakarta ECS 1.4.2	11	15	2.727	2	0.571	1.727	0.899

3.4 Analysis of the Results

In this section, we analyze the obtained results aiming at answering the four research questions presented in Section 3.1.

3.4.1 What Is the Difference between the Results of W_k-core and U_k-core when Applied to PDNs?

Table 3 lists the results obtained by applying W_k-core and U_k-core to the six PDNs, respectively. In Table 3, S^U and S^W are the total number of shells, while kmaxU and kmaxW are the graph coreness, obtained by U_k-core and W_k-core, respectively. n^U and n^W are the total number of nodes in the main core obtained by U_k-core and W_k-core, respectively. N_c is the number of common nodes in both main cores, N_UW is the fraction of the nodes in the main core obtained by U_k-core that also belong to the main core obtained by W_k-core, and N_WU is the fraction of the nodes in the main core obtained by W_k-core that also belong to the main core obtained by U_k-core. Core_WU-core is the core with smallest number of nodes but enclosing all the nodes in the main core obtained by W_k-core. We can see that in all the six software networks, S^W (or kmaxW) is larger than the corresponding S^U (or kmaxU). It indicates that W_k-core can yield a more refined decomposition, giving more detailed information about the internal structure of a software network.

Table 3

Comparison of the Results Obtained by U_k-core and W_k-core.

Subject	S^U	S^W	kmaxU	kmaxW	n^U	n^W	N_c	N_UW	N_WU	Core_WU
Azureus 3.0.1.4	13	37	13	40	42	25	20	0.476	0.800	12
Tomcat 6.0.18	8	28	8	31	28	8	6	0.214	0.750	6
JMeter 2.0.1	11	32	11	38	51	13	12	0.235	0.923	10
JFreeChart 1.0.12	11	35	11	56	18	5	5	0.278	1.000	11
XGen 0.5.0	3	4	3	6	9	5	3	0.333	0.600	2
Jakarta ECS 1.4.2	2	8	2	27	8	3	2	0.250	0.667	1

We can also observe from Table 3 that, compared with the lines of the code, kmaxW and kmaxU are all very small. It means that there exists a hierarchical similarity across different software systems. Such a similarity may be very universal as a result of some design principles during software development. Moreover, three of the six networks have a kmaxU>10, which is larger than that reported in Reference [18]. Such a difference may come from the different programming languages that these software systems use and the different levels of granularity that they focus on. The examined software systems in the current work are developed using Java while that used in Reference [18] are developed using C++. In addition, our work is performed at the package level of granularity, while that of Reference [18] is performed at the class level.

Furthermore, for all the six studied software networks, the main core obtained by the W_k-core contains a smaller number of nodes than that of U_k-core, i.e., n^W < n^U. The nodes in the main core obtained by W_k-core are all from the last three shells of U_k-core (see the Core_WU) and take >60% of the nodes in the main core obtained by U_k-core. This means that W_k-core in most cases is able to split the main core obtained by U_k-core further and to identify which are most central of the central nodes.

3.4.2 What Is the Relation between the Results of Weighted Coreness and Other Centrality Metrics when Applied to PDNs?

Our method evaluates the package importance of a specific software system by giving a weighted coreness. The greater the value is, the more important the package is. This process is essentially a rank of the important packages, i.e., the greater the value is, the higher the ranking is.

In statistics, a rank correlation coefficient is always used to measure the degree of similarity between two rankings, and can be used to assess the significance of the relation between two rankings. Kendall’s τ coefficient is one of the most popular rank correlation statistics [14]. There are three ways to calculate Kendall’s τ, i.e., Tau-a τ_A, Tau-b τ_B, and Tau-c τ_C coefficients. As there are many nodes with the same weighted coreness making them have a same rank, we use the Tau-b coefficient in the current work.

Let (x₁, y₁), (x₂, y₂), …, (x_n, y_n) be a set of observations of the joint random variables X and Y, respectively. Any pair of observations (x_i, y_i) and (x_j, y_j) is said to be concordant if the ranks for both elements agree: that is, if both x_i > x_j and y_i > y_j or if both x_i < x_j and y_i < y_j. They are said to be discordant if x_i > x_j and y_i < y_j or if x_i < x_j and y_i > y_j. If x_i = x_j or y_i = y_j, the pair is neither concordant nor discordant and is said to be tied. Then, the Kendall’s Tau-b τ_B coefficient is defined as

(5)τB = nc − nd(n0 − n1)(n0 − n2), (5)

where n₀ = n(n – 1)/2, n₁ = Σ_it_i(t_i – 1)/2, n₂ = Σ_ju_j(u_j – 1)/2, n_c is the number of concordant pairs, n_d is the number of discordant pairs, t_i is the number of tied values in the i^th group of ties for the first quantity, and u_j is the number of tied values in the j^th group of ties for the second quantity.

Figure 5 shows the relation between the weighted coreness and other four widely used centrality metrics (i.e., betweenness centrality, closeness centrality, degree centrality, and original coreness for unweighted networks). Moreover, τ_B is used to measure their rank correlations (see Table 4). For simplicity, from now on, we will use wCoreness to denote weighted coreness for weighted networks and use uCoreness to denote the original coreness for unweighted networks. As we can see from Table 4, generally wCoreness is positively correlated with other centrality metrics except in xGen and ECS. Such an exception may result from the small size (number of packages) of xGen and ECS, as there are only 21 and 13 packages in XGen and ECS, respectively.

Figure 5

The Relations between wCoreness and Betweenness, Closeness, Degree, and uCoreness Centrality on the Six Networks.

Each data point denotes a node.

Table 4

Rank Correlation between wCoreness and Other Four Centrality Metrics Using Kendall’s Tau-b.

Subject	Betweenness Centrality	Closeness Centrality	Degree Centrality	uCoreness
Azureus 3.0.1.4	0.527**	0.533**	0.787**	0.795**
Tomcat 6.0.18	0.507**	0.354**	0.780**	0.805**
JMeter 2.0.1	0.530**	0.428**	0.723**	0.697**
JFreeChart 1.0.12	0.482**	0.642**	0.845**	0.823**
XGen 0.5.0	0.380	0.501*	0.573**	0.608**
Jakarta ECS 1.4.2	0.054	–0.173	–0.173	–4.0

**Correlation is significant at the 0.01 level (two-tailed). *Correlation is significant at the 0.05 level (two-tailed).

Furthermore, it can also be seen from Table 4 that in Azureus, wCoreness has the strongest correlation with uCoreness and the weakest correlation with betweenness centrality; in Tomcat, wCoreness has the strongest correlation with uCoreness and the weakest correlation with closeness centrality; in JMeter, wCoreness has the strongest correlation with degree centrality and the weakest correlation with closeness centrality; in JFreeChart, wCoreness has the strongest correlation with degree centrality and the weakest correlation with betweenness centrality; in XGen, wCoreness has the strongest correlation with uCoreness and has no significant correlation with betweenness centrality; and in ECS, there is no significant correlation between wCoreness and the other centrality metrics. It can clearly be seen that, in different software systems, the correlation strength between wCoreness and other centrality metrics changes. However, which centrality metric does wCoreness have strongest correlation with? To compare the correlation strength between weighted coreness and other centrality metrics, we introduce the Friedman test, which is widely used to compare the performance of different algorithms in problem solving [12]. In the current work, the smaller the ranking value is, the stronger the correlation strength is. Table 5 shows the ranking of correlation strength between wCoreness and other centrality metrics. As shown, wCoreness has the strongest positive correlation with uCoreness (rank last) and the weakest correlation with betweenness centrality (rank first). It means that the ranking results of wCoreness is most similar to that of uCoreness, and most dissimilar to that of betweenness centrality.

Table 5

Ranking of the Correlation Strength.

Kendall’s Tau-b Coefficient	Ranking
wCoreness vs. Betweenness centrality	3.666666666666666
wCoreness vs. Closeness centrality	3.249999999999996
wCoreness vs. Degree centrality	1.75
wCoreness vs. uCoreness	1.3333333333333335

3.4.3 How about the Effectiveness of the Weighted Coreness in Identifying Important Packages when Compared with Other Centrality Metrics?

To evaluate the effectiveness of wCoreness and other centrality metrics, we examined the spreading influence of the top-ranked nodes by applying the SIR model, which has been extensively used in network research on epidemic spreading, economic crisis spreading, and rumor spreading [7, 10, 21].

There are three states in the SIR model, i.e., susceptible S, infected I, and recovered R. The individuals in S are susceptible to (not yet infected with) the disease. The individuals in I have been infected with the disease and are able to spread the disease to susceptible individuals. The individuals in R have been infected and then recovered from the disease, and are not able to be infected again or to transmit the infection to others. However, in the SIR model, the susceptible neighbors of an infected individual usually get infected with a fixed probability. As the PDN is a weight network, we introduced a probability that depends on the weight of the edges. Such a setting of the infection probability is very similar to that which has ever been introduced to simulate the spreading of an economic crisis [10]. It can be calculated by

(6)pij ∝ m ⋅ wij/∑iwij, (6)

where p_ij denotes the probability that infected node i infects its susceptible neighboring node j, w_ij is the weight on the edge between node i and node j, and m is an amplification parameter that determines the strength of the disease and can obtain any positive value. In software systems, the disease can be viewed as the error (or fault/defect).

Here, we will also introduce a variant of the SIR model that takes into account the weight of the edges that mediate the spreading [10]. Initially, we assign all nodes to be S. Then, the node that we want to investigate its influence is chosen and set to be I. This node will infect all its susceptible neighbors with probability p calculated according to formula (6), changing all the newly infected nodes from status S to I and the node that initiated the process to R. In the consecutive steps, such a process will be repeated, and all the infected nodes will infect their susceptible neighbors. The process stops when there is no infected node left.

For each selected node, we performed 1000 independent realizations of the variant version of the SIR model, and we calculated the average number of infected nodes. The number of infected nodes is used as a score to rank the importance of nodes. In the current work, we consider the values of m in different intervals for different software systems, i.e., for Azureus m ∈ [2, 10], for Tomcat m ∈ [15, 30], for JMeter m ∈ [15, 18], for JFreeChart m ∈ [1, 2.5], for XGen m ∈ [4, 10], and for Jakarta ECS m ∈ [1, 9].

Figure 6 shows that the number of infected nodes changes with the m. Here, the number of infected nodes is averaged over the top-n^W nodes ranked by each centrality metrics. The n^W for each system is shown in Table 3. It should be noticed that as n^U > n^W, the number of infected nodes of uCoreness is averaged over 1000 independent samplings of the nodes in the main core obtained by U_k-core. From Figure 6, we can see that, in general, the number of infected nodes grows with the increase of m. As m determines the strength of the error, with the increase of the error strength its influence potential also increases. We also observed that the nodes ranked by wCoreness are more able to initiate a severe outbreak in comparison with the nodes ranked by other four centrality metrics. The results are robust for all networks used in this study and for different values of m. It means that W_k-core can find the most influential nodes.

Figure 6

The Number of Infected Nodes Changes with the m.

3.4.4 Can W_k-core Find Meaningful Important Packages in the Software Systems?

wCoreness is a measurement of the importance of a package from the perspective of the software system as a whole. It is an internal attribute that should be correlated with some external quality attributes of a software system to show its usefulness [3]. In the current work, we correlate the wCoreness with the external quality factor, understandability of the packages. In this subsection, we only use xGen and Jakarta ECS as subject systems, as researchers only reported the understandability data of these two systems in the literature. We cannot find the understandability data for the other four systems.

We decompose the layered structure of xGen and Jakarta ECS. The wCoreness values for packages in the two systems are shown in Tables 6 and 7, respectively. Although there are a total of 21 packages in xGen and 13 packages in ECS, here we only consider the 6 packages in xGen (shown in Table 6) and the 12 packages in ECS (shown in Table 7), as Gupta and Chhabra [13] only consider these packages, neglecting the subpackages of some packages, e.g., in xGen the six subpackages of workzen.xgen.test, four subpackages of workzen.xgen.model, and one subpackage of workzen.xgen.ant. We also notice that the wCoreness values of org.apache.ecs.factory and org.apache.ecs.storage equal 0. By manually referring to the source code of Jakarta ECS, we find that they are isolated nodes in the PDN. Thus, we let their wCoreness values be 0.

Table 6

wCoreness Values for Packages in xGen.

Package Name	wCoreness
workzen.xgen.ant	4
workzen.xgen.engine	4
workzen.xgen.loader	6
workzen.xgen.model	6
workzen.xgen.test	4
workzen.xgen.util	4

Table 7

wCoreness Values for Packages in Jakarta ECS.

Package Name	wCoreness
org.apache.ecs.examples	5
org.apache.ecs.factory	0
org.apache.ecs.filter	3
org.apache.ecs.html	27
org.apache.ecs.html2ecs	2
org.apache.ecs.jsp	8
org.apache.ecs.rtf	11
org.apache.ecs.storage	0
org.apache.ecs.vxml	9
org.apache.ecs.wml	13
org.apache.ecs.xhtml	27
org.apache.ecs.xml	8

Gupta and Chhabra [13] asked three teams to perform their judgments on the effort required to understand a specific package. The effort is ranked using an integer from 1 to 10. A higher rank indicates that more effort is required. The average effort required to understand a package is shown in Table 8.

Table 8

Average Rank for the Effort Required.

Package Name	Average Rank
workzen.xgen.ant	2.0
workzen.xgen.engine	2.3
workzen.xgen.loader	5.3
workzen.xgen.model	6.0
workzen.xgen.test	4.3
workzen.xgen.util	2.6
org.apache.ecs.example	1.3
org.apache.ecs.factory	1.0
org.apache.ecs.filter	2.0
org.apache.ecs.html	9.0
org.apache.ecs.html2ecs	1.3
org.apache.ecs.jsp	1.6
org.apache.ecs.rtf	2.3
org.apache.ecs.storage	2.0
org.apache.ecs.vxml	1.3
org.apache.ecs.wml	8.0
org.apache.ecs.xhtml	9.0
org.apache.ecs.xml	4.0

To check the correlation between the wCoreness values and the understandability of the packages, we calculated their Spearman’s correlation [16]. For comparison, we also calculated the Spearman’s correlation between understandability of the packages with other centrality metrics. The results are shown in Table 9. We can observe that the Spearman’s correlation between the wCoreness values and the understandability of the packages is 0.595 at the 0.01 significance level, signifying a strong correlation. Such a strong correlation also provides evidence supporting that wCoreness is a valid indicator of the external quality of a software system; that is, if a package has a greater wCoreness value, it usually indicates that more effort is required to understand it. Furthermore, the result of wCoreness is better than that of PCM reported in Reference [13] and other four centrality metrics where there is not a significant correlation. It should be noticed that although PCM is greater than wCoreness, they are at different significant levels.

Table 9

Spearman’s Correlation Test Results.

Centrality Metrics	Correlation Coefficient
wCoreness	0.595**
PCM [13]	0.73*
Betweenness centrality	0.151
Closeness centrality	0.007
Degree centrality	0.140
uCoreness	0.118

**Correlation is significant at the 0.01 level (two-tailed). *Correlation is significant at the 0.05 level (two-tailed).

4 Conclusions and Future Work

In this work, we focused on identifying important packages in software systems. To identify these packages, we first proposed a weighted PDN to represent a piece of software at the package level of granularity. Then, a weighted k-core decomposition method was introduced to partition the software network into a layered structure. The packages that the nodes in the main core denote are the important packages we identified.

We evaluate our method by using a variant of the SIR model to examine the spreading influence of the nodes ranked by our method. The simulations on six real software networks (Azureus, Tomcat, JMeter, JFreeChart, XGen, and Jakarta ECS) show that our method can well identify influential nodes. Comparing with other four methods (the original k-core, degree centrality, closeness centrality, and betweenness centrality methods), our method performs best. Furthermore, we demonstrated our new method on two software networks (XGen and Jakarta ECS) and show that the important packages identified by our method are more meaningful from an software engineering perspective when compared with other methods.

Although our method shows some feasibilities in identifying important packages, the broad validity of our method demands further demonstration. Thus, future work should include (i) evaluating the method using more other open-source software systems from different domains and with different sizes and (ii) extending the current work to other levels of granularity.

Corresponding author: Weifeng Pan, School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China, e-mail: panweifeng1982@gmail.com

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61202048), the Zhejiang Provincial Nature Science Foundation of China (nos. LY13F020010 and LQ13F020004), and the Open Foundation of State Key Laboratory of Software Engineering of Wuhan University of China (no. SKLSE-2012-09-21).

Bibliography

[1] J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat and A. Vespignani, k-Core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases, Net. Heterogen. Media3 (2008), 371–394.10.3934/nhm.2008.3.371Search in Google Scholar

[2] S. Aral and D. Walker, Identifying influential and susceptible members of social networks, Science337 (2012), 337–341.10.1126/science.1215842Search in Google Scholar PubMed

[3] V. Basili, L. Briand and W. Melo, A validation of object-oriented design metrics as quality indicators, IEEE Trans. Software Eng.22 (1996), 751–761.10.1109/32.544352Search in Google Scholar

[4] V. Batagelj and M. Zaversnik, Generalized cores, Prep. Ser. – Univ. Ljubl. Inst. Math.40 (2002), 1–10.Search in Google Scholar

[5] F. Bauer and J. T. Lizier, Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: a walk counting approach, Europhys. Lett.99 (2012), 68007.10.1209/0295-5075/99/68007Search in Google Scholar

[6] D. B. Chen, L. Y. Lu, M. S. Shang, Y. C. Zhang and T. Zhou, Identifying influential nodes in complex networks, Physica A391 (2012), 1777–1787.10.1016/j.physa.2011.09.017Search in Google Scholar

[7] V. Colizza, A. Barrat, M. Barthélemy and A. Vespignani, The role of the airline transportation network in the prediction and predictability of global epidemics, Proc. Natl. Acad. Sci. USA103 (2006), 2015–2020.10.1073/pnas.0510525103Search in Google Scholar PubMed PubMed Central

[8] G. Concas, M. Marchesi, S. Pinna and N. Serra, Power-laws in a large object-oriented software system, IEEE Trans. Software Eng.33 (2007), 687–708.10.1109/TSE.2007.1019Search in Google Scholar

[9] L. F. Costa, F. A. Rodrigues, G. Travieso and P. R. V. Boas, Characterization of complex networks: a survey of measurements, Adv. Phys.56 (2007), 167–242.10.1080/00018730601170527Search in Google Scholar

[10] A. Garas, P. Argyrakis, C. Rozenblat, M. Tomassini and S. Havlin, Worldwide spreading of economic crisis, New J. Phys.12 (2010), 113043.10.1088/1367-2630/12/11/113043Search in Google Scholar

[11] A. Garas, F. Schweitzer and S. Havlin, A k-shell decomposition method for weighted networks, New J. Phys.14 (2012), 083030.10.1088/1367-2630/14/8/083030Search in Google Scholar

[12] S. Garca, A. Fernández, J. Luengo and F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inform. Sci.180 (2010), 2044–2064.10.1016/j.ins.2009.12.010Search in Google Scholar

[13] V. Gupta and J. K. Chhabra, Package coupling measurement in object-oriented software, J. Comput. Sci. Technol.24 (2009), 273–283.10.1007/s11390-009-9223-6Search in Google Scholar

[14] M. Kendall, A new measure of rank correlation, Biometrika30 (1938), 81–93.10.1093/biomet/30.1-2.81Search in Google Scholar

[15] M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley and H. Makse, Identification of influential spreaders in complex networks, Nat. Phys.6 (2010), 888–893.10.1038/nphys1746Search in Google Scholar

[16] C. R. Kothari, Research methodology: methods and techniques, New Age International Publishers, New Delhi, 2007.Search in Google Scholar

[17] D. W. Li, B. Li, P. He and W. F. Pan, Ranking the importance of classes via software structural analysis, Lect. Nodes Elec. Eng.141 (2012), 441–449.10.1007/978-3-642-27311-7_59Search in Google Scholar

[18] H. Li, H. Zhao, J. Q. Xu, B. Li, P. Li and J. L. Wang, Research on hierarchy of large-scale software macro-topology based on k-core, Acta Electron. Sinica38 (2010), 2635–2643.Search in Google Scholar

[19] Y. T. Ma, K. Q. He, B. Li, J. Liu and X. Y. Zhou, A hybrid set of complexity metrics for large-scale object-oriented software systems, J. Comput. Sci. Technol.25 (2010), 1184–1201.10.1007/s11390-010-9398-xSearch in Google Scholar

[20] C. R. Myers, Software systems as complex networks: structure, function, and evolvability of software collaboration graphs, Phys. Rev. E68 (2003), 046116.10.1103/PhysRevE.68.046116Search in Google Scholar PubMed

[21] M. E. J. Newman, Spread of epidemic disease on networks, Phys. Rev. E66 (2002), 016128.10.1103/PhysRevE.66.016128Search in Google Scholar PubMed

[22] W. F. Pan, B. Jiang and B. Li, Refactoring software packages via community detection in complex software networks, Int. J. Autom. Comput.10 (2012), 9–17.10.1007/s11633-013-0708-ySearch in Google Scholar

[23] W. F. Pan, B. Li, Y. T. Ma and J. Liu, Multi-granularity evolution analysis of software using complex network theory, J. Syst. Sci. Comp.24 (2011), 1068–1082.10.1007/s11424-011-0319-zSearch in Google Scholar

[24] W. F. Pan, B. Li, Y. T. Ma, Y. Y. Qin and X. Y. Zhou, Measuring structural quality of object-oriented softwares via bug propagation analysis on weighted software networks, J. Comput. Sci. Technol.25 (2010), 1202–1213.10.1007/s11390-010-9399-9Search in Google Scholar

[25] W. F. Pan and B. Li, Software quality measurement based on error propagation analysis in software networks, J. Central. South Univ. (Sci. Technol.)43 (2012), 4339–4348.Search in Google Scholar

[26] A. Potanin, J. Noble, M. Frean and R. Biddle, Scale-free geometry in OO programs, Commun. ACM48 (2005), 99–103.10.1145/1060710.1060716Search in Google Scholar

[27] R. S. Pressman, Software engineering: a practitioner’s approach, McGraw–Hill, New York, 1992.Search in Google Scholar

[28] M. C. Wang and W. F. Pan, A comparative study of network centrality metrics in identifying key classes in software, J. Comput. Inform. Syst.8 (2012), 10205–10212.Search in Google Scholar

Received: 2014-2-13

Published Online: 2014-5-7

Published in Print: 2014-12-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in the same Issue

https://doi.org/10.1515/jisys-2014-0015

Keywords for this article

Package; centrality metric; k-core; software network; spreading; SIR model

Creative Commons

BY-NC-ND 3.0

Identifying Important Packages of Object-Oriented Software Using Weighted k-Core Decomposition

Article

Abstract

1 Introduction

2 The Proposed Approach

2.1 Software Entity Collection

2.2 Software Network Definition

2.3 The Weighted k-Core Decomposition Method

3 Empirical Study

3.1 Research Questions

3.2 Objects of Study

3.3 Experiment Process and Results

3.4 Analysis of the Results

3.4.1 What Is the Difference between the Results of Wk-core and Uk-core when Applied to PDNs?

3.4.2 What Is the Relation between the Results of Weighted Coreness and Other Centrality Metrics when Applied to PDNs?

3.4.3 How about the Effectiveness of the Weighted Coreness in Identifying Important Packages when Compared with Other Centrality Metrics?

3.4.4 Can Wk-core Find Meaningful Important Packages in the Software Systems?

4 Conclusions and Future Work

Acknowledgments

Bibliography

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue

3.4.1 What Is the Difference between the Results of W_k-core and U_k-core when Applied to PDNs?

3.4.4 Can W_k-core Find Meaningful Important Packages in the Software Systems?