Optimizing execution time and cost while scheduling scientific workflow in edge data center with fault tolerance awareness

Muhanad Mohammed Kadum; Xiaoheng Deng

doi:10.1515/nleng-2024-0015

Article Open Access

Optimizing execution time and cost while scheduling scientific workflow in edge data center with fault tolerance awareness

Muhanad Mohammed Kadum and Xiaoheng Deng

Published/Copyright: July 10, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Nonlinear Engineering Volume 13 Issue 1

Abstract

Scheduling scientific workflows is essential for edge data centers operations. Fault tolerance is a crucial focus in workflow scheduling (WS) research. This study proposed fault-tolerant WS in edge data centers using Task Prioritization Adaptive Particle Swarm Optimization (TPAPSO). The aim is to minimize the Makespan, execution costs, and overcoming failures at all workflow processing stages, including when virtual machines are insufficient or tasks fail. The approach proposes three components: initial heuristic list, scheduling tasks with TPAPSO, and implementing performance monitoring with fault tolerance (PMWFT). TPAPSO-PMWFT is simulated using CloudSim 4.0. The experiments indicate that the suggested approach shows superior results compared to existing methods.

Keywords: edge computing; fault tolerance; scheduling; scientific workflows; metaheuristic; particle swarm optimization

1 Introduction

The emergence of the Internet of Things (IoT) and the rapid increase in data generation have prompted the transition from conventionally centralized cloud computing to edge computing. Edge data centers, strategically located in the outermost regions of the network, assume a crucial function in this emerging computing paradigm by offering reduced latency, increased bandwidth, and immediate availability of data and applications. Nevertheless, the dependability of these edge data centers is a crucial issue owing to their vulnerability to a range of faults, encompassing hardware malfunctions, software glitches, and network interruptions [1,2,3]. The motivation for moving from cloud to edge regarding fault tolerance is that moving specific computing operation from the cloud to the edge can enhance fault tolerance by reducing latency and potential failure sites. This scenario may increase systems’ overall reliability and robustness, as tasks are allocated close to their desired locations, reducing the possibility of failure caused by network issues or cloud service outages. Furthermore, tasks can be executed quickly and efficiently by utilizing edge computing resources, improving fault tolerance and system reliability. However, resource limitations in edge computing still need to improve regarding fault-tolerant systems. The matter becomes more complicated if there are other goals, such as reducing the execution time and cost, the storage cost, and the cost of data transfer, all at the same time [4,5].

It is widely recognized that the problem of scientific workflow scheduling (WS) concerning fault tolerance is classified as NP-hard [6,7,8]. Moreover, numerous strategies, such as applying metaheuristic techniques, have been utilized to tackle the intricate challenge of attaining optimal scheduling [9,10,11]. A fault tolerance refers to a system’s ability to successfully carry out all its duties, irrespective of any defects that may arise due to software or hardware problems. Creating a robust fault tolerance framework is a difficult task [12]. Moreover, to address this issue, the first step is to have the capability to forecast faults in advance and identify the underlying causes so that suitable corrective actions can be taken promptly. In order to ensure the system’s effectiveness, it is crucial to carry out these remedial activities promptly and in real time [13]. Furthermore, to comply with the rigorous demands of contemporary industrial norms, these systems must effectively address a wide range of system or network-related malfunctions. This may entail determining the specific nature of the fault during the prediction phase to make more knowledgeable choices about the recovery process [14]. The inherent variability of workloads and resources dramatically increases the complexity of accurately predicting errors and their categories. In order to avoid any potential errors with serious adverse effects in the future and to minimize the occurrence of incorrect forecasts, it is imperative for prediction models to exhibit a high level of accuracy. Several current methodologies employ machine learning models because of their remarkable precision [15]. However, the issue of ensuring fault tolerance in edge data centers becomes apparent while managing the scheduling of scientific operations. Efficiently allocating resources while maintaining workflow resiliency and adhering to budget limits is challenging. Modern edge data centers have risks and limitations, especially when managing process activities with additional limits. The emphasis on managing workflows in edge data centers underscores the necessity of efficiently distributing computational resources to improve task workload alleviation and concurrency ratios [16,17]. The complexity of fault tolerance in edge data centers is worsened by security issues, as the heightened intricacy of virtual network security presents IT difficulties and escalating hazards. As workloads and applications migrate between servers in a virtualized environment, monitoring security, rules, and configurations becomes increasingly complex. The ease of using virtual machines (VMs) can leave security holes, and allocating resources to tasks that depend on each other and require a lot of computing power during WS can make things even more complicated. Edge data centers encounter difficulties scheduling workflow operations due to their limited resources while suffering oscillations in performance delivery [18,19]. Current fault-tolerant scheduling algorithms cannot be utilized directly for the workflow in edge data centers sharing elasticity and virtualization attributes [20,21]. Many of these methods present many drawbacks, e.g., some make use of multi-copy dependent for the tasks that add memory cost and consume time. In contrast, others utilize backup components that replace failing parts automatically to prevent service interruptions. These consist of hardware systems backed up by complementary or identical schemes, adding high cost by using more hardware [22,23,24]. In this work, we addressed the problem of scientific WS with fault tolerance in the edge data centers and discussed two failure scenarios. We suggested a Task Prioritization Adaptive Particle Swarm Optimization (TPAPSO), which stands on Particle swarm optimization (PSO), and Performance Monitoring with Fault Tolerance (PMWFT). We proposed heuristic scheduling techniques depending on the size of the tasks, and other suggested priorities; this sequence will be the input to TPAPSO. We used the TPAPSO which enhances the original PSO, where the PSO can choose the best schedule for the VM to carry out the task with an optimum decision. In addition, the rapid convergence characteristic of PSO [25], which sets it apart from other algorithms, such as the gravitational search algorithm [26], the ant colony optimization [27], the discrete symbiotic organism search [28], and the genetic algorithm [29]. The algorithm’s performance can be enhanced if the PSO can be supported with an appropriate beginning sequence point [30]. We also presented PMWFT to monitor TPAPSO performance and achieve fault tolerance. In addition, we formulated the workflow as a directed acyclic graph (DAG) and introduced a hierarchical architecture in edge data center resources, with the top tier for dangerous tasks. Our proposal uses the (TPAPSO-PMWFT) scheme to schedule the scientific workflow of the edge data center model to maximize resource utilization, system availability, and integrity by reducing VM failure and overcoming failure. Reducing Makespan and execution costs improves security, load balance, and machine failure. Our method also schedules independent jobs into high-risk VMs. The experiment demonstrates a considerable boost in success. In summary, our contribution is as follows:

Proposed initiating scheduling with priority, which is scheduling that depends on some priority factor suggested in this work.
Proposed TPAPSO, an enhanced variation of the PSO algorithm, to schedule workflow tasks in an edge data center.
Proposed PMWFT to monitor the TPAPSO performance, as well as to minimize the failure rate of VMs and overcome the failure.
Proposed the high-risk level in the data center resources to process the risky tasks.

The structure of the study is organized as follows. A short review of the state-of-the-art is given in Section 2. Section 3 describes task scheduling problem and objective function. Section 4 displays the proposed method. Section 5 illustrates the preparation for experiments and results, and Section 6 illustrates the discussion. Finally, Section 7 is devoted to the research conclusion.

2 Related works

2.1 Fault tolerance in cloud environment

Dynamic fault-tolerant workflow scheduling (DFTWS) was introduced by Wu et al. [31]. The DFTWS method anticipates crucial workflow pathways and calculates task times accordingly. It allocates the most appropriate VM to each task based on the relative importance of the tasks. By combining spatial re-execution and temporal re-execution, the DFTWS scheduling approach executes workflows in real time. Employing a re-execution method for each task failure results in time consumption and increases the algorithm’s temporal complexity. An intrusion tolerant scheduling algorithm (INHIBITOR) has been proposed by Wang et al. [32]. In this method, the authors modeled the Workflow as DAG and considered replicating the task, in which each task represented three different subtasks, and each subtask should be executed in different operating systems (e.g., MS Windows, GNU/Linux, and Solaris). Each subtask depends on the execution time and each replica’s success rate. The voting machine selects one of them to be the final execution of this task. However, the proposed method highly depends on using many VMs to reduce the probability of processing the three replicas in three compromised machines. Moreover, the reliability requirement and the probability distributions associated with the machine failure characteristics should be considered when determining the parameters of replication and resubmission. The INHIBITOR employed the replica approach, and the efficacy of this method is contingent upon the number of VMs utilized. Increasing the number of VMs results in a higher success rate but at an increased cost. Xiang et al. [33] proposed a WS method that utilizes deep reinforcement learning. This proximal policy optimization-based WS (PPO-based WS) approach aims to achieve fault tolerance and cost efficiency in the scheduling process. Moreover, to enhance the adherence to location constraints during task execution, the method utilizes flawed action masking and optimizes the task assignment strategy using proximal policy optimization. The authors employed tasks that replicated and retransformed in the proposed methodology, resulting in an increase in execution time, which this study has not considered. The RLFTWS approach, proposed by Dong et al. [34] combines failure prediction and fault-tolerant approaches in deep reinforcement learning. This adaptive fault-tolerant WS system reduces Makespan and resource use. This framework considers Markov decision process-based fault-tolerant WS. Academically, resubmission and replication are different. Fault-tolerant heuristics govern task allocation and execution. Based on the current environment, the Double Deep Q Network framework predicts and learns about fault-tolerant strategies for individual tasks. The proposed approach combined task retransmission and replication, which increased costs and limited its effectiveness. A fault-tolerant cost-efficient workflow scheduling (FCWS) is an algorithm that optimizes WS using a cost-efficient bottom level proposed by Tang [35]. The FCWS algorithm may lower application execution costs and improve reliability. Furthermore, FCWS can be implemented in polynomial time. This study uses the Weibull distribution to assess task execution reliability and hazard rate in multi-cloud environments. The author desires to duplicate tasks with high execution hazard rates across many cloud provider VMs while minimizing task execution costs. In the suggested method’s objective function, the author weights execution cost over execution time. The suggested methodology works in multi-cloud environments, but resource constraints would make its implementation in an edge data center expensive and time-consuming.

2.2 Fault tolerance in edge environment

Long et al. [21] addressed the problem of substantial fault tolerance in edge-IoT environments. That is essential when deploying computing infrastructures in a collaborative, distributed, dynamic environment vulnerable to failures. A new fault-tolerant edge-IoT collaborative scheduling mechanism is presented in this work. The fault-tolerant scheduling algorithm for collaborative edge-IoT workflows (FTAW) proposed methodology begins with a rigorous dependency-based task allocation study. Next a primary backup approach addresses the task failures in the edge node.

Moreover, a deep Q-learning technique was proposed to identify the optimal workflow task scheduling scheme. The suggested methodology utilizes two distinct copies, namely, the primary copy and the backup copy, for every task within each DAG. They stored each copy on separate servers. However, due to the limited resources of the edge, this approach incurs additional costs in terms of storage and communication.

Moreover, a fault-tolerant collaborative storage technique in a secure cloud-edge context is presented by Chen et al. [4]. In addition, they provide an optimization strategy for efficient data writing. To enhance system reliability, reduce edge storage overhead, and safeguard data privacy, they proposed Hierarchical Cloud-Edge Collaborative Fault-Tolerant Storage (HCEFT). To enhance the effectiveness of the HCEFT creation procedure and further extend the optimization endeavors, they developed erasure code data writing method based on steiner tree and software-defined networking [SDN], a novel data storage optimization technique that utilizes the Steiner tree and SDN. This method aims to maximize the balance between the time taken to write data and the amount of network traffic generated. The proposed fault tolerance approach encodes each file and divides it into segments distributed across multiple nodes. This strategy has consequences for increasing the cost of transmission as well it needs more edge nodes, which conflicts with the limited resources in the edge. Chen et al. [5] introduced a method called multicenter hierarchical federated learning (MCHFL) for edge-cloud heterogeneous wireless networks. The MCHFL, a nascent architectural framework in federated learning, is intricately crafted to enhance fault tolerance in edge computing settings and reduce reliance on a solitary mobile edge computing server. Nevertheless, the proposed solution should have considered the rise in implementation cost, communication cost, and execution time that arises from distributing jobs among multiple dispersed servers. Jing et al. [36] proposed a distributed protocol that allows n edge devices to achieve a (a, b)-majority consensus within identified time steps with a high likelihood. The empirical results obtained from the simulation studies confirm the fault tolerance property and efficiency of this work in obtaining the (a, b)-majority consensus. However, the proposed technique is time-consuming as it requires time, depending on the node number, to ensure that all nodes are likely to be acknowledged and can make decisions for fault tolerance. Therefore, the method may be affected if there are many problematic nodes in the network, which can result in the postponement of decision-making and possible disturbances in the system. Also, the cost of job execution should have been considered in this work.

3 Task scheduling problem and objective function

3.1 Task scheduling problem

The scheduling of tasks in edge data centers is driven by maximizing the effectiveness of various QoS metrics. The task scheduling problem is distributing jobs across a finite number of VMs. Furthermore, the following presumptions were used to model this issue:

Every VM can perform different kinds of jobs.
Failure can be accruing in VM or task.
A group of VMs can be added if needed, and VM can be added to each group.

In this context, the edge model with one data center contains a multi-group of VMs, represented by G = ( g 1 , g 2 , … , g s ) , where s is the maximum number of groups. Each group has a set of virtual machines and can be represented by g = ( VM 1 , VM 2 , … , VM n ) , where n is the maximum number of VMs in the group in the data center. There are three security levels for VM groups squ ( g ) ϵ ( Nr , Md , and Hr ) : Nr denotes normal-security level, Md represents the middle-security level, and Hr indicates high-risk security level. Each VM in a group has its own computational speed Sp ( VM ) , where

(1) Nr ≤ 0.4 , 0.4 < Md ≤ 0.6 , 0.6 < Hr ≤ 0.99 .

Further, we modeled one of these groups as a high-risk group (g _hr), which has a different VM speed and security of Hr type. Our edge model consists of two networks, the first containing all normal and middle security levels groups, the bandwidth (bw) set to the same among the groups in this network, and the cost of moving a task between VMs in this network being disregarded, while the second network contains the group of high-risk tasks (g _hr). The bw between these two networks doubles to improve the data transmission rate, while using the bw between these networks will be charged a cost C (bw). The workflow considered in this study refers to the set of related tasks based on a DAG with no round. WF = (T, E), where T is a set of n tasks in the shape of:

(2) T = U i = 1 ʍ t i .

The communications and dependencies set of data are represented by E, and every reliance e ij indicates the restriction in which task t _i should be ended before task t _j, and it can be represented as

(3) e ij = { ( t i , t j , Data ij ) | ( t i , t j ) ϵ T × T } ,

where Data ij show data amount which should be exchanged between t _i and t _j. The task data are measured using Million Instructions per Second (MIPS), furthermore, the input and output tasks are represented as ( t in p ) and ( t out ) , respectively. For each task t i ϵ T , there is a data size ds ( t i ) , which is measured by MIPS. Also, there is a cost for executing task t _i on VM_j, denoted as C ( t i , VM j ) . Moreover, the cost in high risk group to processing any task can be represented by ( C g hr ( t i , VM j ) ) . For each task there is associated an arrival time ( ŧ ar ( t i ) ) , start processing time ( ŧ st ( t i ) ) , and end time ( ŧ end ( t i ) ) . Each set of task T has its security Ƌ, and the following formulas explain the relationship between task security and the level security:

(4) Ƌ ( T ) ≤ squ ( g ) ,

where

(5) Ƌ ( T ) = Nr if Ƌ ≤ 0.4 Md if 0.4 < Ƌ ≤ 0.6 Hr if 0.6 < Ƌ ≤ 0.99 .

Each DGA has a multi-level of tasks ( L r ), where the first level includes the input tasks ( t inp ) with no parents, while the last level contains the output task ( t out ) , where this kind of task has no successors. For each task t _i processed with VM_j, we can mathematically calculate the execution time as mathematical execution time (MEx ŧ ) [37]. Formula (6) illustrates the mathematical calculation of execution time.

(6) MEx ŧ ( t i , VM j ) = ds ( t i ) / Sp ( VM j ) .

In edge hosting, servers can fail due to software and hardware issues. All or some virtual machines may lose service. Service problems include VM regression. The decline in resource performance and uncertainty regarding failures make scheduling process applications during VM performance regression problematic [38,39]. The impact of performance regression on virtual machines utilized in edge data center is a significant concern. Eq. (7) provides a depiction of the relationship between mathematical execution time MEx ŧ ( t i , VM j ) and the impact of VM performance regression, which represents the estimated execution time (EExŧ).

(7) EEx ŧ ( t i , VM j ) = ( MEx ŧ d ( t i , VM j ) × ( 1 − PerReg ( VM j ) ) .

It can be noted that the PerReg ( VM j ) reflects the ratio of VM_j’s performance to its regression, which may be computed using Eqs (1–3) in the study by Chakravarthi et al. [40]. While Eq. (8) illustrates the PerReg(VM_j) formulation as follows:

(8) PerReg ( VM j ) = ( P load − P no − load ) / P no − load ,

where P no − load and p load represent the performance of VM without and with load, respectively. ( P no − load ) can be estimated using the runtime of VM_j without interference and P load estimated with interference of VMs. Prior to commencing task planning, it is imperative to effectively handle both the ( P no − load ) and ( P load ), to determine the execution time following the execution of a scientific workflow application. Eqs (9) and (10) demonstrate the computational methodology employed for the calculation.

(9) P no − load = P ( ŧ st ( t i ) ) ,

(10) P_load = P ( ŧ end ( t i ) ) – ŧ st ( t i ) ) .

The error rate (Ɛ) of VM_j is almost the same value as PerReg (VM_j) since the value of PerReg (VM_j) falls within the open period (1, 0]. So that Eq. (8) can be rewritten as shown in formula (11).

(11) Ɛ VM j = ( P load − P no − load ) / P no − load .

Then, the estimated execution time (EExŧ) in Eq. (7), for any task t _i processed by VM_j, can be rewritten as depicted in Eq. (12).

(12) EEx ŧ ( t i , VM j ) = MEx ŧ ( t i , VM j ) × ( 1 − Ɛ VM j ) .

The amount of time that is wasted ( was ŧ ( t i , VM fault ) ) in the event of a machine failure during processing any task can be determined using a specific calculation, as shown in formula (13).

(13) was ŧ ( t i , VM fault ) = ŧ int ( t i , VM fault ) – ŧ st ( t i , VM fault ) ,

where ŧ _int(t _i, VM_fault) denoted the interrupt time for processing the task (t _i) on machine that has failure (VM_fault), and ŧ _st (t _i, VM_fault) represents the start processing time for (t _i) on (VM_fault).

In contrast, the computation of total time of execution and Makespan for any job can be formulated as in Eqs (14) and (15), where dl denotes the deadline determined for each task t i ϵ T .

(14) ToEx ŧ ( t i ) = was ŧ ( t i , VM fault ) + ŧ migr ( t i ) + MEx ŧ dR ( t i , VM j ) ,

(15) Makespan = ToEx ŧ ( t i ) , where Makespan < dl ,

where ToExŧ(t _i) represents the total executing time, ŧ _migr(t _i) refers to the migration time of (t _i) between the networks, and it can be expressed by dividing the size of the task by the bandwidth as in Eq. (16).

(16) ŧ migr ( t i ) = d s ( t i ) / 2 b w .

Moreover, the total execution cost ToC ( t i ) will be the regular cost of the machines added to the cost of using bandwidth between the two networks if the task is migrated between them. So that ƈ will be a factor referred to task’s migrating, ƈ ϵ {0, 1}. The following formula shows how the ToC (T) is calculated:

(17) ƈ = 1 if the task ( t i ) migrated 0 otherways ,

(18) ToC ( t i ) = ∑ i = 1 m ∑ j = 1 n C ( t i , VM j ) + ƈ × ( migr C ( t i ) + C g hr ( t i ) ) ,

where migr C ( t i ) represents the cost migration task ( t i ) to ( g hr ) using the bandwidth between the two networks.

3.2 Objective function

The primary objective of this study is to strengthen the system’s fault tolerance capability while optimizing the Makespan and execution cost. Our objective functions are specifically developed to effectively minimize the Makespan by restricting the length of each activity and optimizing execution costs while considering fault tolerance. In order to accomplish this, we employed several constraints. The constraints aim to ensure that the error rate (Ɛ) for each VM falls within the range of [0, 1), with a maximum value of 0.2; this is to maintain the efficiency of the VM as it processes a job and prevent the occurrence of VM failure. The deadline condition (dl) and the security level criterion must also be met. Eq. (19) explicitly illustrates the objective function.

(19) F = Min ( Makespan ) + Min ( ToC ( t i ) ) Subjected to Ɛ ≤ 0.2 Ƌ ( T ) ≤ squ ( g ) Makespan < dl .

4 The proposed method

The suggested approach employs a sequential strategy to achieve the objective of this work. First, it initiates a series based on the task priority at the same DAG’s level. Second, this sequence serves as the input for the proposed TPAPSO. The third option is the recommended PMWFT, which focuses on preventing failures in the VM and task strategy. It also offers a solution in case a failure does occur.

4.1 Initiating scheduling with priority

The scientific workflow is formed as DAG. Then, the tasks in each DAG will be scheduled as a heuristic list according to the total task size Tods ( t i ) , which depends on the data size of the task, the number of predecessors (pred (t _i)), and the number of successors (suc(t _i)) of each task, that can be calculated as in Eq. (20).

(20) Tods ( t i ) = ( ds ( t i ) + Ῥ ( pred ( t i ) ) + PK ( suc ( t i ) ) ) ,

(21) Ῥ = 1 if ( pred ( t i ) ) ≥ 1 0 otherways , PK = 1 if ( suc ( t i ) ) ≥ 1 0 otherways ,

where Ῥ and PK represent the factors indicated to the predecessor and successor existing, respectively. As mentioned above, the set of task T includes the input task, this kind of job has the highest priority, then we evaluate the priority as {0, 1}, where 0 indicates the highest priority, and 1 represents the other cases, as depicted in Eq. (22).

(22) Pr ( t i ) = 0 if Ῥ = 0 1 otherways , max ( Tods ( t i ) ) .

In the proposed initial-heuristics procedure, the scheduling will depend on the following parameters ( Ƌ ( T ) , Pr ( t i ) , EE x ŧ ( t i , VM j ) ) . Figure 1 depicts a sample of DAG task and levels. The initial heuristics approach produces the ideal sequence of tasks, which serves as the input for the subsequent algorithm, TPAPSO. Consequently, performance is influenced by the arrangement of tasks. Optimizing the order in which actions are performed leads to enhanced performance by decreasing the system’s overall temporal complexity, which might result in expedited processing times and enhanced utilization of resources.

Figure 1

Sample of workflow with task levels.

Algorithm 1. Initial-heuristics procedure

Input: WF = (T, E), T = U ⁿ _{i = 1} t _i, E = {(t _i, t _j, Data_ij) | (t _i, t _j)}, E _ij = (t _i, t _j, Data_ij), L, squ ( g )
Output: schedule Sch as pare of task to virtual machine (t, VM)
For j = 1 to n; {//n is the No. of VMs
Arrange VMs Top-Down as Sp(VM)
Check the security level Ƌ ( T ) of the set of task T;
For i = 1 to m; {//m is the no. of task in the set of task T
Current_task = t _i;
If Ƌ ( T ) ≤ squ ( g ) ; then
If the value of Ῥ = 0; then//check the priority of task Pr ( t i )
Start scheduling
Assign current_task to VM_j;
Else
Find max(Tods(t _i)) //use Eq. (20) to find Tods(t _i)
Find the min ( MEx ŧ ( t i , VM j ) ) ; //use Eq. (11) to find MEx ŧ ( t i , VM j )
Assign current_task to VM_j;
Sch = list of ( t i , VM j );
End if
End if
End of scheduling
}
}

Algorithm 1 returns a schedule from a workflow (WF) with tasks (T) and their dependents (E), VM count, and security level (squ(g)). Criteria determine VM allocation. The method ranks VMs by capacity, starting with the highest. Then, it checked the security of T. Each task in T is looped. Each task compares T’s security against squ(g). Upon meeting the requirement, the algorithm assesses a variable ( Ῥ ). Zero signifies the highest-priority task. The algorithm assigns the current task to jth VM for scheduling. To obtain the most significant value of Tods(t _i) when Ῥ is not zero, use Eq. (20). As well as find the execution time for the current task MEx ŧ ( t i , VM j ) using Eq. (6). After finding these values, the algorithm assigns current task with maximum Tods(t _i) to the jth VM that satisfy minimum MEx ŧ ( t i , VM j ) and create (sch) sequence of ( t i , VM j ) . And then, the algorithm updates Sch. The time complexity is determined according to n, the VM number, and m, the task number, computed as O(n × m).

4.2 TPAPSO

The PSO [41] method operates so that these created particles swarm as a population in exploration of the ideal solution, much like a flock of birds. This optimization method has gained substantial popularity due to its ease of use and low computing cost for solving various issues [42]. This study suggests a new approach called TPAPSO based on the PSO method and explicitly designed for multi-objective scheduling. Utilizing adaptive techniques and task prioritization in PSO for scheduling workflow on edge data centers might yield favorable outcomes in minimizing execution time and cost. This approach can significantly enhance the convergence of PSO. By customizing the parameters of the PSO algorithm to match the unique attributes of the workflow and assigning priorities to tasks, the algorithm effectively and systematically seeks the best solutions. This results in faster convergence and enhanced performance, specifically reducing execution time and cost. The input of the proposed algorithm will be the sequence that comes out of algorithm (1), a heuristic list with optimum arrangement. This scenario will add another improvement to the PSO. The security type of task T is also considered when selecting a VM group. For that, the particle searches for the optimal schedule and VM that can be allotted to it so that it may produce the most optimal results. Personal best (Pbest) and global best (Gbest) are the two fundamental parameters of the PSO algorithm. Pbest reflects the optimal location of a particle during its swarm.

Similarly, Gbest is the global best the population experienced throughout the swarm. During the execution of the PSO algorithm, the particle’s velocity is increased randomly to its ideal position and the global optimal. The position of the particles has been represented as → p a u ( ŧ ) = p 1 ( ŧ ) , p 2 ( ŧ ) , p y ( ŧ ) at time ŧ where i = {1, 2, …, y} and the velocity of the particles can be represented as ( → v u ( ŧ ) = v 1 ( ŧ ) , v 2 ( ŧ ) , v y ( ŧ ) ) . The particle moves to a new position in the subsequent step ( ŧ + 1). In the proposed approach, we add several parameters to the fundamental parameters, including the maximum number of iterations (maxIter), acceleration coefficients (c ₁ and c ₂), upper and lower bounds for the inertia weights (wmax and wmin), and a function for determining the inertia weight (w), during each iteration. Furthermore, the task priority parameter calculates during update of Gbest. The proposed algorithm changes the velocity and position of every particle, and the velocity is recalculated by incorporating the current position, Pbest, and Gbest, utilizing acceleration coefficients and random values. The current position is updated by integrating the velocity. The fitness of the new position is evaluated, and if it surpasses the current Pbest, both the Pbest and its corresponding fitness value are changed. Suppose the fitness of the newly obtained position surpasses the current global best position Gbest, in that case, the Gbest is subsequently upgraded considering the task priority representing the particle priority obtained from Eq. (20). After the particles update, the inertia weight is updated through the preestablished formula to balance exploration and exploitation. The method proceeds for the designated number of iterations, where it iteratively updates the particles, inertia weight, and job priority in each iteration. Ultimately, the optimized schedule, denoted as (t _i, VM_j), is the Gbest. The proposed TPAPSO has several advantages over existing optimization algorithms. It may dynamically adapt parameters based on performance, continually optimize job assignments and resource allocation, and enhance efficiency in identifying optimal solutions, that can lead to a more streamlined and economical scheduling process in the edge data center. The following pseudo-code illustrates the proposed TPAPSO approach.

Algorithm 2 TPAPSO

Input: sch resulted from algorithm 1
Output: optimize schedule (t _i, VM_j)
Initialize particles
For (u = 1; u < = y; u++) {
Pbest(u) = sch;
Velocity(u) = random();
Pbfitness(u) = evaluateFitness(Pbest(u));
Gbest = argmin(Pbfitness(1), Pbfitness(2),…, Pbfitness(y))
set maxIter, c ₁, c ₂, wmax, wmin, w();
For (iter = 1; iter < = maxIter; iter++) {
For(u = 1; u < = y; u++) {
r ₁ = random(); r ₂ = random(); //update velocity with consideration to priority
Find fitness F value for Pbfitness according to Eq. (19);
v(u) = w(iter) * v(u) + c ₁ * r ₁ * priority(u) * (Pbest(u)−current(u)) + c ₂ * r ₂ * priority(u) * (Gbest - current(u)); //update position
current(u) = current(u) + v(u); //evaluate fitness value
currentFitness = evaluateFitness(current(u)); //update Pbest
if (currentFitness < Pbfitness(u))
Pbfitness(u) = currentFitness;
Pbest(u) = current(u); //update Gbest with consideration to iteration weight
if (currentFitness < evaluateFitness(Gbest) * iterWeight(iter))
Gbest = current(u); //calculate task priority
For each task t _i in Gbest {
Find the priority Pr(t _i) using Eqs. (20)–(22)
}
}
w(iter + 1) = wmax−((wmax−wmin)/maxIter) * iter; //update inertia weight
}}
optimized schedule (t _i, VM_j) = Gbest

The algorithm presented is a modified version of the PSO algorithm known as TPAPSO. The technique presented in this study improves upon the conventional PSO algorithm by incorporating two key enhancements: task priority and adjustable inertia weight. Within the context of TPAPSO, it is notable that every individual particle represents a potential solution. These particles update by considering their best solution (Pbest) and the Gbest. The velocity of each particle is changed by including random values (r ₁ and r ₂) and considering the job priority. The acceleration of each particle then affects its position. The fitness value of the newly obtained position is assessed and then compared to the personal best fitness value of the particle. If the freshly obtained location exhibits a superior fitness value, the values of Pbest and Pbfitness are subsequently adjusted. The Gbest is updated by considering the new position’s fitness value and the iteration’s weight. Task priority calculation is performed for each task in the Gbest utilizing a designated equation. The inertia weight is dynamically adjusted during each iteration to balance exploration and exploitation. In general, the incorporation of task priority and adaptive inertia weight in TPAPSO has the potential to boost the convergence and optimization performance of the PSO algorithm. The analysis of the temporal complexity of this technique involves examining the number of iterations (maxIter) and the number of particles (y). The computational complexity of the inner loop, responsible for updating the velocities, locations, fitness values, and Pbest, is denoted as O(y). Hence, the algorithm’s temporal complexity can be expressed as O(maxIter × y). Figure 2 demonstrates the proposed method.

Figure 2

Flowchart illustrating the proposed method.

4.3 PMWFT

In this suggested approach, we address monitoring the machine’s performance and the two distinct failure scenarios for an edge data center system. The first scenario addresses the failure of the VM during the execution of a specific task, how to resolve this issue, and the action required to prevent the task from suddenly terminating. In the second scenario, the problem in the task itself is investigated, such as if this is a possible attack in which an attacker inserts a malicious instruction into the task’s source code. In this instance, it has been considered to incorporate unavoidable delays in completing task processing in a regular VM. Our solution also includes a novel approach for scheduling scientific workflow tasks with a secure fault-tolerance strategy, as it focuses on preventing flaws in existing processes and tries to boost system security and decrease the likelihood of failure. In addition, the proposed design can circumvent the failure if it occurs. For the first stimulation of the task, an innovative technique was presented to monitor the performance of TPAPSO and integrate the VMs’ availability and edge system security. At the same time, the TPAPSO is designed to minimize the Makespan and cost of executing the workflow. Where, within edge data centers, the equilibrium between execution time and cost assumes greater relevance owing to the scarcity of resources compared to conventional cloud data centers. Hence, the scheduling algorithms we present aim to achieve an optimal balance by considering task priority, resource availability, and network conditions while making scheduling decisions, that can reduce the time and expenses required to complete a task.

The VM’s efficiency was initially adopted to adjust the load balance and the number of VMs in the VMs group. Our strategy depends on checking the VM’s efficiency during a task processing, then comparing the actual processing time to the estimated execution time for the (t _i) on VM_j while taking the VM’s degradation into account to determine if this VM should continue processing the task, further the possibility to serve another task, or it should be repaired. The VM processing the current task will switch to another task whenever it determines that the time required to complete it will exceed the original estimate. Nevertheless, counters will suggest that a particular VM has a one-time delay in some circumstances. In addition, the migrated task will be displayed on a different counter to ensure that this VM will restart and that another VM will be added to the group if the stated VM has the same problem for a second time. If the migrated task is problematic again, it will migrate to a high-risk group, and the cost of using this group will be charged. Our contribution to this method promotes enhanced security in the edge data center by neutralizing concealed attacks within workflows that seek to traverse VMs. Consequently, our solution encompasses the proposed fault tolerance strategies to a greater extent, as well as deadline restrictions. Moreover, to find the performance efficiency for any machine, this can be an inference from VM regressions, where Ɛ ϵ (1, 0] and the efficiency decreases as regression increases. Therefore, we can state

(23) VM j ef ≤ 1 ,

then,

(24) VM j ef = 1 − ε VM j .

The TPAPSO-PMWFT framework assesses the availability of all VMs by evaluating their respective statuses. Moreover, the analysis of each VM type is conducted by doing activities comparable to the tasks of a scientific workflow regarding the quantity and type of data while ensuring that no work conflicts arise with the other VMs. Subsequently, the same VMs are analyzed under task and resource allocation interference conditions. Additionally, the duration of execution is documented for each case. By comparing the time in both scenarios, we can assess the impact on the location during the project and the extent to which efficiency declined during the execution.

Algorithm 3. PMWFT

Input the optimized schedule (t _i, VM_j) resulted from algorithm 2

Output optimized schedule with fault tolerance

VM_rep = ∅ , I = interrupt, K = 0; A = 0; b = 0;
For (j = 1; n; j++);{
For (i = 1; m; i++);{
Initial_schedule = Initialize scheduling based on algorithm 2
Current_ task = t i g x ; //first task in the Initial_schedule
Start scheduling based on Initial_schedule;
If VM_jgxef ≤ 0.8, then//use Eq. (24)
Send I toVM_jg_x;
Start migrating (Curren_ task) to an appropriate VM_jgx; //schedule as algorithm 2
Current_ task = t i g x _{+ 1}; //track the next task in the schedule
End if
kVM_jgx = k VM_j g _{x j + 1}; //K counter for VM fault
A t ig x = A t ig x + 1//A counter for task fault
b = b + 1; //counter for counting the tasks migrating to g_h-r
End If;
If kVM_jgx > 1;
Send VM_jgx to VM_rep;
End if
If A t ig x > 1
Start migrating (Current task) to g_h−r;
b = b + 1;
End If
}
While VM_jg_x ϵ VM_rep;
Reboot VM_jg_x;
VM_rep = VM_rep – VM_jg_x;
End while
}

Algorithm 3, PMWFT, is a pseudo-code for fault-tolerant schedule optimization. The algorithm takes the optimized schedule from algorithm 2 and outputs a fault-tolerant optimized schedule. Initializing (VMrep) which is a set to store VMs need to be repaired, (I) an interrupt signal, (K) a VM fault counter, (A) a task fault counter, and (b) a task migration counter start the method. The method initializes the initial schedule using algorithm 2 and selects the current task as each iteration’s first task. It then schedules using the initial schedule. The VM receives an interrupt signal (I) if its performance is less than 80%. The scheduling mechanism in algorithm 2 migrates the present work to a suitable VM. The next scheduled task becomes the current one. The procedure increments the K counter for VM faults and, for task faults, the A counter. It also increments the b counter to track high-risk group task migration.

VM_rep receives VM_jgx if the K counter for VM faults exceeds 1. Tasks are moved to a high-risk group if the task fault A counter exceeds 1. The nested loops continue until VM_jgx is not in VM_rep. VM_jgx reboots and leaves the VM_rep set when this circumstance is fulfilled. VM performance is monitored, and jobs are migrated to the appropriate VM or high-risk group based on fault tolerance to optimize the schedule. The time complexity for algorithm 3 depends on the task number m and the VM number, which is calculated as O(n × m). The overall time complexity of TPAPSO-PMWFT is determined as O(n × m) × O(maxIter × y), or O(n*m*maxIter*y) in simpler terms.

Finally, the proposed method involves significant practical ramifications and uses, particularly in scientific workflows. Optimizing task scheduling can increase productivity and lower costs because workloads are frequently dispersed across numerous virtual machines in edge data center environments. Fault tolerance increases availability and reliability by ensuring the system can function even if one or more tasks or VMs fail. Applications with expensive or unacceptable downtime, including financial services, healthcare, or essential infrastructure, may find this very helpful.

5 Experiments and results

5.1 Experimental setup

The key objective of this section is to provide detailed information on the experimental setup created to evaluate the efficiency of the proposed method, TPAPSO-PMWFT. To assess this innovative method, we employed four distinct kinds of Amazon EC2 instances to duplicate our model faithfully. The details of these occurrences are extensively depicted in Table 1.

Table 1

Typical EC2 VMs type for the proposed model

Machine name	CPU (MIPS)	# Cores	Memory (GB)	Bandwidth (Mbps)	Failure rate (pk)	Cost ($ per hour)
m4.large	1,000	2	8	10	0.1	0.16
m4.xlarge	2,000	4	16	10	0.09	0.26
m4.2xlarge	4,000	8	32	10	0.07	0.53
m4.4xlarge	6,000	16	64	10	0.06	0.93

In addition, we utilized four supplementary categories of Amazon EC2 instances to replicate a high-risk situation within our suggested model. The details of these occurrences are extensively depicted in Table 2.

Table 2

Type of EC2 VMs used in high-risk group for the suggested model

Machine name	vCPU	Instance storage (GB)	Memory (GB)	Bandwidth (Mbps)	EBS bandwidth (Mbps)	Failure rate (pk)	Cost ($ per hour)
c6g.4xlarge	16	EBS-Only	32	Up to 10	4,750	0.07	1.079
c6g.8xlarge	32	EBS-Only	64	12	9,000	0.05	1.39
c6g.12xlarge	48	EBS-Only	96	20	13,500	0.03	1.86
c6g.16xlarge	64	EBS-Only	128	25	19,000	0.02	2.01

The simulations were performed on the CloudSim 4.0 platform. When evaluating the efficacy of our proposed methodology in relation to other established methodologies, we included two separate categories of procedures in our research.

The initial workflow was randomly generated using the WorkflowSim tool [43]. In this workflow configuration, the number of edges is twice the number of nodes. The jobs employed in the studies exhibited a range of sizes, spanning from 30 to 100, with file sizes varying from 10 MB to 50 MB. The WorkflowSim API was utilized to create the workflow framework, encompassing the specification of tasks, their interdependencies, the data dependencies between them, and the security level associated with each workflow.

As shown in Table 3, the second workflow category consisted of the Epigenomics and LIGO workflows [44]. The table presents a comprehensive compilation of samples linked to these workflows. The sizes of these workflows are indicated as follows: (S) for small, (S-M) for small-medium, (M) for medium, (L) for large, and (EL) for extra-large.

Table 3

Types of workflows tasks size

Workflow	S	S-M	M	L	EL
Epigenomics	50	100	200	400	600
LIGO	50	100	200	400	600

A series of rigorous experiments were conducted to assess the viability and efficacy of the suggested methodology. The outcomes obtained from the suggested approach were compared to the outcomes of state-of-the-art approaches, such as PPO-based WS [33], RLFTWS [34], FCWS [35], and FTAW [21]. This comparison used many parameters, including task quantity, size, and failure rate. The metrics were selected because they align with the intended objectives of Makespan, execution cost, and task completion rate.

To thoroughly and fairly assess our suggested algorithm, TPAPSO-PMWFT, we performed a comparative analysis with various cutting-edge approaches, including PPO-based WS, RLFTWS, FCWS, and FTAW. This analysis considered the difference in objective functions between the present state-of-the-art approaches and the suggested method. It also considered the type of workflow and metrics employed in each approach. The performance of each approach was evaluated using fundamental measures such as Average Makespan, failure ratio, and task number.

Thus, a comparison was conducted between the proposed method and the previously indicated state-of-the-art, considering their respective objective functions.

Furthermore, we conducted a comparative analysis between our suggested approach and PPO-based WS and FTAW, explicitly emphasizing the task completion rate. Uniform measurements were employed across all methodologies to guarantee an equitable comparison. The experimental methodology employed in these comparisons adhered to the first type of workflow outlined previously, produced randomly using WorkflowSim.

Figures 3–6 visually depict the outcomes of these comparisons. In addition, we conducted a comparative analysis between our suggested technique and the FCWS method, considering variables such as the cost of execution, the overall completion time, and the meter for job size.

Figure 3

Average Makespan according to failure ratio.

Figure 4

Average Makespan according to task number.

Figure 5

Tasks completion rate according to failure ratio.

Figure 6

Tasks completion rate according to tasks number.

The experimental procedure utilized a combination of Epigenomics and LIGO methodologies, as outlined in Table 3. The results of these experiments are depicted in Figures 7–10. By implementing this complete comparative approach, we assessed the efficacy of our suggested technique to other modern methodologies, considering various types of workflows and evaluation criteria.

Figure 7

Comparison of LIGO workflow execution cost according to task size.

Figure 8

Comparison for LIGO workflow execution time according to task size.

Figure 9

Comparison for epigenomics workflow execution cost.

Figure 10

Comparison of epigenomics workflow execution time according to task size.

5.2 Outcomes and analyses

Figure 3 shows that the TPAPSO-PMWFT strategy significantly reduced average Makespan, with a reduction that was 26.97% smaller than the PPO-based WS technique. Surprisingly, the results were almost identical to those obtained from the RLFTWS method, with the proposed strategy showing a modest reduction of 1%. The results were achieved within the specified failure range of the virtual machines in the edge data center that was modeled in the experiments, as the failure rates included values of 0.05, 0.1, 0.2, 0.3, and 0.4 over the entire range. The workflow that was used was the first type discussed previously.

The results of our examinations, which are displayed in Figure 4, revealed that our suggested method outperformed PPO-based WS and RLFTWS in terms of average Makespan by around 6 and 19%, respectively. The results emerged after completing an analysis of the task number metrics that were used in this part of the experiments we conducted. In this particular instance, the very first described form of the scientific workflow was utilized.

In addition, our experimental findings, which specifically aimed to enhance the rate at which tasks are completed, showed that our suggested method outperformed the current most advanced techniques of PPO-based WS and FTAW when considering the failure ratio metrics. The failure rates under consideration are 0.05, 0.1, 0.2, 0.3, and 0.4. The results of our technique demonstrated a significant improvement of around 6.3% compared to PPO-based WS and 10.8% compared to FTAW, as illustrated in Figure 5. The scientific workflow utilized was of the first type.

Moreover, our experimental results, which focus on improving the task completion rate, show that our proposed method outperforms existing leading techniques such as PPO-based WS and FTAW when evaluated using the task number metric. The performance of our method shows remarkable similarity to the PPO-based WS method, leading to unexpected results. When compared to FTAW, the results showed a significant improvement of approximately 12.3%. This improvement became more pronounced as the task size grew, as shown in Figure 6. The scientific workflow used was of the first type.

Additionally, when considering the cost of execution tasks, the suggested technique, as shown in Figure 7, exhibited similar performance to FCWS for tasks of small and small-medium sizes, as well as large and extra-large sizes in the LIGO workflow, which is the second type of workflow discussed earlier. The findings indicated that the FCWS demonstrated superior performance compared to the recommended technique. The outcome was unforeseen. Nevertheless, the FCWS assigned a 70% weight to the cost of completion time of the task. That was the cause of the unexpected result.

Nevertheless, when we conducted our experiments using the LIGO workflow, this time, we focused on optimizing the completion time of the tasks and considering the sizes of each task. The suggested method demonstrated superior performance compared to the FCWS method, as in Figure 8, achieving an improvement of around 11.3%; this was observed across various LIGO workflow scenarios, including small, small-medium, large, and extra-large sizes.

Moreover, in terms of cost, the suggested technique, as illustrated in Figure 9, performed similar to FCWS for small and small-medium task sizes of epigenomics, which was previously mentioned as the second type of workflow. The results suggested that the FCWS exhibited superior performance in comparison to the recommended technique.

However, throughout the experiments with the epigenomics workflow, we specifically aimed to improve the efficiency of task completion time by taking into account the sizes of each task. The recommended method exhibited higher performance in comparison to the FCWS method, as shown in Figure 10, with an approximate improvement of 12.7%. This improvement was observed across several epigenomics workflow scenarios, ranging from small to extra-large sizes.

6 Discussion

From the results obtained from our experiment in failure rates metrics, it is evident that TPAPSO-PMWFT effectively reduced the Makespan by 26.97% compared to the PPO-based WS technique. However, it achieves only a modest reduction of 1% compared to the RLFTWS technique, which means that although the proposed approach is efficient, it may face challenges in outperforming other complex techniques, such as RLFTWS, by a large margin.

In addition, examining the task number metrics shows that our proposed approach achieves superior performance compared to PPO-based WS and RLFTWS, with an approximate improvement of 6 and 19%, respectively, to optimize the Makespan. While, when examining the failure rate metric, it is clear that the proposed method only achieves a performance improvement of approximately 6.3% compared to PPO-based WS and 10.8% compared to FTAW, this indicates that the proposed strategy may need significant assistance in enhancing the failure rate.

And based on the task number metrics, our TPAPSO-PMWFT approach produced results that closely aligned with the task completion percentage. However, when evaluating the implementation overhead, the proposed approach performed similar to FCWS only for small and medium sized jobs. FCWS demonstrated superior execution cost performance over the proposed solution for large and extra-large task sizes, as observed in epigenomics and LIGO workflow scenarios. This implies that the suggested method may be required to enhance and minimize the additional costs and efforts involved in carrying out more extensive tasks.

Nevertheless, when considering that in FCWS, priority was given to cost over execution time with 70% in the implementation, we can infer that the proposed method is preferable since the design was based on giving equal preference to both execution time and cost.

When evaluating the task completion time, the proposed technique showed a performance improvement of approximately 12% compared to FCWS when implemented on Epigenomics and LIGO workflows.

7 Conclusion

This research has specifically addressed the implementation of scientific workflows in edge computing, which has several benefits, especially in the context of modular edge data centers that prioritize fault tolerance. We suggested a three-fold scheduling technique. The initial schedule is generated using a heuristic approach, considering task size and other factors. The TPAPSO algorithm is a modified PSO version that inputs the previous algorithm’s sequence. Its objective is to minimize both execution time and cost. The third algorithm is a fault tolerance technique specifically created to oversee the performance of TPAPSO and handle any faults related to virtual machines or tasks.

Efficient task scheduling is crucial for assuring the timely completion of workflows. Our technique commences with an initial schedule based on heuristics, subsequently optimized via TPAPSO. TPAPSO-PMWFT aims to identify the optimal schedule that minimizes cost and Makespan, ensuring maximum efficiency.

Our proposed method performs superior to state-of-the-art techniques such as PPO-based WS, RLFTWS, FCWS, and FTAW. More precisely, it accomplished a reduction of 26.97% in Makespan when compared to PPO-based WS. Our technique achieved a 6 and 19% increase in performance compared to PPO-based WS and RLFTWS, respectively, regarding Makespan.

Regarding the failure ratio metric, our suggested technique demonstrated superior performance to PPO-based WS and FTAW, with improvements of 6.3 and 10.8%, respectively. The TPAPSO-PMWFT method provided a highly accurate estimation of the task completion ratio while evaluating the task number metric.

When evaluating the size of the tasks, our suggested approach exhibited a 12% enhancement in the time taken to finish the epigenomics and LIGO workflow compared to FCWS. However, FCWS exhibited better results in terms of cost while assigning a 70% weight to cost compared to execution time.

Our TPAPSO-PMWFT algorithm presents a highly effective method for scheduling scientific workflows in edge computing environments. It outperforms other algorithms in various essential measures, making it a potential solution. Nevertheless, additional efforts are required to enhance its cost-effectiveness. In future endeavors, we would like to improve the proposed method to overcome limitations and explore a novel approach for scheduling scientific workflows in an edge environment with distributed heterogeneous data centers. We will also investigate the collaboration between the edge and cloud to optimize execution time, communication cost, latency, energy consumption, and improve QoS and security awareness.

Funding information: This work was funded by National Natural Science Foundation of China, with Award Number: 62172441.
Author contributions: All authors have personally and actively contributed to the article’s development and will assume public responsibility for its contents.
Conflict of interest: The authors declare that they have no potential conflicts of interest.
Data availability statement: The data used in our experiments can be found freely on https://pegasus.isi.edu/.
Images/Graphics: The authors declare that all the images in our manuscript are original.

References

[1] Alsaidy SA, Abbood AD, Sahib MA. Heuristic initialization of PSO task scheduling algorithm in cloud computing. J King Saud Univ-Computer Inf Sci. 2022;34(6):2370–82.10.1016/j.jksuci.2020.11.002Search in Google Scholar

[2] Haibeh LA, Yagoub MC, Jarray A. A survey on mobile edge computing infrastructure: Design, resource management, and optimization approaches. IEEE Access. 2022;10:27591–610.10.1109/ACCESS.2022.3152787Search in Google Scholar

[3] Ray K, Banerjee A. Prioritized fault recovery strategies for multi-access edge computing using probabilistic model checking. IEEE Trans Dependable Secure Comput. 2022;20(1):797–812.10.1109/TDSC.2022.3143877Search in Google Scholar

[4] Chen J, Wang Y, Ye M, Jiang Q. A secure cloud-edge collaborative fault-tolerant storage scheme and its data writing optimization. IEEE Access. 2023;11:66506–21.10.1109/ACCESS.2023.3291452Search in Google Scholar

[5] Chen X, Xu G, Xu X, Jiang H, Tian Z, Ma T. Multicenter hierarchical federated learning with fault-tolerance mechanisms for resilient edge computing networks. IEEE transactions on neural networks and learning systems. 2024.10.1109/TNNLS.2024.3362974Search in Google Scholar PubMed

[6] Ibrahim M, Nabi S, Baz A, Alhakami H, Raza MS, Hussain A, et al. An in-depth empirical investigation of state-of-the-art scheduling approaches for cloud computing. IEEE Access. 2020;8:128282–94.10.1109/ACCESS.2020.3007201Search in Google Scholar

[7] Tong Z, Chen H, Deng X, Li K, Li K. A scheduling scheme in the cloud computing environment using deep Q-learning. Inf Sci. 2020;512:1170–91.10.1016/j.ins.2019.10.035Search in Google Scholar

[8] Singh H, Bhasin A, Kaveri PR. QRAS: Efficient resource allocation for task scheduling in cloud computing. SN Appl Sci. 2021;3(4):1–7.10.1007/s42452-021-04489-5Search in Google Scholar

[9] Houssein EH, Gad AG, Wazery YM, Suganthan PN. Task Scheduling in cloud computing based on meta-heuristics: Review, taxonomy, open challenges, and future trends. Swarm Evolut Computation. 2021;62:100841.10.1016/j.swevo.2021.100841Search in Google Scholar

[10] Wang Y, Zuo X. An effective cloud workflow scheduling approach combining PSO and idle time slot-aware rules. IEEE/CAA J Automatica Sin. 2021;8(5):1079–94.10.1109/JAS.2021.1003982Search in Google Scholar

[11] Zhang L, Zhou L, Salah A. Efficient scientific workflow scheduling for deadline-constrained parallel tasks in cloud computing environments. Inf Sci. 2020;531:31–46.10.1016/j.ins.2020.04.039Search in Google Scholar

[12] Ma X, Gao H, Xu H, Bian M. An IoT-based task scheduling optimization scheme considering the deadline and cost-aware scientific workflow for cloud computing. EURASIP J Wirel Commun Netw. 2019;2019(1):249.10.1186/s13638-019-1557-3Search in Google Scholar

[13] Tuli S, Casale G, Jennings NR, editors. Pregan: Preemptive migration prediction network for proactive fault-tolerant edge computing. IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE; 2022.10.1109/INFOCOM48880.2022.9796778Search in Google Scholar

[14] Mudassar M, Zhai Y, Lejian L. Adaptive fault-tolerant strategy for latency-aware IoT application executing in edge computing environment. IEEE Internet Things J. 2022;9(15):13250–62.10.1109/JIOT.2022.3144026Search in Google Scholar

[15] Sharif A, Nickray M, Shahidinejad A. Fault‐tolerant with load balancing scheduling in a fog‐based IoT application. IET Commun. 2020;14(16):2646–57.10.1049/iet-com.2020.0080Search in Google Scholar

[16] McEnroe P, Wang S, Liyanage M. A survey on the convergence of edge computing and AI for UAVs: Opportunities and challenges. IEEE Internet Things J. 2022;9(17):15435–59.10.1109/JIOT.2022.3176400Search in Google Scholar

[17] Abbasi S, Rahmani AM, Balador A, Sahafi A. A fault-tolerant adaptive genetic algorithm for service scheduling in internet of vehicles. Appl Soft Comput. 2023;143:110413.10.1016/j.asoc.2023.110413Search in Google Scholar

[18] Chakravarthi KK, Shyamala L. TOPSIS inspired budget and deadline aware multi-workflow scheduling for cloud computing. J Syst Architecture. 2021;114:101916.10.1016/j.sysarc.2020.101916Search in Google Scholar

[19] Bansal S, Bansal RK, Arora K. Energy efficient backup overloading schemes for fault tolerant scheduling of real-time tasks. J Syst Architecture. 2021;113:101901.10.1016/j.sysarc.2020.101901Search in Google Scholar

[20] Khaldi M, Rebbah M, Meftah B, Smail O. Fault tolerance for a scientific workflow system in a cloud computing environment. Int J Computers Appl. 2020;42(7):705–14.10.1080/1206212X.2019.1647651Search in Google Scholar

[21] Long T, Ma Y, Wu L, Xia Y, Jiang N, Li J, et al. A novel fault-tolerant scheduling approach for collaborative workflows in an edge-IoT environment. Digital Commun Netw. 2022;8(6):911–22.10.1016/j.dcan.2022.08.010Search in Google Scholar

[22] Hasan M, Goraya MS. Fault tolerance in cloud computing environment: A systematic survey. Computers Ind. 2018;99:156–72.10.1016/j.compind.2018.03.027Search in Google Scholar

[23] Samanta A, Esposito F, Nguyen TG. Fault-tolerant mechanism for edge-based IoT networks with demand uncertainty. IEEE Internet Things J. 2021;8:16963–71.10.1109/JIOT.2021.3075681Search in Google Scholar

[24] Karthikeyan L, Vijayakumaran C, Chitra S, Arumugam S. SALDEFT: Self-adaptive learning differential evolution based optimal physical machine selection for fault tolerance problem in cloud. Wirel Personal Commun. 2021;118:1453–80.10.1007/s11277-021-08089-9Search in Google Scholar

[25] Wang D, Tan D, Liu L. Particle swarm optimization algorithm: An overview. Soft Comput. 2018;22:387–408.10.1007/s00500-016-2474-6Search in Google Scholar

[26] Chaudhary D, Kumar B. Cloudy GSA for load scheduling in cloud computing. Appl Soft Comput. 2018;71:861–71.10.1016/j.asoc.2018.07.046Search in Google Scholar

[27] Mahato DP, Singh RS, Tripathi AK, Maurya AK. On scheduling transactions in a grid processing system considering load through ant colony optimization. Appl Soft Comput. 2017;61:875–91.10.1016/j.asoc.2017.08.047Search in Google Scholar

[28] Abdullahi M, Ngadi MA. Symbiotic organism search optimization based task scheduling in cloud computing environment. Future Gener Computer Syst. 2016;56:640–50.10.1016/j.future.2015.08.006Search in Google Scholar

[29] Keshanchi B, Souri A, Navimipour NJ. An improved genetic algorithm for task scheduling in the cloud environments using the priority queues: formal verification, simulation, and statistical testing. J Syst Softw. 2017;124:1–21.10.1016/j.jss.2016.07.006Search in Google Scholar

[30] Alsaidy SA, Abbood AD, Sahib MA. Heuristic initialization of PSO task scheduling algorithm in cloud computing. J King Saud Univ-Computer Inf Sci. 2020.Search in Google Scholar

[31] Wu N, Zuo D, Zhang Z. Dynamic fault-tolerant workflow scheduling with hybrid spatial-temporal re-execution in clouds. Information. 2019;10(5):169.10.3390/info10050169Search in Google Scholar

[32] Wang Y, Guo Y, Wang W, Liang H, Huo S. INHIBITOR: an intrusion tolerant scheduling algorithm in cloud-based scientific workflow system. Future Gener Computer Syst. 2021;114:272–84.10.1016/j.future.2020.08.004Search in Google Scholar

[33] Xiang Y, Yang X, Sun Y, Luo H, editors. A fault-tolerant and cost-efficient workflow scheduling approach based on deep reinforcement learning for IT operation and maintenance. 2023 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE; 2023.10.1109/CSCWD57460.2023.10152783Search in Google Scholar

[34] Dong T, Xue F, Tang H, Xiao C. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl Intell. 2023;53(9):9916–32.10.1007/s10489-022-03963-wSearch in Google Scholar

[35] Tang X. Reliability-aware cost-efficient scientific workflows scheduling strategy on multi-cloud systems. IEEE Trans Cloud Comput. 2021;10(4):2909–19.10.1109/TCC.2021.3057422Search in Google Scholar

[36] Jing G, Zou Y, Yu D, Luo C, Cheng X. Efficient fault-tolerant consensus for collaborative services in edge computing. IEEE Trans Computers. 2023;72:2139–50.10.1109/TC.2023.3238138Search in Google Scholar

[37] Sujana J, Revathi T, Priya T, Muneeswaran K. Smart PSO-based secured scheduling approaches for scientific workflows in cloud computing. Soft Comput. 2019;23(5):1745–65.10.1007/s00500-017-2897-8Search in Google Scholar

[38] Masoumi M, Motallebi H. A structure-aware algorithm for fault-tolerant scheduling of scientific workflows. J Supercomputing. 2022;78:17348–77.10.1007/s11227-022-04529-wSearch in Google Scholar

[39] Kumari P, Kaur P. A survey of fault tolerance in cloud computing. J King Saud Univ-Computer Inf Sci. 2021;33(10):1159–76.10.1016/j.jksuci.2018.09.021Search in Google Scholar

[40] Chakravarthi KK, Shyamala L, Vaidehi V. Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl Intell. 2021;51(3):1629–44.10.1007/s10489-020-01875-1Search in Google Scholar

[41] Eberhart R, Kennedy J, editors. A new optimizer using particle swarm theory. MHS'95 Proceedings of the Sixth International Symposium on Micro Machine and Human Science. IEEE; 1995.Search in Google Scholar

[42] Ebrahimian H, Barmayoon S, Mohammadi M, Ghadimi N. The price prediction for the energy market based on a new method. Econ Res-Ekonomska istraživanja. 2018;31(1):313–37.10.1080/1331677X.2018.1429291Search in Google Scholar

[43] Chen W, Deelman E, editors. WorkflowSim: A toolkit for simulating scientific workflows in distributed environments. 2012 IEEE 8th International Conference on E-Science. IEEE; 2012.10.1109/eScience.2012.6404430Search in Google Scholar

[44] Yang L, Xia Y, Zhang X, Ye L, Zhan Y. Classification-based diverse workflows scheduling in clouds. IEEE Trans Autom Sci Eng. 2024;21:630–4110.1109/TASE.2022.3217666Search in Google Scholar

Received: 2023-10-17

Revised: 2024-04-15

Accepted: 2024-05-17

Published Online: 2024-07-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/nleng-2024-0015

Keywords for this article

edge computing; fault tolerance; scheduling; scientific workflows; metaheuristic; particle swarm optimization

Creative Commons

BY 4.0