Task parallelization and distributed computing are growing in popularity as the speed of computer development decreases and the demand for computing power increases. Applications such as Deep Learning and graph analytics require multiple computational resources (e.g., CPU/GPU, network bandwidth, memory)^{[1]}. Device placement requires proper scheduling of existing resources in edge computing^{[2,3]}. Therefore, a scheduler that allocates resources reasonably and efficiently so that tasks can be completed as soon as possible and maximizes resource utilization is needed.
Applications such as these are commonly modelled as largescale Directed Acyclic graphs (DAG), in which nodes denote tasks and edges denote intertask dependencies. Usually, the nodes of these DAG range in the tens of thousands, and we call such task scheduling largescale task scheduling. Largescale task scheduling is a significant and challenging area of research in computer science. Task scheduling systems need to balance the execution time of tasks and available computing resources to enable applications to be executed in the shortest time and maximize resource utilization. The demands on computer resources while executing a task vary significantly due to the different characteristics of the task. Some tasks demand a significant CPU for computation, while others have more requirements for IO operations. Meanwhile, there are many criteria used to evaluate the effectiveness of task scheduling  for example, fairness, low latency and high throughput. So there is hardly a method that can balance so many factors at the same time. It is a dynamic scheduling problem if task nodes come sequentially in time. If the tasks are specified, and all taskrelated information is known before schedule, then that is a static scheduling problem. The scheduling problem is a widely acknowledged NPhard problem. Thus, researchers have proposed various solutions to the scheduling for different application cases. In this paper, we focus on largescale static task scheduling on limited processors.
DAG scheduling is known as the static task scheduling problem^{[49]}. Scheduling can be done statically during compilation since data such as task execution time, task dependencies, and communication costs are known about the application. We model the application as a DAG, with nodes denoting a task and edges denoting the dependencies between tasks. There are a certain number of processors available to schedule these tasks. We also know the task’s execution time and its time to transmit between tasks (if they are assigned to different processors). DAG scheduling aims to assign tasks to a reasonable number of processors so that the whole task completion time (makespan) is minimized.
The existing algorithms can be divided into three categories: list scheduling^{[5,7,1013]}, clusterbased scheduling^{[5,14]}, and task duplicationbased scheduling^{[4,6,15]}. Some researchers also proposed mixed algorithms^{[5,16,17]}. Through analysis, we find that the listbased scheduling algorithm is simple and easy to implement. However, it leads to a waste of computational resources, which is not acceptable. The duplicationbased task scheduling algorithm performs well when scheduling smallscale tasks with a thousand nodes or less. Nevertheless, once the scale of the graph grows large, e.g., tens of thousands of nodes, it will take a long time because it usually has high algorithmic complexity. The clusteringbased task scheduling algorithm is based on a bottomup clustering approach, in which atomic tasks form clusters. We argue that the decisions in this method are local and cannot take into account the global structure of the graph. In addition, clusteringbased tasks tend to be better with an adequate number of processors, leading to a waste of computational resources. The mixed algorithm^{[5,16,17]} combines task clustering with listbased scheduling or duplicationbased scheduling. For duplicationbased scheduling^{[4,6,15]} are achieved by allowing the assignment of a single task to multiple clusters to carry out task duplication, which can reduce the communication cost. Our approach can also be viewed as a mix of clustering and task duplication scheduling, where the task duplication part of the decision is limited by cluster scheduling.
In this paper, we propose a new largescale Task Deduplicationbased Partition Algorithm and Task Duplication (TDPATD) scheduling algorithm to reduce the complexity of task duplicationbased (TDB) scheme^{[4,6,15]} and accelerate largescale task scheduling on a limited number of processors. TDPATD applies a DAG partitioning algorithm to cluster tasks with complex dependencies and generates new tiny task clusters at first. Subsequently, TDPATD applies an improved task duplication strategy to schedule the task clusters and obtains a better scheduling scheme. Lastly, the scheduling scheme is applied to the largescale task clusters. Finegrained task scheduling optimization is carried out to eliminate duplicate tasks to attain an ideal result.
The main contributions of our TDPATD are as follows:
We compared TDPATD with stateoftheart algorithms, including TDCA^{[4]}, which is a task duplicationbased algorithm, BL_EST and BL_ETF^{[5]}, a listbased scheduling algorithm, and BL_EST_PART and BL_ETF_PART^{[5]}, which are mixed clustering and listbased scheduling algorithms. In addition, TDCA with a small amount of improvement based on clustering is also compared. Moreover, a DAG generator is used to cover different types of DAGs to evaluate our algorithms. We have investigated our algorithm on datasets from different sources, and extensive experiments have demonstrated that TDPATD can achieve better results when dealing with largescale task scheduling. It can also achieve satisfactory results when dealing with small and midscale scheduling tasks.
The organizational structure of this paper is as follows. In
We use the DAG graph to represent the task model. Let
The computing platform is a homogeneous cluster of identical processing units, called processors, denoted as
Parameters and interpretations
i. j  Number of task in the DAG, also known as nodes. 
Parent(i)  Set of parents of task 
Child(i)  Set of children of task 
W  Set of weght fot all nodes. 
Weight of task 

P  Set of processor available. 
Communication cost from 

When task 

When task 

Based on the current schedule, processor 

Based on the current schedule, the runtime start time for task 

Based on the current schedule, the runtime complete time for task 
List and interpretation of formulas
Eq. No.  Formula for calculation of parameters 

1  
Earliest start time of task 

2  
Earliest completion time of task 

3  
Processor avaliable time for the first task 

4  
5  
In the simulation, the processor that completed node j will be recorded, here Δ represents the processor.  
6  
The runtime completion time of task 

7  
Task scheduling is classified into dynamic scheduling and static scheduling, and dag task scheduling belongs to static task scheduling. Methods of static task scheduling can be roughly divided into three categories: (1) Listbased scheduling methods; (2) clusteringbased scheduling methods; and (3) task duplicationbased scheduling methods.
A priority is allocated to each task initially for the listbased scheduling method^{[5,7,1013,18,19]}. The priority list is formed in the order of descending priority, and then the task list is assigned to the processors in order. These algorithms differ in terms of how the priority levels are defined or how the tasks are assigned to the processors. Shin
The tasks are partitioned into clusters at first, and the tasks from the same cluster are scheduled as a block in the clusterbased scheduling method^{[5,2124]}. Clusters usually consist of tasks with strong correlation. The nature of the method is that tasks are grouped together on the same processor which are strongly correlated and the communication time between tasks on the same processor is quite negligible. Then the cluster will be scheduled to an unlimited number of processors which eventually are put altogether into the number of processors available. The clusterbased scheduling scheme works better when the actual number of available processors does not fall short of the number of clusters.
The underlying logic behind the TDB^{[4,6,8,15]} scheduling algorithm is to reduce communication costs by assigning some tasks to multiple processors redundantly. In duplicationbased scheduling, different strategies are available to select the ancestor nodes to be duplicated. Some algorithms clone the direct ancestors (e.g., TANH^{[6]}) only, while others try to clone all possible ancestors (e.g., TDCA^{[4]}).
The mixed algorithm^{[5,13,16,2527]} combines task clustering with listbased scheduling or duplicationbased scheduling. Existing listbased scheduling algorithms (e.g., LDCP^{[7]}, HEFD^{[13]}) will duplicate the previous tasks in order to reduce communication costs. For duplicationbased scheduling^{[4,6]} are achieved by allowing assignment of a single task to multiple clusters to carry out task duplication, which can reduce the communication cost. Our approach is close to clusterbased scheduling, as we partition the tasks into
We describe the details of the TDPATD in this section. The proposed algorithm includes several vital parameters and three phases. In the first phase, TDPATD generates a specific partition by DAGP^{[28]}. Then we can get a new graph by the partition information. The partition algorithm will ensure that the new graph is directed and acyclic. In the second phase, we generate an original scheduling scheme based on an improved algorithm of TDCA^{[4]} since TDCA is better for the case when there are enough processors. Then the new graph is mapped back to the original graph according to the partition information. This step can ensure that the execution time will not increase, which will be proved later. Finally, the deduplication will occur based on the new schedule.
Global process diagram of Task Deduplicationbased Partition Algorithm and Task Duplication (TDPATD).
First, we will partition the original DAG graph by a partitioning algorithm and obtain the
Then in the second stage, the communication time between partitions will be calculated. There are several possible candidates here: (1) the sum of node communication weights between partitions; (2) the average of node communication weights between partitions; and (3) the maximum value of node communication between partitions. The first case calculation rule modeled that the processor can only process tasks and send data serially. The second calculation method considers the communication situation as a whole, which is in the case that the communication weights are more evenly distributed and have a smaller distribution. However, the new graph is no longer accurate if there are extreme cases, such as a communication cost in a subinterval that is significantly above the average, which will have a negative impact on our later strategies. We adopt the third solution in this paper, which is suitable for the model of multiport duplex communication. The communication cost between and is calculated as follows:
The negative impact due to taking the maximum value will be optimized in the Deduplication phase. In the third step, the connectivity between the partitions will be obtained by calculation. In the new DAG, the node numbers are the ones of the previous partitions, making it easier for the mapping process later. For example, partition 1 and partition 2, correspond to node 1 and node 2 in the new graph. if there exists any node in partition 1 that needs to communicate to the node in partition 2, then node 1 needs to communicate to node 2 in the new graph. The new graph should have a directed edge between node 1 and node 2 . This value takes the communication weight value obtained in the second step.
In the task scheduling phase, the solution we use is primarily based on TDCA, a duplicationbased scheduling algorithm. At the same time, We have made some improvements to adapt our algorithm: TDCA is better when the number of processors is sufficient. TDCA does not give a suitable solution when the processors are fully occupied, but there are still a large number of tasks to be assigned. Considering that we mainly deal with largescale task scheduling, the number of tasks is much larger than the number of available processors. So we made some changes. We stop the initialization method of TDCA when there are no available processors. Based on the existing scheduling scheme, we calculate the RCT of the task assigned and then assign the task to the processor with the smallest RCT. A more detailed description of the algorithm is given in
Initial Task Array for no processor available
After the task initialization, we will get a task scheduling based on the new DAG. The task scheduling algorithm is based on TDCA. In this paper, some parameters are tuned to get better scheduling results. All the parameters we use and the formulas are shown in
It is more challenging to optimize the scheduling scheme based on the new graph further. Before further optimizing makespan, the scheduling scheme needs to be applied to the original graph. The difference is that 3.2 generates a new dag graph based on partitioning, while the work to be done here is to decompose the nodes in the scheduling scheme to the nodes in the original graph that make up the partition based on the partition information. Here it will make the makespan further reduced. Since our scheduling is based on task duplication scheduling, the same task is processed more than once on different processors. Before mapping back to the original graph, the nodes of the same partition are regarded as a block. The granularity of tasks is smaller after mapping back to the original graph. Then there will be some tasks that will be repeatedly executed. So, we can remove the duplicate tasks to improve the global completion time. More specifically, we simulate the execution of tasks on each processor based on the existing task scheduling scheme. When executing to a task
Deduplication
The meanings of the symbols are explained at the beginning,
In the partition stage, we use the DAGP ^{[28]} partition algorithm, which has a time complexity of
To sum up, the overall time complexity of TDPATD is
We will use a small instance to illustrate the process of TDPATD in this section. The DAG we use is shown in
The origin DAG
Weight of nodes for origin DAG
62  67  88  43  40  75  80  74  40  60  69  63  78  39  31 
The cost of communication from node
The new DAG
Weight of nodes for new DAG
233  80  144  204  248 
Next, we schedule
Scheduling result for new DAG
Mapping result for origin DAG
During the Deduplication phase, we will simulate the execution of the scheduling scheme on the processor. We remove and update the scheduling scheme if we find any tasks that match the conditions. First, we simulate the execution of processor 1. It can be seen that processor 1 does not need to accept data from other processors before it can complete all assigned tasks. We can get
Deduplication result (makespan = 617).
We have evaluated instances coming from two sources. The program generates the first set we use to generate the various types of graphs we need. The specific generation process will be described in detail later. The second set is from the work of Lin
There are several important parameters for our generation procedure which we will explain in detail. The first is the number of tasks
The logic of the algorithm is relatively simple. First, all the tasks that can be executed immediately are put into the ready queue. Generally, the tasks in the queue are usually the entry tasks after initialization. All completed tasks are put into the completion queue. The task will be put into the waiting queue if any parents are not in the completion queue. When a processor is available, the first task in the ready queue will be assigned.
EFT is a dynamic prioritybased list scheduler. For tasks in the ready queue, the algorithm calculates the earliest start time for each task on a different processor. Unlike BL_EST, which assigns processors directly, ETF considers the actual time that a task gets executed on a processor, which is relatively better than BL_EST However, its time complexity is higher than the latter.
These two are mixed scheduling algorithms of clusters and lists. The main concept of this algorithm is to partition tasks into a certain number of clusters first and then schedule them. The detailed algorithm for scheduling is basically similar to BL_EST and ETF, but tasks belonging to the same cluster will be assigned to the same processor. This is an algorithm with a balanced performance.
This is half of our algorithm, which does not contain the results of deduplication. The work of B_Deduplication (before deduplication) mainly optimizes the algorithm for initializing the task queue. We set this baseline to emphasize the practical part of our work.
The parameters of the DAG graphs and the parameters of the algorithm involved in the experiments are as follows:
Number of nodes n ∈ {4000, 5000, 6000, 7000, 8000, 9000, 10000}.
Communicationcomputation ratio CCR∈ {0.5, 1, 1.5, 2, 2.5, 5, 7}.
Processor number proc ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20}.
Partation num part ∈ {400, 500, 600, 700, 800, 900, 1000, 1200}.
We evaluated the effect of the different number of partitions on the performance of our strategy in the first experiment. It should be noted that we cannot exhaustively explore all the possibilities due to the large parameter space. Thus we adopted reasonable configurations in each experiment except the variable under investigation. As shown in
Comparison of makespan on the number of
Comparison of makespan on realistic dataset bert 128.
In the second experiment, we investigated the effect of CCR on makespan. As shown in
Comparison of makespan on CCR. TDPATD prefers small CCR.
We investigated the effect of the number of processors on the makespan in the third experiment. We found that the baseline algorithm is not very sensitive to the number of processors. As the number of processors we set increases, the actual number of occupied processors does not produce any additional variation. We found that the baseline algorithm is not very sensitive to the number of processors. As the number of processors increases, the actual number of occupied processors will settle at a specific value. Our analysis of the possible reasons for this phenomenon is that the communication time incurred by scheduling the cluster to other processors will be longer than the computation time of the processors because the communication time on the same processor is almost negligible. The results are shown in
Comparison of makespan on number of processor. TDPATD can obtain better results when the number of available processors is growing.
In the fourth experiment, we investigated the effect of the number of nodes on the makespan.
Comparison of makespan on number of node.
In addition, we compare it with the traditional duplicationbased scheduling algorithm. When TDCA^{[4]} runs tens of thousands of node graphs (largescale DAG), it takes too long, so we tested it with smallscale graphs. The results are shown in
Comparison of makespan and running time for TDCA and TDPATD on the graph with node 1000 TDPATD has outstanding advantages in running speed.
Results for makespan with different number of nodes on the bert dataset. TDPATD has practically comparable results on the realistic data set.
In general, the experiments satisfy our expectations predominantly. Compared with traditional algorithms, the advantage of our algorithm is selfevident when the graph size is larger and fewer processors are available. The number of partitions and CCR have a significant impact on our algorithm. The impact of the number of nodes on the algorithm depends on the number of partitions. The number of processors affects the experimental results in a way.
For largescale DAG task scheduling, we propose a new algorithm. The algorithm mixes clusterbased scheduling and duplicationbased scheduling. The clusterbased scheduling algorithm can significantly reduce the complexity and running time of the algorithm when dealing with largescale scheduling tasks. However, there is a loss of scheduling effectiveness as a consequence. Therefore, we apply the duplicationbased task scheduling method based on clustering to reduce the loss of scheduling effect due to clustering. Moreover, we optimize the task queue initialization strategy of TDCA to make it more suitable for our algorithm. In addition, we define several new parameters
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Huang W, Shi Z, Xiao Z
Funding acquisition, project administration, provide research resources, and supervision: Chen C, Li K
Not applicable.
This work was supported by Natural Science Foundation of Hunan Province (No. 2020JJ5083).
All authors declared that there are no conflicts of interest.
Not applicable.
Not applicable.
© The Author(s) 2021.