2012

## Real-Time Scheduling on Heterogenous Multicore Processors

Cyber-physical Systems , Multicore Processors , Real-Time Scheduling Add commentsBy Bjorn Andersson,

Senior Member of the Technical Staff

Research, Technology & System Solutions

Many DoD computing systems—particularly cyber-physical systems—are subject to stringent size, weight, and power requirements. The quantity of sensor readings and functionalities is also increasing, and their associated processing must fulfill real-time requirements. This situation motivates the need for computers with greater processing capacity. For example, to fulfill the requirements of nano-sized unmanned aerial vehicles (UAVs), developers must choose a computer platform that offers significant processing capacity and use its processing resources to meet its needs for autonomous surveillance missions. This blog post discusses these issues and highlights our research that addresses them.

To choose a computer platform that offers greater capacity, it is necessary to observe the major trends among chip makers. Historically, advances in semiconductor miniaturization (a.k.a., Moore's Law) periodically yielded microprocessors with significantly greater clock speeds. Unfortunately, microprocessor serial processing speed is reaching a physical limit due to excessive power consumption. As a result, semiconductor manufacturers are now producing chips without increasing the clock speed, but instead are increasing the number of processor cores on a chip, which results in multicore processors. For nearly a decade, the use of *homogeneous* multicore processors (which are chips with identical processing cores) gave us some headroom in terms of power consumption and allowed us to enjoy greater computing capacity. This headroom is diminishing, unfortunately, and is about to vanish, forcing semiconductor manufacturers to seek new solutions.

We are currently witnessing a shift among semiconductor manufacturers from homogeneous multicore processors with identical processor cores to *heterogeneous* multicore processors. The impetus for this shift is that processor cores tailored for a specific class of applications behavior have the potential to offer much better power-efficiency. AMD Fusion and NVIDIA Tegra 3 are examples of this shift. Intel Sandybridge, which has a graphics processor integrated onto the same chip as the normal processor, also reflects this shift, as well.

In a heterogeneous multicore environment, the execution time of a software task depends on which processor core it executes on. For example, a software task performing computer graphics rendering, simulating physics, or estimating trajectories of flying objects runs much faster on a graphics processor than on a normal processor. Conversely, some software tasks are inherently sequential and cannot benefit from the graphics processor; they execute much faster on a normal processor. For example, a software task with many branches and no inherent parallelism runs much faster on a normal processor than on a graphics processor. Ideally, each task would be assigned to the processor where it executes with the greatest speed, but unfortunately the workload is often not perfectly balanced to the types of processor cores available.

Efficient use of processing capacity in the new generation of microprocessors therefore requires that tasks are assigned to processors intelligently. In this context, “intelligently” means that the resources requested by the program are the ones possessed by the processor. Moreover, the desire for short design cycles, rapid fielding, and upgrades necessitates that task assignment be done automatically—with algorithms and associated tools.

**THE TASK ASSIGNMENT PROBLEM**

The problem of assigning tasks to processors can be described as follows: A task (such as computer graphics rendering or a program determining whether the process half-or-triple-plus-one reaches one with a known starting value) is described with its processor utilization, but it has different processor utilizations for different processors. For example, if a given task is assigned to a graphics processor, then the task will have a utilization of 10 percent. If the task is assigned to a normal processor, the task will have a utilization of 70 percent. We are interested in assigning each task to exactly one processor such that for each processor, the sum of utilization of all tasks assigned to this processor will not exceed 100 percent. If we can find such an assignment then it is known that if tasks have deadlines described with the model implicit-deadline sporadic tasks—and if the scheduling algorithm Earliest-Deadline-First (EDF) is used—then all deadlines will be met at run-time (with a minor modification, we can use Rate-Monotonic scheduling as well).

**PREVIOUS APPROACHES FOR TASK ASSIGNMENT**

The task assignment problem belongs to a class of problems that are computationally intractable, meaning that it is highly unlikely to be possible to design an algorithm that finds a good assignment and always runs fast. So we should either create an algorithm that always finds a good assignment or one that always runs fast. For the goal of designing an algorithm that always finds a good assignment, task assignment can be modeled as integer-linear programming (ILP) as follows:

Minimize z

subject to the constraints that

for each processor p: x_{1,p}*u_{1,p} + x_{2,p}*u_{2,p} + … + x_{n,p}*u_{n,p} <= z

and

for each task i: x_{i,1} + x_{i,2} + … + x_{i,m} = 1

and

for each pair (i,p) of task i and processor p: x_{i,p} is either 0 or 1

In the optimization problem above, n is the number of tasks and m is
the number of processors and u_{i,p} is the utilization of task i if it
would be assigned to processor p. x_{i,p} is a decision variable with the
interpretation that it is one if task i is assigned to processor p; zero
otherwise.

Unfortunately, solving this integer linear program takes a long time.

To design an algorithm that always runs reasonably fast, there are several algorithms, as described in a research paper by Sanjoy K. Baruha, that transform the ILP into a linear program (LP) and then perform certain tricks. Although an LP runs faster than the ones based on ILP, they still have to solve an optimization problem, which can be time-consuming. To design algorithms that run faster, we would like to perform task assignment in a way that does not require solving LP.

**OUR APPROACH FOR TASK ASSIGNMENT**

Previous work on task assignment for homogeneous multicore processors where all processor cores are identical is based on a framework called bin-packing heuristics. Such algorithms work approximately as follows:

1.Sort tasks according to some criterion.

2.**for** each task **do**

3.**for** each processor **do**

4.if the task has not yet been assigned and it is possible to assign the task to the processor so that the sum of utilization of tasks on the processor does not exceed 100 percent then

5.Assign the task on the processor

6.**end if**

7.**end for**

8.**end for**

Our approach involves adapting bin-packing heuristics to heterogeneous multicore processors. We believe it is possible to modify the algorithm structure outlined above so we can also assign tasks to processors even when the utilization of a task depends on the processor to which it is assigned. One can show that the use of bin-packing can perform poorly if processors and tasks are not considered in any particular order. Specifically, for a set of tasks that could be assigned, such an approach can fail even when given processors that are "infinitely" faster. One of our main research challenges is therefore to determine how to sort tasks (step 1) and in which order we should consider processors (in step 3). We are evaluating our new algorithms in the following ways:

- We plan to prove (mathematically) the performance of our new algorithms. Specifically, we are interested in proving that if it is possible to assign tasks to processors then our algorithm will succeed in assigning tasks to a processor if a given processor is x times as fast. Given that x is our performance metric; the lower its value, the better.
- We also plan to evaluate the performance of our algorithms by applying the algorithms on randomly-generated task sets. This will demonstrate the “typical” behavior of the algorithms.

**CONCLUSION**

Most semiconductor manufacturers are shifting towards heterogeneous multicore processors to offer greater computing capacity while keep power consumption sufficiently low. But using a heterogeneous multicore efficiently for cyber-physical systems with stringent size, weight, and power requirements requires that tasks are assigned properly. This blog post has discussed the state-of-art and summarized our ongoing work in this area.

**ADDITIONAL RESOURCES**

To read the paper “Partitioning real-time tasks among heterogeneous multiprocessors” by Sanjoy Baruah, please visit

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1327956

To read the proceedings “Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Unrelated Types of Processors”, please visit

http://www.cister.isep.ipp.pt/activities/seminars/%28S%28otidyq454nfyy255alnrdb3m%29%29/GetFile.aspx?File=%2Fspringseminars2011%2Frtss10_het2.pdf

Mar 12, 2012 at 8:48 PM For current, practical defense embedded systems, it would be better to satisfice the processing performance, and optimize on another criteria, such as Power consumption. Power aware computing is a significant problem for both commercial and defense handheld devices.

Mar 12, 2012 at 11:29 PM Thanks for your post. One can actually formulate an integer linear program (ILP) for performing task assignment with the objective of minimizing power as well (see details below). An interesting research question would be how to modify it to an approximation scheme that runs really fast. I have been toying with different models for describing tasks and their

resource usage on heterogeneous multicores. If you have suggestions on what would be relevant to you, please let me know.

An ILP for minimizing power

Minimize \sum_{p=1..m} \sum_{i=1..n} x1,p*poweri,p

subject to the constraints that

for each processor p: x1,p*u1,p + x2,p*u2,p + … + xn,p*un,p <= 1

and

for each task i: xi,1 + xi,2 + … + xi,m = 1

and

for each pair (i,p) of task i and processor p: xi,p is either 0 or 1

where

power_i,p is the power consumption of task i if it is assigned to processor p.

Mar 15, 2012 at 11:04 AM Considering real-world examples of such output for "We also plan to evaluate the performance of our algorithms by applying the algorithms on randomly-generated task sets. This will demonstrate the “typical” behavior of the algorithms."

I would add that its effective to see this introduced after, or at the same time as calculating Amdahl's law, so you can weight whether a given task would benefit from distribution. If you assume you have more than one processor then not including Amdahl would seem ineffective, as a sub-component to determine if T can be divided into a set of N tasks Vn of set T, if optimal by some weight of output from calculating Amdahl. This assumes you want task T to complete in a target time or not. So your prioritization algorithm becomes recursive and must limit depth of recursion based on sub-sets of tasks, sub-sets of sub-sets of tasks, etc.

Assume you can distribute the processing code with the data set you want to process for a given task within your hetero/homogeneous.

Add property tags that provide go or no-go compatibility with a given set of processors p, based on task (which provides classification and ability to tag).

Data classification, prioritization, grouping and security can all and any one provided through associative property comparisons. Properties assigned to processor, task, and possibly sub-task ahead of time. To handle the power situation, one merely needs to know the usage per tick or some degree of time to calculate process time vs. task to determine the power usage, but you'd require the property assigned to the processor in the math (data driven process).

Either way, I assume that any definition of task in the context of what you are talking about is a sub-task or most granular non-distributable set of operations, or else a model using pure Amdahl law to choose which tasks to separate (based on data of course) will show more effective than this process when the data set is highly distributable.

Yet, based on the target of your goal set, your model should demonstrate a higher effective process on the same data, thus Amdahl must be considered, or your definition of task must be defined and not arbitrary.

Mar 15, 2012 at 10:14 PM I assume that you have been working with Peter Feiler on this. Back when we were working on a Darpa BAA with John Lehoczky, Peter was looking at multi-variant scheduling theory by (working from memory here) working in cost/benefit space for QoS aware scheduling - it was part of CMU's DASADA offering. It primarily looked at scheduling a single resource based on multiple quality attributes, but there should not be any particular reason why it would not be extendible to a multi-processor environment, even for heterogenious processors. It depended on coming up with both a cost function (in terms of the resources you were optimizing against) and a quality / benefit function.

Mar 16, 2012 at 1:43 PM I don't know Peter, but this is basic data-mining approach at optimizing results for a given set of search criteria.

If heterogenous processors have benefits, where is the control case that would allow a comparative statement. The definition of the task size would be required.