



Barcelona Supercomputing Center Centro Nacional de Supercomputación

# From the latency to the throughput age

Jesús Labarta BSC

> Keynote @ IWOMP 2016 Nara, October 7<sup>th</sup> 2016

### Vision

### ( The multicore and memory revolution

- ISA leak ...
- Plethora of architectures
  - Heterogeneity
  - Memory hierarchies

#### (Complexity +

(( + variability =

tro Nacional de Supercomputación

- ( = Divergence ...
  - ... between our mental models and actual system behavior

and the ISA interface to leak  $\rightarrow$  our world is shaking **Applications** ISA / API

The power wall made us go multicore

#### What programmers need ? **HOPE !!!**



### The programming revolution

( An age changing revolution

- From the latency age ...

- I need something ... I need it now !!!
- Performance dominated by latency in a broad sense
  - At all levels: sequential and parallel
  - Memory and communication, control flow, synchronizations
- ...to the throughput age
  - Ability to instantiate "lots" of work and avoid stalling for specific requests
    - I need this and that ... and as long as it keeps coming I am ok
    - (Much broader interpretation than just GPU computing !!)
  - Performance dominated by overall availability/balance of resources



### Vision: direction ?

Program logic independent of computing platform

General purpose Task based Concurrency + data

Intelligent runtime Parallelization Data management, Dynamic resource management Interoperability Coordination with OS, ... Re introduce "seny" Decouple, insight A quiet revolution Applications PM: High-level, clean, abstract interface Power to the runtime ISA / API





**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# STARSS



### The StarSs family of programming models

#### (Key concept

- Sequential task based program on single address/name space + directionality annotations
- Happens to execute parallel: Automatic run time computation of dependencies between tasks

#### ( Differentiation of StarSs

- Dependences: Tasks instantiated but not ready. Order IS defined
  - Lookahead
    - Avoid stalling the main control flow when a computation depending on previous tasks is reached
    - Possibility to "see" the future searching for further potential concurrency
  - Dependences built from data access specification
- Locality aware
  - Without defining new concepts
- Homogenizing heterogeneity
  - Device specific tasks but homogeneous program logic







**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# OMPSS

### OmpSs

#### ( Experimental platform

- Compiler, runtime, applications

#### ( Forerunner for OpenMP

- "extending" OpenMP
- "following" OpenMP

#### ( Minimalist set of concepts ...

- ... relaxing StarSs functional model
- ... still looking for elegance and fundamentals
- ... aggressively give "power to the runtime"



| Minimalist set of conce                                     | Color code: OpenMP, influenced OpenMP, pushing, not        |
|-------------------------------------------------------------|------------------------------------------------------------|
| [ concurrent ()] [co                                        |                                                            |
| <pre>#pragma omp taskwait [ { in   out   inout</pre>        | :} () ] [noflush]                                          |
| <pre>#pragma omp taskloop [grainsize()]    {for_loop}</pre> | [num_tasks() [nogroup] [ in ()] [reduction(identifier : li |
| #pragma omp target device ({ smp   ope                      | encl   cuda }) \                                           |
| [ implements ( function_name )]                             |                                                            |

BSC

A. Duran, et al, "Extending the OpenMP Tasking Model to Allow Dependent Tasks" IWOMP 2008, LNCS & IJPP

Barcelona Supercomputing Center Centro Nacional da E. Ayguade, et al, "A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures" IWOMP 2009 & IJPP

### **OpenMP** compatibility

#### ( Follow OpenMP syntax

- For adopted OmpSs features
- Adapt semantics for OpenMP features. Ensure High compatibility

#pragma omp parallel // ignore

#pragma omp for [ shared(...)][private(...)][firstprivate(...)][schedule\_clause] // ~ taskloop
{for\_loop}

#pragma omp task [depend (type: list)]



### Dependences vs OpenMP

#### **((**Regions

- Runtime versions that handle partial overlap.
- Overhead but useful.

#### (( | values

Mechanisms to reintroduce size if used with regions

#### (Commutative, concurrent

- relaxed inouts: possible reorders between those in a linear chain

#### **(** Reductions

Concurrent + privatization + associative & commutative operation

#### ( Multidependences

Variable number of in/outs







in (a[i:j])





### Regions



#### inout





#### Commutative BS Tasks executed vec ... out of order but not concurrently ... sum sum sum sum print for (int j; j < N; j + = BS) actual\_size = (N- j> BS ? BS: N-j); #pragma omp task in(vec[j;actual\_size]) commutative(result) for (int count = 0; count < actual\_size; count ++, j++)</pre> result += f(&vec[j], actual\_size) ; No mutual exclusion required #pragma omp task input (result) printf ("TOTAL is %d\n", result); Barcelona #pragma omp taskwait Supercomputing BSC Center 17 Centro Nacional de Supercomputación

### Concurrent, Commutative

#### ( Flexibility in execution orders

- Many other tasks can interleave
- Still maintain individual dependence relationships between tasks involved in the inout chain and the "outside world"
- May be interprocedural





### Task reductions

#### ( Task reductions

- While-loops, recursions



### Task reductions

#### ( Array reductions

 Typical pattern: reductions on large arrays with indirection

### ( Implementation

- Privatization becomes inefficient when scaling cores and data size
- Atomics can introduce significant overhead
- PIBOR Proposal: Privatization with in-lined block-ordered reductions
  - Save footprint
  - Trade processor cycles for locality
- Generalization of implementations
  - Inspector executor based, ...





### Homogenizing Heterogeneity

- ( ISA heterogeneity
- ( Single address space program ... executes in several non coherent address spaces
  - Copy clauses:
    - ensure sequentially consistent copy accessible in the address space where task is going to be executed
    - Requires precise specification of data accessed (e.g. array sections)
  - Runtime offloads data and computation and manages consistency

### (Kernel based programming

- Separation of iteration space identification and loop body

```
#pragma omp target device ({ smp | opencl | cuda }) \
      [ <u>copy_deps</u> | no_copy_deps ] [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] } \
      [ implements ( function_name )] \
      [shmem(...) ] \
      [ndrange (dim, g_array, l_array)]
```

#pragma omp taskwait [ on (...) ][noflush]



## OmpSs@CUDA/OpenCL

#### ( Automatic memory management and kernel synchronization

main.c

#pragma target device (smp) copy\_deps
#pragma omp task input ([size] c) output ([size] b)
void scale\_task (double \*b, double \*c, double scalar, int size)

for (int j=0; j < size; j++) b[j] = scalar\*c[j];</pre>

#pragma target device (cuda) copy\_deps ndrange(1, size, 128)
#pragma omp task input ([size] c) output ([size] b)
\_\_\_global\_ void scale\_task\_cuda (double \*b, double \*c, double scalar, int size);



### Homogenizing Heterogeneity

#pragma omp target device(opencl) ndrange(1,size,128) copy\_deps implements (calculate\_forces)
#pragma omp task out([size] out) in([npart] part)

\_kernel void calculate\_force\_opencl(int size, float time, int npart, \_\_global Part\* part, \_\_global Part\* out, int gid);

#pragma omp target device(cuda) ndrange(1,size,128) copy\_deps implements (calculate\_forces)
#pragma omp task out([size] out) in([npart] part)

\_\_global\_\_\_ void calculate\_force\_cuda(int size, float time, int npar, Part\* part, Particle \*out, int gid);

#pragma omp target device(smp) copy\_deps
#pragma omp task out([size] out) in([npart] part)
void calculate\_forces(int size, float time, int npart, Part\* part, Particle \*out, int gid);

void Particle\_array\_calculate\_forces(Particle\* input, Particle \*output, int npart, float time) {
 for (int i = 0; i < npart; i += BS )
 calculate\_forces(BS, time, npart, input, &output[i], i);</pre>



}

### MACC (Mercurium ACcelerator Compiler)

- ( "OpenMP 4.0 accelerator directives" compiler
  - Generates OmpSs code + CUDA kernels (for Intel & Power8 + GPUs)
  - Propose clauses that improve kernel performance
- ( Extended semantics
  - Change in mentality ... minor details make a difference
  - Dynamic parallelism







G. Ozen et al, "On the roles of the programmer, the compiler and the runtime system when facing accelerators in OpenMP 4.0" IWOMP 2014

G. Ozen et al, "Multiple Target Work-sharing support for the OpenMP Accelerator Model" submitted

## Interoperability: MPI and OmpSs

( Taskifying MPI calls

#### ( Potential issues

- Deadlocks if blocking resources
  - → virtualize MPI engine
  - Nanos6 on pthreads, argobots,...
- Matching if executed out of order
- ( Throughput oriented Opportunity
  - Overlap between phases
    - Grid and frequency domain
  - Provide laxity for communications
    - Tolerate poorer communication
  - Shift load balance issue
    - Eliminate serialization
    - Increase granularity
  - Huge flexibility for changing behavior with minimal syntactic changes

V. Marjanovic, et al, "Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach" ICS 2010

IFS weather code kernel. ECMRWF





### Hybrid Amdahl's law

### ( A fairly "bad message" for programmers

- ( Significant non parallelized part
  - MPI calls + pack/unpack

#### MAXW-DGTD



Y

0





#### ( MPI + OmpSs: Hope for lazy programmers



### **MPI** offload

Centro Nacional de Supercomputación





**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# COMPILER AND RUNTIME

## Scheduling

- ( Locality aware scheduling
  - Affinity to core/node/device can be computed based on pragmas and knowledge of where was data
  - Following dependences reduces data movement
  - Interaction between locality and load balance (work-stealing)



- ( Some "reasonable" criteria
  - Task instantiation order is typically a fair criteria
  - Honor previous scheduling decisions when using nesting
    - Ensure a minimum amount of resources
    - Prioritize continuation of a father task in a taskwait when synchronization fulfilled



### Criticality-awareness in heterogeneous architectures

- ( Heterogeneous multicores
  - ARM big.LITTLE 4 A-15@2GHz; 4A-7@1.4GHz
  - Tasksim simulator: 16-256 cores; 2-4x
- ( Runtime approximation of critical path
  - Implementable, small overhead that pay off
  - Approximation is enough

1

5 Number of big cores Total number of cores

Centro Nacional de Supercomputación

2

3

3

4

SS FLEX

1.35 BF (bars)

1.3

1.25

1.15 em 1.1 1.05

1

1

Center

ent over 1.2

( Higher benefits the more cores, the more big cores, the higher performance ratio

2.8

2.6

2.4 2.2

2

1.8 1.6



| Kernel             | #Tasks | Measured Perf.<br>Ratio |
|--------------------|--------|-------------------------|
| Cholesky 8x8       | 120    |                         |
| Cholesky 16x16     | 816    | 2.23                    |
| Integral Histogram | 2048   | 1.71                    |
|                    |        |                         |
| QR                 | 1496   | 4.26                    |
| Heat difussion     | 5124   | 2.83                    |





### OmpSs: the potential of data access information

- ( Highly NUMA machine
  - Asynchrony with serialized initialization
  - Effect of parallel initialization and first touch

 Copy to workarray. Change of association

- NUMA aware workarray allocation





### **FPGAs**



#### **Device management mechanisms**

- ( Improvements in runtime mechanisms
  - Use of **multiple streams**
  - High asynchrony and overlap (transfers and kernels)
  - Overlap kernels
  - Take overheads out of the critical path
- ( Improvement in schedulers
  - Late binding of locality aware decisions
  - Propagate priorities

| egacy behavior  | devid       | ce sync de | evice sync | device sync | levice svn | c device | e svnc |
|-----------------|-------------|------------|------------|-------------|------------|----------|--------|
| t1 cycle        | HtD t1      | Run t1     | DtH t1     |             |            |          |        |
| t2 cycle        |             | HtD t2     | Run t2     | DtH         | :2         |          |        |
| t3 cycle        | ]           |            | HtD t3     | Run ti      | B Di       | tH t3    |        |
| t4 cycle        | 1           |            |            |             | Ru         | n t4     | DtH t4 |
| synchronous beh | avior @ GPU |            |            |             |            |          |        |
| t1 cycle        | HtD t1      | Run t1     | DtH t1     |             |            |          |        |
| t2 cycle        | ]           | HtD t2     | Run t2     | DtH t2      |            |          |        |
| t3 cycle        | ]           |            | HtD t3     | Run t3      | DtH t3     |          |        |
| t4 cycle        | Run t4      | DtH t4     |            |             |            |          |        |
| synchronous beh | avior @ MIC |            |            |             |            |          |        |
| t1 cycle        | HtD t1      | Run t1     | DtH t1     |             |            |          |        |
| t2 cycle        | HtD t2      | Run t2     | DtH t2     |             |            |          |        |
| t3 cycle        | HtD         | t3 Run     | t3 DtH t3  |             |            |          |        |
| t4 cycle        | Run t4      | DtH t4     |            |             |            |          |        |
|                 | time        | <b></b>    |            |             |            |          |        |



### **Dynamic Load Balancing**

( Dynamically shift cores between processes in node

- Enabled by application malleability (task based)
- Runtime in control (sees what happens and the future)
- Would greatly benefit from OS support (handoff scheduling, fast context switching)







**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# THE REAL REVOLUTION

# The real revolution

- ( Task based
  - OpenMP OK

#### ( "Proper" model does not guarantee "proper" programs

- Flexibility  $\rightarrow$  can be used "wrong"
- Possible to write an MPI program in OpenMP syntax

#### ( Revolution: is in the mindset of programmers

- "Forget" about hardware, resources
  - rely on the runtime system
- Focus on program logic
- Methodology
  - Top down programming methodology
  - Throughput oriented:
    - try not to stall !
    - First order, then overhead
  - Think global:
    - of potentials rather than how-to's
    - may be unprecise
  - Specify local:
    - needs and outcomes of the functionality being written
    - precise



# **Programming practices**

- ( What to avoid
  - Threads
    - Omp\_num\_threads
    - Thread\_private
    - Parallel, barrier
  - separate parallel and serial implementation
  - Ifdefs
  - Infer too much from scaling plots
  - Worry too early about actual performance

#### ( What to try

- Top down & nesting
- Lookahead
  - synchronizing tasks, handle control flow dependences
- Malleability
- Taskify communications
  - Overlap computation, other communications, shift critical path



foo() { if (small) for(;;) {...}; else { #pragma omp parallel for for(;;) {...};

# Examples

# ( Applications

- PARSECS
- Nt-chem
- Alya
- Lulesh
- IFSkernel
- Quantum Expreso





# Nt-chem

# (C Electronic structure calculation(C RIKEN FIBER miniapp)





imp2\_rmp2energy\_incore\_v\_mpiomp.F90

| 1: RIMP2_RMP2Energy_InCore_V_MPIOMP () |                                         |
|----------------------------------------|-----------------------------------------|
| 405:                                   | DO LNumber_Base                         |
| 498:                                   | DGEMM                                   |
| 518:                                   | <pre> if (something)     { wait ;</pre> |
| 588:                                   | Do loops<br>reduction MP2 correlation   |
| 636:                                   | END DO<br>ENDO                          |

# mn3: Scalability

## ( Original: Hybrid MPI + OpenMP



Not fully populated node → system activates TurboBoost increasing frequency from 2.6 to 3.1GHz

Load imbalance Global Serialization Noise

Some gain @ low threading count

High overhead, fine granularity @ large threading count

# Nt-chem

# ( Load imbalance

- Global
- Migrating load imbalance, Serialization
- ( Asynchrony: non blocking MPI calls







# mn3 : OmpSs taskification

- ( Taskify
  - Serial DGEMMs.
    - Coarser granularity than original OpenMP
  - Reduction loops
    - Not parallelized in original?
    - Serial task
  - Communications
    - Overlap, but fixed schedule in original source

# ( Outcome

- Possible with limited understanding of global application
- Happen to be fairly independent





# mn3 : OmpSs taskification

# ( Performance gain

- Sufficient task granularity
- Communication computation overlap
- Still measuring numbers with DLB



# Analyses

( Trace MPI + OmpSs, 4 x 4

#### ( Stalls:

- Interaction throttling ↔ scheduling
   ↔ loose dependence chains
- Prioritize communication tasks

## ( Concurrent MPI tasks $\rightarrow$

- ~ commutative send/rec tasks
  - only one thread executes MPI, avoid contention ...
  - ... but still contention with allreduces !

## ( Task generation overhead $\rightarrow$

- Reduce number of tasks
  - Separate send and receive tasks → MPI\_send\_rec
  - Increase DGEMM size, reduce #tasks (WIP)





# Alya

# ( FE + Particles



- ( Important imbalance
  - ~ unpredictable, not fixable at domain decomposition level
- ( By hybrid programming
  - Reduce processes  $\rightarrow$  less imbalance
    - Sequential performance?
    - "Hybrid Amdahl's law"?
  - Still imbalance





# Matrix Assembly

( Reduction on large matrix with indirection



#### Sequential performance ( Parallelization of indirect reduction on large object Impact on IPC - Still load imbalance Task 2DZoom range [11,11] @ alya 64x4 nocoloring.prv.gz THREAD 1.2.1 THREAD 1.11.1 THREAD 1.20.1 atomics THREAD 1.29.1 THREAD 1.38.1 THREAD 1.47.1 THREAD 1.56.1 THREAD 1.64.4 263,216,526 us 265.856.220 us Task 2DZoom range [20,20] @ alya 64x4 coloring.prv.gz THREAD 1.2.1 THREAD 1.11.1 THREAD 1.20.1 coloring THREAD 1.29.1 THREAD 1.38.1 THREAD 1.47.1 THREAD 1.56.1 THREAD 1.64.4 257,838,698 us 260.651.158 us Task 2DZoom range [18,18] @ alya 64x4 multideps.prv.gz THREAD 1.2.1 THREAD 1.11.1 Commutative THREAD 1.20.1 THREAD 1.29.1 multideps THREAD 1.38.1 THREAD 1.47.1 THREAD 1.56.1 THREAD 1.64.4 257,672,289 us 260,245,229 us Task 2DZoom range [19,19] @ alya 64x4 multideps prio.prv.gz THREAD 1.2.1 THREAD 1.11.1 THREAD 1.20.1 + priority THREAD 1.29.1 THREAD 1.38.1 THREAD 1.47.1 THREAD 1.56.1 THREAD 1.64.4 260,353,666 us 262.951.292 us



# Dynamic load balance

# ( DLB

- Across MPI processes
- Within a node
- 14% gain (38.5% over coloring)

# ( Side effects

- Frequently imbalance concentrated in contiguous processes
- Suggestion:
  - Interleave processes
- Result:
  - Better Pure MPI
     performance !!!





# **Processes & Threads**

# ( Throughput computing

Malleability (tasks) + DLB →
 flat lines

- ( DLB helps in all cases
  - Even more in the bad ones  $\ensuremath{\textcircled{}}$



## ( Side effect

 Hybrid Nx1 can be better than pure MPI !!!



# Granularity

(( Fairly wide range of good granularities

( → Throughput computing







**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# CONCLUSION

# OmpSs

- ( Other features
  - Memory hierarchy management
  - Nesting/recursion
  - Criticality and locality aware scheduling
  - Multiple implementations for a given task
  - CUDA, OpenCL, accelerator directives, FPGA, Cluster
  - Resilience
  - Real time
  - ...
- (Commitment: forerunner for OpenMP
  - Continuous development and use since 2004
  - Pushing ideas into the OpenMP standard
  - Developer Positioning: efforts in OmpSs will not be lost.



# What is important? Methodology

- ( The revolution requires a programming effort
  - Must make the transition it as simple as possible
  - Must make it as long lived as possible
- ( Top down !!
  - Every level contributes
  - Nesting
  - Think global, specify local
- ( Throughput oriented, asynchrony, do not stall
  - Granularity: stay within large plateau of "good" performance
  - Try to avoid predefined schedule of synchronizing operations
  - Try to avoid "machine" specificities
- ( First order and flexible decomposition, then overhead
- ( Malleability / Responsiveness
- ( Incremental:
  - Only where needed (e.g only taskify to enable DLB, overlap,...)
- ( Show your code, look at others code, don't get just dazzled by performance
- ( Do not fly blind
  - Very aggregated statistics may not be enough to gain the insight on actual behavior





#### Dynamic





# Contributors

- Eduard Ayguade
- Rosa M. Badia
- Xavier Martorell
- Vicenç Bertran
- Alex Duran (Intel)
- Roger Ferrer (ARM)
- Xavier Teruel
- Javier Bueno (Metempsy)
- Judit Planas (Lausanne)
- Sergi Mateo
- Carlos Alvarez
- Daniel Jimenez-Gonzalez
- Guillermo Miranda (UPC)
- Diego Caballero (Intel)
- Jorge Bellon

Barcelona

- Antonio Filgueras
- Florentino Sainz (BBVA)



Supercomputing Center Centro Nacional de Supercomputación

- Diego Nieto (...)
- Victor Lopez
- Marta Garcia
- Josep M. Perez
- Guray Ozen
- Antonio J. Peña
- Julian Morillo
- Sara Royuela
- Marc Josep
- Miquel Vidal
- Kevin Sala
- Marc Mari
- Aimar Rodriguez
- Daniel Peyrolon
- Ferran Pallares
- Albert Navarro
- Toni Navarro
- Omer Subasi

- Jan Ciesko
- Marc Casas

. . .

Miquel Moreto



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# THANKS