#### Pleasant Sevices in Computer Science

### Introduction to Parallel Processing

Algorithms and Architectures



#### Behrooz Parhami

#### Part II' Shared-Memory Parallelism

|                          | Part I:<br>Fundamental<br>Concepts        | Background and<br>Motivation<br>Complexity and<br>Models | <ol> <li>Introduction to Parallelism</li> <li>A Taste of Parallel Algorithms</li> <li>Parallel Algorithm Complexity</li> <li>Models of Parallel Processing</li> </ol> |  |  |
|--------------------------|-------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 7                        | Part II'<br>Shared-Memory                 | Shared-Memory<br>Algorithms                              | 5. PRAM and Basic Algorithms<br>6A. More Shared-Memory Algorithms                                                                                                     |  |  |
| tions                    | Parallelism                               | Implementations<br>and Models                            | 6B. Implementation of Shared Memory<br>6C, Shared-Memory Abstractions                                                                                                 |  |  |
| Architectural Variations | Part III:<br>Mask Based                   | Data Movement on<br>2D Arrays                            | 9. Sorting on a 2D Mesh or Torus<br>10. Routing on a 2D Mesh or Torus<br>11. Numerical 2D Mesh Algorithms<br>12. Other Mesh-Related Architectures                     |  |  |
|                          | Mesh-Based<br>Architectures               | Mesh Algorithms<br>and Variants                          |                                                                                                                                                                       |  |  |
|                          | Part IV:<br>Low-Diameter<br>Architectures | The Hypercube<br>Architecture                            | 13. Hypercubes and Their Algorithms<br>14. Sorting and Routing on Hypercubes                                                                                          |  |  |
|                          |                                           | Hypercubic and<br>Other Networks                         | 15. Other Hypercubic Architectures<br>16. A Sampler of Other Networks                                                                                                 |  |  |
|                          | Part V:                                   | Coordination and<br>Data Access                          | 17. Emulation and Scheduling<br>18. Data Storage, Input, and Output                                                                                                   |  |  |
|                          | Some Broad<br>Topics                      | Robustness and<br>Ease of Use                            | 19. Reliable Parallel Processing<br>20. System and Software Issues                                                                                                    |  |  |
|                          | Part VI:                                  | Control-Parallel<br>Systems                              | 21. Shared-Memory MIMD Machines<br>22. Message-Passing MIMD Machines                                                                                                  |  |  |
|                          | Implementation<br>Aspects                 | Data Parallelism<br>and Conclusion                       | 23. Data-Parallel SIMD Machines<br>24. Past, Present, and Future                                                                                                      |  |  |







### About This Presentation

This presentation is intended to support the use of the textbook *Introduction to Parallel Processing: Algorithms and Architectures* (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

| Edition | Released    | Revised     | Revised     | Revised     |
|---------|-------------|-------------|-------------|-------------|
| First   | Spring 2005 | Spring 2006 | Fall 2008   | Fall 2010   |
|         |             | Winter 2013 | Winter 2014 | Winter 2016 |
|         |             | Winter 2019 | Winter 2020 | Winter 2021 |





# II' Shared-Memory Parallelism

Shared memory is the most intuitive parallel user interface:

- Abstract SM (PRAM); ignores implementation issues
- Implementation w/o worsening the memory bottleneck
- Shared-memory models and their performance impact

#### **Topics in This Part**

Chapter 5 PRAM and Basic Algorithms

Chapter 6A More Shared-Memory Algorithms

Chapter 6B Implementation of Shared Memory

Chapter 6C Shared-Memory Abstractions





# 5 PRAM and Basic Algorithms

PRAM, a natural extension of RAM (random-access machine):

- Present definitions of model and its various submodels
- Develop algorithms for key building-block computations

| Topics in This Chapter |                                           |  |  |
|------------------------|-------------------------------------------|--|--|
| 5.1                    | PRAM Submodels and Assumptions            |  |  |
| 5.2                    | Data Broadcasting                         |  |  |
| 5.3                    | Semigroup or Fan-in Computation           |  |  |
| 5.4                    | Parallel Prefix Computation               |  |  |
| 5.5                    | 5.5 Ranking the Elements of a Linked List |  |  |
| 5.6                    | Matrix Multiplication                     |  |  |







# Why Start with Shared Memory?

Study one extreme of parallel computation models:

- Abstract SM (PRAM); ignores implementation issues
- This abstract model is either realized or emulated
- In the latter case, benefits are similar to those of HLLs

In Part II", we will study the other extreme case of models:

- Concrete circuit model; incorporates hardware details
- Allows explicit latency/area/energy trade-offs
- Facilitates theoretical studies of speed-up limits

Everything else falls between these two extremes







# 5.1 PRAM Submodels and Assumptions



Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM).

Winter 2021



Parallel Processing, Shared-Memory Parallelism

Processor *i* can do the following in three phases of one cycle:

- Fetch a value from address s<sub>i</sub> in shared memory
- 2. Perform computations on data held in local registers
- Store a value into address d<sub>i</sub> in shared memory





Fig. 5.1 Submodels of the PRAM model.





Parallel Processing, Shared-Memory Parallelism



### Examples of Exclusive/Concurrent Reads/Writes

Exclusivity not enforced by hardware; rather, it's done by the programmer

Exclusive read: for  $0 \le i < p$  processor *i* read from location *i* Exclusive write:

for  $0 \le i < p$  processor *i* write into location *i* + 1 mod *p* 

Concurrent read:

for  $0 \le i < p$  processor *i* read from location *i* mod 2

Concurrent write:

for  $0 \le i < p$  processor *i* write into location  $d_i$ 







### Types of CRCW PRAM

CRCW submodels are distinguished by the way they treat multiple writes:

| Undefined: | The value written is undefined ( | CRCW-U) |
|------------|----------------------------------|---------|
|------------|----------------------------------|---------|

- Detecting: A special code for "detected collision" is written (CRCW-D)
- Common: Allowed only if they all store the same value (CRCW-C) [This is sometimes called the consistent-write submodel]
- Random: The value is randomly chosen from those offered (CRCW-R)
- Priority: The processor with the lowest index succeeds (CRCW-P)
- Max/Min: The largest/smallest of the values is written (CRCW-M)
- Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written



Parallel Processing, Shared-Memory Parallelism



### Power of CRCW PRAM Submodels

Model U is more powerful than model V if  $T_U(n) = o(T_V(n))$  for some problem

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

**Theorem 5.1:** A *p*-processor CRCW-P (priority) PRAM can be simulated (emulated) by a *p*-processor EREW PRAM with slowdown factor  $\Theta(\log p)$ .

Intuitive justification for concurrent read emulation (write is similar):

Write the *p* memory addresses in a list Sort the list in ascending order of addresses Remove all duplicate addresses Access data at desired addresses Replicate data via parallel prefix computation

Each step requires constant or  $O(\log p)$  time

| Proc Addr |     |     |
|-----------|-----|-----|
| 0,1       | 0,1 | 0,1 |
| 1,6       | 6,1 |     |
| 2,5       | 7,1 |     |
| 3,2       | 3,2 | 3,2 |
| 4,3       | 8,2 |     |
| 5,6       | 4,3 | 4,3 |
| 6,1       | 2,5 | 2,5 |
| 7,1       | 1,6 | 1,6 |
| 8,2       | 5,6 |     |





Parallel Processing, Shared-Memory Parallelism



### Implications of the CRCW Hierarchy of Submodels

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

A *p*-processor CRCW-P (priority) PRAM can be simulated (emulated) by a *p*-processor EREW PRAM with slowdown factor  $\Theta(\log p)$ .

Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown

Efficient parallel algorithms have polylogarithmic running times

Running time still polylogarithmic after slowdown due to emulation

We need not be too concerned with the CRCW submodel used

Simply use whichever submodel is most natural or convenient







### Some Elementary PRAM Computations



Adding two *n*-vectors and storing the results in a third (base addresses *B*', *B*", *B*)

Convolution of two *n*-vectors:  $W_k = \sum_{i+j=k} U_i \times V_j$ (base addresses  $B_W$ ,  $B_U$ ,  $B_V$ )





Parallel Processing, Shared-Memory Parallelism



# 5.2 Data Broadcasting

 $\begin{array}{l} \underline{\text{Making } p \text{ copies of } B[0]} \\ \underline{\text{by recursive doubling}} \\ \text{for } k = 0 \text{ to } \lceil \log_2 p \rceil - 1 \\ \text{Proc } j, \ 0 \leq j < p, \text{ do} \\ \text{Copy } B[j] \text{ into } B[j + 2^k] \\ \text{endfor} \end{array}$ 

Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded



Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling.



Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

Parallel Processing, Shared-Memory Parallelism



### **Class Participation: Broadcast-Based Sorting**

Each person write down an arbitrary nonnegative integer with 3 or fewer digits on a piece of paper

Students take turn broadcasting their numbers by calling them out aloud

Each student puts an X on paper for every number called out that is smaller than his/her own number, or is equal but was called out before the student's own value

Each student counts the number of Xs on paper to determine the rank of his/her number

Students call out their numbers in order of the computed rank





Parallel Processing, Shared-Memory Parallelism





### All-to-All Broadcasting on EREW PRAM

<u>EREW PRAM algorithm for all-to-all broadcasting</u> Processor *j*,  $0 \le j < p$ , write own data value into *B*[*j*] for *k* = 1 to *p* – 1 Processor *j*,  $0 \le j < p$ , do Read the data value in *B*[(*j* + *k*) mod *p*] endfor

This O(p)-step algorithm is time-optimal

Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j,  $0 \le j < p$ , write 0 into R[j]for k = 1 to p - 1 Processor j,  $0 \le j < p$ , do  $l := (j + k) \mod p$ if S[l] < S[j] or S[l] = S[j] and l < jthen R[j] := R[j] + 1endif endifor This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time

Processor *j*,  $0 \le j < p$ , write *S*[*j*] into *S*[*R*[*j*]]

Winter 2021



Parallel Processing, Shared-Memory Parallelism



0

p - q

# 5.3 Semigroup or Fan-in Computation

S

EREW PRAM semigroup <u>computation algorithm</u> Proc j,  $0 \le j < p$ , copy X[j] into S[j] s := 1while s < p Proc j,  $0 \le j , do$  $<math>S[j + s] := S[j] \otimes S[j + s]$  s := 2sendwhile

Broadcast S[p-1] to all proc's



|   | 0:0 |       |  |  |
|---|-----|-------|--|--|
|   | 1   | 0:1   |  |  |
|   |     | 0:2   |  |  |
|   |     | 0:3   |  |  |
| K |     | l 1:4 |  |  |
|   |     | 12:5  |  |  |
|   |     | 3:6   |  |  |
|   |     | 4:7   |  |  |
|   |     | 5:8   |  |  |
|   |     | 6:9   |  |  |

| 0:0 |
|-----|
| 0:1 |
| 0:2 |
| 0:3 |
| 0:4 |
| 0:5 |
| 0:6 |
| 0:7 |
| 0:8 |
| 0:9 |
|     |

0:0

0:1

0:2

0:3

0:4

0:5

0:6

0:7

1:8

2:9

Fig. 5.4 Semigroup computation in EREW PRAM.

This algorithm is optimal for PRAM, but its speedup of  $O(p/\log p)$  is not

If we use p processors on a list of size  $n = O(p \log p)$ , then optimal speedup can be achieved

Lower degree of parallelism near the root

Higher degree of parallelism near the leave

Fig. 5.5 Intuitive justification of why parallel slack helps improve the efficiency.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



## 5.4 Parallel Prefix Computation



# Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### A Divide-and-Conquer Parallel-Prefix Algorithm



Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.





Parallel Processing, Shared-Memory Parallelism



Another Divide-and-Conquer Algorithm



Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.





Parallel Processing, Shared-Memory Parallelism



# 5.5 Ranking the Elements of a Linked List



Parallel Processing, Shared-Memory Parallelism



List Ranking via Recursive Doubling



Fig. 5.11 Element ranks initially and after each of the three iterations.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **PRAM List Ranking Algorithm**



Parallel Processing, Shared-Memory Parallelism

Winter 2021



# 5.6 Matrix Multiplication



### PRAM Matrix Multiplication with $m^2$ Processors



### PRAM Matrix Multiplication with *m* Processors



### PRAM Matrix Multiplication with Fewer Processors

Algorithm is similar, except that each processor is in charge of computing m/p rows of C

 $\Theta(m^3/p)$  steps: Time-optimal EREW model can be used

A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access.

This is particularly costly for NUMA shared-memory machines.





### More Efficient Matrix Multiplication (for NUMA)



Fig. 5.13 Partitioning the matrices for block matrix multiplication .

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Details of Block Matrix Multiplication**



Fig. 5.14 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6A More Shared-Memory Algorithms

Develop PRAM algorithm for more complex problems:

- Searching, selection, sorting, other nonnumerical tasks
- Must present background on the problem in some cases

| Topics in This Chapter |                                     |  |  |
|------------------------|-------------------------------------|--|--|
| 6A.1                   | Parallel Searching Algorithms       |  |  |
| 6A.2                   | Sequential Rank-Based Selection     |  |  |
| 6A.3                   | A Parallel Selection Algorithm      |  |  |
| 6A.4                   | A Selection-Based Sorting Algorithm |  |  |
| 6A.5                   | .5 Alternative Sorting Algorithms   |  |  |
| 6A.6                   | Convex Hull of a 2D Point Set       |  |  |







# 6A.1 Parallel Searching Algorithms

Searching an unordered list in PRAM

Sequential time: *n* worst-case *n*/2 on average

Divide the list of *n* items into *p* segments of  $\lceil n / p \rceil$  items (last segment may have fewer)

Processor *i* will be in charge of  $\lceil n / p \rceil$  list elements, beginning at address  $i \lceil n / p \rceil$ 

Parallel time:

*[n | p*] worst-case ??? on average

Perfect speed-up of *p* with *p* processors?

Pre- and postprocessing overheads

Winter 2021



Parallel Processing, Shared-Memory Parallelism





### Parallel (p + 1)-ary Search on PRAM

p probes, rather than 1, per step  $log_{p+1}(n + 1)$   $= log_2(n + 1) / log_2(p + 1)$   $= \Theta(log n / log p) steps$ 

Speedup  $\cong \log p$ 

Optimal: no comparison-based search algorithm can be faster

A single search in a sorted list can't be significantly speeded up through parallel processing, but all hope is not lost:

Dynamic data (sorting overhead)

Batch searching (multiple lookups)



Slide 31





Parallel Processing, Shared-Memory Parallelism

## 6A.2 Sequential Ranked-Based Selection

Selection: Find the (or a) *k*th smallest among *n* elements

**Example: 5th smallest element in the following list is 1:** 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 1 4 5 4 9 5



### Linear-Time Sequential Selection Algorithm



### Algorithm Complexity and Examples

T(n) = T(n/q) + T(3n/4) + cn

We must have  $q \ge 5$ ; for q = 5, the solution is T(n) = 20cn

| S      |               | -       | of <i>q</i> elements                        |
|--------|---------------|---------|---------------------------------------------|
| T<br>m | 6             | 3       | 3 2 5<br>3                                  |
| 111    | 1 2 1 0 2 1 1 | 3 3 6 4 | 5 6 7 5 8 4 5 6 7 4 5 4 9 5                 |
|        | L<br> L  = 7  | E = 2   | $\begin{array}{c} G\\  G  = 16 \end{array}$ |

To find the 5th smallest element in S, select the 5th smallest element in L



# 6A.3 A Parallel Selection Algorithm

Parallel rank-based selection algorithm PRAMselect(S, k, p)



### Algorithm Complexity and Efficiency

$$T(n,p) = T(n^{1-x},p) + T(3n/4,p) + cn^{x}$$

The solution is  $O(n^x)$ ; verify by substitution

Speedup = 
$$\Theta(n) / O(n^x) = \Omega(n^{1-x}) = \Omega(p)$$
  
Efficiency =  $\Omega(1)$   
Work(*n*, *p*) = *pT*(*n*, *p*) =  $\Theta(n^{1-x}) \Theta(n^x) = \Theta(r)$ 

Remember  $p = O(n^{1-x})$ 

What happens if we set x to 1? (i.e., use one processor)  $T(n, 1) = O(n^{x}) = O(n)$ 

What happens if we set x to 0? (i.e., use n processors)  $T(n, n) = O(n^{x}) = O(1)$ ? No, because in asymptotic analysis, we ignored several  $O(\log n)$  terms compared with  $O(n^x)$  terms





Parallel Processing, Shared-Memory Parallelism



Data Movement in Step 2 of the Algorithm



Consider the sublist *L*: Processor *i* contributes *a<sub>i</sub>* items to this sublist

Processor 0 starts storing at location 0, processor 1 at location  $a_0$ , processor 2 at location  $a_0 + a_1$ , Processor 3 at location  $a_0 + a_1 + a_2$ , ...





Parallel Processing, Shared-Memory Parallelism



## 6A.4 A Selection-Based Sorting Algorithm



### Algorithm Complexity and Efficiency

 $T(n, p) = 2T(n/k, 2p/k) + cn^{x}$ 

The solution is  $O(n^x \log n)$ ; verify by substitution

Speedup $(n, p) = \Omega(n \log n) / O(n^x \log n) = \Omega(n^{1-x}) = \Omega(p)$ Efficiency = speedup /  $p = \Omega(1)$ Work $(n, p) = pT(n, p) = \Theta(n^{1-x}) \Theta(n^x \log n) = \Theta(n \log n)$ 

What happens if we set x to 1? (i.e., use one processor)  $T(n, 1) = O(n^x \log n) = O(n \log n)$  Remember  $p = O(n^{1-x})$ 

Our asymptotic analysis is valid for x > 0 but not for x = 0;

i.e., *PRAMselectionsort* cannot sort *p* keys in optimal O(log *p*) time.





Parallel Processing, Shared-Memory Parallelism



#### **Example of Parallel Sorting**

S: 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 0 4 5 4 9 5

Threshold values for k = 4 (i.e.,  $x = \frac{1}{2}$  and  $p = n^{1/2}$  processors):

|                          | $m_0 = -\infty$                                         |
|--------------------------|---------------------------------------------------------|
| <i>n/k</i> = 25/4 ≅ 6    | <i>m</i> <sub>1</sub> = <i>PRAMselect</i> (S, 6, 5) = 2 |
| 2 <i>n/k</i> = 50/4 ≅ 13 | $m_2 = PRAMselect(S, 13, 5) = 4$                        |
| 3 <i>n/k</i> = 75/4 ≅ 19 | $m_{3}^{-} = PRAMselect(S, 19, 5) = 6$                  |
|                          | $m_4 = +\infty$                                         |



**T:** 0 0 1 1 1 2 2 3 3 4 4 4 4 5 5 5 5 6 6 6 7 7 8 9

Winter 2021



Parallel Processing, Shared-Memory Parallelism



## 6A.5 Alternative Sorting Algorithms

Sorting via random sampling (assume  $p \ll \sqrt{n}$ )

Given a large list *S* of inputs, a random sample of the elements can be used to find *k* comparison thresholds

It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor

Parallel randomized sort PRAMrandomsort(S, p)

- 1. Processor *j*,  $0 \le j < p$ , pick  $|S|/p^2$  random samples of its |S|/p elements and store them in its corresponding section of a list *T* of length |S|/p
- 2. Processor 0 sort the list T

{comparison threshold  $m_i$  is the  $(i|S|/p^2)$ th element of T}

- 3. Processor *j*,  $0 \le j < p$ , store its elements falling in  $(m_i, m_{i+1})$  into T(i)
- 4. Processor *j*,  $0 \le j < p$ , sort the sublist *T*(*j*)

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Parallel Binsort or Bucketsort

Suppose input values are in [0, *m*) Divide the range into *p* subranges

[0, *m*/*p*), [*m*/*p*, 2*m*/*p*), ....

Processor *j* sorts the elements in the *j*th subrange

If values are uniformly distributed, each processor gets  $\sim n/p$  values

Avg. time = 
$$O(n/p)$$
 +  $O((n/p) \log(n/p))$   
Fill Sort  
buckets buckets

Wikipedia images 29 25 3 49 9 37 21 43 2925 49 3 37 9 43 21 10 - 1920-29 30-39 0-9 40-49 20-29 30-39 40-49 0 - 910 - 193 43 37 49 9

#### **3 9 21 25 29 37 43 49**

If the list consists of small range of values (say, numbers 0-9), bucketsort can be used to count the number of occurrences of each value in O(n) time serially and in O(n/p) time in parallel

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Parallel Radixsort

In binary version of *radixsort*, we examine every bit of the *k*-bit keys in turn, starting from the LSB In Step *i*, bit *i* is examined,  $0 \le i < k$ Records are stably sorted by the value of the *i*th key bit

| Binary     | Input<br>list | Sort by<br>LSB | Sort by middle bit | Sort by<br>MSB |
|------------|---------------|----------------|--------------------|----------------|
| forms      | 5 (101)       | <br>4 (100)    | <u> </u>           | <u> </u>       |
|            | 7 (111)       | 2 (010)        | -5 (101)           | -2 (010)       |
| Question:  | 3 (011)       | <u>2 (010)</u> | <u>1 (001)</u>     | <b>2</b> (010) |
| How are    | 1 (001)       | 5 (101)        | 2 (010)            | <u>3 (011)</u> |
| the data   | 4 (100)       | / 7 (111) /    | 2 (010)            | 4 (100)        |
| movements  | 2 (010)       | 3 (011)        | 7 (111)            | 5 (101)        |
| performed? | 7 (111)       | 1 (001)        | 3 (011)            | 7 (111)        |
|            | 2 (010)       | 7 (111)        | 7 (111)            | 7 (111)        |







#### Data Movements in Parallel Radixsort

| Input<br>list | • | Diminished<br>prefix sums | Bit 0 | Prefix sums<br>plus 2 | Shifted<br>list |
|---------------|---|---------------------------|-------|-----------------------|-----------------|
| 5 (101)       | 0 | _                         | 1     | 1 + 2 = 3             | 4 (100)         |
| 7 (111)       | 0 | _                         | 1     | 2 + 2 = 4             | 2 (010)         |
| 3 (011)       | 0 | _                         | 1     | 3 + 2 = 5             | <u>2 (010)</u>  |
| 1 (001)       | 0 | —                         | 1     | 4 + 2 = 6             | 5 (101)         |
| 4 (100)       | 1 | 0                         | 0     | -                     | 7 (111)         |
| 2 (010)       | 1 | 1                         | 0     | -                     | 3 (011)         |
| 7 (111)       | 0 | —                         | 1     | 5 + 2 = 7             | 1 (001)         |
| 2 (010)       | 1 | 2                         | 0     | _                     | 7 (111)         |

Running time consists mainly of the time to perform 2k parallel prefix computations: O(log p) for k constant









### PRAM Convex Hull Algorithm

Parallel convex hull algorithm PRAMconvexhull(S, p)

- 1. Sort point set by *x* coordinates
- 2. Divide sorted list into  $\sqrt{p}$  subsets  $Q^{(i)}$  of size  $\sqrt{p}$ ,  $0 \le i < \sqrt{p}$
- 3. Find convex hull of each subset  $Q^{(i)}$  using  $\sqrt{p}$  processors
- 4. Merge  $\sqrt{p}$  convex hulls CH( $Q^{(i)}$ ) into overall hull CH(Q)



Fig. 6.4 Multiway divide and conquer for the convex hull problem

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Merging of Partial Convex Hulls



Tangent lines are found through binary search in log time

 $= T(p^{1/2}, p^{1/2}) + c \log p$ 

The initial sorting also takes  $O(\log p)$  time

(b) Points of CH(Q(i)) from A to B are on CH(Q)

Fig. 6.5 Finding points in a partial hull that belong to the combined hull.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6B Implementation of Shared Memory

Main challenge: Easing the memory access bottleneck

- Providing low-latency, high-bandwidth paths to memory
- Reducing the need for access to nonlocal memory
- Reducing conflicts and sensitivity to memory latency

| Topics in This Chapter |                                          |  |
|------------------------|------------------------------------------|--|
| 6B.1                   | Processor-Memory Interconnection         |  |
| 6B.2                   | Multistage Interconnection Networks      |  |
| 6B.3                   | Cache Coherence Protocols                |  |
| 6B.4                   | Data Allocation for Conflict-Free Access |  |
| 6B.5                   | Distributed Shared Memory                |  |
| 6B.6                   | Methods for Memory Latency Hiding        |  |







#### About the New Chapter 6B

This new chapter incorporates material from the following existing sections of the book:

- 6.6 Some Implementation Aspects
- 14.4 Dimension-Order Routing
- 15.3 Plus-or-Minus-2<sup>*i*</sup> Network
- 16.6 Multistage Interconnection Networks
- 17.2 Distributed Shared Memory
- 18.1 Data Access Problems and Caching
- 18.2 Cache Coherence Protocols
- 18.3 Multithreading and Latency Hiding
- 20.1 Coordination and Synchronization





Parallel Processing, Shared-Memory Parallelism



### Making PRAM Practical

PRAM needs a read and a write access to memory in every cycle

Even for a sequential computer, memory is tens to hundreds of times slower than arithmetic/logic operations; multiple processors accessing a shared memory only makes the situation worse

Shared access to a single large physical memory isn't scalable

Strategies and focal points for making PRAM practical

- 1. Make memory accesses faster and more efficient (pipelining)
- 2. Reduce the number of memory accesses (caching, reordering of accesses so as to do more computation per item fetched/stored)
- 3. Reduce synchronization, so that slow memory accesses for one computation do not slow down others (synch, memory consistency)
- 4. Distribute the memory and data to make most accesses local
- 5. Store data structures to reduce access conflicts (skewed storage)

Winter 2021



Parallel Processing, Shared-Memory Parallelism



## 6B.1 Processor-Memory Interconnection



Fig. 4.3 A parallel processor with global (shared) memory.





Parallel Processing, Shared-Memory Parallelism



#### Processor-to-Memory Network



An  $8 \times 8$  crossbar switch

Crossbar switches offer full permutation capability (they are *nonblocking*), but are complex and expensive:  $O(p^2)$ 

Even with a permutation network, full PRAM functionality is not realized: two processors cannot access different addresses in the same memory module

Practical processor-tomemory networks cannot realize all permutations (they are *blocking*)

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Bus-Based Interconnections**

#### Single-bus system:

Bandwidth bottleneck Bus loading limit Scalability: very poor Single failure point Conceptually simple Forced serialization



#### Multiple-bus system:

Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Simple serialization



Winter 2021





### Back-of-the-Envelope Bus Bandwidth Calculation

#### Single-bus system:

Bus frequency: 0.5 GHz Data width: 256 b (32 B) Mem. Access: 2 bus cycles  $(0.5G)/2 \times 32 = 8$  GB/s

Bus cycle = 2 ns Memory cycle = 100 ns 1 mem. cycle = 50 bus cycles

#### Multiple-bus system:

Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less)







#### **Hierarchical Bus Interconnection**



Fig. 4.9 Example of a hierarchical interconnection architecture.





Parallel Processing, Shared-Memory Parallelism



### Removing the Processor-to-Memory Bottleneck



Fig. 4.4 A parallel processor with global memory and processor caches.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Why Data Caching Works

Hit rate *r* (fraction of memory accesses satisfied by cache)

$$C_{\text{eff}} = C_{\text{fast}} + (1 - r)C_{\text{slow}}$$

#### **Cache parameters:**

Size Block length (line width) Placement policy Replacement policy Write policy

Fig. 18.1 Data storage and access in a two-way set-associative cache.



Parallel Processing, Shared-Memory Parallelism





### Benefits of Caching Formulated as Amdahl's Law



This corresponds to the miss-rate fraction 1-r of accesses being unaffected and the hit-rate fraction *r* (almost 1) being speeded up by a factor  $C_{slow}/C_{fast}$ 

#### **Generalized form of Amdahl's speedup formula:**

$$S = 1/(f_1/p_1 + f_2/p_2 + \ldots + f_m/p_m)$$
, with  $f_1 + f_2 + \ldots + f_m = 1$ 

In this case, a fraction 1 - r is slowed down by a factor  $(C_{slow} + C_{fast})/C_{slow}$ , and a fraction *r* is speeded up by a factor  $C_{slow}/C_{fast}$ 

Winter 2021



Parallel Processing, Shared-Memory Parallelism



## 6B.2 Multistage Interconnection Networks



#### Cray Y-MP's Interconnection Network



#### Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Butterfly Processor-to-Memory Network



Fig. 6.9 Example of a multistage memory access network.

Two ways to use the butterfly: - Edge switches 1 x 2 and 2 x 1 - All switches 2 x 2

Not a full permutation network (e.g., processor 0 cannot be connected to memory bank 2 alongside the two connections shown)

Is self-routing: i.e., the bank address determines the route

A request going to memory bank 3  $(0\ 0\ 1\ 1)$  is routed:



lower upper upper

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Butterfly as Multistage Interconnection Network



Fig. 6.9 Example of a multistage memory access network

log<sub>2</sub> p + 1 Columns of 2-by-2 Switches 000 001 010 Processors 011 100 101 110 111 000 001 Memory Banks 010 011 100 101 110 111

Fig. 15.8 Butterfly network used to connect modules that are on the same side

Generalization of the butterfly network

High-radix or *m*-ary butterfly, built of  $m \times m$  switches Has  $m^q$  rows and q + 1 columns (*q* if wrapped)

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Self-Routing on a Butterfly Network



From node 3 to 6: routing tag =  $011 \oplus 110 = 101$  "cross-straight-cross" From node 3 to 5: routing tag =  $011 \oplus 101 = 110$  "straight-cross-cross" From node 6 to 1: routing tag =  $110 \oplus 001 = 111$  "cross-cross-cross"

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Butterfly Is Not a Permutation Network



Fig. 14.7 Packing is a "good" routing problem for dimensionorder routing on the hypercube. Fig. 14.8 Bit-reversal permutation is a "bad" routing problem for dimensionorder routing on the hypercube.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Structure of Butterfly Networks

Switching these two row pairs converts this to the original butterfly network. Changing the order of stages in a butterfly is thus equivalent to a relabeling of the rows (in this example, row *xyz* becomes row *xzy*)



Fig. 15.5 Butterfly network with permuted dimensions.



The 16-row butterfly network.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



#### Beneš Network



Fig. 15.9 Beneš network formed from two back-to-back butterflies.

A  $2^{q}$ -row Beneš network: Can route any  $2^{q} \times 2^{q}$  permutation It is "rearrangeable"

Winter 2021



Parallel Processing, Shared-Memory Parallelism



To which memory modules can we connect proc 4 without rearranging the other paths?

What about proc 6?



Winter 2021



Parallel Processing, Shared-Memory Parallelism

s referred s

#### Augmented Data **Manipulator Network**

Data manipulator network was used in Goodyear MPP, an early SIMD parallel machine.

"Augmented" means that switches in a column are independent, as opposed to all being set to same state (simplified control).

Fig. 15.12 Augmented data manipulator network.







Parallel Processing, Shared-Memory Parallelism



#### The Sea of Indirect Interconnection Networks

Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers

They differ in topological, performance, robustness, and realizability attributes

We have already seen the butterfly, hierarchical bus, beneš, and ADM networks

Fig. 4.8 (modified) The sea of indirect interconnection networks.

Winter 2021



Parallel Processing, Shared-Memory Parallelism

### Self-Routing Permutation Networks

Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network)

Permutation routing through a MIN is the same problem as sorting



### Partial List of Important MINs

Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12) **Banyan**: Any MIN with a unique path between any input and any output (e.g. butterfly) **Baseline**: Butterfly network with nodes labeled differently **Beneš**: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10) Bidelta: A MIN that is a delta network in either direction **Butterfly**: aka unfolded hypercube (Figs. 6.9, 15.4-5) **Data manipulator**: Same as ADM, but with switches in a column restricted to same state **Delta:** Any MIN for which the outputs of each switch have distinct labels (say 0 and 1 for  $2 \times 2$  switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output **Flip**: Reverse of the omega network (inputs  $\times$  outputs) **Indirect cube**: Same as butterfly or omega **Omega**: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19) **Permutation:** Any MIN that can realize all permutations **Rearrangeable**: Same as permutation network **Reverse baseline**: Baseline network, with the roles of inputs and outputs interchanged





# 6B.3 Cache Coherence Protocols



Fig. 18.2 Various types of cached data blocks in a parallel processor with global memory and processor caches.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Example: A Bus-Based Snoopy Protocol

Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken



Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Implementing a Snoopy Protocol



Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Scalable (Distributed) Shared Memory



#### Fig. 4.5 A parallel processor with distributed memory.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Example: A Directory-Based Protocol**



Fig. 18.4 States and transitions for a directory entry in a directory-based coherence protocol (*c* denotes the cache sending the message).

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### Implementing a Directory-Based Protocol

Sharing set implemented as a bit-vector (simple, but not scalable)

When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory



The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 18.6 in connection with the SCI standard)

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6B.4 Data Allocation for Conflict-Free Access

Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping



Each matrix column is stored in a different memory module (bank)

Accessing a column leads to conflicts

Fig. 6.6 Matrix storage in column-major order to allow concurrent accesses to rows.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Skewed Storage Format**



Fig. 6.7 Skewed matrix storage for conflict-free accesses to rows and columns.



Parallel Processing, Shared-Memory Parallelism



### A Unified Theory of Conflict-Free Access

| Vector<br>indices                                                                                                                               | 0<br>1<br>2<br>3<br>4<br>5 | 6<br>7<br>8<br>9<br>10<br>11                    | 12<br>13<br>14<br>15<br>16<br>17 | 18<br>19<br>20<br>21<br>22<br>23   | 24<br>25<br>26<br>27<br>28<br>29                                     | 30<br>31<br>32<br>33<br>34<br>35 |               | A qD array can be<br>viewed as a vector,<br>with "row"/"column"<br>accesses associated<br>with constant strides |
|-------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|-------------------------------------------------|----------------------------------|------------------------------------|----------------------------------------------------------------------|----------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------|
| $A_{ij}$ is viewed as vector element <i>i</i> + <i>jm</i><br>Fig. 6.8 A 6 × 6 matrix viewed, in column-<br>major order, as a 36-element vector. |                            |                                                 |                                  |                                    |                                                                      |                                  |               |                                                                                                                 |
| Column<br>Row:<br>Diagona<br>Antidiag                                                                                                           | al:                        | k,<br><i>k</i> ,<br><i>k</i> -<br>I: <i>k</i> , | k+m,<br>k+m+<br>+4(m+<br>k+m-    | k+2 <i>m</i><br>-1, k+2<br>1), k+2 | , <i>k</i> +3n<br>2( <i>m</i> +1<br>5( <i>m</i> +1<br>2( <i>m</i> –1 | ), <i>k</i> +3(                  | m, k+<br>(m+1 | ),<br>Stride = <i>m</i> + 1                                                                                     |





Parallel Processing, Shared-Memory Parallelism



### **Linear Skewing Schemes**



Place vector element *i* in memory bank *a* + *bi* mod *B* (word address within bank is irrelevant to conflict-free access; also, *a* can be set to 0)

 $A_{ij}$  is viewed as vector element i + jm

Fig. 6.8 A  $6 \times 6$  matrix viewed, in columnmajor order, as a 36-element vector.

With a linear skewing scheme, vector elements k, k + s, k + 2s, ..., k + (B - 1)s will be assigned to different memory banks iff *sb* is relatively prime with respect to the number *B* of memory banks.

A prime value for *B* ensures this condition, but is not very practical.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6B.5 Distributed Shared Memory



#### Fig. 4.5 A parallel processor with distributed memory.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Butterfly-Based Distributed Shared Memory**

Randomized emulation of the *p*-processor PRAM on *p*-node butterfly

Use hash function to map memory locations to modules

p locations  $\rightarrow p$  modules, not necessarily distinct

With high probability, at most  $O(\log p)$  of the *p* locations will be in modules located in the same row

Average slowdown =  $O(\log p)$ 



Fig. 17.2 Butterfly distributed-memory machine emulating the PRAM.



Parallel Processing, Shared-Memory Parallelism



### PRAM Emulation with Butterfly MIN

Emulation of the *p*-processor PRAM on (*p* log *p*)-node butterfly, with memory modules and processors connected to the two sides; O(log p) avg. slowdown

processors

Less efficient than Fig. 17.2, which uses a smaller butterfly

By using  $p/(\log p)$  physical processors to emulate the *p*-processor PRAM, this new emulation scheme becomes quite efficient (pipeline the memory accesses of the  $\log p$ virtual processors assigned to each physical processor)



Fig. 17.3 Distributed-memory machine, with a butterfly multistage interconnection network, emulating the PRAM.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Deterministic Shared-Memory Emulation**

One of p =

 $2^{q}(a + 1)$ 

**Deterministic emulation** of p-processor PRAM on *p*-node butterfly

Store log<sub>2</sub> *m* copies of each of the *m* memory location contents

Time-stamp each updated value

A "write" is complete once a majority of copies are updated

A "read" is satisfied when a majority of copies are accessed and the one with latest time stamp is used

Why it works: A few congested links won't delay the operation



Winter 2021



Parallel Processing, Shared-Memory Parallelism

### **PRAM Emulation Using Information Dispersal**

Instead of (log *m*)-fold replication of data, divide each data element into *k* pieces and encode the pieces using a redundancy factor of 3, so that any k/3 pieces suffice for reconstructing the original data



Fig. 17.4 Illustrating the information dispersal approach to PRAM emulation with lower data redundancy.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



## 6B.6 Methods for Memory Latency Hiding

By assumption, PRAM accesses memory locations right when they are needed, so processing must stall until data is fetched Not a smart strategy:

Memory access time = 100s times that of add time



# 6C Shared-Memory Abstractions

A precise memory view is needed for correct algorithm design

- Sequential consistency facilitates programming
- Less strict consistency models offer better performance

| Topics in This Chapter |                                     |  |  |  |
|------------------------|-------------------------------------|--|--|--|
| 6C.1                   | Atomicity in Memory Access          |  |  |  |
| 6C.2                   | Strict and Sequential Consistency   |  |  |  |
| 6C.3                   | Processor Consistency               |  |  |  |
| 6C.4                   | Weak or Synchronization Consistency |  |  |  |
| 6C.5                   | Other Memory Consistency Models     |  |  |  |
| 6C.6                   | Transactional Memory                |  |  |  |







## 6C.1 Atomicity in Memory Access

Performance optimization and latency hiding often imply that memory accesses are interleaved and perhaps not serviced in the order issued



### **Barrier Synchronization Overhead**





Fig. 20.1 Automatic synchronization in message-passing systems.





Parallel Processing, Shared-Memory Parallelism



### Synchronization with Shared Memory

Accomplished by accessing specially designated shared control variables

The fetch-and-add instruction constitutes a useful atomic operation

If the current value of x is c, fetch-and-add(x, a) returns c to the process and overwrites x = c with the value c + a

A second process executing fetch-and-add(x, b) then gets the now current value c + a and modifies it to c + a + b

Why atomicity of fetch-and-add is important: With ordinary instructions, the 3 steps of fetch-and-add for *A* and *B* may be interleaved as follows:

|                                       | Process A    | Process B      | <u>Comments</u>                                                          |
|---------------------------------------|--------------|----------------|--------------------------------------------------------------------------|
| Time step 1                           | read x       |                | A's accumulator holds c                                                  |
| Time step 2                           |              | read x         | B's accumulator holds c                                                  |
| Time step 3                           | add <i>a</i> |                | A's accumulator holds c + a                                              |
| Time step 4                           |              | add <i>b</i>   | <i>B</i> 's accumulator holds <i>c</i> + <i>b</i>                        |
| Time step 5                           | store x      |                | x holds c + a                                                            |
| Time step 6                           |              | store <i>x</i> | <i>x</i> holds <i>c</i> + <i>b</i> (not <i>c</i> + <i>a</i> + <i>b</i> ) |
| • • • • • • • • • • • • • • • • • • • |              |                | ( /                                                                      |

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### **Barrier Synchronization: Implementations**

Make each processor, in a designated set, wait at a barrier until all other processors have arrived at the corresponding points in their computations

**Software implementation** via fetch-and-add or similar instruction **Hardware implementation** via an AND tree (raise flag, check AND result)

A problem with the AND-tree: If a processor can be randomly delayed between raising it flag and checking the tree output, some processors might cross the barrier and lower their flags before others have noticed the change in the AND tree output

**Solution:** Use two AND trees for alternating barrier points



Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6C.2 Strict and Sequential Consistency

A global notion of time does not exist: The speed of light is finite; therefore, we do not become aware of events instantaneously

Suppose a group of people decide to synchronize their watches One person shouts: On the count of 3, set your watches to 1:00 PM

Not everyone will hear the command at the same time

In physics, we learn that two observers do not see the same time

Furthermore, the speed of time passing varies for different observers

# **Conclusion:** A universal notion of time does not exist!



Parallel Processing, Shared-Memory Parallelism



### Strict Consistency

With strict consistency, a read operation always returns the result of the latest write operation on that data object

Strict consistency is impossible to maintain in a distributed system which does not have a global clock

While clocks can be synchronized, there is always some error that causes trouble in near-simultaneous operations

Example: Three processes sharing variables 1-4 (r = read, w = write)



### Sequential Consistency

**Sequential consistency (original def.):** The result of any execution is the same as if processor operations were executed in some sequential order, and the operations of a particular processor appear in the sequence specified by the program it runs



Sequential consistency (new def.): Write operations on the same data object are seen in exactly the same order by all system nodes

Winter 2021



Parallel Processing, Shared-Memory Parallelism



### The Performance Penalty of Sequential Consistency

| Initially     | Thread 1 | <u>Thread 2</u> |  |
|---------------|----------|-----------------|--|
| X = Y = 0     | X := 1   | Y := 1          |  |
|               | R1 := Y  | R2 :=X          |  |
| <u>Exec 1</u> | Exec 2   | Exec 3          |  |
| X := 1        | Y := 1   | X := 1          |  |
| R1 := Y       | R2 :=X   | Y := 1          |  |
| Y := 1        | X := 1   | R1 := Y         |  |
| R2 := X       | R1 := Y  | R2 := X         |  |

If a compiler reorders the seemingly independent statements in Thread 1, the desired semantics (R1 and R2 not being both 0) is compromised

**Relaxed consistency (memory model):** Ease the requirements on doing things in program order and/or write atomicity to gain performance

When maintaining order is absolutely necessary, we use synchronization primitives to enforce it

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6C.3 Processor Consistency

**Processor consistency:** Writes by the same processors are seen by all other processors as occurring in the same order; writes by different processors may appear in different order at various nodes

Example: Linear array in which changes in values propagate at the rate of one node per time step



If P0 and P4 perform two write operations on consecutive time steps, then this is how the processors will see them

| Step 1 | WA |    |        |    | WX |
|--------|----|----|--------|----|----|
| Step 2 | WB | WA |        | WX | WY |
| Step 3 |    | WB | WA, WX | WY |    |
| Step 4 |    | WX | WB, WY | WA |    |
| Step 5 | WX | WY |        | WB | WA |
| Step 6 | WY |    |        |    | WB |

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6C.4 Weak or Synchronization Consistency

Weak consistency: Memory accesses are divided into two categories: (1) Ordinary data accesses Category-1 accesses can be reordered with no limitation If ordering of two operations is to be maintained, the programmer must specify at least one of them as a synchronizing, or Category-2, access



# 6C.5 Other Memory Consistency Models

**Release consistency:** Relaxes synchronization consistency somewhat (1) A process can access a shared variable only if all of its previous acquires have completed successfully

(2) A process can perform a release operation only if all of its previous reads and writes have completed

(3) Acquire and release accesses must be sequentially consistent

For more on memory consistency models, see:

Adve, S. V. and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial," *IEEE Computer*, Dec. 1996.

Adve, S. V., H.-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware," *Communications of the ACM*, Aug. 2010.

For general info on memory management, see:

Gaud, F. *et al.*, "Challenges of Memory Management on Modern NUMA Systems," *Communications of the ACM*, Dec. 2015.

Winter 2021



Parallel Processing, Shared-Memory Parallelism



# 6C.6 Transactional Memory

TM systems typically provide atomic statements that allow the execution of a block of code as an all-or-nothing entity (much like a transaction)

Example of transaction: Transfer \$x from account A to Account B

(1) if a ≥ x then a := a - x else return "insufficient funds"
(2) b := b + x
(3) return "transfer successful"

TM allows a group of read & write operations to be enclosed in a block, so that any changed values become observable to the rest of the system only upon the completion of the entire block





### **Examples of Memory Transactions**

A group of reads, writes, and intervening operations can be grouped into an atomic transaction

**Example:** If wa and wb are made part of the same memory transaction, every processor will see both changes or neither of them



### Implementations of Transactional Memory

**Software:** 2-7 times slower than sequential code [Laru08]

Hardware acceleration: Hardware assists for the most time-consuming parts of TM operations

e.g., maintenance and validation of read sets

Hardware implementation: All required bookkeeping operations are implemented directly in hardware

e.g., by modifying the L1 cache and the coherence protocol

For more information on transactional memory, see:

[Laru08] Larus, J. and C. Kozyrakis, "Transactional Memory," Communications of the ACM, Vol. 51, No. 7, pp. 80-88, July 2008.





Parallel Processing, Shared-Memory Parallelism

