Core Principles of Parallel Programming on CPUs and GPUs (Course 1, Module 2)
What is Concurrent Programming?
- Almost everything that runs on a computer is built around multiprocessing.
- Operating systems balance multiple applications and background threads.
- Applications that need to handle lots of data efficiently, need to distribute computation
- Many artificial intelligence/machine learning techniques require complex logic that continually runs in parallel to work efficiently
- Video game graphics, video editing software, and compression algorithms can use these techniques to give rapid results to consumers and professionals.
Multi processing, concurrent, parallel, and threaded programming are all synonymous. Modern programming is built around maximizing the utilization of the multiple processors and courts of modern computers. And the best way to do that is via running multiple processes and threads. This is done without developer intervention by the operating system, but you have the ability to improve the efficiency of your software by writing quality code with threads that process data independently. Some highly complex tasks such as artificial intelligence, audio, video processing and signals processing, can be written with multi processing in mind puppy, whereas this can go awry. Parallel processing has been the basis of software that requires lots of computing power but also allows for asynchronous interactions with user.
How does it work?
- The state of single computation process that takes multiple cycles can be managed as a thread.
- Consumer laptops can have 4-8 cores and have 4-16 active threads
- Beyond active threads the OS and application can have many more inactive threads
- Scheduling algorithms shift threads between inactive and active states
- Memory Caching is used to shift thread state memory between hard drive and various forms of solid state memory such as RAM and registers
What is a thread? It is an independent collection of sequentially executed programming steps. I think the program with multiple independent threads like train tracks that separate and come together at certain points when the tracks are apart. Multiple trains can go as fast as they want, but when they are together, each train is limited by the speed of the train that is in front of it. With modern CPUs, think of laptops or consumer desktops. You will often have 4-8 cores, and each core can have between one and four threads managed by the scheduler. The heart of any OS is multi-threaded capability is the scheduler, which shuffles different programs, including operating system tasks between active and inactive states, and moving them between cores, moving data around caches, etc. Computers have multiple levels of memory from the hard disk two registers shared between cores. They allow for common data amongst threads, even the instructions that multiple threads from the same program execute the goal of memory. Cashing is to limit the amount of time that is spent waiting for data or instructions to be retrieved from slower and more distant cashing, which causes threads to become inactive. What we see in this diagram is a CPU scheduler shifting between threads on different course. In this case, presumed that all four cores are on the same processor and share at least an L2 cache, buyer threads made inactive and shifted between course.
They have their state changed due to cache misses, and the scheduler doesn’t want to waste CPU cycles on waiting for data or instructions, so instead it finds something that needs to be run and makes it active.
There is, of course, lost time in switching between threads, so constantly switching has a cost as well.
Memory caches are hierarchical in nature, and they really work on the principle that memory that is physically closer is more performant.
Caches are not just used for storing localized data and or instructions for a CPU, but it is also used as the sharing mechanism between cores. And it’s a way to store data and or instructions that limit the number of times that requests have to go to RAM or even hard drive space. There are L3 Caches but apply a little less, when we get to GPS, I won’t be discussed in detail here.
Concurrent Programming Pitfalls
Pitfalls in Concurrent Programming
- Race Conditions
- Resource Contention
- Dead Lock
- Live Lock
- Resource Over/Under Utilization
Race Condition
When the expected order of thread operations is not followed through issues arise.
Add picture of Race Condition.
Resource Contention
Two or more threads attampt to access/modify the same memory and conflict.
Add picture of Race Condition.
Dead Lock
One or more processes are blocked from making progress while waiting for resource.
Add picture of Dead Lock.
Live Lock
Like dead lock (but alive) but processes are actively running.
Add picture of Live Lock.
Resource Under/Overutilization
- Too little work, CPUs are thus sitting idle
- Too many threads, CPUs are spending too much time context switching
- Processes are overly complex and therefor CPU may be overextending itself
- Memory required for processes is too large or changes too frequently. This means that there are cache hits and data is transported to and from RAM and even HD
Hello. In this video, we will discuss issues that arise when writing multi-threaded programs. There are many issues that can occur if care isn’t taken when writing concurrent programs. They include race conditions, dead lock, and live lock, non-optimal resource utilization in contention over resources. One of the most common pitfalls that you encounter when writing multi-threaded programs, is s race condition. This occurs when threads execute in a sequence different than intended and can have dire consequences. In this case, the programmer wanted the first thread to execute its three instructions, and then the second thread to execute its three instructions. Unfortunately, the instructions were interleaved, and therefore, the final integer value is not what the developer wanted. I want to make it clear that in this case, there isn’t an incorrect final value, since each thread performed its instructions in a valid manner related to the shared variable, it is often best to consider minimizing shared or global variables, or if they’re acquired, have a plan to ensure that each threads access is atomic, which we will talk about later. Similar to a race condition, but morally to memory and not the order of execution is resource contention. The more threads and shared resources, the more often this will occur if the programmer hasn’t given sufficient consideration to this happening. This is a more asynchronous version of a race condition, which can extend from threads to completely separate machines. What happens is that resources are needed to be accessed in different ways by different independent threads and they access the same memory, file, etc and this can lead to conflict in constant rewriting of values. Consider the case of databases where values need to be incremented, but each time a value is retrieved from the data base another thread is accessing the same row. Now, it can happen that between the time that the first thread accesses and updates the row and end threads do the same thing that numerous increments go uncounted since the last updated value is what is counted. This is a very significant issue with databases. Thankfully, they have numerous ways of ensuring consistency. Dead lock is similar to resource contention, except that one or more threads or processes require multiple shared resources and will not perform their action until they have access to all that they require. At the same time, they need other required resources. They do not relinquish their hold on a shared resource. Multiple threads each have a shared and required resource and simultaneously are waiting for other resources and cannot proceed to the point when they can release them. None of the participants in the dead lock will make it out of this situation. The first response is to have threads drop their holds on resources, but this will devolve into threads dropping resources and then frantically trying to get all the resources that they can and possibly not helping at all. Live lock is like dead lock but each thread is actively doing something that never makes it out of the overall process that requires multiple resources. This can happen when a programming loop tests for access prior to making a final change and goes to sleep for a small or no time. Consider a do while loop that doesn’t change in externally visible display until it can save the value to a resource. Numerous threads executing the same loop might come to the same while test and not be able to access unnecessary variable, and thus keep incrementing another variable. This could result in a buffer or stack overflow or executing an instruction of millions or even an infinite amount of times, or recursively executing the same code in locking up a program. This can even happen if you incorrectly use some programming language asynchronous capabilities, so be careful. A slightly less worrisome issue is over or underutilizing the computational power of your machine or machines. Though, if this happens with a powerful system or replicated over a large cluster of competing resources, this can be expensive. As the name should indicate, this is when you have too few or too many threads, and they have nothing to do or the few threads are performing a series of computations that could be broken down and run in parallel and therefore more efficiently. If too many threads are going from active to inactive status due to not having work to do, the cost for context switching can outweigh the speed offered when many threads are being used. This can be handled by scaling the number of threads based on the amount of data or CPU utilization. When threads are too few, they may be in constant use and small memory leaks or inefficiencies can compound, and CPU utilization may spike, which makes all running threads sour. Also, if a thread requires a lot of data, or lots of instructions, cache hits can occur more frequently, and therefore the system will be slowed by data transfers to and from the cache or RAM, or even in the worst case, the hard drive.
Concurrent Programming Problems and Algorithms Presentation
Concurrent Programming Challenges and Algorithms
- Dining Philosophers
- Producer-Consumer
- Sleeping Barber
- Data and Code Synchronization
Dining Philosophers
- Each philosopher has a fork on their left and right.
- They want to eat eggs and side dish.
- To eat each philosopher needs both to each.
- We need an algorithm that allows all philosophers to eat.
(The Dining Philosophers Problem With Ron Swanson)[https://www.adit.io/posts/2013-05-11-The-Dining-Philosophers-Problem-With-Ron-Swanson.html]
In the Dining Philosophers Problem, if half of the philosophers had both forks and did not release them, the other half of the philosophers only attempted to get a fork if both were available. would be experiencing
- live lock
- multithreading
- resource starvation (correct)
- synchronization lock
Explanation: Resource starvation is definitely happening from the perspective of the philosophers without forks, since they will never get forks, as the other philosophers do not release the forks that they have.
Producer-Consumer
- One or more consumers need to read data in order and without duplication
- One or more producers add data in the order that it needs to be processed
- It is easy to produce race conditions when trying to solve this problem
Race conditions can occur in the Producer-Consumer Problem under which of the following conditions:
- The shared counter is updated, read, and then the memory is updated. (correct)
- The shared memory/array is filled, and thus the producer cannot work
- The shared counter becomes negative or larger than the size of memory.
- There are no consumers or far less consumers than producers
Explanation: A race condition occurs in the above situation because the index into the array is moved to a new location, then the consumer reads the value from that location (could be null or an initial default value), then the producer updates the value at the index in the array, which is never read. So the order of the basic three operations of this pattern are out of sequence and the results are incorrect, which is the definition of a race condition.
Sleeping Barber
- N customers can sit in the waiting room
- There is only 1 barber
- When inactive the barber sleeps
- If the barber is sleeping a customer should wake him up and sleep
- If there is no space in the waiting room, a customer leaves
Common pitfalls of the Sleeping Barber are all but:
- Underutilization
- Dead Lock
- Resource Contention
- Race conditions (correct)
Explanation: Race conditions don’t happen with this pattern, since it is not like the customer gets into the seat and leaves it before the barber cuts their hair, or any other combination of operations or something similar. There is no way for the operations to be executed that results in realistically incorrect results.
Data and Code Synchronization
- There are numerous ways including synchronization locks to control access to data and code in many languages. Implementations of locks can be managed like a deli counter ticketing systems.
- Semaphores are means of showing state, that can be used to manage use of sections of code or data.
- Through locks, semaphores, and algorithms partial solutions can be devised to handle many concurrent programming issues, though there will always be competing constraints of access and efficiency.
It is possible to guarantee unconstrained access to data and 100% utilization of processing resources
- True
- False (correct)
Explanation: This is correct since you can write code that guarantees 100% access or 100% of processing resources but there will always be a case when you try to do both were one or the other constraint will need to be lessened. The best thing to do is to consider
Well, and I hope you will enjoy this discussion on concurrent programming problems and algorithms to solve them. For canonical challenges and the algorithms in the world of parallel programming are, Dining Philosophers, Producer-Consumer, Sleeping Barber, and also Data and Code Synchronization, which is more of a parallel programming mechanism than a problem. Each of these are at the heart of different issues that you will encounter, and a solution for them will solve numerous analogous issues. So keep them in mind as a pattern to look for when programming with parallelism. I hope you enjoy this depiction of the dining philosophers problem using Ron Swanson from Parks and Rec. This seems a little less scary than having Socrates and Plato waiting to eat. Each philosopher wants to eat, but they need two forks to be able to do this. Don’t ask me why they need two forks, but they do. They can only do one thing at a time, pickup a fork, eat eggs, or put a fork down, etc. So they need to eat with both, and in a naive implementation where a philosopher tries to pick up the left fork and then the right, and then eat, and then put the left fork down, then put the right fork down. This will fail since this will most probably result with each philosopher having one fork and being very hungry. Does this seem familiar? Is a little bit of a few of the problems encountered in the previous video. If you expect each philosopher to neatly and politely get forks and eat without ever trying to pick up the same fork, then it is resource contention, with the fork being the resource. If you have the philosophers hold their ground and never drop a fork, then it is deadlock, almost literally. If each philosopher drops the fork, they hold and tries to get the other fork, they probably have livelock. Consider a way to solve this. Could the philosophers communicate to each other? Or have a central authority determine the order? There are many solutions to this problem, but they aren’t always easy. So think before you write the code. Play video starting at :2:37 and follow transcript2:37 Producer-Consumer or reader-writer pattern is a very common tool these days. A very common way that is used in message queues, which allow for data to be added sequentially and non-sequential in a streaming or batch fashion. Then users pull or subscribe to the message queue, and it returns data either in the order it was placed in there or as it becomes available. One or more processes call them producers, can then add data to be consumed by the consumers. If there is more data than the queue can handle, there are a number of strategies based on how the data is used. If the more recent data is more important, then the queue will drop the oldest data. If only a sub-sampling of data is needed, then data can be randomly dropped from the queue. If all data’s important then newer data can be stored for access once space is available. Think of it as slower and less costly, and consumers can be told to hold back from doing anymore processing. There’s a chance of race conditions if the data structure or queue allows for overriding of values or the pointer index for the producer ends up being ahead of the index of the consumer, and that’s why producer may never end up with new data. The Sleeping Barber is similar to producers- consumers but or is not important, the threads are trying their best to optimize the barbers work. The waiting room is like the queue or data structure from the previous slide, but the barber or barbers can only put one customer share at a time. The waiting room is of a finite size, and two issues can occur. First, if the barbers are chatty and slow and cutting their hair, and there are lots of prospective customers, we can have something like livelock or overutilization where they stick their head and see the waiting room full and leave. The second matrix case is that there are no customers and the barbers should sit around sleeping and talking about how much better baseball was back in the day. This is a situation of under utilization. There are solutions to this problem, from one or more barbers. When I think of this problem, I think about inverting the problem in having the number of customers in the waiting room determine the number of barbers. Many, if not most languages provide a mechanism for synchronizing data or code. This generally locks access to data or code, so you use it with caution. If you make all data synchronize, you end up with dead or livelock. If the threat attempts to access a synchronized piece of data, any thread that requires that data will wait. This is useful in ticketing systems where you want to lock access to code or data until it has been fully used by the previous holder of the call ticket. Semaphores are implemented with Fox, but have a limited number of states to manage access to data or code more indirectly, allow for multiple axises. Concurrently, think of this like sitting categories for flights. Those in-class 1A can be seated before those who are class B or 2. Multiple individuals can board at the same time, which should minimize the possibility of being blocked by other passengers. The major note here is that there are rarely bulletproof solutions to multithreading problems. But you can come up with a reasonable solution based on prioritization of coherent data access and acceptability of under or overutilization.
For the assignments I could only find my solutions - however next time onwards I shall keep a copy of the unsolved version as well.
Producer Consumer Activity
#include <iostream>
#include <thread>
#include <deque>
#include <mutex>
#include <chrono>
#include <condition_variable>
using std::deque;
std::mutex mu,cout_mu;
std::condition_variable cond;
class Buffer
{
public:
void add(int num) {
while (true) {
std::unique_lock<std::mutex> locker(mu);
cond.wait(locker, [this](){return buffer_.size() < size_;});
buffer_.push_back(num);
locker.unlock();
cond.notify_all();
return;
}
}
int remove() {
while (true)
{
std::unique_lock<std::mutex> locker(mu);
cond.wait(locker, [this](){return buffer_.size() > 0;});
int back = buffer_.back();
buffer_.pop_back();
locker.unlock();
cond.notify_all();
return back;
}
}
Buffer() {}
private:
deque<int> buffer_;
const unsigned int size_ = 10;
};
class Producer
{
public:
Producer(Buffer* buffer, std::string name)
{
this->buffer_ = buffer;
this->name_ = name;
}
void run() {
while (true) {
int num = std::rand() % 100;
buffer_->add(num);
cout_mu.lock();
int sleep_time = rand() % 100;
std::cout << "Name: " << name_ << " Produced: " << num << " Sleep time: " << sleep_time << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(sleep_time));
cout_mu.unlock();
}
}
private:
Buffer *buffer_;
std::string name_;
};
class Consumer
{
public:
Consumer(Buffer* buffer, std::string name)
{
this->buffer_ = buffer;
this->name_ = name;
}
void run() {
while (true) {
int num = buffer_->remove();
cout_mu.lock();
int sleep_time = rand() % 100;
std::cout << "Name: " << name_ << " Consumed: " << num << " Sleep time: " << sleep_time << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(sleep_time));
cout_mu.unlock();
}
}
private:
Buffer *buffer_;
std::string name_;
};
int main() {
Buffer b;
Producer p1(&b, "producer1");
Producer p2(&b, "producer2");
Producer p3(&b, "producer3");
Consumer c1(&b, "consumer1");
Consumer c2(&b, "consumer2");
Consumer c3(&b, "consumer3");
std::thread producer_thread1(&Producer::run, &p1);
std::thread producer_thread2(&Producer::run, &p2);
std::thread producer_thread3(&Producer::run, &p3);
std::thread consumer_thread1(&Consumer::run, &c1);
std::thread consumer_thread2(&Consumer::run, &c2);
std::thread consumer_thread3(&Consumer::run, &c3);
producer_thread1.join();
producer_thread2.join();
producer_thread3.join();
consumer_thread1.join();
consumer_thread2.join();
consumer_thread3.join();
getchar();
return 0;
}
Dining Philosopher
Solved Assignment
# https://rosettacode.org/wiki/Dining_philosophers#Python
import threading
import random
import time
# Dining philosophers, 5 Phillies with 5 forks. Must have two forks to eat.
#
# Deadlock is avoided by never waiting for a fork while holding a fork (locked)
# Procedure is to do block while waiting to get first fork, and a nonblocking
# acquire of second fork. If failed to get second fork, release first fork,
# swap which fork is first and which is second and retry until getting both.
#
# See discussion page note about 'live lock'.
class Philosopher(threading.Thread):
running = True
def __init__(self, xname, fork_on_left, fork_on_right):
threading.Thread.__init__(self)
self.name = xname
self.fork_on_left = fork_on_left
self.fork_on_right = fork_on_right
def run(self):
while self.running:
# Philosopher is thinking (but really is sleeping).
time.sleep(random.uniform(3, 13))
print(f'{self.name} is hungry.')
self.dine()
def dine(self):
fork1, fork2 = self.fork_on_left, self.fork_on_right
while self.running:
fork1.acquire(True)
locked = fork2.acquire(False)
if locked:
break
fork1.release()
print(f'{self.name} swaps forks')
fork1, fork2 = fork2, fork1
else:
return
self.dining()
fork2.release()
fork1.release()
def dining(self):
print(f'{self.name} starts eating')
time.sleep(random.uniform(1, 10))
print(f'{self.name} finishes eating and leaves to think.')
def dining_philosophers():
forks = [threading.Lock() for n in range(5)]
philosopher_names = ('Aristotle', 'Kant', 'Buddha', 'Marx', 'Russel')
philosophers = [Philosopher(philosopher_names[i], forks[i % 5], forks[(i + 1) % 5]) for i in range(5)]
# print(forks)
# print(philosophers)
# exit()
random.seed(507129)
Philosopher.running = True
for p in philosophers:
p.start()
time.sleep(100)
Philosopher.running = False
print("Now we're finishing.")
dining_philosophers()
Sleeping Barber
Solved Assignment
# Based on code from https://github.com/Nohclu/Sleeping-Barber-Python-3.6-/blob/master/barber.py
import time
import random
import threading
from queue import Queue
CUSTOMERS_SEATS = 15 # Number of seats in BarberShop
BARBERS = 3 # Number of Barbers working
EVENT = threading.Event() # Event flag, keeps track of Barber/Customer interactions
global Earnings
global SHOP_OPEN
class Customer(threading.Thread): # Producer Thread
def __init__(self, queue): # Constructor passes Global Queue (all_customers) to Class
threading.Thread.__init__(self)
self.queue = queue
self.rate = self.what_customer()
@staticmethod
def what_customer():
customer_types = ["adult", "senior", "student", "child"]
customer_rates = {"adult": 16,
"senior": 7,
"student": 10,
"child": 7}
t = random.choice(customer_types)
print(t + " rate.")
return customer_rates[t]
def run(self):
if not self.queue.full(): # Check queue size
EVENT.set() # Sets EVENT flag to True i.e. Customer available in the Queue
EVENT.clear() # A lerts Barber that their is a Customer available in the Queue
else:
# If Queue is full, Customer leaves.
print("Queue full, customer has left.")
def trim(self):
print("Customer haircut started.")
a = 3 * random.random() # Retrieves random number.
# TODO execute the time sleep function with a, which simulates the time it takes for a barber to give a haircut.
payment = self.rate
time.sleep(a)
# Barber finished haircut.
print("Haircut finished. Haircut took {}".format(a))
global Earnings
Earnings += payment
class Barber(threading.Thread): # Consumer Thread
def __init__(self, queue): # Constructor passes Global Queue (all_customers) to Class
threading.Thread.__init__(self)
# TODO set this class's queue property to the passed value
self.sleep = True # No Customers in Queue therefore Barber sleeps by default
self.queue = queue
def is_empty(self): # Simple function that checks if there is a customer in the Queue and if so
if self.queue.empty():
self.sleep = True # If nobody in the Queue Barber sleeps.
else:
self.sleep = False # Else he wakes up.
print("------------------\nBarber sleep {}\n------------------".format(self.sleep))
def run(self):
global SHOP_OPEN
while SHOP_OPEN:
while self.queue.empty():
# Waits for the Event flag to be set, Can be seen as the Barber Actually sleeping.
EVENT.wait()
print("Barber is sleeping...")
print("Barber is awake.")
customer = self.queue
self.is_empty()
# FIFO Queue So first customer added is gotten.
customer = customer.get()
customer.trim() # Customers Hair is being cut
self.queue.task_done()
customer = self.queue
# TODO use the task_done function to complete cutting the customer's hair
print(self.name) # Which Barber served the Customer
def wait():
time.sleep(1 * random.random())
if __name__ == '__main__':
Earnings = 0
SHOP_OPEN = True
barbers = []
all_customers = Queue(CUSTOMERS_SEATS) # A queue of size Customer Seats
for b in range(BARBERS):
# TODO Pass the all_customers Queue to the Barber constructor
# Makes the Thread a super low priority thread allowing it to be terminated easier
b = Barber(all_customers)
b.daemon = True
b.start() # Invokes the run method in the Barber Class
# Adding the Barber Thread to an array for easy referencing later on.
barbers.append(b)
for c in range(10): # Loop that creates infinite Customers
print("----")
# Simple Tracker too see the qsize (NOT RELIABLE!)
print(all_customers.qsize())
wait()
c = Customer(all_customers) # Passing Queue object to Customer class
all_customers.put(c) # Puts the Customer Thread in the Queue
# TODO Invoke the run method in the Customer Class
c = Customer(all_customers)
c.start()
all_customers.join() # Terminates all Customer Threads
print("€"+str(Earnings))
SHOP_OPEN = False
for i in barbers:
i.join() # Terminates all Barbers
# Program hangs due to infinite loop in Barber Class, use ctrl-z to exit.
https://github.com/Nohclu/Sleeping-Barber-Python-3.6-/blob/master/barber.py
Concurrent Programming Patterns
Concurrent Programming Patterns
- Divide and Conquer
- Map-Reduce
- Repository
- Pipelines/Workflows
- Recursion
Divide and Conquer
- Take large datasets and divide up
- Often implemented via recursive algorithms
- Solving small problems and apply those solutions to solve ever larger problems
- Easily implemented on powerful computing platforms that can maintain complex state
Which is true of the concept of Divide and Conquer?
- It requires multiple processes/threads to work.
- By solving smaller subsets of the problems, they can be combined into larger solutions. (correct)
- It doesn’t work well with recursive function calls.
- It can only work if division is done binarily.
- If used to implement searching for a single value, all data must be investigated.
Explanation: This is the main reason for using divide and conquer, if you had to answer a question but the data structure it was stored in could only allow sequential access then divide and conquer wouldn’t work.
Map-Reduce
- Similar to divide and conquer, with division of data/computation
- Often skips much of hierarchy of divide and conquer, map single data point to a single result.
- Reduce series of inputs to single result.
- Can include multiple stages and repeated iterations.
- Operations in map and reduces steps work best if simple and uniform across a stage.
Map-Reduce is a form of Divide and Conquer?
- True (correct)
- False
Explanation: This is correct as the main principle is to take each dataset break down into single values execute map functions on each value, the map output is sent to a smaller number of reducers, which output the final result. This means that the solution to a small subset is combined to arrive at the overall solution, which is the definition of Divide and Conquer.
In Map-Reduce, which is false?
- Reducers handle multiple inputs and output a single solution.
- There needs to be more reducers than mappers. (correct)
- The result of a map action is a single value.
- There is no requirement for recursion.
Explanation: This is the opposite of the map-reduce pattern, you want many mappers and the smallest number of reducers, since you want a single output value. If multiple reducers are needed, then there will need to be follow-on processing,
Repository
- Repository manages central state across multiple running processes
- Processes access and update their own and central state through predefined communications mechanisms
- Need to ensure that two processes don’t have access to same data simultaneously
In the Repository pattern, which role is defined appropriately?
- Repositories control the order in which processes do computation.
- Processes manage access to all shared data.
- Repositories manage access to all shared data (correct)
- Processes synchronize computation amongst each other.
Explanation: The only role that manages access to shared data is the repository.
Pipelines/Workflows
- Pipelines can manage multiple synchronous or asynchronous processing steps in a linear fashion
- Each step can have its own logic
- Workflows are similar to pipelines but are more complex interactions and can be represented as Directed Acyclic Graphs (not required)
What is not a valid difference between pipelines and workflows?
- Pipelines process data without forks and joins.
- Pipelines are not DAGs (correct)
- Workflow steps can receive input from multiple prior steps
- Workflow outputs can feed into multiple upstream steps.
Explanation: Pipelines are the simplest form of directed acyclic graphs (DAGs). They definitely have no cycles and they proceed from one step to another without any more advanced branching schemes such as forks and joins
Recursion
- Functions call themselves with arguments derived from current input
- Used in some implementations of divide and conquer, but has general purpose
- Does not work well when distributed across CPUs, networked machines, or GPUs because management of state can be complex and wasteful.
Which is an accurate definition of recursion?
- Recursive functions call themselves with subsets of their input data. (correct)
- Recursive functions only work in divide and conquer algorithms (incorrect) [Recursive functions can be used under circumstances that are not necessarily divide and conquer, though it is the primary usage pattern for this type of call. An example of when recursion is not divide and conquer, would be in functional languages with data structures with head and tail or first and rest. A recursive call is made as say the same function is applied to each item of data, which would mean that there would be the same number of recursive calls as the amount of data, which is not dividing and conquering.]
- Recursive functions work in all computing architectures including networked CPUs and GPUs (incorrect) [Many types of hardware and associated software architectures do not handle recursion well. GPUs, Nvidia specifically, and CUDA do not encourage this pattern as recursive calls can halt all other threads. The better usage pattern might be a multipass map-reduce solution.]
- It is impossible for recursive functions to get caught in infinite loops. (incorrect) [Infinite loops are one of the biggest problems with badly written recursive functions. If solutions/base cases do not appropriately handle all solutions, it is possible for a recursive call to continually call itself with the same data and eventually loop until the software or the OS locks.]
- With CPU usage, hardware stack memory ensures that recursive functions can always complete. (incorrect) [While hardware-based stacks and heaps make recursive functions work, there is no guarantee that a badly written recursive algorithm will complete. This is why usage of recursive functions should only occur after careful thought and algorithm design.]
Explanation: This is the exact definition of a recursive call. The subset on subsequent recursive function calls should be smaller than the current call’s input but that is not necessarily the case, though that can be cause for concern.
[1] Kim, Eun-Gyu. 2004. Department of Computer Science, University of Illinois. Parallel Programming Patterns. Retrieved from http://snir.cs.illinois.edu/patterns/patterns.pdf
Google search query: site:snir.cs.illinois.edu filetype:pdf
Hello and welcome to this video that we’ll discuss patterns that you will use in parallel programming. Many of the solutions to parallel programming challenges fit into these five patterns: divide and conquer, map-reduce, the repository, pipelines or workflows, and recursion. Learn to recognize that one of these patterns can be used in your program to make it more efficient and you’ll save yourself a lot of time avoiding trying to reinvent the wheel. One of the most common ways to solve computer science problems even without concurrency is divide and conquer. You’ve probably seen how can we use it in searching and sorting algorithms. The basic idea is to split up the data until it is small enough to answer the larger question, order or equality for the smaller amount of data. Once the thread has an answer, return that answer to a call that and it will take the responses from all the smaller datasets and answer the question with less work. This makes it back to the main calling context and should’ve taking less time than doing pairwise comparisons. It’s not always the best answer. Finding the same value in a set might take any comparisons and thus, if this dataset is not always sorted, you’ll pay an extra cost for the breaking down and bringing back together. If recursion is now allowed or really inefficient which is the case in CUDA, then this should not be used frequently. Also a program implementing this will need to be well-designed and or maintain a complex state for all of the various divides, etc. Map-reduce can be thought of as a subset of divide and conquer. The main difference is that each time a map-reduce cycle is run, it immediately breaks down into individual data points and the same operation is applied to all individual pieces of data. Each mapper returns only one value and the reducer has a job of taking n return values and reducing it to a single value. An example would be testing if a value is in a set, each mapper would return zero if the value passed to it was not the search for value and one if it is. The reducer would just add all the values and convert zero to false and greater than zero to true. The nice thing about this way of doing things is that it scales well regardless of the size of data and quality of competing resources, presuming the MR system is well architected and communications is not slow. What mappers and reducers do can be more complex or they can be strung into a series of MR jobs that feed into each other. The repository pattern can be used when state needs to be maintained across multiple threads or processes. Each process can run independently and when it needs information or wants to change the overall state, it makes requests of the repository. The repository needs to ensure that data is maintained atomically within itself but processes are responsible for maintaining their own state. Based on concerns of data consistency or staleness, this system may want to encourage more or less use of the repository since it is possible for a process to work on data for awhile and want to update it. But doing so could overwrite the changes that were made based on a newer state. Pipelines and workflows are similar. They’re both represented as directly crass without cycles. Though that is not strictly the case since workflow systems allow cycles but data will still flow out of a note. Pipelines are workflows in which each step gets input from the previous step and outputs to a single future step. Workflows often employ fan out and fan in patterns where either the same input is the output to different logical steps or data is divided up and sent to the same logical code. Note that map-reduce and divide and conquer can be implemented using workflows. They neither are exclusive to the other. Programming languages handle divide and conquer and map-reduce especially now in a more functional programming way. A key way to solving many complex problems can be via recursion. Almost any problem can be solved recursively though not always most efficiently. Functional programming languages and framework such as Lisp, Closure and Lodash are built around some level of recursion. In these cases, data is divided into head and tail or first and last. Functions operate on the current data and call themselves with the rest. Recursion does not always have to be handled in that way. It can be divided in a number of ways, including in a binary manner. Recursion requires management of state since you need to ensure that it is an infinite which means that recursive algorithms need final states often when only one or two pieces of data are inputs to a function. Recursion is not advisable when you are using a large data across multiple local or distributed CPU’s, not on the same processor and GPUs don’t perform well in this pattern.
Serial Versus Parallel Code and Flynn’s Taxonomy
Concurrent Programming Patterns
- Serial Search Pseudocode
- Parallel Search Pseudocode
- Flynn’s Taxonomy
Inefficient Serial Search Pseudocode
public class SerialSearchExample {
public static void main(String[] args) {
int[] data = {1, 6, 9, 2, 3, 5, 4};
int x = 3; // example search value
int foundIdx = serialSearch(data, x);
if (foundIdx != -1) {
System.out.println("Value found at index: " + foundIdx);
} else {
System.out.println("Value not found.");
}
}
public static int serialSearch(int[] data, int searchValue) {
int foundIndex = -1;
for (int i = 0; i < data.length; i++) {
if (data[i] == searchValue) {
foundIndex = i;
break;
}
}
return foundIndex;
}
}
Pros:
- Simple code
- Doesn’t depend on sorting
Cons:
- Scales linearly with data size
Recursive Serial Search Pseudocode
public class BinarySearchExample {
public static void main(String[] args) {
int[] data = {1, 2, 3, 4, 5, 6, 9};
int x = 5; // value to search
int foundIdx = binarySearch(data, 0, data.length - 1, x);
if (foundIdx != -1) {
System.out.println("Value found at index: " + foundIdx);
} else {
System.out.println("Value not found.");
}
}
public static int binarySearch(int[] arr, int l, int r, int x) {
if (r >= l) {
int mid = l + (r - l) / 2;
if (arr[mid] == x)
return mid;
if (arr[mid] > x)
return binarySearch(arr, l, mid - 1, x);
return binarySearch(arr, mid + 1, r, x);
}
return -1;
}
}
Presume Sorted
Pros:
- Search time is logarithmic
- Cost of sorting if data doesn’t change
Cons:
- Cost of sorting if data changes
- Doesn’t distribute well
Parallel Search Pseudocode
int pSearch(int[] subData, int x, int offset, SharedResult result) {
for (int idx = 0; idx < subData.length; idx++) {
if (result.isFound()) return -1; // another thread found it
if (subData[idx] == x) {
return idx + offset; // return global index
}
}
return -1;
}
class SharedResult {
private int foundIndex = -1;
synchronized void setFoundIndex(int idx) { if (foundIndex == -1) foundIndex = idx; }
synchronized int getFoundIndex() { return foundIndex; }
synchronized boolean isFound() { return foundIndex != -1; }
}
function killOtherThreads(Thread[] threads) {
for (Thread t : threads) {
if (t is still running)
t.interrupt(); // or some other safe termination
}
}
// Input
int[] data = [4, 6, 7, 1, 2, 8];
int numThreads = 3;
int numElements = ceil(data.length / numThreads);
Thread[] threads = new Thread[numThreads];
SharedResult result = new SharedResult(); // thread-safe class to store result
// Launch threads
for (int tIdx = 0; tIdx < numThreads; tIdx++) {
int startIdx = tIdx * numElements;
int endIdx = min(startIdx + numElements, data.length);
int[] sData = slice(data, startIdx, endIdx);
threads[tIdx] = new Thread(() -> {
int localResult = pSearch(sData, x, startIdx, result);
if (localResult != -1) {
result.setFoundIndex(localResult);
killOtherThreads(threads);
}
});
threads[tIdx].start();
}
// Wait for all threads
for (int tIdx = 0; tIdx < numThreads; tIdx++) {
threads[tIdx].join();
}
if (result.getFoundIndex() != -1) {
print("Found at index: " + result.getFoundIndex());
} else {
print("Not found.");
}
Pros:
- Doesn’t require presorting
- Scales based on number of threads
Cons:
- Bad when #Threads « Size of Data
- Thread management isn’t easy
We are now onto the video discussion of serial versus parallel code, focusing on how you can develop code to handle different situations. We’ll explore how to create code to search for a value in a dataset using both serial and parallel approaches. Search is one of the most canonical challenges in computing, so understanding both good and bad ways to perform it can help you define analogous processes using sequential or concurrent programming.
Also, as a way to break down what happens and what types of programming work best for what types of data, we will investigate Flynn’s taxonomy.
In this code, you will recognize a very simple sequential algorithm for finding the index of a search value in a dataset. Simply stated: iterate through all values in the set. If the current value equals the search value, return the index. Otherwise, return -1 if not found.
The good thing about this approach is that it’s extremely simple to implement. There is no need to sort the data ahead of time, which can be costly. The downside is that you may need to search through all values in the dataset to find a match. Linear search is not efficient, especially if you need to perform it frequently. That’s why the cost of sorting, along with a good algorithm (like binary search), is often worthwhile. It allows you to achieve logarithmic-time search, which becomes very efficient as dataset size grows—even if there’s an upfront cost to sorting, especially when the data can be updated efficiently.
One of the more common and effective implementations of search is using a divide-and-conquer approach. This assumes that the dataset is always sorted, because otherwise it won’t work. The code divides the problem by picking midpoints and comparing them to the search value. If equal, return the index. If the search value is smaller or larger, the function recursively searches the left or right half of the dataset.
The benefit of this approach is that search becomes algorithmic and efficient. The cost of search is low relative to frequent usage, making it worth the upfront sorting cost. However, if the dataset doesn’t update efficiently, or the data and computation are distributed, this approach can become expensive.
With parallel search, things become more complex. You need to carefully assign the correct data subset to each thread. This is usually done by uniformly slicing the dataset into chunks to evenly distribute the workload.
Each thread searches only a small portion of the dataset. If it finds the target value, it needs to terminate the other threads (probably indirectly) and return the index. The advantages here are that no sorting is required, and you can scale the solution by increasing the number of threads.
However, there are drawbacks. If the number of threads is much smaller than the dataset size, then this is only slightly more efficient than inefficient serial search. Thread management isn’t easy, and safely killing threads requires well-thought-out logic.
This approach can be improved with pre-sorting and binary search within each thread’s data slice. But then it inherits the issues we discussed with the recursive version. Where parallel search really shines is when you have lots of threads—this is where GPUs come in. More on that later.
If each thread just needs to do a single comparison, and only update a shared variable if the value is found, then the search time becomes constant, which is better than logarithmic.
Flynn’s Taxonomy Flynn’s taxonomy is a way of categorizing programs based on how they process instructions and data. The first two characters in the taxonomy can be SISD, SIMD, MISD, or MIMD:
- SI (Single Instruction) or MI (Multiple Instruction): Refers to whether the same or different logic is applied across the data.
- SD (Single Data) or MD (Multiple Data): Refers to whether the data is a single complex entity or many simple elements.
In the top-left corner (SISD), we have sequential programs working on a single, complex state. Think of a program managing a chess game—lots of parts, but it likely can’t be decomposed. This isn’t about searching for the next move, just managing state.
In the bottom-right corner (MIMD), multiple different processes do different things to lots of small pieces of data. This is like running multiple filters on pixels in an image or processing video frames. Each operation may be independent, without needing context from nearby pixels or previous frames.
Most CPU-based sequential programs are MISD, since they include different processing steps meant to operate on a single or small number of complex objects.
GPUs and distributed programming solutions are often SIMD, as they apply the same logic to large amounts of data using many threads and processes.
Flynn’s taxonomy helps you assess your data and processing goal. Once you determine which category your task falls into, you can choose appropriate languages, frameworks, and hardware to implement your solution effectively.
(Flynn’s Taxonomy)[https://archive.ph/nY0WM]
Apply Flynn’s taxonomy to come up with candidate solutions (including existing ones) to each of the following domains:
- Search for Extraterrestrial Intelligence
- Cryptography (Encryption/Decryption)
- Gene Folding/Candidate Vaccine Analysis
- Search for Extraterrestrial Intelligence (SETI)
Nature of Task:
- Processing massive amounts of radio signal data from space.
- Looking for patterns, anomalies, or narrow-bandwidth signals across large frequency ranges.
- Highly parallelizable — each signal chunk can be independently analyzed.
Flynn’s Classification: SIMD / MIMD
- SIMD (Single Instruction, Multiple Data): Apply the same signal processing algorithms (FFT, filtering) across massive streams of independent data.
- MIMD (Multiple Instruction, Multiple Data): Different signal analyses (AI-based, frequency-specific heuristics, anomaly detection) applied in parallel across large datasets.
Candidate Solutions / Examples:
- SETI@home (now retired): Classic MIMD model—distributed volunteer computing using BOINC.
- GPU-accelerated signal analysis: Modern SIMD approach using CUDA/OpenCL for real-time FFTs and filtering.
- AI-based anomaly detectors: Running on MIMD-style high-performance clusters.
Best Fit: MIMD with optional SIMD sub-tasks (e.g., FFT)
- Cryptography (Encryption / Decryption)
Nature of Task:
- Modern encryption algorithms (AES, RSA, ECC) are deterministic, structured.
- Brute force or key cracking involves testing many keys — easily parallelizable.
- Secure communication often involves many concurrent encrypt/decrypt operations.
Flynn’s Classification: SIMD / MIMD
- SIMD: Apply the same encryption algorithm across multiple messages or keys — ideal for AES or block cipher modes on GPUs.
- MIMD: Each thread or process could try a different key or decrypt scheme in brute-force cracking; or process independent sessions.
Candidate Solutions / Examples:
- AES-NI: Hardware acceleration using SIMD (Intel/AMD instruction sets).
- Brute-force tools: e.g., Hashcat, John the Ripper — heavily GPU-optimized, SIMD-like.
- Parallelized public-key systems: Cloud-based encryption systems that use thread pools for concurrent sessions.
Best Fit:
- Encryption: SIMD for speed, instruction-level parallelism.
- Brute-force / Decryption: MIMD — each process tries a different key independently.
- Gene Folding / Candidate Vaccine Analysis
Nature of Task:
- Predicting 3D protein folding from amino acid sequences (very complex, nonlinear).
- Exploring huge search spaces to simulate molecular interactions and binding.
- Involves physics-based models (energy minimization), ML models, and stochastic simulations.
Flynn’s Classification: MIMD
- Highly diverse instructions across diverse data (e.g., different folding candidates, molecular models).
- Different simulation types and models may run concurrently.
Candidate Solutions / Examples:
- Folding@home: Classic MIMD model — each node simulates a portion of folding space.
- AlphaFold: Deep learning model (not parallel in Flynn’s sense, but trained using MIMD-style distributed frameworks like TensorFlow on TPU clusters).
- Molecular dynamics engines: e.g., GROMACS, AMBER, which run large-scale simulations often in MIMD supercomputers.
Best Fit: MIMD
- Due to heterogeneity of tasks and models (physics + biology + AI).
- SIMD may be used in low-level operations (e.g., matrix multiplications, force calculations).