Task Parallelism and Data Parallelism

Task Parallelism

Task parallelism, also known as function parallelism, focuses on distributing different tasks or functions across multiple threads or processors. Each thread or processor executes a distinct task independently, allowing for parallel execution of multiple tasks.

Here’s an example of task parallelism in Java using the ExecutorService:

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class TaskParallelismExample {
    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(2);

        Runnable task1 = () -> {
            System.out.println("Task 1 executed by " + Thread.currentThread().getName());
        };

        Runnable task2 = () -> {
            System.out.println("Task 2 executed by " + Thread.currentThread().getName());
        };

        executor.submit(task1);
        executor.submit(task2);

        executor.shutdown();
    }
}

In this example, we create an ExecutorService with a fixed thread pool of size 2. We define two separate tasks, task1 and task2, as Runnable instances. Each task simply prints a message indicating which thread is executing it. We submit both tasks to the executor, which distributes them across the available threads for parallel execution.

Task parallelism is suitable when you have independent tasks that can be executed concurrently without dependencies on each other. It allows for efficient utilization of system resources by leveraging multiple threads or processors.

Data Parallelism

Data parallelism, on the other hand, focuses on distributing the processing of data across multiple threads or processors. It involves partitioning the data into smaller subsets and performing the same operation on each subset in parallel.

Here’s an example of data parallelism in Python using the multiprocessing module:

import multiprocessing

def process_chunk(chunk):
    result = sum(chunk)
    return result

def parallel_sum(data):
    num_processes = multiprocessing.cpu_count()
    chunk_size = len(data) // num_processes

    chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]

    with multiprocessing.Pool(processes=num_processes) as pool:
        results = pool.map(process_chunk, chunks)

    return sum(results)

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
result = parallel_sum(data)
print("Sum:", result)

In this example, we have a list of numbers called data. We want to calculate the sum of all the numbers in parallel. We define a process_chunk function that takes a chunk of data and calculates the sum of that chunk. The parallel_sum function splits the data into chunks based on the number of available processes and uses a process pool to map the process_chunk function to each chunk in parallel. Finally, we sum up the results from each chunk to obtain the final sum.

Data parallelism is effective when you have a large dataset that can be partitioned and processed independently. It allows for parallel execution of the same operation on different subsets of data, leading to improved performance and faster processing times.

Combining Task and Data Parallelism

In many real-world scenarios, it’s common to combine task parallelism and data parallelism to achieve optimal performance. By dividing the problem into independent tasks and then applying data parallelism within each task, you can leverage the benefits of both approaches.

For example, consider a scenario where you need to process multiple files, and each file contains a large amount of data. You can employ task parallelism to process each file independently in separate threads or processes. Within each file processing task, you can further apply data parallelism to distribute the processing of the file’s data across multiple threads or processors.