Multicore and GPGPU Programming

Instructors: Kunal Kishore Korgaonkar

Included with Learn more

Ask Coursera

12 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

8 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

12 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

8 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Understand the fundamentals of multi-threaded programming and its applications in multicore systems.
Develop shared memory programs in OpenMP and distributed programming using MPI.
Gain a foundational understanding of GPGPU architecture and the CUDA programming model.

Skills you'll gain

Tools you'll learn

C (Programming Language)

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

124 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 12 modules in this course

The course "Multicore and GPGPU Programming" provides a foundational understanding of parallel programming, focusing on developing high-performance, multi-threaded applications in both CPU and GPU environments. Beginning with a review of multicore processor architectures, caching mechanisms, and Non-Uniform Memory Access (NUMA) systems, students will learn the essentials of shared memory programming, synchronisation techniques, and the use of locks to ensure data integrity across threads.

The course delves into designing shared memory data structures and introduces advanced synchronisation concepts, including lazy synchronisation, crucial for scalable and efficient concurrent applications. Additionally, students will explore the architecture and programming model of General-Purpose Graphics Processing Units (GPGPUs) and learn CUDA programming to leverage GPU parallelism for compute-intensive tasks. By the end of the course, students will be adept in optimising multi-threaded and many-core applications, balancing workload across CPUs and GPUs to achieve high throughput and efficient resource utilisation. This course is essential for those aiming to develop expertise in high-performance computing and parallel programming for modern multi-core and GPU-based systems.

In this module, the learners will be introduced to the course and its syllabus, setting the foundation for their learning journey. The course's introductory video will provide them with insights into the valuable skills and knowledge they can expect to gain throughout the duration of this course. Additionally, the syllabus reading will comprehensively outline essential course components, including course values, assessment criteria, grading system, schedule, details of live sessions, and a recommended reading list that will enhance the learner’s understanding of the course concepts. Moreover, this module offers the learners the opportunity to connect with fellow learners as they participate in a discussion prompt designed to facilitate introductions and exchanges within the course community.

What's included

4 videos1 reading1 discussion prompt

4 videosTotal 51 minutes

Course Introductory Video2 minutes
Meet Your Instructor - Dr. Gargi Prabhu 1 minute
Meet Your Instructor - Dr. Kunal Korgaonkar1 minute
Recording of Multicore and GPGPU Programming: Week 1 - Live Session on 25-05-23 18:32:50 [47:25]47 minutes

1 readingTotal 10 minutes

Course Overview10 minutes

1 discussion promptTotal 10 minutes

Meet Your Peers10 minutes

In this module, students will gain foundational knowledge of parallel and multi-threaded programming, exploring the core principles that underlie the efficient utilisation of modern multi-core and many-core processors. Beginning with an overview of parallel programming concepts, this module covers different types of parallelism, including data parallelism, task parallelism, and pipeline parallelism. Students will also examine critical performance metrics like speedup, efficiency, and scalability, which help in evaluating the benefits and trade-offs of parallel approaches.

What's included

12 videos2 readings12 assignments1 discussion prompt

12 videosTotal 73 minutes

Need for Ever-Increasing Performance8 minutes
Parallel Systems and Parallel Programs8 minutes
Concurrent, Parallel, Distributed Systems5 minutes
Types of Parallelism: Data, Task and Pipeline Parallelism8 minutes
Speedup and Efficiency5 minutes
Amdahl’s Law 5 minutes
Gustafson’s Law 5 minutes
Scalability in Parallel Systems5 minutes
Cost of Parallelisation7 minutes
Sources of Overhead in Parallel Programs 5 minutes
Timing Parallel Programs: Methods and Best Practices7 minutes
GPU Performance5 minutes

2 readingsTotal 120 minutes

Recommended Reading: Fundamentals of Parallel Computing60 minutes
Recommended Reading: Introduction to Performance Metrics in Parallel Computing60 minutes

12 assignmentsTotal 36 minutes

Need for Ever-Increasing Performance3 minutes
Parallel Systems and Parallel Programs3 minutes
Concurrent, Parallel, Distributed Systems3 minutes
Types of Parallelism: Data, Task and Pipeline Parallelism3 minutes
Speedup and Efficiency3 minutes
Amdahl’s Law 3 minutes
Gustafson’s Law 3 minutes
Scalability in MIMD Systems3 minutes
Cost of Parallelisation3 minutes
Sources of Overhead in Parallel Programs3 minutes
Taking Timings of Parallel Programs3 minutes
GPU Performance3 minutes

1 discussion promptTotal 30 minutes

Why Parallelism? Revisiting the Roots of Multicore Programming30 minutes

This module provides an in-depth exploration of multicore processor architectures, examining the design principles, performance considerations, and challenges involved in building efficient multicore systems. Students will study how multiple cores interact within a processor, focusing on memory hierarchies, caching mechanisms, and the role of parallelism in improving computational performance.

What's included

15 videos2 readings15 assignments1 discussion prompt

15 videosTotal 160 minutes

The Von Neumann Architecture7 minutes
Processes, Multitasking, and Threads5 minutes
The Basics of Caching7 minutes
Virtual Memory7 minutes
Instruction-Level Parallelism9 minutes
Hardware Multithreading6 minutes
Classifications of Parallel Computers6 minutes
SIMD and MIMD Systems7 minutes
Interconnection Networks: Shared Memory Systems6 minutes
Interconnection Networks: Distributed Memory Systems8 minutes
Cache Coherence8 minutes
Shared-Memory vs. Distributed-Memory4 minutes
Parallel Software: Coordinating Process and Threads11 minutes
Distributed Memory Software7 minutes
Recording of Multicore and GPGPU Programming: Week 2 - Live Session on 25-05-30 18:35:08 [02:05]62 minutes

2 readingsTotal 100 minutes

Recommended Reading: Architecture Background40 minutes
Recommended Reading: Parallel Hardware and Software60 minutes

15 assignmentsTotal 114 minutes

Graded Quiz - Modules 1 and 2 60 minutes
The Von Neumann Architecture3 minutes
Processes, Multitasking, and Threads3 minutes
The Basics of Caching3 minutes
Virtual Memory3 minutes
Instruction-Level Parallelism3 minutes
Hardware Multithreading3 minutes
Classifications of Parallel Computer3 minutes
SIMD and MIMD Systems3 minutes
Interconnection Networks: Shared Memory Systems3 minutes
Interconnection Networks: Distributed Memory Systems6 minutes
Cache Coherence3 minutes
Shared-Memory vs. Distributed-Memory3 minutes
Parallel Software: Coordinating Process and Threads12 minutes
Distributed Memory Software3 minutes

1 discussion promptTotal 30 minutes

From Von Neumann to Multicore: Evolving Architectures and Memory Realities30 minutes

This module introduces students to the architectural principles of General-Purpose GPU (GPGPU) systems and the CUDA programming model. It explores the hardware components, including Streaming Multiprocessors (SMs), CUDA cores, and memory hierarchy, which form the foundation of GPU computing. The module also provides an overview of the CUDA programming model, emphasising its thread hierarchy, grid, and block organisation. By understanding these fundamental concepts, students will develop the ability to harness GPU architecture for high-performance parallel computing.

What's included

15 videos2 readings14 assignments1 discussion prompt

15 videosTotal 127 minutes

GPUs and GPGPU5 minutes
GPU Architecture5 minutes
Heterogeneous Computing4 minutes
Paradigm of Heterogeneous Computing5 minutes
Introduction to CUDA5 minutes
Structure of a CUDA Program8 minutes
Threads, Blocks, and Grid9 minutes
Managing Memory7 minutes
Writing and Verifying Your Kernel6 minutes
Compiling and Running CUDA Program4 minutes
Nvidia Compute Capabilities and Device Architecture6 minutes
Timing Your Kernel7 minutes
Organising Parallel Threads5 minutes
Managing Devices4 minutes
Recording of Multicore and GPGPU Programming: Week 3 - Live Session on 25-06-06 18:31:21 [44:50]45 minutes

2 readingsTotal 75 minutes

Recommended Reading: GPGPU Architecture and CUDA15 minutes
Recommended Reading: Programming Model Overview60 minutes

14 assignmentsTotal 48 minutes

GPUs and GPGPU6 minutes
GPU Architecture3 minutes
Heterogeneous Computing3 minutes
Paradigm of Heterogeneous Computing3 minutes
Introduction to CUDA3 minutes
Structure of a CUDA Program3 minutes
Threads, Blocks, and Grid6 minutes
Managing Memory3 minutes
Writing and Verifying Your Kernel3 minutes
Compiling and Running CUDA Program3 minutes
Nvidia Compute Capabilities and Device Architecture3 minutes
Timing Your Kernel3 minutes
Organising Parallel Threads3 minutes
Managing Devices3 minutes

1 discussion promptTotal 30 minutes

Harnessing GPU Power: Exploring CUDA and the Architecture of Parallelism30 minutes

This module provides a comprehensive understanding of how CUDA executes programs on GPUs. It covers key concepts such as warps, warp scheduling, and resource partitioning, which are critical for understanding GPU hardware behaviour. The module delves into branch divergence and its impact on performance, offering strategies to minimise its effects. It also emphasises exposing parallelism effectively by leveraging CUDA’s hierarchical execution model. Students will learn how to design and optimise GPU programs by aligning with the underlying execution model to maximise efficiency and throughput.

What's included

15 videos2 readings15 assignments1 discussion prompt

15 videosTotal 135 minutes

Introduction to CUDA Execution Model7 minutes
Warps and Thread Blocks4 minutes
Warp Divergence9 minutes
Resource Partitioning6 minutes
Latency Hiding10 minutes
Occupancy5 minutes
Synchronization4 minutes
Scalability5 minutes
Exposing Parallelism10 minutes
Checking Active Warps with Nvprof6 minutes
Checking Memory Operations with Nvprof7 minutes
Avoiding Branch Divergence3 minutes
The Parallel Reduction Problem and Thread Divergence7 minutes
Improving Divergence in Parallel Reduction6 minutes
Recording of Multicore and GPGPU Programming: Week 4 - Live Session on 25-06-13 18:32:39 [49:37]45 minutes

2 readingsTotal 120 minutes

Recommended Reading: Structure of a CUDA Program60 minutes
Recommended Reading: Exposing Parallelism and Avoiding Branch Divergence60 minutes

15 assignmentsTotal 105 minutes

Graded Quiz - Modules 3 and 4 60 minutes
Introduction to CUDA Execution Model3 minutes
Warps and Thread Blocks 3 minutes
Warp Divergence3 minutes
Resource Partitioning6 minutes
Latency Hiding3 minutes
Occupancy3 minutes
Synchronization3 minutes
Scalability3 minutes
Exposing Parallelism3 minutes
Checking Active Warps with Nvprof3 minutes
Checking Memory Operations with Nvprof3 minutes
Avoiding Branch Divergence3 minutes
The Parallel Reduction Problem and Thread Divergence3 minutes
Improving Divergence in Parallel Reduction3 minutes

1 discussion promptTotal 30 minutes

Under the Hood: Warps, Divergence, and CUDA Execution Dynamics30 minutes

The CUDA Memory Model & Streams and Concurrency module introduces students to the intricacies of memory hierarchy in CUDA, including global, shared, and local memory. It emphasises the importance of memory coalescing and efficient memory access patterns to optimise performance on GPUs. The module also covers CUDA streams, explaining how concurrent kernel execution and memory operations can be managed to enhance parallelism. By understanding these concepts, students will gain the ability to design GPU programs that maximise throughput and minimise latency.

What's included

14 videos2 readings14 assignments1 discussion prompt1 ungraded lab

14 videosTotal 126 minutes

Introduction to CUDA Memory Model8 minutes
Memory Allocation and Deallocation6 minutes
Zero Copy Memory4 minutes
Unified Virtual Addressing and Unified Memory 3 minutes
Aligned and Coalesced Access6 minutes
CUDA Shared Memory6 minutes
Shared Memory Banks and Access Mode 7 minutes
Configuring the Amount of Shared Memory5 minutes
Synchronisation9 minutes
CUDA Streams7 minutes
Stream Scheduling and Priorities6 minutes
CUDA Events6 minutes
Concurrent Kernel Execution6 minutes
Recording of Multicore and GPGPU Programming: Week 5 - Live Session on 25-06-20 18:31:59 [47:36]48 minutes

2 readingsTotal 120 minutes

Recommended Reading: CUDA Memory Model60 minutes
Recommended Reading: Streams and Concurrency60 minutes

14 assignmentsTotal 342 minutes

SGA-1: CUDA Programming and Performance Optimisation300 minutes
Introduction to CUDA Memory Model3 minutes
Memory Allocation and Deallocation3 minutes
Zero Copy Memory3 minutes
Unified Virtual Addressing and Unified Memory 3 minutes
Aligned and Coalesced Access3 minutes
CUDA Shared Memory6 minutes
Shared Memory Banks and Access Mode 3 minutes
Configuring the Amount of Shared Memory3 minutes
Synchronisation3 minutes
CUDA Streams3 minutes
Stream Scheduling and Priorities3 minutes
CUDA Events3 minutes
Concurrent Kernel Execution3 minutes

1 discussion promptTotal 30 minutes

Smart Memory and Seamless Concurrency: CUDA Memory and Streams30 minutes

1 ungraded labTotal 60 minutes

Hands on lab: Parallel Matrix Addition Using CUDA60 minutes

This module explains in depth the difference between processes and threads and introduces multithreaded programming using pthreads library. Students are expected to learn about the various functions in pthreads library and implement those to solve real-world problems through a multithreaded approach. It also discusses precautions to take while developing an algorithm that uses multi-threading.

What's included

10 videos11 readings10 assignments1 discussion prompt

10 videosTotal 116 minutes

Processes, Threads and Pthreads4 minutes
Hello World!!9 minutes
Matrix-Vector Multiplication13 minutes
Critical Sections5 minutes
Busy Waiting6 minutes
Mutexes5 minutes
Semaphores7 minutes
Barriers and Condition Variables13 minutes
Caches, Cache-Coherence and False Sharing9 minutes
Recording of Multicore and GPGPU Programming: Week 6 - Live Session on 25-06-27 18:38:36 [43:53]44 minutes

11 readingsTotal 295 minutes

Recommended Reading: Processes, Threads and Pthreads10 minutes
Recommended Reading: Hello World!!60 minutes
Recommended Reading: Matrix-Vector Multiplication15 minutes
Recommended Reading: Critical Sections30 minutes
Recommended Reading: Busy Waiting20 minutes
Recommended Reading: Mutexes15 minutes
Recommended Reading: Semaphores30 minutes
Recommended Reading: Barriers and Condition Variables30 minutes
Recommended Reading: Read-Write Locks60 minutes
Recommended Reading: Caches, Cache-Coherence and False Sharing15 minutes
Lab Instruction Document10 minutes

10 assignmentsTotal 135 minutes

Graded Quiz - Modules 5 and 6 60 minutes
Processes, Threads and Pthreads9 minutes
Hello World!!9 minutes
Matrix-Vector Multiplication9 minutes
Critical Sections9 minutes
Busy Waiting9 minutes
Mutexes9 minutes
Semaphores6 minutes
Barriers and Condition Variables6 minutes
Caches, Cache-Coherence and False Sharing9 minutes

1 discussion promptTotal 10 minutes

Thread Synchronization and Shared Memory: Building Reliable Parallel Programs with Pthreads10 minutes

This module aims to introduce students to Distributed memory programming using the Message Passing Interface (MPI). Students will learn about the functions provided by the MPI library and their descriptions. It will enable students to develop parallel programming codes and also to convert a serial programmed code into a parallel code with the help of the MPI functions.

What's included

7 videos9 readings7 assignments1 discussion prompt

7 videosTotal 70 minutes

Introduction to MPI4 minutes
MPI Setup and Communicator Functions6 minutes
SPMD and Communication10 minutes
Potential Pitfalls4 minutes
Simple Serial Sorting Algorithm20 minutes
Parallel Odd-Even Transposition Sort19 minutes
Safety in MPI Programs7 minutes

9 readingsTotal 125 minutes

Recommended Reading: Introduction to MPI15 minutes
Recommended Reading: MPI Setup and Communicator Functions15 minutes
Recommended Reading: SPMD and Communication15 minutes
Recommended Reading: Potential Pitfalls15 minutes
Recommended Reading: Simple Serial Sorting Algorithm15 minutes
Recommended Reading: Parallel Odd-Even Transposition Sort15 minutes
Recommended Reading: Safety in MPI Programs 15 minutes
Lab: Practice Code10 minutes
Lab: Practice Solution10 minutes

7 assignmentsTotal 63 minutes

Introduction to MPI9 minutes
MPI Setup and Communicator Functions9 minutes
SPMD and Communication9 minutes
Potential Pitfalls9 minutes
Simple Serial Sorting Algorithm9 minutes
Parallel Odd-Even Transposition Sort9 minutes
Safety in MPI Programs9 minutes

1 discussion promptTotal 30 minutes

MPI in Action: Understanding Setup, Communication, and Parallel Sorting30 minutes

This module aims to introduce the shared memory programming model with the help of the OpenMP library. Students will gain exposure to the functions in the OpenMP library and methods to implement those in code to implement parallelism using shared memory. Students will explore the foundational concepts of OpenMP through videos and readings, starting with the basics of the library and progressing to more advanced topics such as reduction clauses, variable scoping, and mutual exclusion. Through worked examples like the Trapezoidal Rule and sorting functions, learners will understand how to parallelise loops, manage scheduling, and apply critical sections and locks for safe concurrent execution. The module also covers tasking in OpenMP and classic concurrency problems like producers and consumers.

What's included

12 videos12 readings13 assignments1 discussion prompt

12 videosTotal 94 minutes

Introduction to OpenMP5 minutes
Programming in OpenMP10 minutes
Trapezoidal Rule10 minutes
Scope of Variables4 minutes
Reduction Clause7 minutes
Parallel-For Directive and Caveats in Them8 minutes
Sorting Functions20 minutes
Scheduling6 minutes
Producers and Consumers6 minutes
Termination, Startup and Atomic Directive7 minutes
Critical Sections and Locks6 minutes
Tasking5 minutes

12 readingsTotal 152 minutes

Recommended Reading: Introduction to OpenMP15 minutes
Recommended Reading: Programming in OpenMP15 minutes
Recommended Reading: Trapezoidal Rule15 minutes
Recommended Reading: Scope of Variables15 minutes
Recommended Reading: Reduction Clause15 minutes
Recommended Reading: Parallel-For Directive and Caveats in Them15 minutes
Recommended Reading: Sorting Functions15 minutes
Recommended Reading: Scheduling 15 minutes
Recommended Reading: Producers and Consumers15 minutes
Recommended Reading: Termination, Startup and Atomic Directive1 minute
Recommended Reading: Critical Sections and Locks1 minute
Recommended Reading: Tasking15 minutes

13 assignmentsTotal 168 minutes

Graded Quiz - Modules 7 and 860 minutes
Introduction to OpenMP9 minutes
Programming in OpenMP9 minutes
Trapezoidal Rule9 minutes
Scope of Variables9 minutes
Reduction Clause9 minutes
Parallel-For Directive and Caveats in Them9 minutes
Sorting Functions9 minutes
Scheduling9 minutes
Producers and Consumers9 minutes
Termination, Startup and Atomic Directive9 minutes
Critical Sections and Locks9 minutes
Tasking9 minutes

1 discussion promptTotal 30 minutes

Mastering OpenMP: From Parallel Patterns to Synchronisation30 minutes

This module will introduce the n-body problem in physics, examining its significance in simulating gravitational interactions among multiple particles. It will explore classical and modern algorithmic approaches to solving the n-body problem, followed by a discussion on their computational complexity. Emphasis will be placed on identifying opportunities for parallelisation, and students will analyse and implement efficient parallel solutions using the programming languages and parallel computing directives covered in the course.

What's included

13 videos13 readings13 assignments1 discussion prompt

13 videosTotal 107 minutes

Introduction to N-body Problem8 minutes
Serial Solutions to the N-body Problem16 minutes
Parallelising Strategy13 minutes
Parallelising Basic Solver Using OpenMP9 minutes
Parallelising Reduced Solver Using OpenMP 11 minutes
Evaluating OpenMP Performance5 minutes
Parallelising Basic Solver Using Pthreads 4 minutes
Parallelising Basic Solver Using MPI 9 minutes
Parallelising Reduced Solver Using MPI9 minutes
Evaluating MPI Performance6 minutes
Parallelising Basic Solver Using CUDA7 minutes
Evaluating CUDA Solver and Improving Performance4 minutes
Using Shared Memory for Solvers7 minutes

13 readingsTotal 195 minutes

Recommended Reading: Introduction to N-body Problem15 minutes
Recommended Reading: Serial Solutions to the N-body Problem15 minutes
Recommended Reading: Parallelising Strategy15 minutes
Recommended Reading: Parallelising Basic Solver Using OpenMP15 minutes
Recommended Reading: Parallelising Reduced Solver Using OpenMP15 minutes
Recommended Reading: Evaluating OpenMP performance15 minutes
Recommended Reading: Parallelising Basic Solver Using Pthreads15 minutes
Recommended Reading: Parallelising Basic Solver Using MPI15 minutes
Recommended Reading: Parallelising Reduced Solver Using MPI15 minutes
Recommended Reading: Evaluating MPI Performance15 minutes
Recommended Reading: Parallelising Basic Solver Using CUDA15 minutes
Recommended Reading: Evaluating CUDA Solver and Improving Performance15 minutes
Recommended Reading: Using Shared Memory for Solvers15 minutes

13 assignmentsTotal 138 minutes

Introduction to N-body Problem9 minutes
Serial Solutions to the N-body Problem9 minutes
Parallelising Strategy9 minutes
Parallelising Basic Solver Using OpenMP9 minutes
Parallelising Reduced Solver Using OpenMP9 minutes
Evaluating OpenMP Performance9 minutes
Parallelising Basic Solver Using Pthreads9 minutes
Parallelising Basic Solver Using MPI30 minutes
Parallelising Reduced Solver Using MPI9 minutes
Evaluating MPI Performance9 minutes
Parallelising Basic Solver Using CUDA9 minutes
Evaluating CUDA Solver and Improving Performance9 minutes
Using Shared Memory for Solvers9 minutes

1 discussion promptTotal 30 minutes

The N-Body Solver: Exploring Parallelism Across Models30 minutes

This module focuses on hands-on implementations of the Sample Sort algorithm using OpenMP, Pthreads, MPI, and CUDA. Students will explore the strengths and limitations of each parallel programming model through practical coding exercises. The module includes performance benchmarking and comparative analysis of the implementations to highlight trade-offs in scalability, efficiency, and suitability for different architectures. By the end of the module, students will have a strong grasp of each API and be equipped to make informed decisions about the most appropriate tool for a given parallel computing task.

What's included

8 videos9 readings10 assignments1 discussion prompt

8 videosTotal 61 minutes

Sample Sort and Bucket Sort10 minutes
Map17 minutes
Implementing Sample Sort Using OpenMP: First Implementation5 minutes
Implementing Sample Sort Using OpenMP: Second Implementation7 minutes
Implementing Sample Sort Using Pthreads 4 minutes
Implementing Sample Sort Using MPI6 minutes
Implementing Sample Sort Using MPI: Example5 minutes
Implementing Sample Sort Using CUDA 7 minutes

9 readingsTotal 115 minutes

Recommended Reading: Sample Sort and Bucket Sort15 minutes
Recommended Reading: Map10 minutes
Recommended Reading: Implementing Sample Sort Using OpenMP: First Implementation15 minutes
Recommended Reading: Implementing Sample Sort Using OpenMP: Second Implementation15 minutes
Recommended Reading: Implementing Sample Sort Using Pthreads10 minutes
Recommended Reading: Implementing Sample Sort Using MPI15 minutes
Recommended Reading: Implementing Sample Sort Using MPI: Example15 minutes
Recommended Reading: Implementing Sample Sort Using CUDA10 minutes
Recommended Reading: Which API?10 minutes

10 assignmentsTotal 432 minutes

Graded Quiz - Modules 9 and 1060 minutes
SGA-2: Odd-Even Transposition Sort Parallelisation 300 minutes
Sample Sort and Bucket Sort9 minutes
Map (Quiz)9 minutes
Implementing Sample Sort Using OpenMP: First Implementation9 minutes
Implementing Sample Sort Using OpenMP: Second Implementation9 minutes
Implementing Sample Sort Using Pthreads9 minutes
Implementing Sample Sort Using MPI9 minutes
Implementing Sample Sort Using MPI: Example9 minutes
Implementing Sample Sort Using CUDA9 minutes

1 discussion promptTotal 30 minutes

Parallel Sample Sort Across Platforms30 minutes

Final Comprehensive Examination

What's included

1 assignment

Instructors

Kunal Kishore Korgaonkar

Birla Institute of Technology & Science, Pilani

2 Courses1,945 learners

Prof. Gargi Prabhu

Birla Institute of Technology & Science, Pilani

1 Course62 learners

Offered by

Birla Institute of Technology & Science, Pilani

Explore more from Algorithms

Status: Preview
Birla Institute of Technology & Science, Pilani
Multicore and GPGPU Programming
Course
Packt
GPU Programming with C++ and CUDA
Course
Status: Free Trial
Johns Hopkins University
Introduction to Concurrent Programming with GPUs
Course
Status: Preview
Coursera
OpenCL Programming
Course

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Unlock access to 10,000+ courses with a subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 4,700 global companies that choose Coursera for Business

Frequently asked questions

To access course materials, assignments, and earn a Certificate, you'll need to purchase the Certificate experience when you enroll in a course. Eligible learners may also have the option to start with a Free Trial. Some courses may also offer a Full Course, No Certificate option. This lets you access course materials, submit required assessments, and receive a final grade, but you won't be able to earn or purchase a Certificate.

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Multicore and GPGPU Programming

Multicore and GPGPU Programming

What you'll learn

Skills you'll gain

Tools you'll learn

Details to know

See how employees at top companies are mastering in-demand skills

There are 12 modules in this course

Course Introduction

What's included

Introduction to Parallel and Multicore Programming

What's included

Multicore Processor Architectures and Caching Mechanisms

What's included

GPGPU Architecture and Programming Model Overview

What's included

Cuda Execution Model

What's included

CUDA Memory Model and Streams and Concurrency

What's included

Shared-Memory Programming with Pthreads

What's included

Distributed Memory Programming with MPI

What's included

Shared-Memory Programming with OpenMP

What's included

Parallel Program Development 1

What's included

Parallel Program Development 2

What's included

Final Comprehensive Examination

What's included

Instructors

Offered by

Explore more from Algorithms

Multicore and GPGPU Programming

GPU Programming with C++ and CUDA

Introduction to Concurrent Programming with GPUs

OpenCL Programming

Why people choose Coursera for their career

Felipe M.

Jennifer J.

Larry W.

Chaitanya A.

Unlock access to 10,000+ courses with a subscription

Advance your career with an online degree

Join over 4,700 global companies that choose Coursera for Business

Frequently asked questions

When will I have access to the lectures and assignments?

What will I get if I purchase the Certificate?

Is financial aid available?

More questions