Lecture 5: Performance Optimization Part 1 - Work Distribution and Scheduling¶

约 361 个字 32 行代码 10 张图片预计阅读时间 2 分钟共被读过次

CMU 15-418/618, Fall 2018

Key Goals of Parallel Program Optimization 🎯¶

Balance workload across execution resources.
Reduce communication to avoid stalls.
Minimize overhead from parallelism management.

TIP #1: Always start with the simplest solution, then measure performance.

"My solution scales" = Your code scales as needed for your target hardware.

Balancing the Workload ⚖️¶

Ideal Scenario: All processors compute simultaneously and finish at the same time.

Amdahl’s Law Impact¶

Example: If P4 does 20% more work → P4 takes 20% longer → 20% of runtime becomes serial execution.
Serialized section (S) = 5% of total work → Limits maximum speedup.

Static Assignment 🔧¶

Definition: Pre-determine work-to-thread mapping (may depend on runtime parameters).

Example: Grid Solver¶

Assign equal grid cells to each thread.
Strategies:
Blocked: Contiguous chunks.
Interleaved: Cyclic distribution.

When to Use Static Assignment?¶

Predictable work cost (e.g., uniform task durations).
Known statistics (e.g., average execution time).

Example: 12 tasks with equal cost → Assign 3 tasks to each of 4 processors.

Semi-Static Assignment 🔄¶

Periodic profiling to adjust assignments.
Example: Particle simulation redistributes particles as they move slowly.

Dynamic Assignment 🚀¶

Use Case: Unpredictable task execution time/number.

Shared Counter Example (Prime Testing)¶

C++

int N = 1024;
int* x = new int[N];
bool* is_prime = new bool[N];
LOCK counter_lock;
int counter = 0;

while (1) {
    int i;
    lock(counter_lock);
    i = counter++;
    unlock(counter_lock);
    if (i >= N) break;
    is_prime[i] = test_primality(x[i]);
}

Problem: High synchronization overhead due to frequent lock contention.

Task Granularity Adjustment 🧩¶

Coarse Granularity: Reduce synchronization by grouping tasks.

C++

const int GRANULARITY = 10;
...
counter += GRANULARITY;
for (int j = i; j < end; j++) {
    is_prime[j] = test_primality(x[j]);
}

Trade-off:
- Small tasks → Better load balance.
- Large tasks → Lower overhead.

Smarter Task Scheduling 🧠¶

Problem: Load Imbalance¶

Solutions:
1. Split long tasks into smaller subtasks.
2. Schedule long tasks first to minimize "slop".

Distributed Work Queues 🗃️¶

Design:
- Each thread has its own work queue.
- Work stealing when local queue is empty.

Advantages:
- Reduced synchronization.
- Improved locality (producer-consumer pattern).

Fork-Join Parallelism with Cilk Plus 🛠️¶

Key Constructs:¶

cilk_spawn: Fork a task.
cilk_sync: Join all spawned tasks.

Example: Parallel QuickSort

C++

void quick_sort(int* begin, int* end) {
    if (begin >= end - PARALLEL_CUTOFF) std::sort(begin, end);
    else {
        int* middle = partition(begin, end);
        cilk_spawn quick_sort(begin, middle);
        quick_sort(middle + 1, end);
    }
}

Work Stealing Scheduler 🕵️¶

Continuation Stealing: Run child task first, leave continuation for stealing.
Greedy Policy: Idle threads steal work immediately.

Example: Loop with cilk_spawn

C++

for (int i = 0; i < N; i++) {
    cilk_spawn foo(i);
}
cilk_sync;

Summary 📝¶

Load Balance: Critical for maximizing resource utilization.
Assignment Strategies:
- Static: Predictable workloads.
- Dynamic: Unpredictable workloads (use work queues).
Fork-Join: Natural for divide-and-conquer (Cilk Plus uses work stealing).
Granularity: Balance task size to minimize overhead and ensure parallelism.

Key Insight: Use workload knowledge to choose the right strategy! 🚀