Lecture 2: A Modern Multi-Core Processor¶

约 274 个字 49 行代码 1 张图片预计阅读时间 2 分钟共被读过次

CMU 15-418/15-618, Fall 2018

Key Concepts¶

Parallel Execution
- Multi-core processing
- SIMD (Single Instruction, Multiple Data)
- Instruction-Level Parallelism (ILP)
Memory Access Challenges
- Latency vs. Bandwidth
- Caching, Prefetching, Multi-threading

Part 1: Parallel Execution¶

Example: Computing \(\sin(x)\) Using Taylor Expansion¶

Formula:
\(\(\sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \ldots\)\)

Serial Code:

C

void sinx(int N, int terms, float* x, float* result) {  
    for (int i = 0; i < N; i++) {  
        float value = x[i];  
        float numer = x[i] * x[i] * x[i];  
        int denom = 6;  // 3!  
        int sign = -1;  
        for (int j = 1; j <= terms; j++) {  
            value += sign * numer / denom;  
            numer *= x[i] * x[i];  
            denom *= (2*j+2) * (2*j+3);  
            sign *= -1;  
        }  
        result[i] = value;  
    }  
}

Problem: No parallelism → Slower on multi-core (0.75x speedup per core).

Parallelizing with Pthreads¶

Data-Parallel Approach:

C

void sinx(int N, int terms, float* x, float* result) {  
    forall(int i from 0 to N-1) {  // Independent iterations  
        float value = x[i];  
        float numer = x[i] * x[i] * x[i];  
        int denom = 6;  
        int sign = -1;  
        for (int j = 1; j <= terms; j++) {  
            value += sign * numer / denom;  
            numer *= x[i] * x[i];  
            denom *= (2*j+2) * (2*j+3);  
            sign *= -1;  
        }  
        result[i] = value;  
    }  
}

Compiler Hint: Use forall to declare independent loop iterations.

Multi-Core Scaling¶

Cores	Performance
1	1x
2	1.5x
4	3x
16	12x

Example Architectures:
- Intel Core i7 (6 cores)
- NVIDIA GTX 1080 (20 SMs)
- Apple A9 (2 cores)

SIMD Vectorization¶

AVX Intrinsics Example:

C

#include <immintrin.h>  
void sinx(int N, int terms, float* x, float* result) {  
    float three_fact = 6;  
    for (int i = 0; i < N; i += 8) {  
        __m256 origx = _mm256_load_ps(&x[i]);  
        __m256 value = origx;  
        __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx));  
        __m256 denom = _mm256_broadcast_ss(&three_fact);  
        int sign = -1;  
        for (int j = 1; j <= terms; j++) {  
            __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1_ps(sign), numer), denom);  
            value = _mm256_add_ps(value, tmp);  
            numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx));  
            denom = _mm256_mul_ps(denom, _mm256_broadcast_ss((2*j+2)*(2*j+3)));  
            sign *= -1;  
        }  
        _mm256_store_ps(&result[i], value);  
    }  
}

Effect: 8 elements processed per instruction 🚀.

Part 2: Memory Access¶

Latency vs. Bandwidth¶

Latency: Time to service a memory request (e.g., 100 cycles).
Bandwidth: Data transfer rate (e.g., 20 GB/s).

Techniques to Mitigate Latency¶

Caching: Reduce access time for frequently used data.
Prefetching: Predict and load data before it’s needed.
Multi-threading: Hide latency by switching threads.

Multi-Threading Trade-offs¶

Scenario	Pros	Cons
Single Thread	Simple	Frequent stalls
4 Hardware Threads	Better ALU utilization	Increased cache pressure
16 SIMD Cores	High throughput	Requires massive parallelism

Example: NVIDIA GTX 480 (15 cores, 32 ALUs/core, 1.3 TFLOPS).

GPU vs. CPU Memory Hierarchy¶

- CPU: Large caches, low latency.
- GPU: High bandwidth, optimized for throughput.

Key Takeaways¶

Parallel Execution Forms:
- Multi-core (thread-level)
- SIMD (data-level)
- Superscalar (ILP)
Memory Challenges:
- Bandwidth-bound programs need high arithmetic intensity.
- Optimize for data reuse and locality.
Hardware Trends:
- Simpler cores + more parallelism > complex cores.
- GPUs push throughput to extremes.

🔍 Pro Tip: Use forall and vectorization to exploit parallelism!