簡單並行計算技術方法研究_ZenDei技術網路在線

這篇文章主要寫給我們這些非電腦專業的又要寫程式、實現演算法的人，有的連多線程都不會，所以這裡就說些不需要大篇幅修改程式就可以簡單實現的並行計算。這邊把並行計算分為2類，基於CPU的多線程處理、基於異構架構的並行計算（如GPU等）。基於CPU的主要有：OpenMP、TBB、PPL、Parallel、

這篇文章主要寫給我們這些非電腦專業的又要寫程式、實現演算法的人，有的連多線程都不會，所以這裡就說些不需要大篇幅修改程式就可以簡單實現的並行計算。

這邊把並行計算分為2類，基於CPU的多線程處理、基於異構架構的並行計算（如GPU等）。基於CPU的主要有：OpenMP、TBB、PPL、Parallel、IPP等，基於異構的並行計算主要有OpenCL、CUDA、AMP等。以上我也沒有全部使用過，這裡只介紹部分，以後用了再補充吧。

名稱解釋
線程鎖，是指如果線程沒有搶到線程鎖，那麼線程就會被阻塞，線程所在的CPU會發生進程調度，選擇其他進程執行。
並行計算（Parallel Computing）,是指同時使用多種計算資源解決計算問題的過程，是提高電腦系統計算速度和處理能力的一種有效手段。

OpenMP

使用條件：語言 C/C++、Fortran，編譯器 Sun Studio、Intel Compiler、Microsoft Visual Studio、GCC（但不僅限於），這裡只介紹了對for迴圈的優化

使用要點：

開啟編譯器OpenMP開關：如VS，點擊工程的Properties，彈出菜單里，點擊 Configuration Properties->C/C++->Language->OpenMP Support，在下拉菜單里選擇Yes。

應用頭文件：#include <omp.h>

加入並行計算：在for迴圈前面加上#pragma omp parallel for

線程鎖：#pragma omp critical{…}
完整常式：

#include <iostream>
#include <omp.h>
int main()
{
    int sum = 0;
    int a[10] = {1,2,3,4,5,6,7,8,9,10};
    int coreNum = omp_get_num_procs();//獲得處理器個數
    int* sumArray = new int[coreNum];//對應處理器個數，先生成一個數組
    for (int i=0;i<coreNum;i++)//將數組各元素初始化為0
        sumArray[i] = 0;
#pragma omp parallel for
    for (int i=0;i<10;i++)
    {
        int k = omp_get_thread_num();//獲得每個線程的ID
        sumArray[k] = sumArray[k]+a[i];
    }
    for (int i = 0;i<coreNum;i++)
        sum = sum + sumArray[i];
    std::cout<<"sum: "<<sum<<std::endl;
    return 0;
}

註意：
對於for迴圈的優化，其本質是每個核分段處理，例如 for (int i=0;i<40;i++) 而CPU有4個核心，這CPU0 處理i=0~9,CPU2處理1=10-19…以此類推，所以在每次迴圈有前後影響時應註意不要使用並行處理。
延伸閱讀：
openMP的一點使用經驗 - yangyangcv - 博客園.html
http://www.cnblogs.com/yangyangcv/archive/2012/03/23/2413335.html
OpenMP創建線程中的鎖及原子操作性能比較
http://blog.csdn.net/drzhouweiming/article/details/1689853

Parallel

使用條件：.NET Framework 4以上

使用要點：
添加命名空間：using System.Threading.Tasks
使用一下方法代替for、foreach：
Parallel.For(int fromInclusive,int toExclusive,Action<int, ParallelLoopState> body)
Parallel.ForEach<TSource>(IEnumerable<TSource> source,Action<TSource> body)
完整常式：

using System;
using System.Threading.Tasks;

public class Example
{
   public static void Main()
   {
      ParallelLoopResult result = Parallel.For(0, 100, ctr => 
      { 
            Random rnd = new Random(ctr * 100000);
            Byte[] bytes = new Byte[100];
            rnd.NextBytes(bytes);
            int sum = 0;
            foreach(var byt in bytes)
                sum += byt;
            Console.WriteLine("Iteration {0,2}: {1:N0}", ctr, sum);
      });
      Console.WriteLine("Result: {0}", result.IsCompleted ? "Completed Normally" : String.Format("Completed to {0}", result.LowestBreakIteration));
   }
}

延伸閱讀：
MSDN
https://msdn.microsoft.com/zh-cn/library/system.threading.tasks.parallel_methods(v=vs.100).aspx

類似C#的Parallel，詳見《遇見PPL：C++ 的並行和非同步》

為什麼選擇在GPU上做並行計算呢？現在的多核CPU一般都是雙核或四核的，如果把超線程技術考慮進來，可以把它們看作四個或八個邏輯核，但現在的GPU動則就上百個核，比如中端的NVIDIA GTX 560 SE就有288個核，頂級的NVIDIA GTX 690更有多達3072個核，這些超多核（many-core）GPU非常適合大規模並行計算。
但是GPU的每個核心計算能力沒有CPU那麼強，適合做海量數據的簡單處理。

使用條件:語言C/C++，編譯器VS2012及以上、C++11，運行環境DX11以上（Win7以上操作系統安裝最新顯卡驅動都可以支持，XP無緣）

使用要點：
引用頭文件：#include<amp.h> #include<amp_math.h>
添加命名空間：using namespace concurrency::fast_math 只支持單精度浮點數，而using namespace concurrency::precise_math 則對單精度浮點數和雙精度浮點數均提供支持。
把array數組對象封裝到array_view對象中。
使用parallel_for_each迴圈。
完整常式：

#include <amp.h>
#include <iostream>
using namespace concurrency;

const int size = 5;

void CppAmpMethod() {
    int aCPP[] = {1, 2, 3, 4, 5};
    int bCPP[] = {6, 7, 8, 9, 10};
    int sumCPP[size];

    // Create C++ AMP objects.
    array_view<const int, 1> a(size, aCPP);
    array_view<const int, 1> b(size, bCPP);
    array_view<int, 1> sum(size, sumCPP);
    sum.discard_data();

    parallel_for_each( 
        // Define the compute domain, which is the set of threads that are created.
        sum.extent, 
        // Define the code to run on each thread on the accelerator.
        [=](index<1> idx) restrict(amp)
    {
        sum[idx] = a[idx] + b[idx];
    }
    );

    // Print the results. The expected output is "7, 9, 11, 13, 15".
    for (int i = 0; i < size; i++) {
        std::cout << sum[i] << "\n";
    }
}

註意
包含 restrict(amp) 子句的函數具有以下限制：

函數只能調用具有 restrict(amp) 子句的函數。
函數必須可內聯。
函數只能聲明 int、unsigned int、float 和 double 變數，以及只包含這些類型的類和結構。也允許使用 bool，但如果您在複合類型中使用它，則它必須是 4 位元組對齊的。
Lambda 函數無法通過引用捕獲，並且無法捕獲指針。
僅支持引用和單一間接指針作為局部變數、函數參數和返回類型。
不允許使用以下項：
- 遞歸。
- 使用 volatile 關鍵字聲明的變數。
- 虛函數。
- 指向函數的指針。
- 指向成員函數的指針。
- 結構中的指針。
- 指向指針的指針。
- goto 語句。
- Labeled 語句。
- try 、catch 或 throw 語句。
- 全局變數。
- 靜態變數。請改用 tile_static 關鍵字。
- dynamic_cast 強制轉換。
- typeid 運算符。
- asm 聲明。
- Varargs。

擴展閱讀
入門：http://www.infoq.com/cn/articles/cpp_amp_computing_on_GPU
MSDN：https://msdn.microsoft.com/zh-cn/library/hh265136.aspx