C# 使用SIMD向量類型加速浮點數組求和運算(3):迴圈展開

来源:https://www.cnblogs.com/zyl910/archive/2022/11/16/dotnet_simd_BenchmarkVector3.html
-Advertisement-
Play Games

aspnetcore上傳圖片也就是上傳文件有兩種方式,一種是通過form-data,一種是binary。 先介紹第一種form-data: 該方式需要顯示指定一個IFormFile類型,該組件會動態通過打開一個windows視窗選擇文件 及圖片。 postman演示如上,代碼如下: [HttpPos ...


作者: zyl910

目錄

一、背景

先前的2篇文章,說了向量類型的類型選擇問題。本文討論一個使用方面的問題——迴圈展開。

現在的CPU採用了流水線、超標量等機制來提高運算性能。如果完全是順序代碼,那麼流水線的效果會非常好。
但是程式中不可避免的需要 分支 與 迴圈來處理各種複雜的邏輯。分支與迴圈會被編譯為跳轉指令,而跳轉指令會導致CPU流水線失效,對性能的影響很大。雖然現代處理器增加了分支預測技術,但總會有預測失敗的概率。
尤其是在使用向量類型進行SIMD運算時,因向量類型僅儘可能榨乾CPU內部的ALU(算術邏輯單元),於是在跳轉時的性能損失更大。

故在使用向量類型處理大規模數學計算時,應儘可能的避免分支與迴圈。
對於分支,最好儘量將分支挪到內迴圈外。若是內迴圈中必須的分支,可儘量用位掩碼等辦法來寫無分支代碼。
對於迴圈,一般可使用迴圈展開技術,來避免短的迴圈。

1.1 迴圈展開簡介

摘錄——

迴圈展開(Loop unrolling)技術是一種提升程式執行速度的非常有效的優化方法,它可以由程式員手工編寫,也可由編譯器自動優化。迴圈展開的本質是,利用CPU指令級並行,來降低迴圈的開銷,當然,同時也有利於指令流水線的高效調度。
……

迴圈展開的優點:
第一,減少了分支預測失敗的可能性。
第二,增加了迴圈體內語句併發執行的可能性,當然,這需要迴圈體內各語句不存在數據相關性。

迴圈展開的缺點:
第一,造成代碼膨脹,導致ELF文件(或Windows PE文件)尺寸增大。
第二,代碼可讀性顯著降低,前一個人寫的迴圈展開代碼,很可能被不熟悉的後續維護人員改回去。

1.2 測試準備

註意,迴圈展開提高的是流水線性能,對小迴圈效果明顯。此時分支造成的延時,大多與內迴圈的運算耗時差不多。
對於有些複雜的大迴圈,內迴圈的運算耗時已經很大了,而分支造成的延時仍是常數值,比例下降了很多。此時迴圈展開的收益就少了。

由於迴圈展開是程式員手工編寫的,故必須在編碼前就確定好展開次數。
本文就來探討一下大多數時候的展開次數選擇。

展開2倍的話,性能最多為原來2倍,即大多數情況下只有1倍多的性能提升,提升不大。
展開2倍的話,性能最多為原來4倍。區間大了,很多時候能達到2倍以上的提升。
故一開始可以用展開4倍來測試。下麵將進行測試。

測試電腦的配置信息為:lntel(R) Core(TM) i5-8250U CPU @ 1.60GHz、Windows 10。

二、在C#中使用

為了對比測試 Avx指令的效果,故可在 BenchmarkVectorCore30 工程里進行測試。因是64位操作系統,故選取 x64、Release版的測試結果.

2.1 對基礎演算法做迴圈展開

回顧一下基礎演算法:

private static float SumBase(float[] src, int count, int loops) {
    float rt = 0; // Result.
    for (int j=0; j< loops; ++j) {
        for(int i=0; i< count; ++i) {
            rt += src[i];
        }
    }
    return rt;
}

改為迴圈展開4倍後,代碼為:

private static float SumBaseU4(float[] src, int count, int loops) {
    float rt = 0; // Result.
    float rt1 = 0;
    float rt2 = 0;
    float rt3 = 0;
    int nBlockWidth = 4; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    int p; // Index for src data.
    int i;
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            rt += src[p];
            rt1 += src[p + 1];
            rt2 += src[p + 2];
            rt3 += src[p + 3];
            p += nBlockWidth;
        }
        // Remainder processs.
        //p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    rt = rt + rt1 + rt2 + rt3;
    return rt;
}

之前內迴圈只處理1個數據,現在內迴圈處理了4個數據。
註意內迴圈在處理者4個數據時,並不是直接將結果全部累加到 rt 變數,而是使用新增的 rt1、rt2、rt3 變數來臨時存儲累加值。這是為了消除變數之間的相關性,因為變數之間的相關性會影響流水線性能,故分別使用獨立的變數就好了。
最後在 Reduce 階段,將 rt1、rt2、rt3 的值累加到 rt。

2.1.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336

可以發現,基礎演算法使用4倍迴圈展開後,性能是原先的 2.6336 倍。

2.2 對 Vector4 版演算法做迴圈展開

回顧一下Vector4 版演算法:

private static float SumVector4(float[] src, int count, int loops) {
    float rt = 0; // Result.
    const int VectorWidth = 4;
    int nBlockWidth = VectorWidth; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector4 vrt = Vector4.Zero; // Vector result.
    int p; // Index for src data.
    int i;
    // Load.
    Vector4[] vsrc = new Vector4[cntBlock]; // Vector src.
    p = 0;
    for (i = 0; i < vsrc.Length; ++i) {
        vsrc[i] = new Vector4(src[p], src[p + 1], src[p + 2], src[p + 3]);
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            // Equivalent to scalar model: rt += src[i];
            vrt += vsrc[i]; // Add.
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    rt += vrt.X + vrt.Y + vrt.Z + vrt.W;
    return rt;
}

改為迴圈展開4倍後,代碼為:

private static float SumVector4U4(float[] src, int count, int loops) {
    float rt = 0; // Result.
    const int LoopUnrolling = 4;
    const int VectorWidth = 4;
    int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector4 vrt = Vector4.Zero; // Vector result.
    Vector4 vrt1 = Vector4.Zero;
    Vector4 vrt2 = Vector4.Zero;
    Vector4 vrt3 = Vector4.Zero;
    int p; // Index for src data.
    int i;
    // Load.
    Vector4[] vsrc = new Vector4[count / VectorWidth]; // Vector src.
    p = 0;
    for (i = 0; i < vsrc.Length; ++i) {
        vsrc[i] = new Vector4(src[p], src[p + 1], src[p + 2], src[p + 3]);
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt += vsrc[p]; // Add.
            vrt1 += vsrc[p + 1];
            vrt2 += vsrc[p + 2];
            vrt3 += vsrc[p + 3];
            p += LoopUnrolling;
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    vrt = vrt + vrt1 + vrt2 + vrt3;
    rt += vrt.X + vrt.Y + vrt.Z + vrt.W;
    return rt;
}

跟剛纔的辦法一樣,使用新增的 rt1、rt2、rt3 變數來臨時存儲累加值,消除變數之間的相關性。
最後在 Reduce 階段,將 vrt1、vrt2、vrt3 的值累加到 vrt。

2.2.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVector4:     2.748779E+11    # msUsed=1218, MFLOPS/s=3362.8899835796387, scale=4.054187192118227
SumVector4U4:   1.0995116E+12   # msUsed=532, MFLOPS/s=7699.248120300752, scale=9.281954887218046

SumVector4U4對比基礎演算法(SumBase),性能倍數是 9.281954887218046。
SumVector4U4對比未迴圈展開的演算法(SumVector4),倍數是 9.281954887218046/4.054187192118227=2.2894736842105263092984587836542

2.3 對 Vector<T> 版演算法做迴圈展開

回顧一下 Vector<T> 版演算法:

private static float SumVectorT(float[] src, int count, int loops) {
    float rt = 0; // Result.
    int VectorWidth = Vector<float>.Count; // Block width.
    int nBlockWidth = VectorWidth; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector<float> vrt = Vector<float>.Zero; // Vector result.
    int p; // Index for src data.
    int i;
    // Load.
    Vector<float>[] vsrc = new Vector<float>[cntBlock]; // Vector src.
    p = 0;
    for (i = 0; i < vsrc.Length; ++i) {
        vsrc[i] = new Vector<float>(src, p);
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt += vsrc[i]; // Add.
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt[i];
    }
    return rt;
}

改為迴圈展開4倍後,代碼為:

private static float SumVectorTU4(float[] src, int count, int loops) {
    float rt = 0; // Result.
    const int LoopUnrolling = 4;
    int VectorWidth = Vector<float>.Count; // Block width.
    int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector<float> vrt = Vector<float>.Zero; // Vector result.
    Vector<float> vrt1 = Vector<float>.Zero;
    Vector<float> vrt2 = Vector<float>.Zero;
    Vector<float> vrt3 = Vector<float>.Zero;
    int p; // Index for src data.
    int i;
    // Load.
    Vector<float>[] vsrc = new Vector<float>[count / VectorWidth]; // Vector src.
    p = 0;
    for (i = 0; i < vsrc.Length; ++i) {
        vsrc[i] = new Vector<float>(src, p);
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt += vsrc[p]; // Add.
            vrt1 += vsrc[p + 1];
            vrt2 += vsrc[p + 1];
            vrt3 += vsrc[p + 1];
            p += LoopUnrolling;
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    vrt = vrt + vrt1 + vrt2 + vrt3;
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt[i];
    }
    return rt;
}

跟剛纔的辦法一樣,使用新增的 rt1、rt2、rt3 變數來臨時存儲累加值,消除變數之間的相關性。
最後在 Reduce 階段,將 vrt1、vrt2、vrt3 的值累加到 vrt。

2.3.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVectorT:     5.497558E+11    # msUsed=609, MFLOPS/s=6725.7799671592775, scale=8.108374384236454
SumVectorTU4:   2.1990233E+12   # msUsed=203, MFLOPS/s=20177.339901477833, scale=24.32512315270936

SumVectorTU4對比基礎演算法(SumBase),性能倍數是 24.32512315270936。
SumVectorTU4對比未迴圈展開的演算法(SumVectorT),倍數是 24.32512315270936/8.108374384236454=2.9999999999999997533414337788579
初步發現 Vector<T>迴圈展開(2.9999)帶來的性能提升, 比VectorT迴圈展開(2.2894)更高一些。

2.4 對 Avx版演算法做迴圈展開

先前分別嘗試用 數組、Span、指針 的辦法來操縱數據、使用Avx指令集。現在對這3種辦法,均寫一套迴圈展開4次的代碼:

/// <summary>
/// Sum - Vector AVX.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvx(float[] src, int count, int loops) {
#if Allow_Intrinsics
    float rt = 0; // Result.
    //int VectorWidth = 32 / 4; // sizeof(__m256) / sizeof(float);
    int VectorWidth = Vector256<float>.Count; // Block width.
    int nBlockWidth = VectorWidth; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
    int p; // Index for src data.
    int i;
    // Load.
    Vector256<float>[] vsrc = new Vector256<float>[cntBlock]; // Vector src.
    p = 0;
    for (i = 0; i < cntBlock; ++i) {
        vsrc[i] = Vector256.Create(src[p], src[p + 1], src[p + 2], src[p + 3], src[p + 4], src[p + 5], src[p + 6], src[p + 7]); // Load.
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt = Avx.Add(vrt, vsrc[i]);    // Add. vrt += vsrc[i];
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt.GetElement(i);
    }
    return rt;
#else
    throw new NotSupportedException();
#endif
}

/// <summary>
/// Sum - Vector AVX - Loop unrolling *4.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvxU4(float[] src, int count, int loops) {
#if Allow_Intrinsics
    float rt = 0; // Result.
    const int LoopUnrolling = 4;
    int VectorWidth = Vector256<float>.Count; // Block width.
    int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
    Vector256<float> vrt1 = Vector256<float>.Zero;
    Vector256<float> vrt2 = Vector256<float>.Zero;
    Vector256<float> vrt3 = Vector256<float>.Zero;
    int p; // Index for src data.
    int i;
    // Load.
    Vector256<float>[] vsrc = new Vector256<float>[count / VectorWidth]; // Vector src.
    p = 0;
    for (i = 0; i < vsrc.Length; ++i) {
        vsrc[i] = Vector256.Create(src[p], src[p + 1], src[p + 2], src[p + 3], src[p + 4], src[p + 5], src[p + 6], src[p + 7]); // Load.
        p += VectorWidth;
    }
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt = Avx.Add(vrt, vsrc[p]);    // Add. vrt += vsrc[p];
            vrt1 = Avx.Add(vrt1, vsrc[p + 1]);
            vrt2 = Avx.Add(vrt2, vsrc[p + 2]);
            vrt3 = Avx.Add(vrt3, vsrc[p + 3]);
            p += LoopUnrolling;
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    vrt = Avx.Add(Avx.Add(vrt, vrt1), Avx.Add(vrt2, vrt3)); // vrt = vrt + vrt1 + vrt2 + vrt3;
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt.GetElement(i);
    }
    return rt;
#else
    throw new NotSupportedException();
#endif
}

/// <summary>
/// Sum - Vector AVX - Span.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvxSpan(float[] src, int count, int loops) {
#if Allow_Intrinsics
    float rt = 0; // Result.
    int VectorWidth = Vector256<float>.Count; // Block width.
    int nBlockWidth = VectorWidth; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
    int p; // Index for src data.
    ReadOnlySpan<Vector256<float>> vsrc; // Vector src.
    int i;
    // Body.
    for (int j = 0; j < loops; ++j) {
        // Vector processs.
        vsrc = System.Runtime.InteropServices.MemoryMarshal.Cast<float, Vector256<float> >(new Span<float>(src)); // Reinterpret cast. `float*` to `Vector256<float>*`.
        for (i = 0; i < cntBlock; ++i) {
            vrt = Avx.Add(vrt, vsrc[i]);    // Add. vrt += vsrc[i];
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt.GetElement(i);
    }
    return rt;
#else
    throw new NotSupportedException();
#endif
}

/// <summary>
/// Sum - Vector AVX - Span - Loop unrolling *4.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvxSpanU4(float[] src, int count, int loops) {
#if Allow_Intrinsics
    float rt = 0; // Result.
    const int LoopUnrolling = 4;
    int VectorWidth = Vector256<float>.Count; // Block width.
    int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    int cntBlock = count / nBlockWidth; // Block count.
    int cntRem = count % nBlockWidth; // Remainder count.
    Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
    Vector256<float> vrt1 = Vector256<float>.Zero;
    Vector256<float> vrt2 = Vector256<float>.Zero;
    Vector256<float> vrt3 = Vector256<float>.Zero;
    int p; // Index for src data.
    ReadOnlySpan<Vector256<float>> vsrc; // Vector src.
    int i;
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Vector processs.
        vsrc = System.Runtime.InteropServices.MemoryMarshal.Cast<float, Vector256<float>>(new Span<float>(src)); // Reinterpret cast. `float*` to `Vector256<float>*`.
        for (i = 0; i < cntBlock; ++i) {
            vrt = Avx.Add(vrt, vsrc[p]);    // Add. vrt += vsrc[p];
            vrt1 = Avx.Add(vrt1, vsrc[p + 1]);
            vrt2 = Avx.Add(vrt2, vsrc[p + 2]);
            vrt3 = Avx.Add(vrt3, vsrc[p + 3]);
            p += LoopUnrolling;
        }
        // Remainder processs.
        p = cntBlock * nBlockWidth;
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    vrt = Avx.Add(Avx.Add(vrt, vrt1), Avx.Add(vrt2, vrt3)); // vrt = vrt + vrt1 + vrt2 + vrt3;
    for (i = 0; i < VectorWidth; ++i) {
        rt += vrt.GetElement(i);
    }
    return rt;
#else
    throw new NotSupportedException();
#endif
}

/// <summary>
/// Sum - Vector AVX - Ptr.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvxPtr(float[] src, int count, int loops) {
#if Allow_Intrinsics && UNSAFE
    unsafe {
        float rt = 0; // Result.
        int VectorWidth = Vector256<float>.Count; // Block width.
        int nBlockWidth = VectorWidth; // Block width.
        int cntBlock = count / nBlockWidth; // Block count.
        int cntRem = count % nBlockWidth; // Remainder count.
        Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
        Vector256<float> vload;
        float* p; // Pointer for src data.
        int i;
        // Body.
        fixed(float* p0 = &src[0]) {
            for (int j = 0; j < loops; ++j) {
                p = p0;
                // Vector processs.
                for (i = 0; i < cntBlock; ++i) {
                    vload = Avx.LoadVector256(p);    // Load. vload = *(*__m256)p;
                    vrt = Avx.Add(vrt, vload);    // Add. vrt += vsrc[i];
                    p += nBlockWidth;
                }
                // Remainder processs.
                for (i = 0; i < cntRem; ++i) {
                    rt += p[i];
                }
            }
        }
        // Reduce.
        for (i = 0; i < VectorWidth; ++i) {
            rt += vrt.GetElement(i);
        }
        return rt;
    }
#else
    throw new NotSupportedException();
#endif
}

/// <summary>
/// Sum - Vector AVX - Ptr - Loop unrolling *4.
/// </summary>
/// <param name="src">Soure array.</param>
/// <param name="count">Soure array count.</param>
/// <param name="loops">Benchmark loops.</param>
/// <returns>Return the sum value.</returns>
private static float SumVectorAvxPtrU4(float[] src, int count, int loops) {
#if Allow_Intrinsics && UNSAFE
    unsafe {
        float rt = 0; // Result.
        const int LoopUnrolling = 4;
        int VectorWidth = Vector256<float>.Count; // Block width.
        int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
        int cntBlock = count / nBlockWidth; // Block count.
        int cntRem = count % nBlockWidth; // Remainder count.
        Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
        Vector256<float> vrt1 = Vector256<float>.Zero;
        Vector256<float> vrt2 = Vector256<float>.Zero;
        Vector256<float> vrt3 = Vector256<float>.Zero;
        Vector256<float> vload;
        Vector256<float> vload1;
        Vector256<float> vload2;
        Vector256<float> vload3;
        float* p; // Pointer for src data.
        int i;
        // Body.
        fixed (float* p0 = &src[0]) {
            for (int j = 0; j < loops; ++j) {
                p = p0;
                // Vector processs.
                for (i = 0; i < cntBlock; ++i) {
                    vload = Avx.LoadVector256(p);    // Load. vload = *(*__m256)p;
                    vload1 = Avx.LoadVector256(p + VectorWidth * 1);
                    vload2 = Avx.LoadVector256(p + VectorWidth * 2);
                    vload3 = Avx.LoadVector256(p + VectorWidth * 3);
                    vrt = Avx.Add(vrt, vload);    // Add. vrt += vsrc[i];
                    vrt1 = Avx.Add(vrt1, vload1);
                    vrt2 = Avx.Add(vrt2, vload2);
                    vrt3 = Avx.Add(vrt3, vload3);
                    p += nBlockWidth;
                }
                // Remainder processs.
                for (i = 0; i < cntRem; ++i) {
                    rt += p[i];
                }
            }
        }
        // Reduce.
        vrt = Avx.Add(Avx.Add(vrt, vrt1), Avx.Add(vrt2, vrt3)); // vrt = vrt + vrt1 + vrt2 + vrt3;
        for (i = 0; i < VectorWidth; ++i) {
            rt += vrt.GetElement(i);
        }
        return rt;
    }
#else
    throw new NotSupportedException();
#endif
}

2.4.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVectorAvx:   5.497558E+11    # msUsed=609, MFLOPS/s=6725.7799671592775, scale=8.108374384236454
SumVectorAvxSpan:       5.497558E+11    # msUsed=625, MFLOPS/s=6553.6, scale=7.9008
SumVectorAvxPtr:        5.497558E+11    # msUsed=610, MFLOPS/s=6714.754098360656, scale=8.095081967213115
SumVectorAvxU4: 2.1990233E+12   # msUsed=328, MFLOPS/s=12487.80487804878, scale=15.054878048780488
SumVectorAvxSpanU4:     2.1990233E+12   # msUsed=312, MFLOPS/s=13128.205128205129, scale=15.826923076923078
SumVectorAvxPtrU4:      2.1990233E+12   # msUsed=157, MFLOPS/s=26089.171974522294, scale=31.452229299363058

未做迴圈展開時,這3鐘辦法的性能拉不開差距,都是8倍左右。
而現在用了迴圈展開後,數組版(SumVectorAvxU4)、Span版(SumVectorAvxSpanU4)只有15倍左右的性能提升。而指針版有 31倍性能提升,是 數組版、Span版 的2倍。
可能是因為指針更貼近底層硬體、更易於編譯器優化。故當使用內在函數時,推薦優先使用指針。

SumVectorAvxPtrU4 對比基礎演算法(SumBase),性能倍數是 31.452229299363058。
SumVectorAvxPtrU4 對比未迴圈展開的演算法(SumVectorAvxPtr),倍數是 31.452229299363058/8.095081967213115=3.8853503184713375449974589366746。

2.5 對 Avx版演算法做迴圈展開16次

剛纔嘗試了4倍迴圈展開,故理論上限是4倍。而SumVectorAvxPtrU4版有約 3.8853 倍性能提升,故可考慮進一步加大,於是可測試一下 4*4=16 次的迴圈展開。

將 SumVectorAvxPtr 改造為迴圈展開16次的,代碼如下:

private static float SumVectorAvxPtrU16(float[] src, int count, int loops) {
#if Allow_Intrinsics && UNSAFE
    unsafe {
        float rt = 0; // Result.
        const int LoopUnrolling = 16;
        int VectorWidth = Vector256<float>.Count; // Block width.
        int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
        int cntBlock = count / nBlockWidth; // Block count.
        int cntRem = count % nBlockWidth; // Remainder count.
        Vector256<float> vrt = Vector256<float>.Zero; // Vector result.
        Vector256<float> vrt1 = Vector256<float>.Zero;
        Vector256<float> vrt2 = Vector256<float>.Zero;
        Vector256<float> vrt3 = Vector256<float>.Zero;
        Vector256<float> vrt4 = Vector256<float>.Zero;
        Vector256<float> vrt5 = Vector256<float>.Zero;
        Vector256<float> vrt6 = Vector256<float>.Zero;
        Vector256<float> vrt7 = Vector256<float>.Zero;
        Vector256<float> vrt8 = Vector256<float>.Zero;
        Vector256<float> vrt9 = Vector256<float>.Zero;
        Vector256<float> vrt10 = Vector256<float>.Zero;
        Vector256<float> vrt11 = Vector256<float>.Zero;
        Vector256<float> vrt12 = Vector256<float>.Zero;
        Vector256<float> vrt13 = Vector256<float>.Zero;
        Vector256<float> vrt14 = Vector256<float>.Zero;
        Vector256<float> vrt15 = Vector256<float>.Zero;
        float* p; // Pointer for src data.
        int i;
        // Body.
        fixed (float* p0 = &src[0]) {
            for (int j = 0; j < loops; ++j) {
                p = p0;
                // Vector processs.
                for (i = 0; i < cntBlock; ++i) {
                    //vload = Avx.LoadVector256(p);    // Load. vload = *(*__m256)p;
                    vrt = Avx.Add(vrt, Avx.LoadVector256(p)); // Add. vrt[k] += *((*__m256)(p)+k);
                    vrt1 = Avx.Add(vrt1, Avx.LoadVector256(p + VectorWidth * 1));
                    vrt2 = Avx.Add(vrt2, Avx.LoadVector256(p + VectorWidth * 2));
                    vrt3 = Avx.Add(vrt3, Avx.LoadVector256(p + VectorWidth * 3));
                    vrt4 = Avx.Add(vrt4, Avx.LoadVector256(p + VectorWidth * 4));
                    vrt5 = Avx.Add(vrt5, Avx.LoadVector256(p + VectorWidth * 5));
                    vrt6 = Avx.Add(vrt6, Avx.LoadVector256(p + VectorWidth * 6));
                    vrt7 = Avx.Add(vrt7, Avx.LoadVector256(p + VectorWidth * 7));
                    vrt8 = Avx.Add(vrt8, Avx.LoadVector256(p + VectorWidth * 8));
                    vrt9 = Avx.Add(vrt9, Avx.LoadVector256(p + VectorWidth * 9));
                    vrt10 = Avx.Add(vrt10, Avx.LoadVector256(p + VectorWidth * 10));
                    vrt11 = Avx.Add(vrt11, Avx.LoadVector256(p + VectorWidth * 11));
                    vrt12 = Avx.Add(vrt12, Avx.LoadVector256(p + VectorWidth * 12));
                    vrt13 = Avx.Add(vrt13, Avx.LoadVector256(p + VectorWidth * 13));
                    vrt14 = Avx.Add(vrt14, Avx.LoadVector256(p + VectorWidth * 14));
                    vrt15 = Avx.Add(vrt15, Avx.LoadVector256(p + VectorWidth * 15));
                    p += nBlockWidth;
                }
                // Remainder processs.
                for (i = 0; i < cntRem; ++i) {
                    rt += p[i];
                }
            }
        }
        // Reduce.
        vrt = Avx.Add( Avx.Add( Avx.Add(Avx.Add(vrt, vrt1), Avx.Add(vrt2, vrt3))
            , Avx.Add(Avx.Add(vrt4, vrt5), Avx.Add(vrt6, vrt7)) )
            , Avx.Add( Avx.Add(Avx.Add(vrt8, vrt9), Avx.Add(vrt10, vrt11))
            , Avx.Add(Avx.Add(vrt12, vrt13), Avx.Add(vrt14, vrt15)) ) )
        ; // vrt = vrt + vrt1 + vrt2 + vrt3 + ... vrt15;
        for (i = 0; i < VectorWidth; ++i) {
            rt += vrt.GetElement(i);
        }
        return rt;
    }
#else
    throw new NotSupportedException();
#endif
}

2.5.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVectorAvxPtr:        5.497558E+11    # msUsed=610, MFLOPS/s=6714.754098360656, scale=8.095081967213115
SumVectorAvxPtrU4:      2.1990233E+12   # msUsed=157, MFLOPS/s=26089.171974522294, scale=31.452229299363058
SumVectorAvxPtrU16:     8.386202E+12    # msUsed=125, MFLOPS/s=32768, scale=39.504

SumVectorAvxPtrU16 對比基礎演算法(SumBase),性能倍數是 39.504。
SumVectorAvxPtrU16 對比未迴圈展開的演算法(SumVectorAvxPtr),倍數是 39.504/8.095081967213115=4.8799999999999998517618469015796。
SumVectorAvxPtrU16 對比迴圈展開4次的演算法(SumVectorAvxPtrU4),倍數是 39.504/31.452229299363058=1.2559999999999999730384771162414。

從迴圈展開4次,改為迴圈展開16次,性能倍數只是從 31倍多,提升到 39 倍多,僅提升 25% 左右。
性能提升的少,但編碼麻煩多了。看來迴圈展開16次的性價比很低,故一般情況下用迴圈展開4次就行了。

2.6 嘗試用數組來存儲迴圈展開的臨時變數

使用迴圈展開N次時,將會導致臨時變數數量是非迴圈展開版的N倍。例如剛纔的 SumVectorAvxPtrU16 函數,因迴圈展開16次,導致臨時變數是非迴圈展開版的16倍,寫起了很啰嗦。
這些變數的類型是一樣的,放到數組中的話,代碼會清晰不少,但會不會影響性能呢?
於是做了一個測試,代碼如下:

private static float SumVectorAvxPtrU16A(float[] src, int count, int loops) {
#if Allow_Intrinsics && UNSAFE
    unsafe {
        float rt = 0; // Result.
        const int LoopUnrolling = 16;
        int VectorWidth = Vector256<float>.Count; // Block width.
        int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
        int cntBlock = count / nBlockWidth; // Block count.
        int cntRem = count % nBlockWidth; // Remainder count.
        int i;
        //Vector256<float>[] vrt = new Vector256<float>[LoopUnrolling]; // Vector result.
        Vector256<float>* vrt = stackalloc Vector256<float>[LoopUnrolling]; ; // Vector result.
        for (i = 0; i< LoopUnrolling; ++i) {
            vrt[i] = Vector256<float>.Zero;
        }
        float* p; // Pointer for src data.
        // Body.
        fixed (float* p0 = &src[0]) {
            for (int j = 0; j < loops; ++j) {
                p = p0;
                // Vector processs.
                for (i = 0; i < cntBlock; ++i) {
                    //vload = Avx.LoadVector256(p);    // Load. vload = *(*__m256)p;
                    vrt[0] = Avx.Add(vrt[0], Avx.LoadVector256(p)); // Add. vrt[k] += *((*__m256)(p)+k);
                    vrt[1] = Avx.Add(vrt[1], Avx.LoadVector256(p + VectorWidth * 1));
                    vrt[2] = Avx.Add(vrt[2], Avx.LoadVector256(p + VectorWidth * 2));
                    vrt[3] = Avx.Add(vrt[3], Avx.LoadVector256(p + VectorWidth * 3));
                    vrt[4] = Avx.Add(vrt[4], Avx.LoadVector256(p + VectorWidth * 4));
                    vrt[5] = Avx.Add(vrt[5], Avx.LoadVector256(p + VectorWidth * 5));
                    vrt[6] = Avx.Add(vrt[6], Avx.LoadVector256(p + VectorWidth * 6));
                    vrt[7] = Avx.Add(vrt[7], Avx.LoadVector256(p + VectorWidth * 7));
                    vrt[8] = Avx.Add(vrt[8], Avx.LoadVector256(p + VectorWidth * 8));
                    vrt[9] = Avx.Add(vrt[9], Avx.LoadVector256(p + VectorWidth * 9));
                    vrt[10] = Avx.Add(vrt[10], Avx.LoadVector256(p + VectorWidth * 10));
                    vrt[11] = Avx.Add(vrt[11], Avx.LoadVector256(p + VectorWidth * 11));
                    vrt[12] = Avx.Add(vrt[12], Avx.LoadVector256(p + VectorWidth * 12));
                    vrt[13] = Avx.Add(vrt[13], Avx.LoadVector256(p + VectorWidth * 13));
                    vrt[14] = Avx.Add(vrt[14], Avx.LoadVector256(p + VectorWidth * 14));
                    vrt[15] = Avx.Add(vrt[15], Avx.LoadVector256(p + VectorWidth * 15));
                    p += nBlockWidth;
                }
                // Remainder processs.
                for (i = 0; i < cntRem; ++i) {
                    rt += p[i];
                }
            }
        }
        // Reduce.
        for (i = 1; i < LoopUnrolling; ++i) {
            vrt[0] = Avx.Add(vrt[0], vrt[i]); // vrt[0] += vrt[i]
        }
        for (i = 0; i < VectorWidth; ++i) {
            rt += vrt[0].GetElement(i);
        }
        return rt;
    }
#else
    throw new NotSupportedException();
#endif
}

2.6.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVectorAvxPtr:        5.497558E+11    # msUsed=610, MFLOPS/s=6714.754098360656, scale=8.095081967213115
SumVectorAvxPtrU4:      2.1990233E+12   # msUsed=157, MFLOPS/s=26089.171974522294, scale=31.452229299363058
SumVectorAvxPtrU16:     8.386202E+12    # msUsed=125, MFLOPS/s=32768, scale=39.504
SumVectorAvxPtrU16A:    8.3862026E+12   # msUsed=187, MFLOPS/s=21903.74331550802, scale=26.406417112299465

可以發現 SumVectorAvxPtrU16A 的性能比 SumVectorAvxPtrU16 差。
曾經以為是因為數組是在堆中分配的(new Vector256)引起的,有堆記憶體分配的開銷,且需要多次定址才能定位變數。
隨後改為棧中分配的數組(stackalloc Vector256),且用最貼近硬體的指針來操作,可性能幾乎一致。故猜測可能是編譯優化時難以將它們優化為寄存器變數。

故在使用迴圈展開時,臨時變數不要用數組來存,還是逐個定義局部變數比較好。

2.7 嘗試用棧數組來減少相關性

還嘗試了用棧數組來減少相關性,代碼如下:

private static float SumVectorAvxPtrUX(float[] src, int count, int loops, int LoopUnrolling) {
#if Allow_Intrinsics && UNSAFE
    unsafe {
        float rt = 0; // Result.
        //const int LoopUnrolling = 16;
        if (LoopUnrolling <= 0) throw new ArgumentOutOfRangeException("LoopUnrolling", "Argument LoopUnrolling must >0 !");
        int VectorWidth = Vector256<float>.Count; // Block width.
        int nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
        int cntBlock = count / nBlockWidth; // Block count.
        int cntRem = count % nBlockWidth; // Remainder count.
        int i;
        //Vector256<float>[] vrt = new Vector256<float>[LoopUnrolling]; // Vector result.
        Vector256<float>* vrt = stackalloc Vector256<float>[LoopUnrolling]; ; // Vector result.
        for (i = 0; i < LoopUnrolling; ++i) {
            vrt[i] = Vector256<float>.Zero;
        }
        float* p; // Pointer for src data.
        // Body.
        fixed (float* p0 = &src[0]) {
            for (int j = 0; j < loops; ++j) {
                p = p0;
                // Vector processs.
                for (i = 0; i < cntBlock; ++i) {
                    for(int k=0; k< LoopUnrolling; ++k) {
                        vrt[k] = Avx.Add(vrt[k], Avx.LoadVector256(p + VectorWidth * k)); // Add. vrt[k] += *(*__m256)(p+k);
                    }
                    p += nBlockWidth;
                }
                // Remainder processs.
                for (i = 0; i < cntRem; ++i) {
                    rt += p[i];
                }
            }
        }
        // Reduce.
        for (i = 1; i < LoopUnrolling; ++i) {
            vrt[0] = Avx.Add(vrt[0], vrt[i]); // vrt[0] += vrt[i]
        } // vrt = vrt + vrt1 + vrt2 + vrt3 + ... vrt15;
        for (i = 0; i < VectorWidth; ++i) {
            rt += vrt[0].GetElement(i);
        }
        return rt;
    }
#else
    throw new NotSupportedException();
#endif
}

測試代碼:

// SumVectorAvxPtrUX.
int[] LoopUnrollingArray = { 4, 8, 16 };
foreach (int loopUnrolling in LoopUnrollingArray) {
    tickBegin = Environment.TickCount;
    rt = SumVectorAvxPtrUX(src, count, loops, loopUnrolling);
    msUsed = Environment.TickCount - tickBegin;
    mFlops = countMFlops * 1000 / msUsed;
    scale = mFlops / mFlopsBase;
    tw.WriteLine(indent + string.Format("SumVectorAvxPtrUX[{4}]:\t{0}\t# msUsed={1}, MFLOPS/s={2}, scale={3}", rt, msUsed, mFlops, scale, loopUnrolling));
}

2.7.1 測試結果:

測試結果摘錄如下:

SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVectorAvxPtr:        5.497558E+11    # msUsed=610, MFLOPS/s=6714.754098360656, scale=8.095081967213115
SumVectorAvxPtrU4:      2.1990233E+12   # msUsed=157, MFLOPS/s=26089.171974522294, scale=31.452229299363058
SumVectorAvxPtrU16:     8.386202E+12    # msUsed=125, MFLOPS/s=32768, scale=39.504
SumVectorAvxPtrU16A:    8.3862026E+12   # msUsed=187, MFLOPS/s=21903.74331550802, scale=26.406417112299465
SumVectorAvxPtrUX[4]:   2.1990233E+12   # msUsed=547, MFLOPS/s=7488.117001828154, scale=9.027422303473491
SumVectorAvxPtrUX[8]:   4.3980465E+12   # msUsed=500, MFLOPS/s=8192, scale=9.876
SumVectorAvxPtrUX[16]:  8.3862026E+12   # msUsed=500, MFLOPS/s=8192, scale=9.876

可以發現 SumVectorAvxPtrUX 版的性能比 非迴圈展開版(SumVectorAvxPtr)的性能要好一些,從8倍左右,達到9倍。
調整棧數組長度,達到8之後,性能幾乎差不多,看來已經達到瓶頸了。

該辦法的性能提升少,性價比不高。故還是推薦用經典的迴圈展開辦法。

2.8 測試結果彙總

測試結果彙總如下:

BenchmarkVectorCore30

IsRelease:      True
EnvironmentVariable(PROCESSOR_IDENTIFIER):      Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
Environment.ProcessorCount:     8
Environment.Is64BitOperatingSystem:     True
Environment.Is64BitProcess:     True
Environment.OSVersion:  Microsoft Windows NT 6.2.9200.0
Environment.Version:    3.1.26
RuntimeEnvironment.GetRuntimeDirectory: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.26\
RuntimeInformation.FrameworkDescription:        .NET Core 3.1.26
BitConverter.IsLittleEndian:    True
IntPtr.Size:    8
Vector.IsHardwareAccelerated:   True
Vector<byte>.Count:     32      # 256bit
Vector<float>.Count:    8       # 256bit
Vector<double>.Count:   4       # 256bit
Vector4.Assembly.CodeBase:      file:///C:/Program Files/dotnet/shared/Microsoft.NETCore.App/3.1.26/System.Numerics.Vectors.dll
Vector<T>.Assembly.CodeBase:    file:///C:/Program Files/dotnet/shared/Microsoft.NETCore.App/3.1.26/System.Private.CoreLib.dll

Benchmark:      count=4096, loops=1000000, countMFlops=4096
SumBase:        6.871948E+10    # msUsed=4938, MFLOPS/s=829.485621709194
SumBaseU4:      2.748779E+11    # msUsed=1875, MFLOPS/s=2184.5333333333333, scale=2.6336
SumVector4:     2.748779E+11    # msUsed=1218, MFLOPS/s=3362.8899835796387, scale=4.054187192118227
SumVector4U4:   1.0995116E+12   # msUsed=532, MFLOPS/s=7699.248120300752, scale=9.281954887218046
SumVectorT:     5.497558E+11    # msUsed=609, MFLOPS/s=6725.7799671592775, scale=8.108374384236454
SumVectorTU4:   2.1990233E+12   # msUsed=203, MFLOPS/s=20177.339901477833, scale=24.32512315270936
SumVectorAvx:   5.497558E+11    # msUsed=609, MFLOPS/s=6725.7799671592775, scale=8.108374384236454
SumVectorAvxSpan:       5.497558E+11    # msUsed=625, MFLOPS/s=6553.6, scale=7.9008
SumVectorAvxPtr:        5.497558E+11    # msUsed=610, MFLOPS/s=6714.754098360656, scale=8.095081967213115
SumVectorAvxU4: 2.1990233E+12   # msUsed=328, MFLOPS/s=12487.80487804878, scale=15.054878048780488
SumVectorAvxSpanU4:     2.1990233E+12   # msUsed=312, MFLOPS/s=13128.205128205129, scale=15.826923076923078
SumVectorAvxPtrU4:      2.1990233E+12   # msUsed=157, MFLOPS/s=26089.171974522294, scale=31.452229299363058
SumVectorAvxPtrU16:     8.386202E+12    # msUsed=125, MFLOPS/s=32768, scale=39.504
SumVectorAvxPtrU16A:    8.3862026E+12   # msUsed=187, MFLOPS/s=21903.74331550802, scale=26.406417112299465
SumVectorAvxPtrUX[4]:   2.1990233E+12   # msUsed=547, MFLOPS/s=7488.117001828154, scale=9.027422303473491
SumVectorAvxPtrUX[8]:   4.3980465E+12   # msUsed=500, MFLOPS/s=8192, scale=9.876
SumVectorAvxPtrUX[16]:  8.3862026E+12   # msUsed=500, MFLOPS/s=8192, scale=9.876

三、在C++中使用

3.1 修改代碼

參考上面的經驗,現在來將 C++ 版程式也改為迴圈展開的。
BenchmarkVectorCpp.cpp 的全部代碼如下:

// BenchmarkVectorCpp.cpp : This file contains the 'main' function. Program execution begins and ends there.
//

#include <immintrin.h>
#include <malloc.h>
#include <stdio.h>
#include <time.h>

#ifndef EXCEPTION_EXECUTE_HANDLER 
#define EXCEPTION_EXECUTE_HANDLER (1)
#endif // !EXCEPTION_EXECUTE_HANDLER 

// Sum - base.
float SumBase(const float* src, size_t count, int loops) {
    float rt = 0; // Result.
    size_t i;
    for (int j = 0; j < loops; ++j) {
        for (i = 0; i < count; ++i) {
            rt += src[i];
        }
    }
    return rt;
}

// Sum - base - Loop unrolling *4.
float SumBaseU4(const float* src, size_t count, int loops) {
    float rt = 0; // Result.
    float rt1=0;
    float rt2 = 0;
    float rt3 = 0;
    size_t nBlockWidth = 4; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    size_t p; // Index for src data.
    size_t i;
    for (int j = 0; j < loops; ++j) {
        p = 0;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            rt += src[p];
            rt1 += src[p + 1];
            rt2 += src[p + 2];
            rt3 += src[p + 3];
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += src[p + i];
        }
    }
    // Reduce.
    rt = rt + rt1 + rt2 + rt3;
    return rt;
}

// Sum - Vector AVX.
float SumVectorAvx(const float* src, size_t count, int loops) {
    float rt = 0; // Result.
    size_t VectorWidth = sizeof(__m256) / sizeof(float); // Block width.
    size_t nBlockWidth = VectorWidth; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    __m256 vrt = _mm256_setzero_ps(); // Vector result. [AVX] Set zero.
    __m256 vload; // Vector load.
    const float* p; // Pointer for src data.
    size_t i;
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = src;
        // Vector processs.
        for (i = 0; i < cntBlock; ++i) {
            vload = _mm256_load_ps(p);    // Load. vload = *(*__m256)p;
            vrt = _mm256_add_ps(vrt, vload);    // Add. vrt += vload;
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += p[i];
        }
    }
    // Reduce.
    p = (const float*)&vrt;
    for (i = 0; i < VectorWidth; ++i) {
        rt += p[i];
    }
    return rt;
}

// Sum - Vector AVX - Loop unrolling *4.
float SumVectorAvxU4(const float* src, size_t count, int loops) {
    float rt = 0;    // Result.
    const int LoopUnrolling = 4;
    size_t VectorWidth = sizeof(__m256) / sizeof(float); // Block width.
    size_t nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    __m256 vrt = _mm256_setzero_ps(); // Vector result. [AVX] Set zero.
    __m256 vrt1 = _mm256_setzero_ps();
    __m256 vrt2 = _mm256_setzero_ps();
    __m256 vrt3 = _mm256_setzero_ps();
    __m256 vload; // Vector load.
    __m256 vload1, vload2, vload3;
    const float* p; // Pointer for src data.
    size_t i;
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = src;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            vload = _mm256_load_ps(p);    // Load. vload = *(*__m256)p;
            vload1 = _mm256_load_ps(p + VectorWidth * 1);
            vload2 = _mm256_load_ps(p + VectorWidth * 2);
            vload3 = _mm256_load_ps(p + VectorWidth * 3);
            vrt = _mm256_add_ps(vrt, vload);    // Add. vrt += vload;
            vrt1 = _mm256_add_ps(vrt1, vload1);
            vrt2 = _mm256_add_ps(vrt2, vload2);
            vrt3 = _mm256_add_ps(vrt3, vload3);
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += p[i];
        }
    }
    // Reduce.
    vrt = _mm256_add_ps(_mm256_add_ps(vrt, vrt1), _mm256_add_ps(vrt2, vrt3)); // vrt = vrt + vrt1 + vrt2 + vrt3;
    p = (const float*)&vrt;
    for (i = 0; i < VectorWidth; ++i) {
        rt += p[i];
    }
    return rt;
}

// Sum - Vector AVX - Loop unrolling *16.
float SumVectorAvxU16(const float* src, size_t count, int loops) {
    float rt = 0;    // Result.
    const int LoopUnrolling = 16;
    size_t VectorWidth = sizeof(__m256) / sizeof(float); // Block width.
    size_t nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    __m256 vrt = _mm256_setzero_ps(); // Vector result. [AVX] Set zero.
    __m256 vrt1 = _mm256_setzero_ps();
    __m256 vrt2 = _mm256_setzero_ps();
    __m256 vrt3 = _mm256_setzero_ps();
    __m256 vrt4 = _mm256_setzero_ps();
    __m256 vrt5 = _mm256_setzero_ps();
    __m256 vrt6 = _mm256_setzero_ps();
    __m256 vrt7 = _mm256_setzero_ps();
    __m256 vrt8 = _mm256_setzero_ps();
    __m256 vrt9 = _mm256_setzero_ps();
    __m256 vrt10 = _mm256_setzero_ps();
    __m256 vrt11 = _mm256_setzero_ps();
    __m256 vrt12 = _mm256_setzero_ps();
    __m256 vrt13 = _mm256_setzero_ps();
    __m256 vrt14 = _mm256_setzero_ps();
    __m256 vrt15 = _mm256_setzero_ps();
    const float* p; // Pointer for src data.
    size_t i;
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = src;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt = _mm256_add_ps(vrt, _mm256_load_ps(p));    // Add. vrt += *((*__m256)(p)+k);
            vrt1 = _mm256_add_ps(vrt1, _mm256_load_ps(p + VectorWidth * 1));
            vrt2 = _mm256_add_ps(vrt2, _mm256_load_ps(p + VectorWidth * 2));
            vrt3 = _mm256_add_ps(vrt3, _mm256_load_ps(p + VectorWidth * 3));
            vrt4 = _mm256_add_ps(vrt4, _mm256_load_ps(p + VectorWidth * 4));
            vrt5 = _mm256_add_ps(vrt5, _mm256_load_ps(p + VectorWidth * 5));
            vrt6 = _mm256_add_ps(vrt6, _mm256_load_ps(p + VectorWidth * 6));
            vrt7 = _mm256_add_ps(vrt7, _mm256_load_ps(p + VectorWidth * 7));
            vrt8 = _mm256_add_ps(vrt8, _mm256_load_ps(p + VectorWidth * 8));
            vrt9 = _mm256_add_ps(vrt9, _mm256_load_ps(p + VectorWidth * 9));
            vrt10 = _mm256_add_ps(vrt10, _mm256_load_ps(p + VectorWidth * 10));
            vrt11 = _mm256_add_ps(vrt11, _mm256_load_ps(p + VectorWidth * 11));
            vrt12 = _mm256_add_ps(vrt12, _mm256_load_ps(p + VectorWidth * 12));
            vrt13 = _mm256_add_ps(vrt13, _mm256_load_ps(p + VectorWidth * 13));
            vrt14 = _mm256_add_ps(vrt14, _mm256_load_ps(p + VectorWidth * 14));
            vrt15 = _mm256_add_ps(vrt15, _mm256_load_ps(p + VectorWidth * 15));
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += p[i];
        }
    }
    // Reduce.
    vrt = _mm256_add_ps(_mm256_add_ps(_mm256_add_ps(_mm256_add_ps(vrt, vrt1), _mm256_add_ps(vrt2, vrt3))
        , _mm256_add_ps(_mm256_add_ps(vrt4, vrt5), _mm256_add_ps(vrt6, vrt7)))
        , _mm256_add_ps(_mm256_add_ps(_mm256_add_ps(vrt8, vrt9), _mm256_add_ps(vrt10, vrt11))
        , _mm256_add_ps(_mm256_add_ps(vrt12, vrt13), _mm256_add_ps(vrt14, vrt15))))
    ; // vrt = vrt + vrt1 + vrt2 + vrt3 + ... vrt15;
    p = (const float*)&vrt;
    for (i = 0; i < VectorWidth; ++i) {
        rt += p[i];
    }
    return rt;
}

// Sum - Vector AVX - Loop unrolling *16 - Array.
float SumVectorAvxU16A(const float* src, size_t count, int loops) {
    float rt = 0;    // Result.
    const int LoopUnrolling = 16;
    size_t VectorWidth = sizeof(__m256) / sizeof(float); // Block width.
    size_t nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    size_t i;
    __m256 vrt[LoopUnrolling]; // Vector result.
    for (i = 0; i < LoopUnrolling; ++i) {
        vrt[i] = _mm256_setzero_ps(); // [AVX] Set zero.
    }
    const float* p; // Pointer for src data.
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = src;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            vrt[0] = _mm256_add_ps(vrt[0], _mm256_load_ps(p));    // Add. vrt += *((*__m256)(p)+k);
            vrt[1] = _mm256_add_ps(vrt[1], _mm256_load_ps(p + VectorWidth * 1));
            vrt[2] = _mm256_add_ps(vrt[2], _mm256_load_ps(p + VectorWidth * 2));
            vrt[3] = _mm256_add_ps(vrt[3], _mm256_load_ps(p + VectorWidth * 3));
            vrt[4] = _mm256_add_ps(vrt[4], _mm256_load_ps(p + VectorWidth * 4));
            vrt[5] = _mm256_add_ps(vrt[5], _mm256_load_ps(p + VectorWidth * 5));
            vrt[6] = _mm256_add_ps(vrt[6], _mm256_load_ps(p + VectorWidth * 6));
            vrt[7] = _mm256_add_ps(vrt[7], _mm256_load_ps(p + VectorWidth * 7));
            vrt[8] = _mm256_add_ps(vrt[8], _mm256_load_ps(p + VectorWidth * 8));
            vrt[9] = _mm256_add_ps(vrt[9], _mm256_load_ps(p + VectorWidth * 9));
            vrt[10] = _mm256_add_ps(vrt[10], _mm256_load_ps(p + VectorWidth * 10));
            vrt[11] = _mm256_add_ps(vrt[11], _mm256_load_ps(p + VectorWidth * 11));
            vrt[12] = _mm256_add_ps(vrt[12], _mm256_load_ps(p + VectorWidth * 12));
            vrt[13] = _mm256_add_ps(vrt[13], _mm256_load_ps(p + VectorWidth * 13));
            vrt[14] = _mm256_add_ps(vrt[14], _mm256_load_ps(p + VectorWidth * 14));
            vrt[15] = _mm256_add_ps(vrt[15], _mm256_load_ps(p + VectorWidth * 15));
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += p[i];
        }
    }
    // Reduce.
    vrt[0] = _mm256_add_ps(_mm256_add_ps(_mm256_add_ps(_mm256_add_ps(vrt[0], vrt[1]), _mm256_add_ps(vrt[2], vrt[3]))
        , _mm256_add_ps(_mm256_add_ps(vrt[4], vrt[5]), _mm256_add_ps(vrt[6], vrt[7])))
        , _mm256_add_ps(_mm256_add_ps(_mm256_add_ps(vrt[8], vrt[9]), _mm256_add_ps(vrt[10], vrt[11]))
        , _mm256_add_ps(_mm256_add_ps(vrt[12], vrt[13]), _mm256_add_ps(vrt[14], vrt[15]))))
    ; // vrt = vrt + vrt1 + vrt2 + vrt3 + ... vrt15;
    p = (const float*)&vrt;
    for (i = 0; i < VectorWidth; ++i) {
        rt += p[i];
    }
    return rt;
}

// Sum - Vector AVX - Loop unrolling *X - Array.
float SumVectorAvxUX(const float* src, size_t count, int loops, const int LoopUnrolling) {
    float rt = 0;    // Result.
    size_t VectorWidth = sizeof(__m256) / sizeof(float); // Block width.
    size_t nBlockWidth = VectorWidth * LoopUnrolling; // Block width.
    size_t cntBlock = count / nBlockWidth; // Block count.
    size_t cntRem = count % nBlockWidth; // Remainder count.
    size_t i;
    __m256* vrt = new __m256[LoopUnrolling]; // Vector result.
    for (i = 0; i < LoopUnrolling; ++i) {
        vrt[i] = _mm256_setzero_ps(); // [AVX] Set zero.
    }
    const float* p; // Pointer for src data.
    // Body.
    for (int j = 0; j < loops; ++j) {
        p = src;
        // Block processs.
        for (i = 0; i < cntBlock; ++i) {
            for (int k = 0; k < LoopUnrolling; ++k) {
                vrt[k] = _mm256_add_ps(vrt[k], _mm256_load_ps(p + VectorWidth * k));    // Add. vrt += *((*__m256)(p)+k);
            }
            p += nBlockWidth;
        }
        // Remainder processs.
        for (i = 0; i < cntRem; ++i) {
            rt += p[i];
        }
    }
    // Reduce.
    for (i = 1; i < LoopUnrolling; ++i) {
        vrt[0] = _mm256_add_ps(vrt[0], vrt[i]); // vrt[0] += vrt[i]
    } // vrt = vrt + vrt1 + vrt2 + vrt3 + ... vrt15;
    p = (const float*)&vrt[0];
    for (i = 0; i < VectorWidth; ++i) {
        rt += p[i];
    }
    delete[] vrt;
    return rt;
}

// Do Benchmark.
void Benchmark() {
    const size_t alignment = 256 / 8; // sizeof(__m256) / sizeof(BYTE);
    // init.
    clock_t tickBegin, msUsed;
    double mFlops; // MFLOPS/s .
    double scale;
    float rt;
    const int count = 1024 * 4;
    const int loops = 1000 * 1000;
    //const int loops = 1;
    const double countMFlops = count * (double)loops / (1000.0 * 1000);
    float* src = (float*)_aligned_malloc(sizeof(float)*count, alignment); // new float[count];
    if (NULL == src) {
        printf("Memory alloc fail!");
        return;
    }
    for (int i = 0; i < count; ++i) {
        src[i] = (float)i;
    }
    printf("Benchmark: \tcount=%d, loops=%d, countMFlops=%f\n", count, loops, countMFlops);
    // SumBase.
    tickBegin = clock();
    rt = SumBase(src, count, loops);
    msUsed = clock() - tickBegin;
    mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
    printf("SumBase:\t%g\t# msUsed=%d, MFLOPS/s=%f\n", rt, (int)msUsed, mFlops);
    double mFlopsBase = mFlops;
    // SumBaseU4.
    tickBegin = clock();
    rt = SumBaseU4(src, count, loops);
    msUsed = clock() - tickBegin;
    mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
    scale = mFlops / mFlopsBase;
    printf("SumBaseU4:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", rt, (int)msUsed, mFlops, scale);
    // SumVectorAvx.
    __try {
        tickBegin = clock();
        rt = SumVectorAvx(src, count, loops);
        msUsed = clock() - tickBegin;
        mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
        scale = mFlops / mFlopsBase;
        printf("SumVectorAvx:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", rt, (int)msUsed, mFlops, scale);
        // SumVectorAvxU4.
        tickBegin = clock();
        rt = SumVectorAvxU4(src, count, loops);
        msUsed = clock() - tickBegin;
        mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
        scale = mFlops / mFlopsBase;
        printf("SumVectorAvxU4:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", rt, (int)msUsed, mFlops, scale);
        // SumVectorAvxU16.
        tickBegin = clock();
        rt = SumVectorAvxU16(src, count, loops);
        msUsed = clock() - tickBegin;
        mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
        scale = mFlops / mFlopsBase;
        printf("SumVectorAvxU16:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", rt, (int)msUsed, mFlops, scale);
        // SumVectorAvxU16A.
        tickBegin = clock();
        rt = SumVectorAvxU16A(src, count, loops);
        msUsed = clock() - tickBegin;
        mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
        scale = mFlops / mFlopsBase;
        printf("SumVectorAvxU16A:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", rt, (int)msUsed, mFlops, scale);
        // SumVectorAvxUX.
        int LoopUnrollingArray[] = { 4, 8, 16 };
        for (int i = 0; i < sizeof(LoopUnrollingArray) / sizeof(LoopUnrollingArray[0]); ++i) {
            int loopUnrolling = LoopUnrollingArray[i];
            tickBegin = clock();
            rt = SumVectorAvxUX(src, count, loops, loopUnrolling);
            msUsed = clock() - tickBegin;
            mFlops = countMFlops * CLOCKS_PER_SEC / msUsed;
            scale = mFlops / mFlopsBase;
            printf("SumVectorAvxUX[%d]:\t%g\t# msUsed=%d, MFLOPS/s=%f, scale=%f\n", loopUnrolling, rt, (int)msUsed, mFlops, scale);
        }
    }
    __except (EXCEPTION_EXECUTE_HANDLER) {
        printf("Run SumVectorAvx fail!");
    }
    // done.
    _aligned_free(src);
}

int main() {
    printf("BenchmarkVectorCpp\n");
    printf("\n");
    printf("Pointer size:\t%d\n", (int)(sizeof(void*)));
#ifdef _DEBUG
    printf("IsRelease:\tFalse\n");
#else
    printf("IsRelease:\tTrue\n");
#endif // _DEBUG
#ifdef _MSC_VER
    printf("_MSC_VER:\t%d\n", _MSC_VER);
#endif // _MSC_VER
#ifdef __AVX__
    printf("__AVX__:\t%d\n", __AVX__);
#endif // __AVX__
    printf("\n");
    // Benchmark.
    Benchmark();
}

3.2 測試結果

測試結果彙總如下:

BenchmarkVectorCpp

Pointer size:   8
IsRelease:      True
_MSC_VER:       1916
__AVX__:        1

Benchmark:      count=4096, loops=1000000, countMFlops=4096.000000
SumBase:        6.87195e+10     # msUsed=4938, MFLOPS/s=829.485622
SumBaseU4:      2.74878e+11     # msUsed=1229, MFLOPS/s=3332.790887, scale=4.017901
SumVectorAvx:   5.49756e+11     # msUsed=616, MFLOPS/s=6649.350649, scale=8.016234
SumVectorAvxU4: 2.19902e+12     # msUsed=247, MFLOPS/s=16582.995951, scale=19.991903
SumVectorAvxU16:        8.3862e+12      # msUsed=89, MFLOPS/s=46022.471910, scale=55.483146
SumVectorAvxU16A:       8.3862e+12      # msUsed=89, MFLOPS/s=46022.471910, scale=55.483146
SumVectorAvxUX[4]:      2.19902e+12     # msUsed=465, MFLOPS/s=8808.602151, scale=10.619355
SumVectorAvxUX[8]:      4.39805e+12     # msUsed=336, MFLOPS/s=12190.476190, scale=14.696429
SumVectorAvxUX[16]:     8.3862e+12      # msUsed=323, MFLOPS/s=12681.114551, scale=15.287926

先前做未迴圈展開時,C# 與 Visual C++ 程式的性能差距不大。而現在使用迴圈展開後,發現差距拉大了——

  • SumBaseU4:C++版的MFLOPS/s為 3332.790887,C#版的MFLOPS/s為 2184.5333333333333。3332.790887/2184.5333333333333=1.5256305940246582264042754215188,即大約是 1.5256 倍。
  • SumVectorAvxU4:C++版的MFLOPS/s為 16582.995951,C#版的MFLOPS/s為 12487.80487804878。16582.995951/12487.80487804878=1.3279352226386719268724696343231,即大約是 1.3279 倍。
  • SumVectorAvxU16:C++版的MFLOPS/s為 46022.471910,C#版的MFLOPS/s為 32768。46022.471910/32768=1.40449438201904296875,即大約是 1.4045 倍。

而且還發現——

  • SumVectorAvxU16A 與 SumVectorAvxU16 的性能差不多,表示C++編譯器能很好地優化數組訪問,能達到局部變數同級別的速度。故C++中可以放心使用數組來存儲迴圈展開的臨時變數。
  • SumVectorAvxUX 的 C++ 版性能比C#好一些。而且臨時數組長度大於8時,也能帶來一定的性能提升。只可惜還是存在性價比不高的問題,還是經典的迴圈展開更好用。

四、小結

C#中使用迴圈展開的心得總結——

  • 使用迴圈展開能提高性能,但由於編碼比較麻煩,且會增加代碼維護的成本。故應作為最後手段,應該先嘗試其他優化手段。
  • 使用迴圈展開後,指針版代碼比數組版、Span版要高一些,可能是因為指針更貼近底層硬體、更易於編譯器優化。故推薦優先使用指針。
  • 在C#中做迴圈展開,一般展開4次就行。展開16次只有少量提升,性價比不高。
  • 迴圈展開會引起臨時變數倍增,應該堅持逐個定義局部變數。若用數組會造成性能下降,包括用指針操作棧分配數組也會下降。<
您的分享是我們最大的動力!

-Advertisement-
Play Games
更多相關文章
  • 大家好,我是棧長。 今天給大家通報一則框架更新消息,時隔兩個月,Spring Cloud 2021.0.5 最新版發佈了,來看下最新的 Spring Cloud 版本情況: Spring Cloud 無疑是現在 Java 微服務事實上的標準,完全基於 Spring Boot 構建,依賴 Spring ...
  • 目錄 一.freeglut 簡介 二.freeglut 下載 五.猜你喜歡 零基礎 OpenGL ES 學習路線推薦 : OpenGL ES 學習目錄 >> OpenGL ES 基礎 零基礎 OpenGL ES 學習路線推薦 : OpenGL ES 學習目錄 >> OpenGL ES 特效 零基礎 ...
  • 類模板=>實力化=>模板類 通過類模板實現棧,點擊查看代碼 #include <iostream> #include <cstring> using namespace std; template<typename T> //template<typename T=int> 也可以這樣寫,寫個預設類 ...
  • ##SpringBoot集成JWT(極簡版) ###在WebConfig配置類中設置介面統一首碼 import org.springframework.context.annotation.Configuration; import org.springframework.web.bind.anno ...
  • 這一篇問文章主要介紹元組的相關知識。 元組:不可修改的序列 與列表一樣,元組也是序列,唯一的差別在於元組是不能修改的(同樣的,字元串也不能修改)。 元組的語法很簡單。 >>> >>> 1, 2, 3 (1, 2, 3) >>> (1, 2, 3) (1, 2, 3) >>> >>> () () >> ...
  • 大家好,我是三友~~ 之前有小伙伴私信我說看源碼的時候感覺源碼很難,不知道該怎麼看,其實這有部分原因是因為沒有弄懂一些源碼實現的套路,也就是設計模式,所以本文我就總結了9種在源碼中非常常見的設計模式,併列舉了很多源碼的實現例子,希望對你看源碼和日常工作中有所幫助。 單例模式 單例模式是指一個類在一個 ...
  • 目錄 一.glut 簡介 二.猜你喜歡 零基礎 OpenGL ES 學習路線推薦 : OpenGL ES 學習目錄 >> OpenGL ES 基礎 零基礎 OpenGL ES 學習路線推薦 : OpenGL ES 學習目錄 >> OpenGL ES 特效 零基礎 OpenGL ES 學習路線推薦 : ...
  • 線程池無論是工作還是面試都是必備的技能,但是很多人對於線程池的實現原理卻一知半解,並不瞭解線程池內部的工作原理,今天一燈就帶大家一塊剖析線程池底層實現原理。 ...
一周排行
    -Advertisement-
    Play Games
  • 前言 在我們開發過程中基本上不可或缺的用到一些敏感機密數據,比如SQL伺服器的連接串或者是OAuth2的Secret等,這些敏感數據在代碼中是不太安全的,我們不應該在源代碼中存儲密碼和其他的敏感數據,一種推薦的方式是通過Asp.Net Core的機密管理器。 機密管理器 在 ASP.NET Core ...
  • 新改進提供的Taurus Rpc 功能,可以簡化微服務間的調用,同時可以不用再手動輸出模塊名稱,或調用路徑,包括負載均衡,這一切,由框架實現並提供了。新的Taurus Rpc 功能,將使得服務間的調用,更加輕鬆、簡約、高效。 ...
  • 順序棧的介面程式 目錄順序棧的介面程式頭文件創建順序棧入棧出棧利用棧將10進位轉16進位數驗證 頭文件 #include <stdio.h> #include <stdbool.h> #include <stdlib.h> 創建順序棧 // 指的是順序棧中的元素的數據類型,用戶可以根據需要進行修改 ...
  • 前言 整理這個官方翻譯的系列,原因是網上大部分的 tomcat 版本比較舊,此版本為 v11 最新的版本。 開源項目 從零手寫實現 tomcat minicat 別稱【嗅虎】心有猛虎,輕嗅薔薇。 系列文章 web server apache tomcat11-01-官方文檔入門介紹 web serv ...
  • C總結與剖析:關鍵字篇 -- <<C語言深度解剖>> 目錄C總結與剖析:關鍵字篇 -- <<C語言深度解剖>>程式的本質:二進位文件變數1.變數:記憶體上的某個位置開闢的空間2.變數的初始化3.為什麼要有變數4.局部變數與全局變數5.變數的大小由類型決定6.任何一個變數,記憶體賦值都是從低地址開始往高地 ...
  • 如果讓你來做一個有狀態流式應用的故障恢復,你會如何來做呢? 單機和多機會遇到什麼不同的問題? Flink Checkpoint 是做什麼用的?原理是什麼? ...
  • C++ 多級繼承 多級繼承是一種面向對象編程(OOP)特性,允許一個類從多個基類繼承屬性和方法。它使代碼更易於組織和維護,並促進代碼重用。 多級繼承的語法 在 C++ 中,使用 : 符號來指定繼承關係。多級繼承的語法如下: class DerivedClass : public BaseClass1 ...
  • 前言 什麼是SpringCloud? Spring Cloud 是一系列框架的有序集合,它利用 Spring Boot 的開發便利性簡化了分散式系統的開發,比如服務註冊、服務發現、網關、路由、鏈路追蹤等。Spring Cloud 並不是重覆造輪子,而是將市面上開發得比較好的模塊集成進去,進行封裝,從 ...
  • class_template 類模板和函數模板的定義和使用類似,我們已經進行了介紹。有時,有兩個或多個類,其功能是相同的,僅僅是數據類型不同。類模板用於實現類所需數據的類型參數化 template<class NameType, class AgeType> class Person { publi ...
  • 目錄system v IPC簡介共用記憶體需要用到的函數介面shmget函數--獲取對象IDshmat函數--獲得映射空間shmctl函數--釋放資源共用記憶體實現思路註意 system v IPC簡介 消息隊列、共用記憶體和信號量統稱為system v IPC(進程間通信機制),V是羅馬數字5,是UNI ...