Job Sava
Tinymembench Results


RUN1:

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1147.9 MB/s (1.3%)
 C copy backwards (32 byte blocks)                    :   1121.7 MB/s (1.5%)
 C copy backwards (64 byte blocks)                    :    995.3 MB/s (1.3%)
 C copy                                               :   1082.9 MB/s (1.0%)
 C copy prefetched (32 bytes step)                    :    903.5 MB/s (0.9%)
 C copy prefetched (64 bytes step)                    :    954.8 MB/s (0.4%)
 C 2-pass copy                                        :    858.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    597.4 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    304.4 MB/s
 C fill                                               :   2151.3 MB/s (0.1%)
 C fill (shuffle within 16 byte blocks)               :   2149.3 MB/s
 C fill (shuffle within 32 byte blocks)               :   2147.5 MB/s
 C fill (shuffle within 64 byte blocks)               :   2151.7 MB/s (0.2%)
 ---
 standard memcpy                                      :   1065.0 MB/s (0.9%)
 standard memset                                      :   2151.4 MB/s (0.1%)
 ---
 NEON LDP/STP copy                                    :   1175.8 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    790.2 MB/s (0.6%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    958.6 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1192.3 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1190.4 MB/s
 NEON LD1/ST1 copy                                    :   1094.0 MB/s (1.0%)
 NEON STP fill                                        :   2151.3 MB/s (0.2%)
 NEON STNP fill                                       :   2068.2 MB/s (3.8%)
 ARM LDP/STP copy                                     :   1181.9 MB/s (0.9%)
 ARM STP fill                                         :   2151.5 MB/s (0.1%)
 ARM STNP fill                                        :   2076.2 MB/s (6.9%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    12.6 ns
    524288 :   12.1 ns          /    18.3 ns
   1048576 :  127.3 ns          /   195.4 ns
   2097152 :  189.5 ns          /   250.4 ns
   4194304 :  226.5 ns          /   272.7 ns
   8388608 :  245.3 ns          /   282.3 ns
  16777216 :  255.6 ns          /   287.1 ns
  33554432 :  262.7 ns          /   289.7 ns
  67108864 :  275.4 ns          /   309.8 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.3 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   12.3 ns          /    17.0 ns
   1048576 :  126.9 ns          /   194.9 ns
   2097152 :  187.7 ns          /   248.4 ns
   4194304 :  218.2 ns          /   264.8 ns
   8388608 :  234.2 ns          /   270.6 ns
  16777216 :  242.2 ns          /   272.8 ns
  33554432 :  246.1 ns          /   273.7 ns
  67108864 :  248.0 ns          /   274.0 ns


RUN2:


root@mitysom-am62x:~# tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1148.8 MB/s (1.8%)
 C copy backwards (32 byte blocks)                    :   1102.7 MB/s (1.0%)
 C copy backwards (64 byte blocks)                    :   1001.1 MB/s (1.5%)
 C copy                                               :   1070.9 MB/s (0.8%)
 C copy prefetched (32 bytes step)                    :    900.7 MB/s (1.0%)
 C copy prefetched (64 bytes step)                    :    959.1 MB/s (0.5%)
 C 2-pass copy                                        :    859.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    594.2 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    304.6 MB/s
 C fill                                               :   2147.6 MB/s
 C fill (shuffle within 16 byte blocks)               :   2148.8 MB/s
 C fill (shuffle within 32 byte blocks)               :   2150.6 MB/s
 C fill (shuffle within 64 byte blocks)               :   2150.5 MB/s
 ---
 standard memcpy                                      :   1054.6 MB/s (1.0%)
 standard memset                                      :   2149.1 MB/s
 ---
 NEON LDP/STP copy                                    :   1183.2 MB/s (0.8%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    786.0 MB/s (0.8%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    957.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1191.6 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1188.3 MB/s
 NEON LD1/ST1 copy                                    :   1094.8 MB/s (0.9%)
 NEON STP fill                                        :   2148.5 MB/s
 NEON STNP fill                                       :   2078.3 MB/s (2.6%)
 ARM LDP/STP copy                                     :   1181.0 MB/s (0.7%)
 ARM STP fill                                         :   2149.4 MB/s
 ARM STNP fill                                        :   2075.5 MB/s (3.8%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.5 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.4 ns
    524288 :   12.1 ns          /    18.4 ns
   1048576 :  127.0 ns          /   195.4 ns
   2097152 :  189.6 ns          /   250.5 ns
   4194304 :  226.6 ns          /   272.9 ns
   8388608 :  245.4 ns          /   282.3 ns
  16777216 :  255.7 ns          /   287.2 ns
  33554432 :  262.8 ns          /   289.6 ns
  67108864 :  275.0 ns          /   309.2 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.3 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.4 ns
    524288 :   12.0 ns          /    18.2 ns
   1048576 :  127.0 ns          /   195.2 ns
   2097152 :  187.8 ns          /   248.6 ns
   4194304 :  218.4 ns          /   265.0 ns
   8388608 :  234.4 ns          /   270.8 ns
  16777216 :  242.4 ns          /   273.0 ns
  33554432 :  246.3 ns          /   273.9 ns
  67108864 :  248.2 ns          /   274.2 ns


RUN3:
root@mitysom-am62x:~# tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1128.8 MB/s (1.6%)
 C copy backwards (32 byte blocks)                    :   1113.4 MB/s (1.7%)
 C copy backwards (64 byte blocks)                    :   1006.5 MB/s (2.0%)
 C copy                                               :   1079.4 MB/s (1.4%)
 C copy prefetched (32 bytes step)                    :    902.2 MB/s (1.1%)
 C copy prefetched (64 bytes step)                    :    956.9 MB/s (0.3%)
 C 2-pass copy                                        :    859.2 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    594.9 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    304.2 MB/s
 C fill                                               :   2150.6 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   2150.3 MB/s (0.1%)
 C fill (shuffle within 32 byte blocks)               :   2148.0 MB/s
 C fill (shuffle within 64 byte blocks)               :   2149.0 MB/s
 ---
 standard memcpy                                      :   1059.1 MB/s (0.9%)
 standard memset                                      :   2150.6 MB/s (0.1%)
 ---
 NEON LDP/STP copy                                    :   1185.5 MB/s (0.7%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    787.7 MB/s (0.4%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    959.0 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1192.0 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1189.7 MB/s (0.1%)
 NEON LD1/ST1 copy                                    :   1098.0 MB/s (1.0%)
 NEON STP fill                                        :   2148.2 MB/s
 NEON STNP fill                                       :   2074.4 MB/s (1.4%)
 ARM LDP/STP copy                                     :   1174.4 MB/s (0.4%)
 ARM STP fill                                         :   2151.2 MB/s (0.1%)
 ARM STNP fill                                        :   2074.4 MB/s (4.1%)


==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.5 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    12.0 ns
    524288 :   12.4 ns          /    18.5 ns
   1048576 :  127.5 ns          /   195.3 ns
   2097152 :  189.6 ns          /   250.5 ns
   4194304 :  226.6 ns          /   272.9 ns
   8388608 :  245.4 ns          /   282.4 ns
  16777216 :  255.7 ns          /   287.3 ns
  33554432 :  262.9 ns          /   289.6 ns
  67108864 :  274.4 ns          /   307.5 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.5 ns
    524288 :   12.3 ns          /    18.1 ns
   1048576 :  127.0 ns          /   195.2 ns
   2097152 :  187.9 ns          /   248.6 ns
   4194304 :  218.5 ns          /   265.0 ns
   8388608 :  234.4 ns          /   270.8 ns
  16777216 :  242.4 ns          /   273.0 ns
  33554432 :  246.3 ns          /   273.8 ns
  67108864 :  248.2 ns          /   274.2 ns


RUN4: (Uboot changes: TCR/ASR disabled)

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1067.3 MB/s (2.1%)
 C copy backwards (32 byte blocks)                    :   1038.0 MB/s (1.9%)
 C copy backwards (64 byte blocks)                    :    945.6 MB/s (2.1%)
 C copy                                               :   1016.2 MB/s (1.6%)
 C copy prefetched (32 bytes step)                    :    851.0 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :    903.4 MB/s (0.4%)
 C 2-pass copy                                        :    823.9 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    575.5 MB/s (0.6%)
 C 2-pass copy prefetched (64 bytes step)             :    292.2 MB/s (0.1%)
 C fill                                               :   2143.0 MB/s
 C fill (shuffle within 16 byte blocks)               :   2145.5 MB/s
 C fill (shuffle within 32 byte blocks)               :   2145.1 MB/s
 C fill (shuffle within 64 byte blocks)               :   2146.2 MB/s
 ---
 standard memcpy                                      :    978.3 MB/s (1.2%)
 standard memset                                      :   2146.9 MB/s (0.2%)
 ---
 NEON LDP/STP copy                                    :   1079.3 MB/s (0.9%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    741.7 MB/s (0.6%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    902.3 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1115.8 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1111.9 MB/s (0.2%)
 NEON LD1/ST1 copy                                    :   1034.6 MB/s (1.0%)
 NEON STP fill                                        :   2145.8 MB/s
 NEON STNP fill                                       :   2054.4 MB/s (1.7%)
 ARM LDP/STP copy                                     :   1072.7 MB/s (0.8%)
 ARM STP fill                                         :   2144.7 MB/s
 ARM STNP fill                                        :   2058.5 MB/s (1.4%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.3 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.0 ns
    524288 :   12.9 ns          /    17.7 ns
   1048576 :  130.9 ns          /   201.5 ns
   2097152 :  195.2 ns          /   259.6 ns
   4194304 :  234.1 ns          /   283.6 ns
   8388608 :  253.7 ns          /   290.5 ns
  16777216 :  263.3 ns          /   295.9 ns
  33554432 :  268.3 ns          /   301.2 ns
  67108864 :  283.6 ns          /   323.3 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.5 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.4 ns
    524288 :   12.5 ns          /    18.4 ns
   1048576 :  130.8 ns          /   201.4 ns
   2097152 :  193.2 ns          /   257.4 ns
   4194304 :  222.4 ns          /   276.4 ns
   8388608 :  237.1 ns          /   283.0 ns
  16777216 :  244.8 ns          /   285.3 ns
  33554432 :  248.7 ns          /   286.0 ns
  67108864 :  250.7 ns          /   286.4 ns


RUN5:
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1043.6 MB/s (1.4%)
 C copy backwards (32 byte blocks)                    :   1019.6 MB/s (1.5%)
 C copy backwards (64 byte blocks)                    :    940.8 MB/s (1.6%)
 C copy                                               :   1039.5 MB/s (1.2%)
 C copy prefetched (32 bytes step)                    :    844.6 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :    905.3 MB/s (0.5%)
 C 2-pass copy                                        :    824.2 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    573.5 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    291.8 MB/s
 C fill                                               :   2144.6 MB/s
 C fill (shuffle within 16 byte blocks)               :   2146.1 MB/s
 C fill (shuffle within 32 byte blocks)               :   2147.4 MB/s (0.2%)
 C fill (shuffle within 64 byte blocks)               :   2146.8 MB/s
 ---
 standard memcpy                                      :    980.4 MB/s (0.9%)
 standard memset                                      :   2147.9 MB/s (0.5%)
 ---
 NEON LDP/STP copy                                    :   1077.2 MB/s (0.6%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    734.3 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    901.2 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1115.3 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1111.1 MB/s
 NEON LD1/ST1 copy                                    :   1037.7 MB/s (1.4%)
 NEON STP fill                                        :   2147.2 MB/s (0.2%)
 NEON STNP fill                                       :   2050.7 MB/s (1.6%)
 ARM LDP/STP copy                                     :   1076.5 MB/s (0.6%)
 ARM STP fill                                         :   2146.2 MB/s (0.1%)
 ARM STNP fill                                        :   2050.6 MB/s (2.8%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.1 ns          /    12.0 ns
    524288 :   12.4 ns          /    18.3 ns
   1048576 :  130.7 ns          /   201.8 ns
   2097152 :  195.0 ns          /   259.5 ns
   4194304 :  234.0 ns          /   283.5 ns
   8388608 :  253.7 ns          /   290.4 ns
  16777216 :  263.3 ns          /   295.8 ns
  33554432 :  268.3 ns          /   301.1 ns
  67108864 :  282.3 ns          /   321.4 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   12.3 ns          /    18.5 ns
   1048576 :  130.8 ns          /   201.6 ns
   2097152 :  193.2 ns          /   257.3 ns
   4194304 :  222.5 ns          /   276.2 ns
   8388608 :  237.1 ns          /   282.9 ns
  16777216 :  244.8 ns          /   285.2 ns
  33554432 :  248.7 ns          /   286.0 ns
  67108864 :  250.6 ns          /   286.3 ns

RUN6: 
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   1034.0 MB/s (1.1%)
 C copy backwards (32 byte blocks)                    :   1009.6 MB/s (1.9%)
 C copy backwards (64 byte blocks)                    :    942.8 MB/s (1.2%)
 C copy                                               :   1020.0 MB/s (1.5%)
 C copy prefetched (32 bytes step)                    :    847.1 MB/s (0.6%)
 C copy prefetched (64 bytes step)                    :    899.7 MB/s (0.4%)
 C 2-pass copy                                        :    825.8 MB/s (0.2%)
 C 2-pass copy prefetched (32 bytes step)             :    575.8 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    292.4 MB/s (0.1%)
 C fill                                               :   2146.5 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   2144.4 MB/s
 C fill (shuffle within 32 byte blocks)               :   2146.4 MB/s
 C fill (shuffle within 64 byte blocks)               :   2146.5 MB/s
 ---
 standard memcpy                                      :    976.3 MB/s (1.2%)
 standard memset                                      :   2146.8 MB/s (0.3%)
 ---
 NEON LDP/STP copy                                    :   1070.0 MB/s (0.6%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    744.5 MB/s (0.8%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    902.0 MB/s (0.1%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1115.7 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1111.5 MB/s (0.2%)
 NEON LD1/ST1 copy                                    :   1028.8 MB/s (1.5%)
 NEON STP fill                                        :   2145.2 MB/s
 NEON STNP fill                                       :   2062.5 MB/s (8.3%)
 ARM LDP/STP copy                                     :   1069.0 MB/s (0.6%)
 ARM STP fill                                         :   2146.5 MB/s
 ARM STNP fill                                        :   2060.2 MB/s (3.9%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.3 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    12.0 ns
    524288 :   12.5 ns          /    18.6 ns
   1048576 :  130.7 ns          /   201.5 ns
   2097152 :  195.1 ns          /   259.5 ns
   4194304 :  234.0 ns          /   283.4 ns
   8388608 :  253.7 ns          /   290.5 ns
  16777216 :  263.3 ns          /   295.9 ns
  33554432 :  268.4 ns          /   301.3 ns
  67108864 :  281.3 ns          /   320.1 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    10.9 ns
    262144 :    8.1 ns          /    12.4 ns
    524288 :   12.8 ns          /    17.7 ns
   1048576 :  130.8 ns          /   201.5 ns
   2097152 :  193.2 ns          /   257.4 ns
   4194304 :  222.4 ns          /   276.3 ns
   8388608 :  237.2 ns          /   283.0 ns
  16777216 :  244.8 ns          /   285.3 ns
  33554432 :  248.7 ns          /   286.0 ns
  67108864 :  250.6 ns          /   286.4 ns


RUN7: (ECC enabled)
==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    849.3 MB/s (1.3%)
 C copy backwards (32 byte blocks)                    :    870.2 MB/s (2.0%)
 C copy backwards (64 byte blocks)                    :    859.6 MB/s (1.2%)
 C copy                                               :    863.9 MB/s (1.5%)
 C copy prefetched (32 bytes step)                    :    687.7 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :    772.1 MB/s (0.4%)
 C 2-pass copy                                        :    795.9 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    519.2 MB/s (1.2%)
 C 2-pass copy prefetched (64 bytes step)             :    281.5 MB/s (0.2%)
 C fill                                               :   1910.1 MB/s (0.1%)
 C fill (shuffle within 16 byte blocks)               :   1910.6 MB/s (0.2%)
 C fill (shuffle within 32 byte blocks)               :   1910.6 MB/s (0.1%)
 C fill (shuffle within 64 byte blocks)               :   1909.2 MB/s
 ---
 standard memcpy                                      :    879.2 MB/s (0.6%)
 standard memset                                      :   1910.2 MB/s (0.3%)
 ---
 NEON LDP/STP copy                                    :    895.4 MB/s (0.8%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    622.2 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    753.3 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1002.9 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1002.4 MB/s
 NEON LD1/ST1 copy                                    :    877.8 MB/s (1.2%)
 NEON STP fill                                        :   1908.2 MB/s
 NEON STNP fill                                       :   1852.8 MB/s (2.1%)
 ARM LDP/STP copy                                     :    894.8 MB/s (0.4%)
 ARM STP fill                                         :   1911.3 MB/s (0.1%)
 ARM STNP fill                                        :   1850.2 MB/s (0.9%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    12.5 ns
    524288 :   12.8 ns          /    18.1 ns
   1048576 :  139.2 ns          /   215.9 ns
   2097152 :  207.9 ns          /   279.4 ns
   4194304 :  247.2 ns          /   305.5 ns
   8388608 :  267.7 ns          /   315.2 ns
  16777216 :  280.3 ns          /   321.5 ns
  33554432 :  288.0 ns          /   326.0 ns
  67108864 :  302.1 ns          /   347.4 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.3 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   12.6 ns          /    18.8 ns
   1048576 :  139.0 ns          /   215.6 ns
   2097152 :  206.0 ns          /   277.1 ns
   4194304 :  239.0 ns          /   296.4 ns
   8388608 :  255.0 ns          /   303.5 ns
  16777216 :  263.1 ns          /   306.7 ns
  33554432 :  267.3 ns          /   308.2 ns
  67108864 :  269.3 ns          /   309.0 ns

RUN8: 
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    881.9 MB/s (2.6%)
 C copy backwards (32 byte blocks)                    :    859.2 MB/s (1.6%)
 C copy backwards (64 byte blocks)                    :    847.6 MB/s (2.3%)
 C copy                                               :    869.8 MB/s (2.3%)
 C copy prefetched (32 bytes step)                    :    688.7 MB/s (0.7%)
 C copy prefetched (64 bytes step)                    :    770.7 MB/s (0.5%)
 C 2-pass copy                                        :    799.4 MB/s (0.2%)
 C 2-pass copy prefetched (32 bytes step)             :    520.7 MB/s (0.5%)
 C 2-pass copy prefetched (64 bytes step)             :    281.3 MB/s
 C fill                                               :   1911.5 MB/s (0.1%)
 C fill (shuffle within 16 byte blocks)               :   1911.2 MB/s (0.2%)
 C fill (shuffle within 32 byte blocks)               :   1911.5 MB/s (0.1%)
 C fill (shuffle within 64 byte blocks)               :   1911.2 MB/s (0.2%)
 ---
 standard memcpy                                      :    877.9 MB/s (0.7%)
 standard memset                                      :   1908.8 MB/s
 ---
 NEON LDP/STP copy                                    :    891.0 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    624.9 MB/s (0.3%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    753.3 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1002.9 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1003.0 MB/s
 NEON LD1/ST1 copy                                    :    883.1 MB/s (1.6%)
 NEON STP fill                                        :   1910.8 MB/s (0.1%)
 NEON STNP fill                                       :   1853.0 MB/s (1.2%)
 ARM LDP/STP copy                                     :    889.2 MB/s
 ARM STP fill                                         :   1910.9 MB/s (0.3%)
 ARM STNP fill                                        :   1852.6 MB/s (1.0%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    10.9 ns
    262144 :    8.1 ns          /    12.7 ns
    524288 :   12.6 ns          /    19.5 ns
   1048576 :  139.1 ns          /   215.9 ns
   2097152 :  207.8 ns          /   279.3 ns
   4194304 :  247.1 ns          /   305.5 ns
   8388608 :  267.5 ns          /   314.9 ns
  16777216 :  280.3 ns          /   321.4 ns
  33554432 :  288.0 ns          /   326.0 ns
  67108864 :  303.2 ns          /   350.0 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   12.8 ns          /    17.6 ns
   1048576 :  139.2 ns          /   215.4 ns
   2097152 :  205.9 ns          /   277.0 ns
   4194304 :  239.0 ns          /   296.3 ns
   8388608 :  254.9 ns          /   303.4 ns
  16777216 :  263.1 ns          /   306.7 ns
  33554432 :  267.2 ns          /   308.2 ns
  67108864 :  269.2 ns          /   308.9 ns

RUN9:
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    876.8 MB/s (1.7%)
 C copy backwards (32 byte blocks)                    :    860.7 MB/s (2.2%)
 C copy backwards (64 byte blocks)                    :    855.5 MB/s (1.7%)
 C copy                                               :    862.2 MB/s (1.4%)
 C copy prefetched (32 bytes step)                    :    684.3 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :    773.2 MB/s (0.4%)
 C 2-pass copy                                        :    797.4 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    506.8 MB/s (0.6%)
 C 2-pass copy prefetched (64 bytes step)             :    281.4 MB/s
 C fill                                               :   1910.6 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   1910.6 MB/s (0.1%)
 C fill (shuffle within 32 byte blocks)               :   1911.3 MB/s (0.1%)
 C fill (shuffle within 64 byte blocks)               :   1910.4 MB/s
 ---
 standard memcpy                                      :    881.4 MB/s (0.8%)
 standard memset                                      :   1909.8 MB/s (0.6%)
 ---
 NEON LDP/STP copy                                    :    894.7 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    626.9 MB/s (0.3%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    752.7 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   1003.3 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   1002.3 MB/s
 NEON LD1/ST1 copy                                    :    874.1 MB/s (0.9%)
 NEON STP fill                                        :   1910.3 MB/s (0.1%)
 NEON STNP fill                                       :   1847.1 MB/s (1.4%)
 ARM LDP/STP copy                                     :    895.6 MB/s (0.7%)
 ARM STP fill                                         :   1908.8 MB/s
 ARM STNP fill                                        :   1844.5 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.7 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.1 ns          /    12.8 ns
    524288 :   12.8 ns          /    19.2 ns
   1048576 :  139.0 ns          /   215.8 ns
   2097152 :  207.9 ns          /   279.4 ns
   4194304 :  247.0 ns          /   305.3 ns
   8388608 :  267.5 ns          /   315.0 ns
  16777216 :  280.3 ns          /   321.6 ns
  33554432 :  288.1 ns          /   326.0 ns
  67108864 :  301.5 ns          /   346.0 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.0 ns          /    12.8 ns
    524288 :   13.0 ns          /    19.4 ns
   1048576 :  139.1 ns          /   215.7 ns
   2097152 :  206.0 ns          /   277.1 ns
   4194304 :  239.0 ns          /   296.4 ns
   8388608 :  255.0 ns          /   303.5 ns
  16777216 :  263.2 ns          /   306.7 ns
  33554432 :  267.2 ns          /   308.2 ns
  67108864 :  269.3 ns          /   308.9 ns

RUN10: (Uboot changes: TCR/ASR disabled and ECC Enabled)

tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    812.7 MB/s (1.6%)
 C copy backwards (32 byte blocks)                    :    807.0 MB/s (1.5%)
 C copy backwards (64 byte blocks)                    :    817.1 MB/s (1.5%)
 C copy                                               :    814.8 MB/s (1.6%)
 C copy prefetched (32 bytes step)                    :    651.8 MB/s (0.6%)
 C copy prefetched (64 bytes step)                    :    726.0 MB/s
 C 2-pass copy                                        :    770.5 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    486.5 MB/s (0.7%)
 C 2-pass copy prefetched (64 bytes step)             :    269.8 MB/s
 C fill                                               :   1906.5 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   1907.1 MB/s (1.2%)
 C fill (shuffle within 32 byte blocks)               :   1906.9 MB/s (0.2%)
 C fill (shuffle within 64 byte blocks)               :   1908.7 MB/s (0.1%)
 ---
 standard memcpy                                      :    836.3 MB/s (0.6%)
 standard memset                                      :   1905.6 MB/s
 ---
 NEON LDP/STP copy                                    :    849.5 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    591.6 MB/s (0.4%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    713.3 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :    947.3 MB/s (0.1%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :    947.9 MB/s
 NEON LD1/ST1 copy                                    :    831.5 MB/s (1.2%)
 NEON STP fill                                        :   1906.0 MB/s
 NEON STNP fill                                       :   1843.3 MB/s (2.2%)
 ARM LDP/STP copy                                     :    849.5 MB/s (0.4%)
 ARM STP fill                                         :   1905.8 MB/s
 ARM STNP fill                                        :   1842.1 MB/s (1.1%)


==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    10.7 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   12.9 ns          /    18.1 ns
   1048576 :  143.2 ns          /   223.5 ns
   2097152 :  215.7 ns          /   290.8 ns
   4194304 :  255.7 ns          /   314.0 ns
   8388608 :  275.4 ns          /   327.5 ns
  16777216 :  285.6 ns          /   336.9 ns
  33554432 :  291.5 ns          /   341.8 ns
  67108864 :  305.3 ns          /   360.3 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.1 ns          /    12.8 ns
    524288 :   12.7 ns          /    19.1 ns
   1048576 :  143.1 ns          /   223.4 ns
   2097152 :  213.7 ns          /   288.5 ns
   4194304 :  248.2 ns          /   308.0 ns
   8388608 :  264.1 ns          /   313.7 ns
  16777216 :  271.6 ns          /   315.9 ns
  33554432 :  275.2 ns          /   316.8 ns
  67108864 :  277.0 ns          /   317.3 ns

RUN 11:
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    825.2 MB/s (2.2%)
 C copy backwards (32 byte blocks)                    :    822.6 MB/s (1.5%)
 C copy backwards (64 byte blocks)                    :    809.6 MB/s (1.4%)
 C copy                                               :    817.7 MB/s (1.3%)
 C copy prefetched (32 bytes step)                    :    655.5 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :    728.1 MB/s (0.3%)
 C 2-pass copy                                        :    770.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    493.9 MB/s (1.4%)
 C 2-pass copy prefetched (64 bytes step)             :    269.8 MB/s
 C fill                                               :   1907.0 MB/s
 C fill (shuffle within 16 byte blocks)               :   1905.8 MB/s (0.2%)
 C fill (shuffle within 32 byte blocks)               :   1903.7 MB/s
 C fill (shuffle within 64 byte blocks)               :   1907.7 MB/s (0.2%)
 ---
 standard memcpy                                      :    838.0 MB/s (0.6%)
 standard memset                                      :   1907.6 MB/s (0.1%)
 ---
 NEON LDP/STP copy                                    :    846.3 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    590.8 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    713.0 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :    946.7 MB/s (0.2%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :    946.5 MB/s
 NEON LD1/ST1 copy                                    :    838.6 MB/s (1.0%)
 NEON STP fill                                        :   1907.1 MB/s (0.1%)
 NEON STNP fill                                       :   1846.0 MB/s (1.2%)
 ARM LDP/STP copy                                     :    847.6 MB/s (0.5%)
 ARM STP fill                                         :   1904.7 MB/s
 ARM STNP fill                                        :   1848.7 MB/s (1.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    10.9 ns
    262144 :    8.1 ns          /    11.9 ns
    524288 :   13.1 ns          /    18.9 ns
   1048576 :  143.4 ns          /   224.1 ns
   2097152 :  216.0 ns          /   291.1 ns
   4194304 :  256.0 ns          /   314.5 ns
   8388608 :  275.6 ns          /   327.5 ns
  16777216 :  286.0 ns          /   337.5 ns
  33554432 :  291.8 ns          /   342.3 ns
  67108864 :  308.0 ns          /   364.8 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.4 ns
    131072 :    6.7 ns          /    10.8 ns
    262144 :    8.1 ns          /    12.5 ns
    524288 :   13.1 ns          /    19.5 ns
   1048576 :  143.4 ns          /   223.6 ns
   2097152 :  213.8 ns          /   288.7 ns
   4194304 :  248.5 ns          /   308.1 ns
   8388608 :  264.3 ns          /   313.9 ns
  16777216 :  271.8 ns          /   316.1 ns
  33554432 :  275.4 ns          /   317.0 ns
  67108864 :  277.2 ns          /   317.6 ns

RUN12:
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :    817.2 MB/s (2.0%)
 C copy backwards (32 byte blocks)                    :    821.4 MB/s (1.7%)
 C copy backwards (64 byte blocks)                    :    812.7 MB/s (1.2%)
 C copy                                               :    815.8 MB/s (1.2%)
 C copy prefetched (32 bytes step)                    :    651.9 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :    729.5 MB/s (0.3%)
 C 2-pass copy                                        :    771.5 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    499.8 MB/s (2.0%)
 C 2-pass copy prefetched (64 bytes step)             :    269.8 MB/s
 C fill                                               :   1905.1 MB/s
 C fill (shuffle within 16 byte blocks)               :   1907.2 MB/s (0.2%)
 C fill (shuffle within 32 byte blocks)               :   1907.4 MB/s (0.2%)
 C fill (shuffle within 64 byte blocks)               :   1908.2 MB/s (0.1%)
 ---
 standard memcpy                                      :    839.3 MB/s (0.4%)
 standard memset                                      :   1909.1 MB/s (0.3%)
 ---
 NEON LDP/STP copy                                    :    848.9 MB/s (0.5%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :    592.0 MB/s (0.3%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :    712.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :    947.4 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :    946.3 MB/s
 NEON LD1/ST1 copy                                    :    829.7 MB/s (1.2%)
 NEON STP fill                                        :   1907.5 MB/s (0.2%)
 NEON STNP fill                                       :   1840.8 MB/s (2.3%)
 ARM LDP/STP copy                                     :    849.9 MB/s (0.5%)
 ARM STP fill                                         :   1908.4 MB/s (0.2%)
 ARM STNP fill                                        :   1843.1 MB/s (0.5%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.1 ns          /    12.8 ns
    524288 :   12.8 ns          /    18.4 ns
   1048576 :  143.2 ns          /   223.6 ns
   2097152 :  215.7 ns          /   290.9 ns
   4194304 :  255.7 ns          /   314.0 ns
   8388608 :  275.4 ns          /   327.2 ns
  16777216 :  285.7 ns          /   337.1 ns
  33554432 :  291.5 ns          /   342.0 ns
  67108864 :  307.0 ns          /   363.3 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    4.2 ns          /     7.6 ns
    131072 :    6.7 ns          /    11.2 ns
    262144 :    8.1 ns          /    12.8 ns
    524288 :   12.8 ns          /    19.6 ns
   1048576 :  143.3 ns          /   223.8 ns
   2097152 :  213.6 ns          /   288.4 ns
   4194304 :  248.2 ns          /   308.0 ns
   8388608 :  264.1 ns          /   313.7 ns
  16777216 :  271.6 ns          /   315.9 ns
  33554432 :  275.3 ns          /   316.8 ns
  67108864 :  277.0 ns          /   317.3 ns