My Octopress Blog

A blogging framework for hackers.

C2C - False Sharing Detection in Linux Perf

Do you run your application in a NUMA environment? Is it multi-threaded? Is it multi-process with shared memory? If so, is your performance impacted by false sharing?

Now there’s a way to easily find out. We’re posting patches for a new feature to the Linux perf tool, called “c2c” for cache-2-cache.
We at Red Hat have been running the development prototype of c2c on lots of big Linux applications and it’s uncovered many hot false sharing cachelines.

I’ve been playing with this tool quite a bit. It is pretty cool. Let me share a little about what it is and how to use it.

At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved.

Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record ” Then generate a report output with “perf c2c report

Before covering the output data, here is a “how to” for the flags to use when calling “perf c2c”:
c2c usage flags

Then here’s an output file from a recent “perf c2c” run I did:
c2c output file

And, if you want to play with it yourself, here’s a simple source file to generate lots of false sharing.
False sharing .c src file

First I’ll go over the output file to highlight the interesting fields.

This first table in the output file gives a high level summary of all the load and store samples collected. It is interesting to see where your program’s load instructions got their data.
Notice the term “HITM”, which stands for a load that hit in a modified cacheline. That’s the key that false sharing has occured. Remote HITMs, meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.

 1  =================================================
 2              Trace Event Information
 3  =================================================
 4    Total records                     :     329219  << Total loads and stores sampled.
 5    Locked Load/Store Operations      :      14654
 6    Load Operations                   :      69679  << Total loads
 7    Loads - uncacheable               :          0
 8    Loads - IO                        :          0
 9    Loads - Miss                      :       3972
10    Loads - no mapping                :          0
11    Load Fill Buffer Hit              :      11958
12    Load L1D hit                      :      17235  << loads that hit in the L1 cache.
13    Load L2D hit                      :         21
14    Load LLC hit                      :      14219  << loads that hit in the last level cache (LLC).
15    Load Local HITM                   :       3402  << loads that hit in a modified cache on the same numa node (local HITM).
16    Load Remote HITM                  :      12757  << loads that hit in a modified cache on a remote numa node (remote HITM).
17    Load Remote HIT                   :       5295
18    Load Local DRAM                   :        976  << loads that hit in the local node's main memory.
19    Load Remote DRAM                  :       3246  << loads that hit in a remote node's main memory.
20    Load MESI State Exclusive         :       4222 
21    Load MESI State Shared            :          0
22    Load LLC Misses                   :      22274  << loads not found in any local node caches.
23    LLC Misses to Local DRAM          :        4.4% << % hitting in local node's main memory.
24    LLC Misses to Remote DRAM         :       14.6% << % hitting in a remote node's main memory.
25    LLC Misses to Remote cache (HIT)  :       23.8% << % hitting in a clean cache in a remote node.
26    LLC Misses to Remote cache (HITM) :       57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27    Store Operations                  :     259539  << store instruction sample count
28    Store - uncacheable               :          0
29    Store - no mapping                :         11
30    Store L1D Hit                     :     256696  << stores that got L1 cache when requested.
31    Store L1D Miss                    :       2832  << stores that couldn't get the L1 cache when requested (L1 miss).
32    No Page Map Rejects               :       2376
33    Unable to parse data source       :          1

The second table, (below), in the output file gives a brief one-line summary of the hottest cachelines where false sharing was detected. It’s sorted by which line had the most remote HITMs (or local HITMs if you select that sort option). It gives a nice high level sense for the load and store activity for each cacheline.
I look to see if a cacheline has a high number of “Rmt LLC Load Hitm’s”. If so, it’s time to dig further.

54  =================================================
55             Shared Data Cache Line Table          
56  =================================================
57  #
58  #                              Total      Rmt  ----- LLC Load Hitm -----  ---- Store Reference ----  --- Load Dram ----      LLC    Total  ----- Core Load Hit -----  -- LLC Load Hit --
59  # Index           Cacheline  records     Hitm    Total      Lcl      Rmt    Total    L1Hit   L1Miss       Lcl       Rmt  Ld Miss    Loads       FB       L1       L2       Llc       Rmt
60  # .....  ..................  .......  .......  .......  .......  .......  .......  .......  .......  ........  ........  .......  .......  .......  .......  .......  ........  ........
61  #
62        0            0x602180   149904   77.09%    12103     2269     9834   109504   109036      468       727      2657    13747    40400     5355    16154        0      2875       529
63        1            0x602100    12128   22.20%     3951     1119     2832        0        0        0        65       200     3749    12128     5096      108        0      2056       652
64        2  0xffff883ffb6a7e80      260    0.09%       15        3       12      161      161        0         1         1       15       99       25       50        0         6         1
65        3  0xffffffff81aec000      157    0.07%        9        0        9        1        0        1         0         7       20      156       50       59        0        27         4
66        4  0xffffffff81e3f540      179    0.06%        9        1        8      117       97       20         0        10       25       62       11        1        0        24         7

Next is the Pareto table, which shows lots of valuable information about each contended cacheline. This is the most important table in the output. I only show three cachelines here to keep this blog simple. Here’s what’s in it.

    * Lines 71 and 72 are the column headers for what’s happening in each cacheline.
    * Line 76 shows the HITM and store activity for each cacheline - first with counts for load
       and store activity, followed by the cacheline virtual data address.
    * Then there’s the data address column. Line 76 shows the virtual address of the cacheline.
       Each row underneath is represents the offset into the cachline where those accesses occured.
    * The next column shows the pid, and/or the thread id (tid) if you selected that for the output.
    * Following is the instruction pointer code address.
    * Next are three columns showing the average load latencies. I always look here for long
       latency averages, which is a sign for how painful the contention was to that cacheline.
   * The “cpu cnt” column shows how many different cpus samples came from.
   * Then there’s the function name, binary object name, source file and line number.
   * The last column shows for each node, the specific cpus that samples came from.

67  =================================================
68        Shared Cache Line Distribution Pareto      
69  =================================================
70  #
71  #        ----- HITM -----  -- Store Refs --        Data address                               ---------- cycles ----------       cpu                                     Shared                                   
72  #   Num      Rmt      Lcl   L1 Hit  L1 Miss              Offset      Pid        Code address  rmt hitm  lcl hitm      load       cnt               Symbol                Object                  Source:Line  Node{cpu list}
73  # .....  .......  .......  .......  .......  ..................  .......  ..................  ........  ........  ........  ........  ...................  ....................  ...........................  ....
74  #
75    -------------------------------------------------------------
76        0     9834     2269   109036      468            0x602180
77    -------------------------------------------------------------
78            65.51%   55.88%   75.20%    0.00%                 0x0    14604            0x400b4f     27161     26039     26017         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:144   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
79             0.41%    0.35%    0.00%    0.00%                 0x0    14604            0x400b56     18088     12601     26671         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
80             0.00%    0.00%   24.80%  100.00%                 0x0    14604            0x400b61         0         0         0         9  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:145   0{0-1,4}  1{24-25,120}  2{48,54}  3{169}
81             7.50%    9.92%    0.00%    0.00%                0x20    14604            0x400ba7      2470      1729      1897         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:154   1{122}  2{144}
82            17.61%   20.89%    0.00%    0.00%                0x28    14604            0x400bc1      2294      1575      1649         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:158   2{53}  3{170}
83             8.97%   12.96%    0.00%    0.00%                0x30    14604            0x400bdb      2325      1897      1828         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:162   0{96}  3{171}

84    -------------------------------------------------------------
85        1     2832     1119        0        0            0x602100
86    -------------------------------------------------------------
87            29.13%   36.19%    0.00%    0.00%                0x20    14604            0x400bb3      1964      1230      1788         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:155   1{122}  2{144}
88            43.68%   34.41%    0.00%    0.00%                0x28    14604            0x400bcd      2274      1566      1793         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:159   2{53}  3{170}
89            27.19%   29.40%    0.00%    0.00%                0x30    14604            0x400be7      2045      1247      2011         2  [.] read_write_func  no_false_sharing.exe  false_sharing_example.c:163   0{96}  3{171}

90    -------------------------------------------------------------
91        2       12        3      161        0  0xffff883ffb6a7e80
92    -------------------------------------------------------------
93            58.33%  100.00%    0.00%    0.00%                 0x0    14604  0xffffffff810cf16d      1380       941      1229         9  [k] task_tick_fair              [kernel.kallsyms]  atomic64_64.h:21   0{0,4,96}  1{25,120,122}  2{53}  3{170-171}
94            16.67%    0.00%   98.76%    0.00%                 0x0    14604  0xffffffff810c9379      1794         0       625        13  [k] update_cfs_rq_blocked_load  [kernel.kallsyms]  atomic64_64.h:45   0{1,4,96}  1{25,120,122}  2{48,53-54,144}  3{169-171}
95            16.67%    0.00%    0.00%    0.00%                 0x0    14604  0xffffffff810ce098      1382         0       867        12  [k] update_cfs_shares           [kernel.kallsyms]  atomic64_64.h:21   0{1,4,96}  1{25,120,122}  2{53-54,144}  3{169-171}
96             8.33%    0.00%    0.00%    0.00%                 0x8    14604  0xffffffff810cf18c      2560         0       679         8  [k] task_tick_fair              [kernel.kallsyms]  atomic.h:26        0{4,96}  1{24-25,120,122}  2{54}  3{170}
97             0.00%    0.00%    1.24%    0.00%                 0x8    14604  0xffffffff810cf14f         0         0         0         2  [k] task_tick_fair              [kernel.kallsyms]  atomic.h:50        2{48,53}

How I often use “perf c2c”

Here are the flags I most commonly use.

   perf c2c record -F 60000 -a --all-user sleep 5
   perf c2c record -F 60000 -a --all-user sleep 3     // or to sample for a shorter time.
   perf c2c record -F 60000 -a --all-kernel sleep 3   // or to only gather kernel samples.
   perf c2c record -F 60000 -a -u --ldlat 50 sleep 3  // or to collect only loads >= 50 cycles of load latency (30 is the ldlat default).

To generate report files, you can use the graphical tui report or send the output to stdout:

 perf report -NN -c pid,iaddr                 // to use the tui interactive report
 perf report -NN -c pid,iaddr --stdio         // or to send the output to stdout
 perf report -NN -d lcl -c pid,iaddr --stdio  // or to sort on local hitms

By default, symbol names are truncated to a fixed width - for readability.
You can use the “–full-symbols” flag to get full symbol names in the output.
For example:

 perf c2c report -NN -c pid,iaddr --full-symbols --stdio 

Finding the callers to these cachelines:

Sometimes it’s valuable to know who the callers are. Here is how to get call graph information.
I never generate call graph info initially because it emits so much data, it makes it very difficult to see if and where a false sharing problem exists. I find the problem first without call graphs, then if needed I’ll rerun with call graphs.

perf c2c record --call-graph dwarf,8192 -F 60000 -a --all-user sleep 5
perf c2c report -NN -g --call-graph -c pid,iaddr --stdio 

Does bumping perf’s sample rate help?

I’ll sometimes bump the perf sample rate with “-F 60000” or “-F 80000”.
There’s no requirement to do so, but it is a good way to get a richer sample collection in a shorter period of time. If you do, it’s helpful to bump the kernel’s perf sample rate up with the following two echo commands. (see dmesg for “perf interrupt took too long …” sample lowering entries).

 echo    500 > /proc/sys/kernel/perf_cpu_time_max_percent
 echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
 <then do your "perf c2c record" here>
 echo     50 > /proc/sys/kernel/perf_cpu_time_max_percent

What to do when perf drowns in excessive samples:

When running on larger systems (e.g. 4, 8 or 16 socket systems), there can be so many samples that the perf tool can consume lots of cpu time and the perf.data file size grows significantly.
Some tips to help that include:
- Bump the ldlat from the default of 30 to 50. This free’s perf to skip the faster non-interesting loads.
- Lower the sample rate.
- Shorten the sleep time during the “perf record” window. For ex, from “sleep 5” to “sleep 3”.

What I’ve learned by using C2C on numerous applications:

It’s common to look at any performance tool output and ask ‘what does all this data mean?’.
Here are some things I’ve learned. Hopefully they’re of help.

    * I tend to run “perf c2c” for 3, 5, or 10 seconds. Running it any longer may take you
       from seeing concurrent false sharing to seeing cacheline accesses which are
       disjoint in time.
    * If you’re not interested in kernel samples, you’ll get better samples in your program by
       specifying –all—user.   Conversely, specifying –all-kernel is useful when focusing on the
       kernel.
    * On busy systems with high cpu counts , like >148 cpus, setting –ldlat to a higher value
       (like 50 or even 70) may enable perf to generate richer C2C samples.
    * Look at the Trace Event table at the top, specifically the “LLC Misses to Remote cache HITM”
       number. If it’s not close to zero, then there’s likely worthwhile false sharing to pursue resolving.
    * Most of the time the top one, two, or three cachelines in the Shared Cache Line Distribution
       Pareto table are the ones to focus on.
    * However, sometimes you’ll see the same code from multiple threads causing “less hot”
       contention, but you will see it on multiple cachelines for different data addresses.
       Even though any one of those lines are less hot individually, fixing them is often a
       win because the benefit is spread across many cachelines. This can also happen with
       different processes executing the same code accessing shared memory.
    * In the Shared Cache Line Distribution Pareto table, if you see long load average load latencies,
       it’s often a giveaway that false sharing contention is heavy and is hurting performance.
    * Then looking to see what nodes and cpus the samples for those accesses are coming from
       can often be a valuable guide to numa-pinning your processes or memory.
   
For processes using shared memory, it is possible for them to use different virtual addresses,
all pointing to (and contending with) the same shared memory location. They will show
up in the Pareto table as different cachelines, but in fact they are the same cacheline.
These can be tricky to spot. I usually uncover these by first looking to see that shared memory is being used, and then looking for similar patterns in the information provided
for each cacheline.

Last, the Shared Cache Line Distribution Pareto table can also provide great insight into any
ill-aligned hot data.
For example:
    * It’s easy to spot heavily modified variables that need to be placed into their own cachelines.
       This will enable them to be less contended (and run faster), and it will help accesses to
       the other variables that shared their cacheline to not be slowed down.
    * It’s easy to spot hot locks or mutexes that are unaligned and spill into multiple cachelines.
    * It’s easy to spot “read mostly” variables which can be grouped together into their own
       cachelines.

The raw samples can be helpful.

I’ve often found it valuable to take a peek at the raw instruction samples contained in the perf.data file (the one generated by the “perf c2c record”). You can get those raw samples using “perf script”. See man perf-script. The output may be cryptic, but you can sort on the load weight (5th column) to see which loads suffered the most from false sharing contention and took the longest to execute.

The c2c functionality is available in the upstream perf as of the Linux 4.2 kernel./h4>

Lastly, this was a collective effort.

Although Don Zickus, Dick Fowles and Joe Mario worked together to get this implemented, we got lots of early help from Arnaldo Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
Additionally Jiri has been heavily involved recently integrating the c2c functionality into perf.
A big thanks to all of you for helping to pull this together!