2 Performance Counters for Linux
3 ------------------------------
5 Performance counters are special hardware registers available on most modern
6 CPUs. These registers count the number of certain types of hw events: such
7 as instructions executed, cachemisses suffered, or branches mis-predicted -
8 without slowing down the kernel or applications. These registers can also
9 trigger interrupts when a threshold number of events have passed - and can
10 thus be used to profile the code that runs on that CPU.
12 The Linux Performance Counter subsystem provides an abstraction of these
13 hardware capabilities. It provides per task and per CPU counters, counter
14 groups, and it provides event capabilities on top of those.
16 Performance counters are accessed via special file descriptors.
17 There's one file descriptor per virtual counter used.
19 The special file descriptor is opened via the perf_counter_open()
22 int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
23 pid_t pid, int cpu, int group_fd);
25 The syscall returns the new fd. The fd can be used via the normal
26 VFS system calls: read() can be used to read the counter, fcntl()
27 can be used to set the blocking mode, etc.
29 Multiple counters can be kept open at a time, and the counters
32 When creating a new counter fd, 'perf_counter_hw_event' is:
35 * Hardware event to monitor via a performance monitoring counter:
37 struct perf_counter_hw_event {
43 u32 disabled : 1, /* off by default */
44 nmi : 1, /* NMI sampling */
45 raw : 1, /* raw event type */
52 * Generalized performance counter event types, used by the hw_event.type
53 * parameter of the sys_perf_counter_open() syscall:
57 * Common hardware events, generalized by the kernel:
59 PERF_COUNT_CYCLES = 0,
60 PERF_COUNT_INSTRUCTIONS = 1,
61 PERF_COUNT_CACHE_REFERENCES = 2,
62 PERF_COUNT_CACHE_MISSES = 3,
63 PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
64 PERF_COUNT_BRANCH_MISSES = 5,
67 * Special "software" counters provided by the kernel, even if
68 * the hardware does not support performance counters. These
69 * counters measure various physical and sw events of the
70 * kernel (and allow the profiling of them as well):
72 PERF_COUNT_CPU_CLOCK = -1,
73 PERF_COUNT_TASK_CLOCK = -2,
75 * Future software events:
77 /* PERF_COUNT_PAGE_FAULTS = -3,
78 PERF_COUNT_CONTEXT_SWITCHES = -4, */
81 These are standardized types of events that work uniformly on all CPUs
82 that implements Performance Counters support under Linux. If a CPU is
83 not able to count branch-misses, then the system call will return
86 More hw_event_types are supported as well, but they are CPU
87 specific and are enumerated via /sys on a per CPU basis. Raw hw event
88 types can be passed in under hw_event.type if hw_event.raw is 1.
89 For example, to count "External bus cycles while bus lock signal asserted"
90 events on Intel Core CPUs, pass in a 0x4064 event type value and set
93 'record_type' is the type of data that a read() will provide for the
94 counter, and it can be one of:
97 * IRQ-notification data record type:
99 enum perf_counter_record_type {
100 PERF_RECORD_SIMPLE = 0,
102 PERF_RECORD_GROUP = 2,
105 a "simple" counter is one that counts hardware events and allows
106 them to be read out into a u64 count value. (read() returns 8 on
107 a successful read of a simple counter.)
109 An "irq" counter is one that will also provide an IRQ context information:
110 the IP of the interrupted context. In this case read() will return
111 the 8-byte counter value, plus the Instruction Pointer address of the
114 The parameter 'hw_event_period' is the number of events before waking up
115 a read() that is blocked on a counter fd. Zero value means a non-blocking
118 The 'pid' parameter allows the counter to be specific to a task:
120 pid == 0: if the pid parameter is zero, the counter is attached to the
123 pid > 0: the counter is attached to a specific task (if the current task
124 has sufficient privilege to do so)
126 pid < 0: all tasks are counted (per cpu counters)
128 The 'cpu' parameter allows a counter to be made specific to a full
131 cpu >= 0: the counter is restricted to a specific CPU
132 cpu == -1: the counter counts on all CPUs
134 (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
136 A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
137 events of that task and 'follows' that task to whatever CPU the task
138 gets schedule to. Per task counters can be created by any user, for
141 A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
142 all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
144 Group counters are created by passing in a group_fd of another counter.
145 Groups are scheduled at once and can be used with PERF_RECORD_GROUP
146 to record multi-dimensional timestamps.