Commit | Line | Data |
---|---|---|
df78488d TG |
1 | |
2 | hrtimers - subsystem for high-resolution kernel timers | |
3 | ---------------------------------------------------- | |
4 | ||
5 | This patch introduces a new subsystem for high-resolution kernel timers. | |
6 | ||
7 | One might ask the question: we already have a timer subsystem | |
8 | (kernel/timers.c), why do we need two timer subsystems? After a lot of | |
9 | back and forth trying to integrate high-resolution and high-precision | |
10 | features into the existing timer framework, and after testing various | |
11 | such high-resolution timer implementations in practice, we came to the | |
12 | conclusion that the timer wheel code is fundamentally not suitable for | |
fff9289b | 13 | such an approach. We initially didn't believe this ('there must be a way |
df78488d TG |
14 | to solve this'), and spent a considerable effort trying to integrate |
15 | things into the timer wheel, but we failed. In hindsight, there are | |
16 | several reasons why such integration is hard/impossible: | |
17 | ||
18 | - the forced handling of low-resolution and high-resolution timers in | |
19 | the same way leads to a lot of compromises, macro magic and #ifdef | |
20 | mess. The timers.c code is very "tightly coded" around jiffies and | |
21 | 32-bitness assumptions, and has been honed and micro-optimized for a | |
22 | relatively narrow use case (jiffies in a relatively narrow HZ range) | |
23 | for many years - and thus even small extensions to it easily break | |
24 | the wheel concept, leading to even worse compromises. The timer wheel | |
25 | code is very good and tight code, there's zero problems with it in its | |
26 | current usage - but it is simply not suitable to be extended for | |
27 | high-res timers. | |
28 | ||
29 | - the unpredictable [O(N)] overhead of cascading leads to delays which | |
992caacf | 30 | necessitate a more complex handling of high resolution timers, which |
df78488d TG |
31 | in turn decreases robustness. Such a design still led to rather large |
32 | timing inaccuracies. Cascading is a fundamental property of the timer | |
33 | wheel concept, it cannot be 'designed out' without unevitably | |
34 | degrading other portions of the timers.c code in an unacceptable way. | |
35 | ||
36 | - the implementation of the current posix-timer subsystem on top of | |
37 | the timer wheel has already introduced a quite complex handling of | |
38 | the required readjusting of absolute CLOCK_REALTIME timers at | |
39 | settimeofday or NTP time - further underlying our experience by | |
40 | example: that the timer wheel data structure is too rigid for high-res | |
41 | timers. | |
42 | ||
43 | - the timer wheel code is most optimal for use cases which can be | |
44 | identified as "timeouts". Such timeouts are usually set up to cover | |
45 | error conditions in various I/O paths, such as networking and block | |
46 | I/O. The vast majority of those timers never expire and are rarely | |
47 | recascaded because the expected correct event arrives in time so they | |
48 | can be removed from the timer wheel before any further processing of | |
49 | them becomes necessary. Thus the users of these timeouts can accept | |
50 | the granularity and precision tradeoffs of the timer wheel, and | |
51 | largely expect the timer subsystem to have near-zero overhead. | |
52 | Accurate timing for them is not a core purpose - in fact most of the | |
53 | timeout values used are ad-hoc. For them it is at most a necessary | |
54 | evil to guarantee the processing of actual timeout completions | |
55 | (because most of the timeouts are deleted before completion), which | |
56 | should thus be as cheap and unintrusive as possible. | |
57 | ||
58 | The primary users of precision timers are user-space applications that | |
59 | utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel | |
60 | users like drivers and subsystems which require precise timed events | |
53cb4726 | 61 | (e.g. multimedia) can benefit from the availability of a separate |
df78488d TG |
62 | high-resolution timer subsystem as well. |
63 | ||
64 | While this subsystem does not offer high-resolution clock sources just | |
65 | yet, the hrtimer subsystem can be easily extended with high-resolution | |
66 | clock capabilities, and patches for that exist and are maturing quickly. | |
67 | The increasing demand for realtime and multimedia applications along | |
68 | with other potential users for precise timers gives another reason to | |
69 | separate the "timeout" and "precise timer" subsystems. | |
70 | ||
53cb4726 | 71 | Another potential benefit is that such a separation allows even more |
df78488d TG |
72 | special-purpose optimization of the existing timer wheel for the low |
73 | resolution and low precision use cases - once the precision-sensitive | |
74 | APIs are separated from the timer wheel and are migrated over to | |
75 | hrtimers. E.g. we could decrease the frequency of the timeout subsystem | |
76 | from 250 Hz to 100 HZ (or even smaller). | |
77 | ||
78 | hrtimer subsystem implementation details | |
79 | ---------------------------------------- | |
80 | ||
81 | the basic design considerations were: | |
82 | ||
83 | - simplicity | |
84 | ||
85 | - data structure not bound to jiffies or any other granularity. All the | |
86 | kernel logic works at 64-bit nanoseconds resolution - no compromises. | |
87 | ||
88 | - simplification of existing, timing related kernel code | |
89 | ||
90 | another basic requirement was the immediate enqueueing and ordering of | |
91 | timers at activation time. After looking at several possible solutions | |
92 | such as radix trees and hashes, we chose the red black tree as the basic | |
93 | data structure. Rbtrees are available as a library in the kernel and are | |
94 | used in various performance-critical areas of e.g. memory management and | |
95 | file systems. The rbtree is solely used for time sorted ordering, while | |
96 | a separate list is used to give the expiry code fast access to the | |
97 | queued timers, without having to walk the rbtree. | |
98 | ||
53cb4726 ML |
99 | (This separate list is also useful for later when we'll introduce |
100 | high-resolution clocks, where we need separate pending and expired | |
df78488d TG |
101 | queues while keeping the time-order intact.) |
102 | ||
103 | Time-ordered enqueueing is not purely for the purposes of | |
104 | high-resolution clocks though, it also simplifies the handling of | |
105 | absolute timers based on a low-resolution CLOCK_REALTIME. The existing | |
106 | implementation needed to keep an extra list of all armed absolute | |
107 | CLOCK_REALTIME timers along with complex locking. In case of | |
108 | settimeofday and NTP, all the timers (!) had to be dequeued, the | |
109 | time-changing code had to fix them up one by one, and all of them had to | |
110 | be enqueued again. The time-ordered enqueueing and the storage of the | |
111 | expiry time in absolute time units removes all this complex and poorly | |
112 | scaling code from the posix-timer implementation - the clock can simply | |
113 | be set without having to touch the rbtree. This also makes the handling | |
114 | of posix-timers simpler in general. | |
115 | ||
116 | The locking and per-CPU behavior of hrtimers was mostly taken from the | |
117 | existing timer wheel code, as it is mature and well suited. Sharing code | |
118 | was not really a win, due to the different data structures. Also, the | |
119 | hrtimer functions now have clearer behavior and clearer names - such as | |
120 | hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly | |
121 | equivalent to del_timer() and del_timer_sync()] - so there's no direct | |
122 | 1:1 mapping between them on the algorithmical level, and thus no real | |
123 | potential for code sharing either. | |
124 | ||
125 | Basic data types: every time value, absolute or relative, is in a | |
126 | special nanosecond-resolution type: ktime_t. The kernel-internal | |
127 | representation of ktime_t values and operations is implemented via | |
128 | macros and inline functions, and can be switched between a "hybrid | |
129 | union" type and a plain "scalar" 64bit nanoseconds representation (at | |
130 | compile time). The hybrid union type optimizes time conversions on 32bit | |
131 | CPUs. This build-time-selectable ktime_t storage format was implemented | |
132 | to avoid the performance impact of 64-bit multiplications and divisions | |
133 | on 32bit CPUs. Such operations are frequently necessary to convert | |
134 | between the storage formats provided by kernel and userspace interfaces | |
135 | and the internal time format. (See include/linux/ktime.h for further | |
136 | details.) | |
137 | ||
138 | hrtimers - rounding of timer values | |
139 | ----------------------------------- | |
140 | ||
141 | the hrtimer code will round timer events to lower-resolution clocks | |
142 | because it has to. Otherwise it will do no artificial rounding at all. | |
143 | ||
144 | one question is, what resolution value should be returned to the user by | |
145 | the clock_getres() interface. This will return whatever real resolution | |
146 | a given clock has - be it low-res, high-res, or artificially-low-res. | |
147 | ||
148 | hrtimers - testing and verification | |
149 | ---------------------------------- | |
150 | ||
151 | We used the high-resolution clock subsystem ontop of hrtimers to verify | |
152 | the hrtimer implementation details in praxis, and we also ran the posix | |
153 | timer tests in order to ensure specification compliance. We also ran | |
154 | tests on low-resolution clocks. | |
155 | ||
156 | The hrtimer patch converts the following kernel functionality to use | |
157 | hrtimers: | |
158 | ||
159 | - nanosleep | |
160 | - itimers | |
161 | - posix-timers | |
162 | ||
163 | The conversion of nanosleep and posix-timers enabled the unification of | |
164 | nanosleep and clock_nanosleep. | |
165 | ||
166 | The code was successfully compiled for the following platforms: | |
167 | ||
168 | i386, x86_64, ARM, PPC, PPC64, IA64 | |
169 | ||
170 | The code was run-tested on the following platforms: | |
171 | ||
172 | i386(UP/SMP), x86_64(UP/SMP), ARM, PPC | |
173 | ||
174 | hrtimers were also integrated into the -rt tree, along with a | |
175 | hrtimers-based high-resolution clock implementation, so the hrtimers | |
176 | code got a healthy amount of testing and use in practice. | |
177 | ||
178 | Thomas Gleixner, Ingo Molnar |