First a warning: do you really need Locking or CondVars?
Are you sure you need to use explicit locking and condition variables? In Chrome code, message passing is far more common
(via TaskRunner
and PostTask
) and low-level
primitives like locks and condition variables should be used only when
necessary.
Some additional motivation:
- Condition variables are nearly impossible to implement correctly on Windows XP or earlier. Our implementation is correct, but _very_ slow. Whenever you use a CV you are disproportionately harming our performance on Windows.
- A lot of times people just want to wait on a boolean. In such cases, if message passing cannot work, please use WaitableEvent instead.
But for the times when you do need to use locks and condition
variables, this document will explain the best practices and pitfalls
in using them.
(Much of the below was originally written by Mike Burrows.)
(TODO: Figure out how to get anchor links to work)
Using Lock
and ConditionVariable
Terminology and basics
The Lock
class implements a mutual exclusion lock,
or mutex for short. A mutex is used to permit only one thread at a time
to have exclusive access to some resource, which is typically some variable or
data structure. Mutexes are so common that many words have been coined to
describe their operation.
Each Lock
mu
has two basic operations: mu.Acquire()
and
mu.Release()
. Conceptually, it has just a single bit of abstract state: the
boolean mu.held
. When mu
is created, mu.held
is false
and mu
is said to be free or unlocked.
mu.Acquire()
blocks the caller until some moment when mu
is free, and then
atomically changes mu.held
from false
to true
; mu
is then said to
be held or locked by the caller. mu.Release()
sets
mu.held
to false
once more.
Calling mu.Acquire()
is often referred to as
locking or acquiring mu
, while calling mu.Release()
is referred to as unlocking or releasing mu
.
An action performed by a thread while holding mu
is said to be performed
under mu
. Data structures manipulated under mu
, and
their invariants, are said to be protected by mu
.
Clients of Lock
must obey these rules:
- Each time a thread acquires
mu
it must later release mu
.
- A thread may not attempt to release
mu
unless it holds mu
.
Because mu.Acquire()
acts atomically to change the state of mu.held
we are guaranteed that, if these rules are followed, only one thread may hold mu
at any given time.
These rules are best followed by bracketing regions of code with matching
calls to mu.Acquire()
and mu.Release()
in the same procedure.
Such sections of code are called critical regions or critical
sections, or occasionally monitors after Hoare monitors,
from which
mutexes were derived. (A Hoare monitor is an abstraction that automatically
acquires a lock on entry and releases it on exit.) In Chrome C++ code, many
use the idiom AutoLock l(mu)
, which acquires
mu
and releases it when l
goes out of
scope. (Less commonly, AutoUnlock l(mu)
can also be
used.)
Mutexes perform two tasks in concurrent programming. Their primary role is to
protect the invariants of variables and data structures manipulated by multiple
threads (these invariants are often called monitor invariants, again
recalling the work of Hoare). The programmer is required to establish the
invariant before releasing the mutex; he can then assume the invariant holds whenever
he acquires the mutex, even if his updates temporarily invalidate the
invariant during the critical section. One cannot guarantee
the invariant is true in a thread that does not hold the mutex, because
the mutex holder may be changing the monitored state at that moment.
For example, suppose Lock mu
protects the invariant that
a + b == 0
. This code is
then legal:
mu.Acquire();
assert(a + b == 0); // invariant assumed to hold
a++; // invariant temporarily invalidated
b--; // invariant restored before mu is released
mu.Release();
while this code is erroneous:
mu.Acquire();
assert(a + b == 0); // invariant assumed to hold
a++; // invariant invalidated
mu.Release(); // BUG: mu released while invariant invalid
mu.Acquire();
b--; // attempt to restore the invariant, but the damage is already done
mu.Release();
The following does not invalidate the invariant,
but incorrectly assumes it is true when the lock is not held:
mu.Acquire();
assert(a + b == 0); // correct: invariant assumed to hold
mu.Release();
assert(a + b == 0); // BUG: can't assume invariant without lock
The invariant holds only when evaluated on state observed
in a single critical section:
mu.Acquire();
assert(a + b == 0); // correct: invariant assumed to hold
temp = a; // takes a temporary copy of "a"
mu.Release();
mu.Acquire();
assert(temp + b == 0); // BUG: can't assume invariant on state
// from two distinct critical sections
mu.Release();
A less obvious role of mutexes is to ensure a sequentially-consistent
view of such data structures on machines with
memory systems that are
not sequentially consistent. Mutexes also prevent compiler reordering which could otherwise cause race conditions.
Other properties of Lock
-
The call mu.Try()
either returns false
or
acquires mu
and returns true
. It does not block.
If mu
is free, it is unlikely to return false
.
-
The call mu.AssertAcquired()
aborts the process in debug mode if mu
is not held by the calling thread..
-
Lock
is able to synchronize its own deletion. For example, if an object
*o
contains a Lock
mu
and a
correctly-maintained reference count c
, this code is safe:
o->mu.Acquire();
bool del = (--(o->c) == 0);
o->mu.Release();
if (del) { delete o; }
Not all lock implementations have this property, so it should not be taken for
granted when porting code. (To provide this property, we
guarantee that mu.Release()
does not touch mu
after the
atomic moment at which mu
becomes free.)
-
Lock
is not re-entrant (also known as not recursive). See below.
Condition variables
Condition variables are a means for blocking a thread until some condition has been satisfied. Viewed in isolation, a condition variable allows threads to block and to be
woken by other threads. However, condition variables are designed to be used
in a specific way; a condition variable interacts with a mutex to make it easy
to wait for an
arbitrary condition on state protected by the mutex.
Chrome's C++ condition variables have type
ConditionVariable
.
Suppose that a thread is to wait for some boolean expression cond_expr
to become true, where the state associated with cond_expr is
protected by Lock mu
.
The programmer would write:
// Waiter
mu.Acquire();
while (!cond_expr) {
cv.Wait(); // mu was passed to cv's constructor
}
// cond_expr now holds
...
mu.Release();
The
Wait()
call atomically unlocks
mu
(which the thread must hold),
and blocks on the condition variable
cv
. When another thread signals
the condition variable, the thread will reacquire the
mu
, and
go around the
mandatory while-loop to recheck
cond_expr.
Another thread that makes cond_expr true might execute:
// Waker
mu.Acquire();
Make_cond_expr_True();
// cond_expr now holds
cv.Signal();
mu.Release();
The call to
Signal()
wakes at least one of the threads waiting on
cv
. Many threads may be blocked on a condition variable at any
given time; if it makes sense to wake more than one such thread
Broadcast()
can be used. (However, this may lead to
contention and poor performance if all waiting threads use the same
lock; a possibly better approach to getting a lot of threads out of
Wait()
is to have each thread (upon
exiting
Wait()
) call
Signal()
to free up
another
Wait()
ing thread.)
A single condition variable can be used by threads waiting for
different conditions. However, in this case, Broadcast()
must be used
when any of the conditions becomes true, because the ConditionVariable
implementation
cannot otherwise guarantee to wake the correct thread. It can be more efficient
to use one condition variable for each different condition;
any number of condition variables can be used with a single mutex.
Both Signal()
and Broadcast()
are
efficient if there are no threads to wake. (TODO: verify this) Clients should call
Signal()
or Broadcast()
inside the critical section
that makes the condition true.
The call TimedWait()
allows a thread to wait until
either a condition is true, or until some time has elapsed.
Like Wait()
, TimedWait()
always reacquires
the mutex before returning. Example use:
static const int64 kMSToWait = 1000; // we'll wait at most 1000ms
TimeDelta left_to_wait = TimeDelta::FromMilliseconds(kMSToWait); // ms to wait at any given time
Time deadline = Time::Now() + left_to_wait;
mu.Acquire();
while (!cond_expr && left_to_wait.InMilliseconds() > 0) {
cv.TimedWait(left_to_wait);
left_to_wait = deadline - Time::Now();
}
if (cond_expr) {
// cond_expr true
} else {
// cond_expr false; 1000ms timeout expired
}
mu.Release();
Recommended usage
Variables accessed by more than one thread and written
by at least one thread should be protected consistently by a
Lock
.
An exception to this rule is that fields may be initialized in constructors
without a mutex, since no other thread should be able to reference the memory
at that time.
Recommended commenting convention
There are two main dangers when using threads and mutexes: deadlocks and data races.
These can be avoided using some simple rules and adding
small comments to variable and procedure declarations.
You are strongly advised to use such comments; they may seem
tedious to write, but they help tremendously in avoiding errors.
The particular commenting conventions shown below are derived from the work of Nelson and Manasse.
-
Critical sections should almost always start and end in the same routine. That is,
if a routine acquires a lock, it should release it before it returns, and it should release
no locks that it does not itself acquire. This is normally achieved by writing
Acquire()
and Release()
calls in pairs, or by using AutoLock l(mu)
,
which automatically releases mu
when l
goes out of scope.
-
Every shared variable/field should have a comment
indicating which mutex protects it:
int accesses_; // count of accesses (guarded by mu_)
or a comment explaining why no mutex is needed:
int table_size_; // no. of elements in table (readonly after init)
-
Every mutex should have a comment indicating which variables and also any non-obvious
invariants it protects:
Lock mu_; // protects accesses_, list_, count_
// invariant: count_ == number of elements in linked-list list_
Think of the matching
comments on variables and mutexes as analogous to
matching types on procedure arguments and parameters; the redundancy can
be very helpful to later maintainers of the code.
-
Whenever a thread can hold two mutexes concurrently, either one
(or both) of the mutexes should be commented with
acquired before
or acquired after
to indicate
which mutex must be acquired first:
Lock mu0_; // protects foo_ (acquired before mu1_)
If the mutex acquisition order is not consistent, deadlock may result.
-
Each routine should have a comment indicating which mutexes must and must not be held on entry.
This allows implementors to edit routines without examining their call sites, and allows
clients to use routines without reading their bodies.
-
Never hold locks when invoking a callback, as the callback may call into
the module once more, leading to deadlock. (Violations of this rule
should be extremely rare and conspicuously commented in the module's
interface.) Comments should indicate what threads can or cannot be used
for callbacks:
// Call cb.Run() in "ms" milliseconds.
// cb.Run() is called in a private thread; it will not be called
// from the thread calling RunAfter(), even if ms==0.
void RunAfter(Closure cb, int ms);
-
In rare cases, it may be useful for a routine to acquire a lock and return without releasing
it, or to release a lock (perhaps temporarily using
ConditionVariable::Wait
) that it did not acquire. Such routines may surprise clients, and should
be commented clearly in interfaces. Note that a routine that acquires
a lock and returns without releasing it is practically a locking
primitive and should be commented as such.
-
Every condition variable should have a comment indicating when it is signalled:
ConditionVariable non_empty_; // signalled when the queue becomes non-empty
-
In some cases, exclusive access to data is ensured by referencing it only
from one thread. (See the section on message passing.)
The thread can be thought of as playing the part of a mutex;
you should name the thread and use the name in comments as if it were a lock.
int queue_length_; // length of receive queue (under net_thread)
...
// Process one packet from the queue.
// L >= net_thread
void ProcessPacket() {
...
}
-
In very rare cases, a variable may be protected by more than one mutex.
This means that the variable may be read while holding any of the mutexes,
but may be written only when all the mutexes are held.
You should document it clearly in the comments.
int bytes_left_; // bytes remaining in queue (under mu_ and net_thread)
If these conventions are followed it is straightforward to tell what locks are held
at any point in a routine by reading the routine and its comments. By reading the comments
on shared variables and mutexes, it is then possible to tell that all variable accesses
are correctly protected by a mutex, and that mutexes are acquired in the correct order.
Without such comments, working with mutexes is significantly harder. We recommend
their use.
Performance hints
Critical sections should not be too long
Normally, you should hold mutexes for short periods (nanoseconds to microseconds) at a time,
and the mutexes should be free most of the time, often approaching 99%.
To achieve this, it's best to avoid doing slow operations such as I/O inside
critical sections---assuming it is not the purpose of the mutex to serialize the I/O, of course.
Another, more complex technique is to make the locking more fine-grained
by employing more locks, each protecting a subset of the data.
Other transformations may help, such as breaking a critical section in two, or
arranging to perform long-running operations on state that is local to a
thread.
Lock acquisitions are cheap but not free
A lock acquisition is generally more expensive than
a cached variable access, but less expensive than a cache miss.
If a mutex is acquired and released too often (say, more than a hundred
thousand times per second) the overhead of these operations themselves
may start to be significant in CPU profiles.
Frequent acquisition can be avoided by combining critical sections,
or by delaying operations on shared state by buffering them
in memory local to a single thread. For example, tcmalloc
uses a per-thread cache of memory to avoid
locking overhead on every allocation.
Pitfalls
Deadlocks
A deadlock (sometimes called a
deadly embrace) occurs when an
activity attempts to acquire a limited
resource that has been
exhausted and cannot be replenished unless that activity makes progress.
When considering deadlocks involving only mutexes, each activity is typically
represented by a thread, and the resources are mutexes that are exhausted when
held, and replenished when released. However, other possibilities exist: In a
message passing environment,
an activity may be represented by a message, and a resource may be represented
by a bounded-length message queue or a bounded-size thread pool. When both
mutexes and message passing are used, deadlocks may involve combinations of
these activities and resources.
The simplest mutex-related deadlock is the self-deadlock:
mu.Acquire();
mu.Acquire(); // BUG: deadlock: thread already holds mu
Deadlocks involving two resources, such as a mutex and a bounded-sized thread pool,
are easily generated too, but deadlocks involving three or more resources
are less common. A two-mutex deadlock results when thread T0 attempts to
acquire M1 while holding M0 at the same time that thread T1
attempts to acquire M0 while holding M1; each thread will wait indefinitely
for the other.
Debugging deadlocks
Fortunately, deadlocks are among the easiest bugs to debug and avoid.
Debugging is typically easy because the address space stops exactly where
the bug occurs. A stack trace of the threads is usually all that is
is required to see what the threads are waiting for and what resources
they hold. (Deadlocks involving messages in queues can be harder to spot.)
Avoiding deadlocks
Deadlocks can be avoided by disallowing cycles in the resources'
exhaust-before graph; this can be done by imposing a partial order
on the graph. If an activity can exhaust resource R0
and then attempt to use a resource R1 that may be exhausted, then
we say that R0
precedes R1 (and R1
follows R0) in the exhaust-before graph.
To guarantee no deadlocks, it is sufficient to guarantee that if
R0
precedes R1, then R1 never
precedes R0. That is,
for all pairs of resources R0 and R1, either R0 and R1 are unordered
(neither
precedes the other), or their ordering is unambiguous
(one
precedes the other, but not vice versa).
Considering only mutexes, we can avoid deadlocks by ensuring
that the acquired-before graph is a partial order and is therefore free
of cycles. In practice, we simply pick an order for any two mutexes
that can be held simultaneously by the same thread, and comment the code
with this choice.
As described above,
if a thread does mu0.Acquire(); mu1.Acquire();
then we should comment the declarations of mu0
and
mu1
with either acquired before
or
acquired after
(or both).
Because we wish our code to be modular, our comments
should also indicate what locks a caller must or must not hold on entry to a routine.
Combined, these comments allow the programmer to know whether he is about to violate the locking order by
acquiring a mutex or calling a routine.
Experience shows that if this convention is followed, deadlocks are usually
both rare and easy to correct.
A particularly important rule of thumb for deadlock avoidance is
never hold a lock while invoking a callback.
More generally, try to avoid holding locks for long periods, and across calls
to other levels of abstraction that you do not fully understand. (That is,
you might hold a lock access an access to a hash table, but you should not
hold a lock across a call to a complex subsystem.)
Races
Races occur in three main ways:
-
A shared variable is accessed without being protected consistently by a mutex.
The reasons for the problems are discussed at length below.
This error can be avoided with the conventions already described;
simply ensure that each shared variable is accessed only when its
protecting mutex is known to be held. Such races can be detected
automatically by ThreadSanitizer
as described below.
-
A critical section modifies protected state but does not
preserve the monitor invariant. Such bugs
are rare if invariants are commented correctly.
-
A critical section reads protected state, which is then encoded in
a temporary variable or the program counter. Then the lock is released,
then reacquired and the state from the previous critical section is used
as though still valid:
string name_; // guarded by mu_
size_t NameLen() { AutoLock l(this->mu_); return this->name_.size(); }
// Requires 0 <= i < this->name_.size()
int CharFromName(size_t i) { AutoLock l(this->mu_); return this->name_[i]; }
...
size_t len = this->NameLen();
int x = 0;
if (len > 0) {
// BUG: temporary len encodes protected state from a previous
// critical section that is used inside another.
// The length of name_ may have changed since len was assigned.
x = this->CharFromName(len - 1);
}
...
This is the most insidious form of race, and the best known way to avoid
them is vigilance. Fortunately, they are quite rare.
There are algorithms that can detect such races using data flow analysis,
but as yet none has been applied to C++.
Debugging
Lock
and
ConditionVariable
have various features to aid debugging.
Assertions
Race detection
Race detection requires an external tool. One such tool is
ThreadSanitizer, which is a dynamic race detector
based on the
Valgrind binary translation
framework. See
this page for more details on how to use it with Chrome.
Examples
The following show simple implementations of reader-writer locks
and producer-consumer queues using condition variables.
Reader-writer lock
The example below could be improved in various ways at the cost of clarity.
In particular, they allow readers to starve writers.
class RWLock {
public:
RWAcquire() : lockers_(0), lock_free_(&mu_) {}
~RWAcquire() {}
void WriteAcquire() { // acquire a write lock
this->mu_.Acquire();
while (this->lockers_ != 0) {
this->lock_free_.Wait();
}
this->lockers_ = -1;
this->mu_.Release();
}
void ReadAcquire() { // acquire a read lock
this->mu_.Acquire();
while (this->lockers_ < 0) {
this->lock_free_.Wait();
}
this->lockers_++;
this->mu_.Release();
}
void Release() { // release lock (either mode)
this->mu_.Acquire();
this->lockers_ = (this->lockers_ == -1? 0 : this->lockers_ - 1);
if (this->lockers_ == 0) { // if lock became free, wake all waiters
this->lock_free_.Broadcast();
}
this->mu_.Release();
}
private:
Lock mu_; // protects lockers_
int lockers_; // 0 => free, -1 => writer, +ve => reader count
ConditionVariable lock_free_; // signalled when lockers_ becomes 0
};
Producer-consumer queue
class PCQ { // a bounded, blocking FIFO producer-consumer queue
public:
PCQ(int n) : max_count_(n), non_full_(&mu_), non_empty_(&mu_) {}
~PCQ() { CHECK_EQ(this->queue_.size(), 0); } // error if queue is not empty
// waits until queue is not full, then adds value to its end
void Add(void *value) {
this->mu_.Acquire();
while (this->queue_.size() == this->max_count_) {
this->non_full_.Wait();
}
this->non_empty_.Signal(); // no need to Broadcast.
// (only one remover can consume this item)
// Could use:
// if (this->queue_.size() == 0) { this->non_empty_.Broadcast(); }
this->queue_.push_back(value);
this->mu_.Release();
}
// waits until this queue is non-empty, then removes and returns first value
void *Remove() {
this->mu_.Acquire();
while (this->queue_.size() == 0) {
this->non_empty_.Wait();
}
this->non_full_.Signal(); // no need to Broadcast.
// (only one adder can consume this gap)
// Could use:
// if (this->queue_.size() == this->max_count_) { this->non_full_.Broadcast(); }
void *value = this->queue_.front();
this->queue_.pop_front();
this->mu_.Release();
return value;
}
private:
Lock mu_; // protects *queue_
// protects invariant 0 <= queue_.size() <= max_count_
deque<void *> queue_; // list of queued items
const int max_count_; // max number of items in *queue_ (readonly after init)
ConditionVariable non_full_; // signalled when queue becomes non-full
ConditionVariable non_empty_; // signalled when queue becomes non-empty
};
Why are mutexes the way they are?
Why use a mutex when accessing shared data?
It is perilous to access data that another thread may be modifying concurrently.
Consider accesses to a C++
string
. A well-formed
string
may have invariants, such as its length field indicating the true length of the
string it represents. Such invariants may be temporarily broken by the
string
implementation when an update occurs. Clearly, one thread may not be allowed
to read a
string
that is being written by another, as it may not
observe a length consistent with the stored bytes. If the string is accessed
only under a mutex
mu
, the
string
's invariants
become
mu
's monitor invariants, and each thread will see a well-formed
string
.
It is tempting to assume that mutexes are unnecessary when there is no
obvious monitor invariant to protect. Consider a variable or field with type double
.
One might expect to be able to read and write this variable
from multiple threads without the protection of a mutex, but this is not safe:
-
Many machines, including the x86, do not guarantee
to access a double
atomically.
(A stack-allocated double need not be naturally-aligned by the compiler, potentially leading to two
memory operations for a single access.)
Thus, there is an invariant we need to protect: that the double
is well-formed.
-
On some machines seemingly
obvious data-dependency properties do not hold without the cross-thread synchronization
provided by a mutex; a thread might read a well-formed double
but get a value
from an apparently earlier time. This comment applies to all types, including
integers, pointers, and even "atomic" types
provided by the language or runtime.
-
A variable's concrete type may change as a program is maintained,
and this change may be hidden by a typedef
.
In short, data accessed by multiple threads should be protected
with a mutex. To do otherwise is to court disaster.
Despite this advice, some programmers enjoy the intellectual challenge of using
lower-level primitives ("atomic" types, compare-and-swap, memory barrier) instead of mutexes.
There are many problems with such code:
- it usually does not work.
- it is often no faster overall than using mutexes.
- it may rely on assumptions about the hardware or compiler, or both.
However, the most important reason not to use such code is that it is complicated.
Even if the author understands it, the next maintainer of the code may not.
Worse, he may
think he understands it.
The best way to avoid locking is to avoid shared mutable state.
When shared mutable state is needed, use a mutex.
If you experience lock contention, consider using
more mutexes, each
protecting less data (that is, make the locking finer-grained).
If you feel you must access a shared mutable variable
without a mutex, and have data that shows it is worth the maintenance expense of
doing so, ask an expert how to do it.
Why can the holder of a Lock
not reacquire it?
Some mutex implementations, notably that of Java, and the Windows CRITICAL_SECTION are called
reentrant or
recursive.
They allow the holder of a mutex to reacquire the mutex without deadlocking by
maintaining an internal
acquisition count and
holder identity instead of the
held
boolean. The mutex is free when the count is zero.
The acquisition count keeps track of the number of acquisitions performed
by the holder; the holder is required to release the mutex the same number of times
it acquired it to make the mutex free.
This bookkeeping allows a method of an object to call other methods
of the same object while holding the lock, even if those other methods acquire the lock.
We do not allow this apparently convenient usage in
Lock
not because
of the small additional cost of maintaining the counter, but because of two problems.
Recall that a mutex's primary purpose is to allow the programmer to maintain
monitor invariants, and that the programmer may assume a monitor invariant
just after acquiring the appropriate mutex. Consider a Java method M0
that acquires a mutex mu
protecting invariant Inv.
The author of M0 is entitled to assume Inv at the moment he acquires mu
.
Now consider another method M1 that acquires mu
, invalidates Inv during an update, calls M0,
restores Inv, and releases mu
.
M0 assumes Inv, but Inv is untrue when M0 is called, so M0 fails.
Remember that both M0 and M1 may have multiple implementations written by multiple authors
over years, and perhaps multiple implementations in the same program, due to inheritance.
The source code of M0 may not be available to the author of M1, or
vice versa. Without remarkably good discipline and documentation standards, the programmers
may not understand why M0 is not functioning correctly.
If a programmer attempted to do the same thing with a non-reentrant mutex such
as Lock
, his code would instantly deadlock and a stack trace would show that a
thread is attempting to reacquire a lock it already holds.
Not only is the error discovered immediately, but the fix is usually trivial:
write a small method M0 that acquires mu
and calls a private method M0',
which does the real work, assuming Inv but without
acquiring mu
. The specifications of M0 and M0' differ only in their locking
behaviour, so the programmer almost always documents this difference, often in
the names of M0 and M0'. The presence of the additional method and the
corresponding name or comment provides an additional
reminder to the author of M1. He realizes that
by calling M0' rather than M0, he has the burden of establishing
Inv before the call---it is not guaranteed automatically by the monitor invariant.
This solution is not a panacea, but disallowing reentrancy at least makes the error
apparent, rather than hiding it.
The second problem with reentrancy is associated with condition variables. In
the example above, imagine that M0 waits on a condition variable and thus
effectively contains more than one critical section.
Normally M0 will work, but if called from M1 with mu
held, it is
unclear what happens, and neither outcome is satisfactory.
If the wait() primitive decrements mu
's acquisition count by one,
mu
does not become free, the condition never becomes
true, and the thread deadlocks. If instead the
acquisition count is set to zero by Wait()
, mu
becomes free during
a critical section initiated by M1. This is likely to
cause M1 to malfunction silently. In the non-reentrant case, M0 must be split into
M0 and M0' as before. Since M0' waits on a condition variable, it now has an
interesting specification: it temporarily releases a mutex that it did not
acquire. This is unusual, and usually very dangerous,
so one might expect a comment to that effect. This comment then tells the author of M1
that he must be especially careful if he calls M0'.
Why not use monitored modules? (or automatically locked objects, locking pointers, lock-free data structures, ...)
It seems attractive to automate the acquisition and release of mutexes by
declaring somehow that a mutex will be acquired on entry to a module and
released on exit, as in a Hoare monitor. This can be used for trivial cases, but
even quite common examples require more complex locking.
Consider a table
*t
mapping strings to integer counts. The table might have
methods insert()
, lookup()
, remove()
.
If the table provides its own synchronization, perhaps inserted automatically in each
method by some mechanism, we eliminate data races within the table itself,
but this does not help the client. Consider this code, which increments the count for
"foo" in the table *t
:
int *v = t->lookup("foo"); // safe because *t is a monitor
if (v != 0) {
(*v)++; // BUG: data race: unlocked increment
} else {
t->insert("foo", 1); // safe because *t is a monitor
}
If the client does not use his own mutex, counts may be missed.
If he does, the synchronization inside
*t
is redundant.
Thus, monitored modules are rarely helpful.
The implementors of SGI STL made the
same observation.
A further problem is that the designers of automatic locking mechanisms often
desire to reduce the amount of typing needed to implement a monitor, rather
than to improve the readability and maintainability of the code.
All too often, these two desires are in conflict; some code is
more readable
if one can tell when a lock is released.
Alternatives to mutexes
There are a number of ways to handle concurrency, and ways to avoid it
altogether. Of the various possible models, only two permit high-performance
implementations that can use multiple CPUs and sharing of resources, and still
allow large programs to be built from smaller components while maintaining
abstraction boundaries. These models are "threads + mutexes +
condition-variables", and "threads + message-passing". These
two can be used together, and often are.
Message passing
One can associate data with threads, so that each thread
owns some variables
and data structures; a variable is accessed only by its owning thread.
Other threads wishing to access
the data then communicate with the owning thread via
message
passing, such as Chrome's
TaskRunner
.
This style is a dual of the mutex-based style (see Lauer and Needham's
oft-cited paper on the subject): A message-send corresponds to
acquiring a mutex; running in a critical section corresponds to executing code
within the owning thread; and receiving a reply corresponds to releasing the
mutex. Thus, the most obvious difference between the approaches is that
in message-passing all the code that manipulates a particular data item is
brought together into one thread,
while with mutexes the data accesses can be interleaved with other code.
Message passing and mutexes can be intermixed; often one is preferred
either because the author is comfortable with the style, or because
one leads to a clearer module than the other.
The message-passing model tends to work well when there is a
natural resource that already serializes accesses (such as an I/O device),
a linear state machine best expressed as a single routine,
or when critical sections are long.
Mutexes work well when critical sections are short and may be
invoked in many places, or when reader-writer locks can be used effectively.
In Chrome code, message passing is much more common
(via TaskRunner
and PostTask
) and low-level
primitives like locks and condition variables are used only when
necessary.
Both models allow high-throughput implementations,
and both can suffer from both races and deadlocks. Deadlocks can often be
eliminated in the message-passing model by using unbounded queues and/or threadpools.
Atomic types and atomic operations
Many runtimes and languages (including C++11)
provide atomic operations, such as compare-and-swap, and "atomic" types
that can be read and written atomically. Atomic operations and types
are much harder to use than one might first think,
and they should not be used in normal code.
Unfortunately, programmers are attracted to them for various reasons:
-
Programmers enjoy the intellectual puzzle of using these operations.
This leads to clever, but ill-advised, and often broken code.
Algorithms involving atomic operations are extremely subtle.
For example, a general-purpose, efficient, lock-free,
singly-linked list algorithm took
significant research to discover, and requires care to implement. Almost
all programmers make mistakes when they attempt direct use of atomic
operations. Even when they don't make mistakes, the resulting code is
hard for others to maintain. Both CPUs and compilers can rearrange reads and writes in ways that lead to subtle race conditions. The simple-sounding pattern of double-checked locking is actually extremely subtle and is usually implemented incorrectly.
-
Programmers assume that locking is expensive, and that using atomic
operations will be more efficient. But in reality, acquiring and
releasing a lock is cheaper than a cache miss; attention to cache behaviour
is usually a more fruitful way to improve performance. Furthermore,
lock-free data structures are often more expensive than using locks.
This is because a lock allows arbitrary changes to be made to a complex
data structure; if the same changes must be made without a lock, the
result is likely to take more atomic read-modify-write instructions, not
fewer.
-
People wish to avoid lock contention when concurrency is high.
This is best solved by partitioning locked data structures to avoid lock
contention. For example, it is easier, more efficient, and
more useful to build a
high-concurrency hash table from many normal hash tables, each with its own
lock, than to build one lock-free hash table using atomic operations.
Atomic operations should be used in only a handful of low-level data
structures, written by a few experts, and then reviewed and tested thoroughly.
Unfortunately, many attempt to write lock-free code, and almost always this is
a mistake. Please do not fall into this trap: do
not use atomic
operations or types---if you do, you will make mistakes, and you will cost the
company time and money.
A single thread
A process that uses only a single thread requires no mutexes, and
this is often the best approach for simple programs that do not
require high performance or that are inherently sequential.
However, it is usually not the best choice for large programs,
or when high performance is required.
-
A single-threaded application can use
only one CPU, which typically makes it far slower than other options, even when
the overheads of locking are taken into account.
If the application is simple
enough, one may be able to run multiple copies of the same program
on each machine, but this introduces two inefficiencies:
cross-address-space context switches are more expensive than thread context
switches because threads share TLB entries while address spaces do not; and
the address space separation precludes sharing some resources (caches, ports, etc.).
-
Some programs, such as network servers, exhibit natural concurrency:
they must deal with many concurrent client requests, so some mechanism is needed to
allow this.
The fallacy of thread context-switch cost
Some try to argue that it is significantly cheaper to multiplex a single thread
than to use multiple threads because a single thread requires no thread context switches.
Such an argument stems from confusion about what constitutes a "context
switch" and what contributes to its cost. A context switch is simply the act of multiplexing
the processor between multiple activities; its dominant costs are similar
regardless of whether this is done in kernel-mode or in user-mode:
-
When a program switches to a new activity, it incurs cache and TLB misses by touching the
data and instructions associated with a new activity. This cost is the most important,
and is essentially the same
regardless of whether the new activity
takes place in a different thread or in the same thread. The cost occurs
not only when a multithreaded program performs a thread context switch, but also
when an event
driven program processes a new event, or when a
co-operative multithreaded program switches context in
user-space. Multithreaded programs rarely context switch due to time-slicing
because time-slices are large.
-
The cost of user-kernel mode switches is sometimes counted as part of
the context-switch cost between threads. However,
multithreaded programs usually context switch when
they have already entered the kernel for other reasons---typically,
via some system call or to service an interrupt.
A single-threaded program incurs these same mode switches,
and thus they are common to all models. One might expect
mutex and condition variable calls to add to the number of system calls,
but this effect is modest because
uncontended mutexes induce no system calls, while contended mutexes
and condition-variable operations should be relatively rare.
-
Context switches between address spaces are more expensive because
TLB entries must be replaced.
To summarize, if a single address space is used, the costs of switching between activities
are nearly independent of the number of threads used within that address space;
the technique that leads to slowest execution is to run multiple copies of a
single-threaded program.
The event-driven style
To handle concurrent activities
in a single thread, some programmers adopt a style variously known as
event-driven,
state-machine or
continuation-passing: One can decompose
the actions for a given request into a graph of
handler routines that never
block for I/O internally, but rather specify which handlers should be invoked
when pending I/O requests complete. This approach can be made to work and may even be
straightforward in simple programs, but it has bad effects on
readability, modularity, and abstraction, as well as using only one CPU.
To see the problems with the event-driven style,
consider the code
...
for (int i = 0; i != 10; i++) { foo(i); }
...
Now imagine that the third-party library routine
foo()
is
modified at some future time to improve its functionality or average performance.
Imagine that a side-effect of this improvement is that occasionally
foo()
must perform
some blocking I/O that should not be performed within a handler. Neither the
author of the for-loop nor the author of the new implementation of
foo()
has done anything unusual, and yet the program may show
poor throughput or even deadlock in an event-driven environment.
Even subtler changes can have undesirable effects.
Imagine that
foo()
includes a call to
fprintf()
;
if one day the output stream is redirected to a device with high-throughput but
high-latency, the program's throughput will drop precipitously because
foo()
's latency cannot be hidden in the event-driven model without rewriting
both
fprintf()
and
foo()
.
We can fix the performance problem if we change foo()
's signature
to include a continuation to be called when foo()
completes.
However, this is not a local change: the loop must be restructured so
that the loop variable is encoded in the continuation state. Worse, every
use of foo()
must be similarly modified not only to handle the new signature, but
to break the calling procedure into two---the prefix before the call to foo()
,
and the continuation. Thus, a local change in
foo()
has affected an arbitrary number of modules in a significant
way: the event-driven style does not preserve modularity.
Notice how this differs from the multi-threaded case. Just as the event-driven style
style requires that foo()
be non-blocking, the multithreaded style
requires
that foo()
be thread-safe.
However, this constraint is easier to live with. First,
most libraries are thread-safe
either naturally or by design, while few are designed for use in event-driven systems.
(For example, fprintf()
is thread safe, but provides no callback
mechanism.) Second, if foo()
were not thread safe,
calls to it
can be made safe either by a change to foo()
or by wrapping
foo()
with a new routine foo()
that acquires a lock before
calling the old foo()
.
Either change is local, and does not affect other aspects of the
interface, so modularity is preserved.
The problem with the event-driven style is worse for routines like
printf()
whose signatures cannot be changed lightly.
Even more worryingly, some I/O methods cannot be made efficient in a
single-threaded event-driven system even with arbitrary restructuring of the
entire program. For example, while one can with effort write a
continuation-passing equivalent of printf()
, memory-mapped I/O and
programmed I/O have no such equivalent.
A further problem with the event-driven style is that the resulting code
becomes quite difficult to understand, maintain, and debug. This is primarily
because it is harder to tell from reading the code which routine caused the
current one to be called, and which routine will be called next. Standard
debugging and performance tools become less effective too, as they rarely have
support for tracing through continuations, and continuation mechanisms rarely
maintain call history. In contrast, non-event-driven programs maintain much of
their state and history conveniently on the thread stacks; debugging tools can
reveal a great deal simply by displaying stack traces.
Co-operative multithreading
An alternative style called
co-operative multithreading
allows the programmer to use multiple threads on a single CPU.
The scheduler guarantees that no two threads can run concurrently, and guarantees never to
pre-empt a thread that has not blocked. In theory, this allows
mutexes to be omitted: the code
a++; b--;
will
always execute atomically. Unfortunately, reliance on this property makes the code
more fragile. For example, because any I/O may block,
a++; printf("bar\n"); b--;
probably does not execute atomically,
and
a++; foo(); b--;
may or may not execute atomically,
depending on the implementation of
foo()
.
Thus, co-operative multithreading without explicit synchronization can lead to
code in which a bug may be introduced to one module by adding a debug
printf-statement in another module. If explicit synchronization is used,
the technique becomes equivalent to the straightforward use of threads.
For these reasons, unless a program is quite simple it usually pays in both
performance and maintainability to use multiple threads, and to protect shared
variables explicitly with mutexes or to communicate between threads with messages.
Why is cv.Wait()
always in a while-loop?
Hoare's original condition variables did not require the while-loop,
but modern versions require it for somewhat subtle reasons:
-
The presence of the while-loop allows one to tell by local
inspection that the condition is true when the loop exits. Hoare's original
precise semantics required inspection of all code that could potentially call
Signal()
, which made some errors rather harder to debug.
-
The while-loop allows clients to do spurious wakeups, which gives the programmer
the opportunity to trade performance for ease of programming. Suppose he arranges
for threads always to signal the condition variable when they
modify protected state, rather than only when they make a specific
condition true. This allows modularity between waiters and wakers: the
wakers don't need to know what conditions wakers are waiting for, and each
waiter can wait for a different condition without affecting the code of
the wakers.
-
The while-loop allows the condition variable implementation more freedom
to schedule woken threads in any order. Consider a thread T0 that wakes
thread T1 that was waiting for condition C. If the runtime semantics
guarantee that T1 will enter the critical section next, T1 can assume C.
But context switches have overhead, so it is usually more efficient
merely to add T1 to the run queue while continuing to run T0 and perhaps other threads,
which may then enter the critical section before T1. If any of those threads
falsifies C, T1 must not assume C on entering the critical section; scheduling
has made it appear that it has received a spurious wakeup. The
while-loop ensures that T1 tests C, and continues only if C is really true.
Thus, the while-loop effectively allows more freedom in choosing an
efficient scheduling discipline.
-
Timed waits become less error-prone. A timed wait may cause a thread to
wake before its condition C is true. Suppose the programmer forgets
to test for a timeout. If he is forced to use a while-loop, his thread
will go to sleep again and his program will probably deadlock, allowing
easy detection of the bug. Without the while-loop, the thread would
falsely assume C, and cause arbitrarily bad behaviour.
Why must the condition used with cv.Wait()
be a function of state protected by the mutex?
Consider a thread W waiting for a condition
cond_expr
to become
true:
mu.Acquire();
while (!cond_expr) {
cv.Wait(); // mu was passed to cv's constructor
}
// cond_expr now holds
...
mu.Release();
If
cond_expr
is not a function of state protected by
mu
, two bad things can happen:
-
Suppose that thread W finds cond_expr
false, and is about to
call cv.Wait()
. If the state associated with
cond_expr
is not protected by mu
, another thread
can make cond_expr
true and call cv.Signal()
before W calls cv.Wait()
. This means that W may block
indefinitely in Wait()
, even though cond_expr
is
true (only a thread currently in Wait()
is woken by a call to
Signal()
).
-
Suppose that thread W finds cond_expr
true, and is about to
execute the code labelled "cond_expr now holds". If the state
associated with cond_expr
is not protected by
mu
, another thread can make cond_expr
false
before W runs the rest of the code, so W cannot assume
cond_expr
. This negates the purpose of the condition
variable, which was to give W a guarantee about cond_expr
.
Why put Signal()
inside the critical section?
In most cases, it is correct to put
Signal()
after the critical section, but in Chrome code it is
always
both safe and efficient to put it inside the critical section. (TODO: verify this)
Some texts recommend putting Signal()
after the critical
section because this makes it more likely that the mutex is free
when a thread attempts to reacquire it on return from Wait()
.
If the Signal()
were inside the critical section,
a naive implementation might wake the thread which could then
block once more on the mutex held by the very thread that woke it.
Chrome's condition variables (and most reasonable implementations)
detect this case, and delay waking the waiting thread until the mutex
is free. (TODO: verify this) Hence, there is no performance penalty
for calling Signal()
inside the critical section.
In rare cases, it is incorrect to call Signal()
after
the critical section, so we recommend always using it inside the
critical section. The following code can attempt to access the condition
variable after it has been deleted, but could be safe if Signal()
were called inside the critical section.
struct Foo { Lock mu; ConditionVariable cv; bool del; ... };
...
void Thread1(Foo *foo) {
foo->mu.Acquire();
while (!foo->del) {
foo->cv.Wait();
}
foo->mu.Release();
delete foo;
}
...
void Thread2(Foo *foo) {
foo->mu.Acquire();
foo->del = true;
// Signal() should be called here
foo->mu.Release();
foo->cv.Signal(); // BUG: foo may have been deleted
}
Why should implementors of mutexes pay attention to mutex performance under contention?
Clients should avoid lock contention, because contention necessarily implies
less parallelism; some threads are blocked while another executes the critical section.
Because clients must avoid contention, some implementors of mutexes pay less
attention to the performance of mutexes under contention.
However, contention is sometimes encountered despite clients' best efforts.
For example:
-
A network server may become overloaded or see a changed pattern of use,
causing a mutex to be used more than it normally would.
-
A program may be run on an upgraded machine with more CPUs,
causing contention on a mutex that was previously lightly loaded.
-
Software developers encourage abstraction between parts of our programs, so
the authors and clients of modules may have different expectations of how
the module will be used. In particular, a client may cause contention on a
mutex that he is unaware of.
While it is important for clients to fix contention to avoid loss of parallelism,
that loss of parallelism should be their main consideration.
The performance of the mutex itself should not degrade precipitously, even when
heavily contended. That is: an overloaded server should recover from overload
if the load drops once more; a machine with more CPUs should run no slower than
a machine with fewer CPUs; and calling a module more often should not reduce
the amount of work that gets done, even if it doesn't increase it.
Ideally, a critical section should provide approximately the same rate
of progress to many contending threads as it can to a single thread.
Mutex implementations can approximate this ideal by not providing
fairness, and by preventing multiple threads that have already blocked from
simultaneously competing for the lock.
Does every Lock operation imply a memory barrier?
Programmers should not use
Lock
operations as a means for inserting arbitrary
memory barriers into their code. (Or for exerting control over when
threads run.)
Lock
operations imply only ordering
necessary for the protection of monitor invariants. In
particular, the intent is:
If threads T0 and T1 execute the following code,
where some location is modified by one of T0_Inside()
and
T1_Inside()
and read or written by the other:
// thread T0 // thread T1
T0_Before(); T1_Before();
mu.Acquire(); mu.Acquire();
T0_Inside(); T1_Inside();
mu.Release(); mu.Release();
T0_After(); T1_After();
then the memory operations in
Tx_Before()
and
Tx_Inside()
all precede the memory operations in
Ty_Inside()
and
Ty_After()
either for
x=0, y=1
or for
x=1, y=0
.
If the predicate does not hold, no memory ordering should be assumed from the
Lock
operations. This surprises programmers who expect the simplest possible
implementation, with no optimizations. Unfortunately, this expectation
is reinforced by some API standards.
We discourage such assumptions because they make
various transformations more difficult. Examples include:
- Removal of critical sections that are redundant with others.
- Removal of locks used by only one thread, or that protect no data.
- Coalescing and splitting of locks and critical sections.
- Conversion of exclusive locks to shared locks.
- Replacing locks with transactional memory.
Some lock implementations already apply some of these transformations,
and are more efficient as a result. Therefore,
Lock
reserves the right to use such transformations when safe,
even if that means removing memory barriers.