Friday, January 07, 2011

tricks for writing multi-threaded code

Today, I found and summarized an interesting paper about tricks for writing multi-threaded code as we are going to multi-core area.

********************************
From the article
B. Cantrill and J. Bonwick. Real-world concurrency. Commun. ACM, 51(11):34-39, 2008.
********************************
This journal article is opposed to the claim that due to the blossom of multicore, "the free lunch is over". Most of concurrency can be achieved by standing on the shoulders of highly parallel subsystems such as operating systems or databases. Only system developers will continue to care for details of writing performance-friendly multi-threaded code, but for others, the challenge is not how to implement parallel components but how to use them to deliver a scalable system. It is concurrency by architecture instead of implementation. The authors claim that historically it was the success of concurrent software that led to the considerations of multiple cores on a die. Free lunch is not exactly free anymore but we still can enjoy it with cares.

It is true that concurrency is for performance; we use concurrency to yield an acceptably performing system. There are 3 major ways to increase performance: reduce latency, hide latency and increase throughput.

In order to equip programmers with experience and wisdom, the article pinpoints a collection of tricks for writing multi-threaded code.
  • Know your cold paths from your hot paths. It is to know which parts of the program can run in parallel (hot paths) versus which parts can execute sequentially without affect performance. In cold paths, keep the locking as coarse-grained as possible. Likewise, in hot paths, locking strategy should be simple and fine-grained. If you do not know, assume it is a cold path and adopt the coarse-grained strategy.
  • Intuition is frequently wrong - be data intensive. It is your data, not your intuition to decide the performance of your systems. Data can tell you a lot about performance. Therefore, you should (1) have a sufficiently concurrent machine to highlight performance problems, (2) put load on the system to resemble the production workload and (3) have an infrastructure to pinpoint performance problems. Current dynamic instrumentation tools enable us to gather sufficient data to exhibit performance issues.
  • Know when - and when not - to break up a lock. Breaking a lock, such as breaking look into per-CPU lock, a hash table of locks, per-structure locks..., is not the only way to reduce contention and contention can be easily reduced by decreasing the hold time of the lock by algorithmic improvements or simply omitting the lock.
  • Be wary of readers/writer locks. Replacing a mutex with a readers/writer lock is reasonable only when we have a long hold time for the lock because acquiring the lock as a reader incurs the same bus traffic as acquiring a mutex. Especially, readers/writer locks are more deadlock-prone.
  • Consider per-CPU lock. Per-CPU lock, acquiring a lock based on the current CPU identifiers, is a good technique to consider. However, be cautious because this technique is more complex and deadlock-prone.
  • Know when to broadcast - and when to signal. Broadcasting and signaling have different semantics: a signal wakes up 1 sleeping thread, a broadcast wakes up all threads. Broadcasting only when indicate a state change rather than resource availability where signaling is enough. Broadcasting can add needless scheduling and locking activity because all threads are woken up and contending for resources.
  • Learn to debug post-mortem. Debugging from a core dump is a desirable skill for system developers because problems in highly parallel systems are not necessarily reproducible.
  • Design your systems to be composable. Locking systems can be completely composable by (1) making locking entirely internal to subsystems or (2) not locking at all (then it is up to consumers to assure that they do not access their systems in parallel).
  • Don't use semaphore where a mutex would suffice. Because mutexes have the notion of ownership while semaphores don't and the lack of ownership makes debugging much more onerous.
  • Consider memory retiring when implementing per-chain hash-table locks.
  • Be aware of false sharing. False sharing frequently arises in practice and is very difficult to detect dynamically because it requires not only a bus analyzer but also a translator from physical to virtual addresses. Paddling array elements to be multiple of the coherence granularity is a popular technique to deal with false sharing.
  • Consider using nonblocking synchronization routines to monitor contention. When an attempt to acquire a lock fails, we know that there is a contention. Therefore, instead of simply acquiring a lock, we can attempt to acquire the lock by using techniques such as exponential-backoff acquisition.
  • When reacquiring locks, consider using generation counts to detect state change.
  • Use wait- and lock-free structures only if you absolutely must. We do this only when the locks could not be acquired for the reasons of correctness (for example, to implement a locking system itself). Those structures trade-off between complexity and maintenance burdens with little performance benefit.
  • Prepare for the thrill of victory - and the agony of defeat. It is often impossible to know if the current modification to scalability is the last one. Likewise, it is very discouraging when your new modification shows bad result and you have to return to the system to re-gather the data.

Labels: , , , , , , , , , ,