Linux Parallel Processing Using SMP

Prof. Hank Dietz
Purdue University School of Electrical and Computer Engineering
hankd@ecn.purdue.edu
Still under construction... 8 July 1996

This document gives a brief overview of how to use SMP Linux systems for parallel processing. The most up-to-date information on SMP Linux is probably available via the SMP Linux project mailing list; send email to majordomo@vger.rutgers.edu with the text subscribe linux-smp to join the list.

Right now, the primary problem is that there really isn't much high-level support for shared memory parallel programs under SMP Linux. This document provides a simple overview of how to write shared memory parallel programs using standard Linux facilities. As you read this document, keep in mind that we haven't tested everything and we don't even have access to high-end SMP hardware.

That said, we have been able to start testing SMP parallel stuff on real hardware. In June 1996, we purchased a two-processor 100MHz Pentium system... both processors, Asus motherboard, 256K cache, 32M RAM, 1.6G disk, 6X CDROM, Stealth 64, and 15" Acer monitor all for $1,800! Getting SMP Linux up on this system was simply a matter of recompiling the kernel with the SMP=1 line in the makefile uncommented (although I find setting SMP to 1 a bit ironic ;-).

It is worth mentioning that an SMP Linux system can use most parallel processing software that was originally developed for a workstation cluster using socket communication. Sockets should work within an SMP Linux system, and even for multiple SMPs networked as a cluster; however, sockets also imply a lot of unnecessary overhead for an SMP. Worse still, much of that overhead is within the kernel, and SMP Linux currently allows at most one processor in the kernel at a time.

The remainder of this document discusses SMP hardware, reviews the basic Linux mechanisms for sharing memory across the processes of a parallel program, makes a few observations about atomicity, volatility, locks, and cache lines, and finally gives some pointers to other shared memory parallel processing resources.

SMP Hardware

Although SMP systems have been around for many years, until very recently, each such machine tended to implement basic functions differently enough so that operating system support was not portable. The thing that has changed this situation is Intel's Multiprocessor Specification, often referred to as simply MPS. A wide range of vendors are building MPS-compliant systems supporting up to four processors, but MPS theoretically allows many more processors. SMP Linux essentially supports most Intel MPS version 1.1 or 1.4 compliant machines with up to sixteen 486, Pentium, or Pentium Pro processors. It also supports Sun4m multiprocessor Sparc machines.

It is important to understand, however, that the performance of MPS-compliant systems can vary widely. As expected, one cause for performance differences is processor speed: faster clock speeds tend to yield faster systems, and a Pentium Pro processor is faster than a Pentium. However, MPS does not really specify how hardware implements shared memory, but only how that implementation must function from a software point of view; this means that performance is also a function of how the shared memory implementation interacts with the characteristics of SMP Linux and your particular programs.

The primary way in which systems that comply with MPS differ is in how they implement access to physically shared memory:

Does each processor have its own L2 (second-level) cache or do they share one?
Some MPS Pentium systems, and all MPS Pentium Pro systems, have independent L2 caches. (Each Pentium Pro "chip" really contains two chips: a processor and its L2 cache.) Separate L2 caches are generally viewed as maximizing compute performance, but things are not quite so obvious under Linux. The primary complication is that the current SMP Linux scheduler does not attempt to keep each process on the same processor, a concept known as processor affinity. Thus, having separate L2 caches may introduce significant overhead when a process is given a timeslice on a processor other than the one that was executing it last.
Many relatively inexpensive systems are organized so that two Pentium processors share a single L2 cache. The bad news is that this causes contention for the cache, seriously degrading performance when running multiple independent sequential programs. The good news is that many parallel programs might actually benefit from the shared cache because if both processors will want to access the same line from shared memory, only one had to fetch it into cache and contention for the bus is averted. The lack of processor affinity also causes less damage with a shared L2 cache. Thus, for parallel programs, it isn't really clear that sharing L2 cache is as harmful as one might expect.
Preliminary experience with our dual Pentium shared 256K cache system shows quite a wide range of performance depending on the level of kernel activity required. At worst, we see only about 1.2x speedup. However, we also have seen up to 2.1x speedup, which suggests that compute-intensive SPMD-style code really does profit from the "shared fetch" effect.
Is there more than one PCI bus?
Although an MPS system can achieve good speed-up for many compute-intensive parallel programs even if there is only one PCI bus, I/O operations occur at no better than uniprocessor performance... and probably a little worse due to bus contention from the processors. Thus, if you are looking to speed-up I/O, make sure that you get an MPS system with multiple independent PCI busses and I/O controllers (e.g., multiple SCSI chains). You will need to be careful to make sure SMP Linux supports what you get. Also keep in mind that the current SMP Linux essentially allows only one processor in the kernel at any time, so you should choose your I/O controllers carefully to pick ones that minimize the kernel time required for each I/O operation. For really high performance, you might even consider doing raw device I/O directly from user processes, without a system call... this isn't necessarily as hard as it sounds, and need not compromise security: e.g., see the PAPERS project kernel modification for giveioperm().
What about memory interleaving?
Memory interleaving actually has nothing whatsoever to do with MPS, but you will often see it mentioned for MPS systems because these systems are typically more demanding of memory bandwidth. Basically, two-way or four-way interleaving organizes RAM so that a block access is accomplished using multiple banks of RAM rather than just one. This provides higher memory access bandwidth, particularly for cache line loads and stores. The waters are a bit muddied about this, however, because EDO RAM and various other memory technologies tend to improve similar kinds of operations. Put another way, I'm not sure which PC memory structure is currently best... if anyone ever finds out empirically, please let me know.

Linux Mechanisms For Sharing Memory

Ok, so you have decided that parallel processing on an SMP is a great thing to do... how do you get started? Well, the very first thing that occurs to most shared-memory parallel processing folk is to get a copy of the documentation on the local threads library.

Threads are essentially "light-weight" UNIX processes that are not scheduled the same way as regular UNIX processes and allow pretty much the entire memory map to be shared among them. Trouble is, until very recently, Linux didn't do threads. The POSIX Pthreads package has been ported to Linux, and is roughly at the stage of beta testing. Of course, the big question is whether this threads package actually runs the threads of a program in parallel under SMP Linux. Apparently, it does not. The POSIX API doesn't seem to provide a way to specify concurrency.

The new bb_threads library for Linux is a very small library that uses the Linux clone() call to fork new, independently scheduled, Linux processes all sharing a single address space. Because each "thread" is a full Linux process, you do not get the same "light-weight" scheduling control provided by most thread libraries, but SMP Linux machines can run multiple of these "threads" in parallel. The library uses a bit of C-wrapped assembly code to install a new chunk of memory as each thread's stack and to provide atomic access functions for an array of locks (mutex objects). The primary documentation for this library is a very short demo program and a README file, both packaged with the library archive, but a brief overview is also presented here.

In addition to the "shared everything" mechanism of bb_threads, there are two mechanisms that allow groups of Linux processes to have independent memory spaces, all sharing only a relatively small memory segment. Assuming that you didn't foolishly exclude "System V IPC" when you configured your Linux system, Linux supports a very portable mechanism that has generally become known as "System V Shared Memory." The other alternative is a little feature from BSD UNIX that isn't completely implemented for Linux, but still does enough to be usable for sharing memory between processes: the mmap() system call. You can, and should, learn about these calls from the manual pages... but a brief of each is given here to help get you started.

Which shared memory model should you use? That is mostly a question of religion. A lot of people like the "shared everything" model, as implemented by bb_threads, because they do not really need to identify which data structures should be shared at the time they are declared... you simply put locks around potentially-conflicting accesses to shared objects to ensure that only one process(or) has access at any moment. Then again, that really isn't all that simple.... ;-) Given the choice, I'd recommend use of the System V mechanism simply because the complete mechanism is fully implemented, so unpleasant surprises are less likely.

The bb_threads Library

The bb_threads library is very simple and apparently easy to use, but it is also the easiest library to abuse. Why? Because it implements a "shared everything" model.

A lot of people prefer this type of shared memory model because it does not force them to identify which data structures should be shared at the time they are declared. One simply places locks around code that accesses shared objects if having more than one process(or) access those objects simultaneously could cause problems (typically, races). However, with everything in shared space, omission of even one lock can lead to a terrible parallel debugging nightmare. The flip side is that if one is too conservative in using locks, it is quite common for the lock accesses themselves to destroy system performance.

The basic program structure for using the bb_threads library is:

  1. Start the program running as a single process.
  2. You will need to estimate the maximum stack space that will be required for each thread. Guessing large is relatively harmless (that is what virtual memory is for ;-), but remember that all the stacks are coming from a single virtual address space, so guessing huge is not a great idea. The demo suggests 64K. This size is set to b bytes by:
    bb_threads_stacksize(b);
    
  3. The next step is to initialize any locks that you will need. The lock mechanism built-into this library numbers locks from 0 to MAX_MUTEXES, and initializes lock i by:
    bb_threads_mutexcreate(i);
    
  4. Spawning a new thread is done by calling a library routine that takes arguments specifying what function the new thread should execute and what arguments should be transmitted to it. To start a new thread executing the void-returning function f with the single argument arg, you do something like:
    bb_threads_newthread(f, &arg)
    
    Where f should be declared something like:
    void f(void *arg, size_t dummy)
    
    If you need to pass more than one argument, pass a pointer to a structure initialized to hold the argument values.
  5. Because all the address space is shared, it might seem that you can simply write your code freely accessing variables and calling functions and everything will be just fine... but this is not the case. The problems involving access to variables are common to all three ways of mapping shared memory, and are thus discussed in the section on Atomicity, Volatility, Locks, and Cache Lines. However, only this technique causes (local) static variables to be shared across all processes; the result of this is that user functions and C library routines that use static locals might not function correctly if simultaneously called from two or more processes. We say that such a function has upward exposed internal state.
    The suggested solution is to surround each such function call with bb_threads_lock(n) and bb_threads_unlock(n) where n is an integer identifying which lock to use. This causes at most one process to be allowed to enter the locked code at a time. Unfortunately, this is not really sufficient for two reasons: In fact, the library demo program fails to correctly use locks to prevent printf() from being executed simultaneously from within the functions fnn and main... and because of this, the demo does not always work. I'm not saying this to knock the demo, but rather to emphasize that this stuff is very tricky.
  6. When a thread executes a return, it actually destroys the process... but the local stack memory is not automatically deallocated. To be precise, Linux doesn't support deallocation, but the memory space is not automatically added back to the malloc free list. Thus, the parent process should reclaim the space for each dead child by:
    bb_threads_cleanup(wait(NULL));
    

In summary, the bb_threads library is a good start, but right now it is very small and very alpha. Even once it matures, the issues involving locking to avoid conflicts on upward exposed internal state will still be there.... Basically, too many locks destroy performance or yield deadlocks and too few locks cause races and incorrect behavior. In fact, although GCC generates re-entrant code, it is quite possible that some compilers would use static locations as pseudo-registers, in which case this entire model is essentially unusable. In my opinion, although the shared everything model seems the easiest to use, the subtle nature of its problems makes the other two models much easier to use correctly.

System V Shared Memory

The System V IPC support consists of a number of inter-process communication system calls providing message queues, semaphores, and a shared memory mechanism. Of course, this mechanism was really intended to be used for multiple processes to communicate within a uniprocessor system. However, that implies that it also should work to communicate between processes under SMP Linux, no matter which processors they run on. The basic procedure for creating a group of processes sharing access to a shared memory segment is:

  1. Start the program running as a single process.
  2. Typically, you will want each run of a parallel program to have its own shared memory segment, so you will need to call shmget() to create a new segment of the desired size. Alternatively, this call can be used to get the ID of a pre-existing shared memory segment. In either case, the return value is either the shared memory segment ID or -1 for error. For example, to create a shared memory segment of b bytes, the call might be:
    shmid = shmget(IPC_PRIVATE, b, (IPC_CREAT | 0666));
    
  3. The next step is to attach this shared memory segment to this process, literally adding it to the virtual memory map of this process. Although the shmat() call allows the programmer to specify the virtual address at which the segment should appear, the address selected must be aligned on a page boundary (i.e., be a multiple of the page size returned by getpagesize(), which is usually 4096 bytes), and will override the mapping of any memory formerly at that address. Thus, we instead prefer to let the system pick the address. In either case, the return value is a pointer to the base virtual address of the segment just mapped. The code is:
    shmptr = shmat(shmid, 0, 0);
    
    Notice that you can allocate all your static shared variables into this shared memory segment by simply declaring all shared variables as members of a struct type, and declaring shmptr to be a pointer to that type. Using this technique, shared variable x would be accessed as shmptr->x.
  4. Since this shared memory segment should be destroyed when the last process with access to it terminates or detaches from it, we need to call shmctl() to set-up this default action. The code is something like:
    shmctl(shmid, IPC_RMID, 0);
    
  5. Use the standard Linux fork() call to make the desired number of processes... each will inherit the shared memory segment.
  6. When a process is done using a shared memory segment, it really should detach from that shared memory segment. This is done by:
    shmdt(shmptr);
    

Although the above set-up does require a few system calls, once the shared memory segment has been established, any change made by one processor to a value in that memory will automatically be visible to all processes. Most importantly, each communication operation will occur without the overhead of a system call.

When debugging your code, it is useful to remember that the ipcs command will report the status of the System V IPC facilities currently in use.

Memory Map Call

Using system calls for file I/O can be very expensive; in fact, that is why there is a user-buffered file I/O library (getchar(), fwrite(), etc.). But user buffers don't work if multiple processes are accessing the same writeable file, and the user buffer management overhead is significant. The BSD UNIX fix for this was the addition of a system call that allows a portion of a file to be mapped into user memory, essentially using virtual memory paging mechanisms to cause updates. This same mechanism also has been used in systems from Sequent for many years as the basis for their shared memory parallel processing support. Despite some very negative comments in the (quite old) man page, Linux seems to correctly perform at least some of the basic functions, and it supports the degenerate use of this system call to map an anonymous segment of memory that can be shared across multiple processes.

In essence, the Linux implementation of mmap() is a plug-in replacement for steps 2, 3, and 4 in the System V shared memory scheme above. To create an anonymous shared memory segment:

shmptr =
    mmap(0,                        /* system assigns address */
         b,                        /* size of shared memory segment */
         (PROT_READ | PROT_WRITE), /* access rights, can be rwx */
         (MAP_ANON | MAP_SHARED),  /* anonymous, shared */
         0,                        /* file descriptor (not used) */
         0);                       /* file offset (not used) */

The equivalent to the System V shared memory shmdt() call is munmap():

munmap(shmptr, b);

In my opinion, there really isn't a compelling reason to use mmap() instead of the System V shared memory support.

Atomicity, Volatility, Locks, and Cache Lines

No matter which of the above two mechanisms you use, the result is pretty much the same: you get a pointer to a chunk of read/write memory that is accessible by all processes within your parallel program. Does that mean I can just have my parallel program access shared memory objects as though they were in ordinary local memory? Well, not quite....

Atomicity refers to the concept that an operation on an object is accomplished as an indivisible, uninterruptible, sequence. Unfortunately, sharing memory access does not imply that all operations on data in shared memory occur atomically. Worse still, "smart" compilers like GCC will often perform optimizations that could eliminate the memory operations needed to ensure that other processors can see what this processor has done. Fortunately, both these problems can be remedied... leaving only the relationship between access efficiency and cache line size for us to worry about.

However, before discussing these issues, it is useful to point-out that all of this assumes that memory references for each processor happen in the order in which they were coded. The Pentium does this, but also notes that future Intel processors might not. So, for future processors, keep in mind that it may also be necessary to surround some shared memory accesses with instructions that cause all pending memory accesses to complete, thus providing memory access ordering. The CPUID instruction apparently is reserved to have this side-effect.

Volatility

To prevent GCC's optimizer from buffering values of shared memory objects in registers, all objects in shared memory should be declared as having types with the volatile attribute. If this is done, all shared object reads and writes that require just one word access will occur atomically. For example, suppose that p is a pointer to an integer, where both the pointer and the integer it will point at are in shared memory; the ANSI C declaration might be:

volatile int * volatile p;
Yes, that is a annoying, but it is the price one pays for enabling GCC to perform some very powerful optimizations. At least in theory, the -traditional option to GCC might suffice to produce correct code at the expense of some optimization, because pre-ANSI K&R C essentially claimed that all variables were volatile unless explicitly declared as register. Still, if your typical GCC compile looks like cc -O6 ..., you really will want to explicitly mark things as volatile only where necessary.

It appears that there is a rumor accompanying the bb_threads library to the effect that using assembly-language locks that are marked as modifying "everything" will cause GCC to appropriately flush all variables, thus avoiding the "inefficient" handling associated with things declared as volatile. This hack appears to work for statically allocated global variables using version 2.7.0 of GCC... however, that behavior is not required by the ANSI C standard. Still worse, other processes that are making only read accesses can buffer the values in registers forever, thus never noticing that the shared memory value has actually changed. In summary, do what you want, but only variables accessed through volatile are guaranteed to work correctly.

Note that you can cause a volatile access to an ordinary variable by using a type cast that imposes the volatile attribute. For example, the ordinary int i; can be referenced as a volatile by *((volatile int) &i); thus, you can explicitly invoke the "overhead" of volatility only where it is critical.

Locks

If you thought that ++i; would always work to add one to a variable i in shared memory, you've got a nasty little surprise coming: even if coded as a single instruction, the load and store of the result are separate memory transactions, and other processors could access i between these two transactions. For example, having two processes both perform ++i; might only increment i by one, rather than by two. According to the Intel Pentium "Architecture and Programming Manual," the LOCK prefix can be used to ensure that any of the following instructions is atomic relative to the data memory location it accesses:

BTS, BTR, BTC                     mem, reg/imm
XCHG                              reg, mem
XCHG                              mem, reg
ADD, OR, ADC, SBB, AND, SUB, XOR  mem, reg/imm
NOT, NEG, INC, DEC                mem
CMPXCHG, XADD
However, it probably is not a good idea to use all these operations. For example, XADD did not even exist for the 386, so coding it may cause portability problems.

The XCHG instruction always asserts a lock, even without the LOCK prefix, and thus is clearly the preferred atomic operation from which to build higher-level atomic constructs such as semaphores and shared queues. Of course, you can't get GCC to generate this instruction just by writing C code... instead, you must use a bit of in-line assembly code. Given a word-size volatile object obj and a word-size register value reg, the GCC in-line assembly code is:

__asm__ __volatile__ ("xchgl %1,%0"
                      :"=r" (reg), "=m" (obj)
                      :"r" (reg), "m" (obj));

Examples of GCC in-line assembly code using bit operations for locking are given in the source code for the bb_threads library.

It is important to remember, however, that there is a cost associated with making operations atomic. A locking operation carries a fair amount of overhead and may delay memory activity from other processors, whereas ordinary references may simply use local cache. The best performance results when locking operations are used as infrequently as possible.

Cache Line Size

One more fundamental atomicity concern can have a dramatic impact on SMP performance: cache line size. Although the MPS standard requires references to be coherent no matter what caching is used, the fact is that when one processor writes to a particular line of memory, every cached copy of the old line must be invalidated or updated. This implies that if two or more processors are both writing data to different portions of the same line a lot of cache and bus traffic may result, effectively to pass the line from cache to cache. This problem is known as false sharing. The solution is simply to try to organize data so that what is accessed in parallel tends to come from a different cache line for each process.

You might be thinking that false sharing is not a problem using a system with a shared L2 cache, but remember that there are still separate L1 caches. Cache organization and number of separate levels can both vary, but the Pentium L1 cache line size is 32 bytes and typical external cache line sizes are around 256 bytes. Suppose that the addresses (physical or virtual) of two items are a and b and that the largest per-processor cache line size is c, which we assume to be a power of two. To be very precise, if ((int) a) & ~(c - 1) is equal to ((int) b) & ~(c - 1), then both references are in the same cache line. A simpler rule is that if shared objects being referenced in parallel are at least c bytes apart, they must be in different cache lines.

Other Shared Memory Parallel Processing Resources

Because SMP Linux does not yet have much parallel processing support, it is useful to see what some of the other SMP systems out there do. The following are a few such resources.

http://www.epm.ornl.gov/~zhou/ltpvm/ltpvm.html
This link points to the home page for development of an SMP-oriented Lightweight version of PVM, the Parallel Virtual Machine software commonly used for cluster parallel processing. Unfortunately, LPVM is being developed using threads, and thus might not directly take advantage of parallelism using SMP Linux and POSIX threads. It shouldn't be so different as to make porting to bb_threads difficult...?
http://www.sequent.com/
Sequent Computer Systems, Inc., claims to be the inventor of SMP, and has been building systems since 1983. Originally, their machines were based on National Semiconductor's RISC processors, but they were an early adopter of the Intel 386, and have been using Intel processors ever since. The Symmetry architecture supports up to 30 processors, and they are developing NUMA-Q/IQ-Link to "transparently join multiple Intel 4X P6 building blocks" to make machines with over 250 processors. There is also much to be learned from Sequent's UNIX-based SMP parallel processing support....
http://http.cs.berkeley.edu/~culler/smp.html
This is a generic list of commercially available SMPs, most of which are not MPS compliant and will never run SMP Linux. The reason they are listed here is that, like Sequent, these companies have all developed shared memory parallel processing support that may be useful as a guide in developing SMP Linux support.

HGD

This page was last modified July 08, 1996.
full statistics are available.