Despite the high bandwidth and performance claims of InfiniBand, Cisco has demonstrated that 10 Gigabit Ethernet is a viable solution for HPC clusters. The powerful Cisco C200 M2 server combined with Cisco Nexus 5000 Series Switches and RDMA NICs provides a low-latency solution that meets or beats QDR InfiniBand in real application performance for leading HPC applications.
Friday, November 25, 2011
New Whitepaper Compares HPC App Performance on 10GbE vs IB
University of Oslo to Build Big Data SMP with Numascale and IBM
The week Numascale announced an agreement with the University of Oslo to build a prototype Big Data SMP supercomputer for large scale advanced research computation. Numascale technology enables the use of commodity servers to be aggregated into a scalable computer system that shares all processor, memory and I/O resources in a single system image.
We believe the NumaConnect technology will revolutionize the way to perform advanced computation, whether in academic research or in the industry at large. PRACE Research Infrastructure provides a perfect framework for this collaborative effort to evaluate NumaConnect and we are very proud to work with the University of Oslo, IBM and their partners in this effort,” says CEO Kare Lochsen of Numascale.Funded by PRACE, the project will evaluate NumaConnect for future use in pan-European Research Infrastructure. As part of the pilot program, IBM will provide the server infrastructure and Gridcore will do the system integration.
How Caching Works
Many factors contribute to processor performance. These factors include instruction execution resources, clock speed, internal bandwidth, cache design, and memory-access efficiency. If any one of these features is poorly designed or implemented, overall processor performance suffers.
Note that the converse is not true: if any single factor is particularly well implemented, it will rarely be enough to compensate for other limitations. This rule was made clear during the last decade, when buyers finally came to understand that clock speed alone is not a good measure of overall performance.
A key aspect of processor performance, but one which is lost in the hyperbole around clock speed, is the design of the cache architecture. The importance of cache architecture is frequently underestimated by developers who tend to focus primarily on cache size. Size is indeed an important factor for cache performance, but it is only one factor. This article explains how caches are implemented today on x86 and RISC processors.
The cache on a processor plays the same role as a cache anywhere else on computer hardware: it buffers data between two devices whose data transfer rates are significantly different. In this case, it buffers data between the fast processor and slow main memory. Its goal is to minimize the number of processor accesses to main memory.
In early processor designs, a single cache was the dominant model. However, as computing needs required greater performance, it became clear that a faster caching mechanism with several layers was required.
That's ancient history. For more than a decade now, most processors have used a two-tier cache. The first tier, called the Level 1 or L1 cache, is placed directly in the processing core and loaded with the instructions and data items the processor’s execution units need immediately. Due to its location at the innermost parts of the processor, access to L1 cache is extremely fast. A second, larger, and slower Level 2 (or L2) cache handles the interface to main memory on many designs. In this approach, the L2 cache feeds the L1 cache, which in turn feeds the processor core. AMD uses the caches slightly differently, as explained in the accompanying article, due to some interesting innovations. However, the basic two-tier cache design has been common for several generations of x86 processors.
RISC processors, such as IBM's Power chips, and some niche x86 server processors add a third layer of cache, called L3. This is a larger and slower cache frequently located either off the actual processor die or on the die but connected via a dedicated bus. An L3 cache is particularly important in applications that need large chunks of data to be in cache simultaneously—database transactions being the defining example.
While historically most x86 processors have used only L1 and L2 caches, AMD has announced that next year it will begin shipping Opteron processors with Level 3 cache.
Caches enable processors to access data without performing a memory fetch. Because those fetches are so expensive, considerable engineering is built into all aspects of a processor to avoid them. One common technology is the hardware pre-fetcher. It watches memory accesses and looks for patterns. Once it detects an access pattern, it begins anticipating the memory reads or writes and prefetches data into the cache. For example, if your code is stepping through a large array, the hardware prefetcher will recognize this pattern and begin loading upcoming data elements into cache in anticipation of need. This activity hides the expense, or latency, of accessing main memory.
A software version of the prefetch also exists. Instructions added in the Streaming SIMD Extensions (or SSE)—now standard on all x86 processors—enable programmers to preload data into cache. This option is particularly helpful when it is known that a data item will be needed, but it is not one the hardware pre-fetcher will anticipate. A whole arithmetic exists around deciding how far ahead of need the prefetch should be done. It clearly has to occur several instructions before access, so that it can complete before the data is needed.
By the same token, if it's done too early, it evicts data from the L2 cache that might still be needed. AMD’s processor manuals have information on this calculation for proper pre-fetch instruction placement.
The minimum unit of data read into cache on a single fetch is called a cache line. The size of the cache line varies by processor type, but it has been fixed at 64 bytes on most x86 processors for a long time. (Some processor models are designed to read two cache lines at a time, so they effectively process chunks of data 128 bytes at a time.)
This concept of cache lines brings us to a subtle problem that can cause a devastating loss of performance: false sharing. The basic problem arises from a principle called cache coherency, which requires that data in caches always be the latest version. So, if the same cache line is in the caches of two different processor cores, they cannot differ. Any update to memory by one core must immediately be reflected in the other cache's version of the data. This obviously makes sense. The cache coherency mechanism that keeps these data items in sync is a behind-the-scenes notifier that tells a processor that a given cache line needs to be reloaded.
Suppose two threads—each running on its own core—have a variable of their own that only they control. Unfortunately, because the developer did not understand how caches operate, these two variables (let's say they’re integers) were placed side by side in the same cache line.
Now, every time variable A is updated by thread A, the copy of that cache line in core B must be updated. As a result, processor B now spends cycles reloading the cache line even though none of the data it uses has changed. If variable A is a loop index, then for every iteration of the loop in A, core B must reload the cache line—for absolutely no benefit. So, a key design criterion for all software is to make sure that variables that are accessed by different threads are always at least one cache line part. This problem can be extremely subtle to uncover even with careful profiling—clearly, knowledge of cache operations and proper design for multi-threading are important for good performance.
Once a cache has data loaded into it, how is that data accessed by the processor? To answer this, it's necessary to consider how a cache might be architected. Typically, the cache data is mapped to addresses in memory. Two principal models for this mapping exist. I'll illustrate them on a system that has a cache of 1MB and total RAM of 1GB. To make matters simple, let's say cache lines are 100 bytes each. In such a case, there are 10,000 cache lines available in this cache.
In the first model—called direct mapped— Each address in the 1GB space can map to one particular cache line, which is determined by the low address bits. When the data is read, 100 consecutive bytes are stored in the cache line, and the upper address bits are stored as a tag associated with that cache line.
On an attempted read, if the upper bits of the desired address don't match the stored tag value, we know we have to fetch from memory, because no other cache line will have data from this range of addresses.
This design has the advantage that the cache lookups are very fast, because only one tag value needs to be tested. The drawback is that much of the cache is unused, while some cache lines are very heavily swapped out - it's like one of those 80/20 rules, where 20 percent of the cache lines do 80 percent of the work.
The second model, called fully associative, is to scrap the mapping of cache lines to memory and simply store the latest 1MB of data in cache, regardless of where it comes from in memory. The entire address is stored as a tag value. This approach keeps the cache loaded with all the freshest data. But it cripples the cache look-up process: how can a processor tell whether a cache line is in memory without examining every cache line's tag?
All x86 processors use a blend of these two approaches. They carve up memory into large blocks and for each of block they allocate several cache lines. The number of such cache lines allocated for a given address block is the n in the term "n-way associative" cache.
For example, the Athlon 64 X2 processor has a 16-way associative L2 cache. This means that when a memory fetch is needed, the L2 cache checks 16 possible places (tags) for the cache line before having to recourse to main memory. Due to the nature of the cache, these 16 look-ups take very little time and so add virtually no overhead. However, because 16 possible slots can hold cache lines, the possibility that the cache line is actually in cache is high. (By the by, 16-way is a comparatively high number. Most processors are 8-way, with some only 4-way.)
All data in cache comes from one of two sources: main memory or the processor core. In the first case, as we have seen, it is read into cache in cache-line-sized blocks. In the second case, the processor outputs modified data that must be written back to memory. On most processors today, this write is not done right away, rather it is queued for later writing when it will have the least effect on performance. The write to memory is generally a slower operation than reading, because the write operation generally requires a read of memory first to place the data to be overwritten in cache. This design, which is called read-before-write and has been used for years, is elegantly modified by AMD.
In this introduction, we have seen that cache management involves lots of different operations. Developers who know how these operations are performed are at an advantage because they can write their code to favor cache performance and avoid problems, such as false sharing, that can be disastrous.
Subscribe to:
Comments (Atom)