Understanding CPU caching and performance-2
Eviction policies
Another way to increase the number of sets in the cache without actually increasing the cache size is to reduce the number of blocks in each set. Check out the diagram below, which shows a two-way set associative cache.
To get a better idea of what’s going on, let’s briefly compare the two-way associative cache to both the direct-mapped and the four-way cache. For the sake of comparison, let’s assume that the cache size and the memory size both stay constant. And as you read the comparison, keep in mind that since each increase in the level of associativity (i.e. from two-way to four-way, or from four-way to eight-way) also increases the number of tags that must be checked in order to locate a specific block, an increase in associativity also means an increase in the cache’s latency.
Two-way vs. direct-mapped
With the two-way cache, the number of potential collisions (and hence the miss rate) is reduced versus the direct-mapped scheme. However, the number of tags that must be searched for each fetch is twice as high. Depending on the relative sizes of the cache and main memory, this may or may not increase the cache’s overall latency to the point that the decreased conflict miss rate is worth it.
Two-way vs. four-way
Though a two-way cache’s latency is less than that of the four-way scheme, its number of potential collisions (and hence its miss rate) is higher than that of the four-way cache. Just as with the preceding comparison, how a two-way associative cache compares to a four-way associative cache depends on just how much latency the increase in associativity ends up costing versus the decrease in conflict miss rate.
Associativity: Conclusions
In general, it turns out that when you factor in current technological conditions (i.e. the speed of tag RAM, the range of sizes that most caches fall into, etc.) some level of associativity less than or equal to eight-way turn out to be optimal for most caches. Any more than eight-way associativity and the cache’s latency is increased so much that the decrease in miss rate isn’t worth it. Any less than two-way associativity and the number of collisions often increase the miss rate to the point that the decrease in latency isn’t worth it. There are some direct-mapped caches out there, though.
Before I conclude the discussion of associativity, there are two minor bits of information that I should include for the sake of completeness. First, though you’ve probably already figured it out, a direct-mapped cache is simply a one-way associative cache and a fully associative cache is an n-way associative cache where n is equal to the total number of blocks in the cache.
The second thing you should know is the formula for computing which set in which an arbitrary block of memory should be stored in an n-way associative cache. From page 377 of Henessey and Patterson, the cache placement formula is as follows:
(Block address) MOD (Number of sets in cache)
I recommend trying this formula out on some of the examples above. It seems like it might be a boring and pointless exercise, but if you take 5 minutes and place a few blocks using this simple formula in conjunction with the preceding diagrams, every thing I’ve said in this section will really fall into place for you in a "big picture" kind of way. And it’s actually more fun to do that it probably sounds.
Temporal and Spatial Locality Revisited: Replacement/Eviction Policies and Block Sizes
Replacement/Eviction policies
Caches can increase the amount of benefit they derive from temporal locality by implementing an intelligent replacement policy (also called, conversely, an eviction policy). A replacement policy dictates which of the blocks currently in the cache will be replaced by any new block that gets fetched in. (Or, another way of putting it would be that an eviction policy dictates which of the blocks currently in the cache will be evicted in order to make room for any new block that is fetched in.) One way to do things would be to pick a block at random to be replaced. Other alternatives would be FIFO (First in, First Out) or LIFO (Last In, First Out), or some other such variation. However, none of these policies take account of the fact that any block that was recently used is likely to be used again, soon. With these algorithms, we wind up evicting blocks that will be used again shortly, thereby increasing the cache miss rate and eating up valuable memory bus bandwidth with a bunch of unnecessary fetches.
The optimal replacement policy that doesn’t involve predicting the future would be to evict the block that has gone the longest period of time without being used, or the Least Recently Used (LRU) block. The logic here is that if a block hasn’t been used in a while, it’s less likely to be part of the current working set than a block that was more recently used.
An LRU algorithm, though ideal, isn’t quite feasible to implement in real life. Control logic that checks each cache block to determine which one is the least recently used not only adds complexity to the cache design, but such a check would take up valuable time and thereby add unwanted latency to each replacement. Most caches wind up implementing some type of pseudo-LRU algorithm that approximates true LRU by marking blocks as more and more "dirty" the longer they sit unused in the cache. When a new block is fetched into the cache, the dirtiest blocks are the first to be replaced.
Sometimes, blocks that aren’t all that dirty get replaced by newer blocks just because there isn’t enough room in the cache to contain the entire working set. A miss that results when a block containing needed data has been evicted from the cache due to a lack of cache capacity is called a capacity miss. This is the third and final type of cache miss.
Conclusions
Block sizes
In the section on spatial locality I mentioned that storing whole blocks is one way that caches take advantage of spatial locality of reference. Now that we know a little more about how caches are organized internally, we can look a bit closer at the issue of block size. You might think that as cache sizes increase you could take even better advantage of spatial locality by making block sizes even bigger. Surely fetching more bytes per block into the cache would decrease the odds that some part of the working set will be evicted because it resides in a different block. This is true, to some extent, but we have to be careful. If we increase the block size while keeping the cache size the same, then we decrease the number of blocks that the cache can hold. Fewer blocks in the cache means fewer sets, and fewer sets means that collisions and therefore misses are more likely. And of course, with fewer blocks in the cache the likelihood that any particular block that the CPU needs will be available in the cache decreases.
The upshot of all this is that smaller block sizes allow us to exercise more fine-grained control of the cache. We can trace out the boundaries of a working set with a higher resolution by using smaller cache blocks. If our cache blocks are too large, we wind up with a lot of wasted cache space because many of the blocks will contain only a few bytes from the working set while the rest is irrelevant junk. If we think of this issue in terms of cache pollution, we can say that large cache blocks are more prone to pollute the cache with non-reusable data than small cache blocks.
The following image shows the memory map we’ve been using, with large block sizes.
This next image shows the same map, but with the block sizes decreased. Notice how much more control the smaller blocks allow over cache pollution.
The other problems with large block sizes are bandwidth-related. Since the larger the block size the more data is fetched with each LOAD, large block sizes can really eat up memory bus bandwidth, especially if the miss rate is high. So a system has to have plenty of bandwidth if it’s going to make good use of large cache blocks. Otherwise, the increase in bus traffic can increase the amount of time it takes to fetch a cache block from memory, thereby adding latency to the cache.
Write Policies: Write through vs. Write back
So far, this entire article has dealt with only one type of memory traffic: loads, or requests for data from memory. I’ve only talked about loads because they make up the vast majority of memory traffic. The remainder of memory traffic is made up of stores, which in simple uniprocessor systems are much easier to deal with. In this section, we’ll cover how to handle stores in single-processor systems with just an L1 cache. When you throw in more caches and multiple processors, things get more complicated than I want to go into, here.
Once a retrieved piece of data is modified by the CPU, it must be stored or written back out to main memory so that the rest of the system has access to the most up-to-date version of it. There are two ways to deal with such writes to memory. The first way is to immediately update all the copies of the modified data in each level of the hierarchy to reflect the latest changes. So a piece of modified data would be written to the L1 and main memory so that all of its copies are current. Such a policy for handling writes is called write through, since it writes the modified data through to all levels of the hierarchy.
A write through policy can be nice for multiprocessor and I/O-intensive system designs, since multiple clients are reading from memory at once and all need the most current data available. However, the multiple updates per write required by this policy can greatly increase memory traffic. For each STORE, the system must update multiple copies of the modified data. If there’s a large amount of data that has been modified, then that could eat up quite a bit of memory bandwidth that could be used for the more important LOAD traffic.
The alternative to write through is write back, and it can potentially result in less memory traffic. With a write back policy, changes propagate down to the lower levels of the hierarchy as cache blocks are evicted from the higher levels. So an updated piece of data in an L1 cache block will not be updated in main memory until it’s evicted from the L1.
Conclusions
There is much, much more that can be said about caching, and this article has covered only the basic concepts. In the next article, we’ll look in detail at the caching and memory systems of both the P4 and the G4e. This will provide an opportunity note only to fill in the preceding, general discussion with some real-world specifics, but also to introduce some more advanced caching concepts like data prefetching and cache coherency.
Bibliography
- David A. Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach. Second Edition. Morgan Kaufmann Publishers, Inc.: San Francisco, 1996.
- Dennis C. Lee, Patrick J. Crowley, Jean-Loup Baer, Thomas E. Anderson, and Brian N. Bershad, "Execution Characteristics of Desktop Applications on Windows NT." 1998. http://citeseer.nj.nec.com/lee98execution.html
- Institute for System-Level Integration, "Chapter 5: The Memory Hierarchy."
- Manish J. Bhatt, "Locality of Reference." Proceedings of the 4th Pattern Languages of Programming Conference. 1997. http://st-www.cs.uiuc.edu/users/hanmer/PLoP-97/
- James R. Goodman, "Using Cache Memory to Reduce Processor-Memory Traffic." 25 Years ISCA: Retrospectives and Reprints 1998: 255-262
- Luiz Andre Barroso, Kourosh Gharachorloo, and Edouard Bugnion, "Memory System Characterization of Commercial Workloads." Proceedings of the 25th International Symposium on Computer Architecture. June 1998.
Revision History
Date | Version | Changes |
7/9/2002 | 1.0 | Release |
7/9/2002 | 1.1 | Page 3 was missing, so I’ve added it in. |