Ludwig-Maximilians-Universität München, Institut für Informatik
Lehr- und Forschungseinheit für Programmierung und Softwaretechnik

CMC test runs

CMC is model checker implemented in component technology. It uses the NIPS Promela virtual machine by Michael Weber.

Using double caching for large state spaces

We use a special caching technique for checking large state spaces (in the range of 10-100 million states, and even way beyond). This is done using hard disk memory, but we aim at reducing the required number of disk queries by employing caching structures in memory.

Lunar-4 scenario

A first example is the Lunar protocol, here variation 4 without any optimization technique. We can check this protocol, which has 8.1 million states, in under a GB of main memory. With 8GB main memory, the model can be checked without disk. The following run has been done on a system with 2GB. On a 64-bit machine with 16GB memory, it can be checked in less than 3 minutes!
Search complete: visited 8.119.319 states, created 8.723.018 states, search time elapsed: 1.917s, maximum state size 603
                   7.828.962 single successor states, 0 dead ends, 53.431 DFS problems (53 real)
Double caching: 6 explicit lookups
  Bloom filter, k=1, size 1073741824: 0 partial collisions
                8088540 bits set, 8119319 entries, collision rate: 1.00, fill rate:0.01
                8723019 queries, 8088540 resolved
  Lossy hashing with 8119319 elements; reclaimed 7727374 elements with 306.257.134 bytes in 74 runs
                634479 queries, 572781 resolved
  Cache failures: 61698
  Diskstore: 61.698 lookups, 290 maximum lookup size, 27.575 revisits detected
             71.745 pages written, 264.999 pages read, 65.536 pagesize, 128 partitions, 65.536 lookupbuf
State size histogram: 220:23, 262:3, 304:3, 346:3, 380:1, 387:1, 388:5, 434:157, 441:50, 488:3.445, 495:672, 542:77.329, 549:3.124, 596:8.633.222, 603:4.980
See the entire logfile!

This model has two fascinating properties: first, it has rather lot of different state sizes, and the largest one has only a few states! Also, only a very small amount of states are revisited. This helps to perform fast, of course!

This image shows the size of the candidate set (containing states that need to be queried against the disk), open set (containing states that need to be processed) and the total number of visited states.
We see that 90% of the states are found within the first 10% of the time. Afterwards, the disk checks dominate. During the checks, almost all candidates are discovered to be unvisited. Soon enough, the candidate set becomes very small, however, for the last few states the disk has to be checked completely. This produces an expensive "final phase" where the remaining states need to be checked - they might lead to a big, unexplored region, after all. Note that the first lookup reopens almost every candidate state, whereas the second lookup only reopens a fractional amount.

Peterson's mutual exclusion example for 5 processes

Search complete: visited 68.907.937 states, created 378.993.534 states, search time elapsed: 3.849s, maximum state size 152
                   26 single successor states, 0 dead ends, 415.581 DFS problems (415.581 real)
Double caching: 5 explicit lookups
  Bloom filter, k=1, size 1073741824: 0 partial collisions
                66744743 bits set, 68907937 entries, collision rate: 1.03, fill rate:0.06
	        378993535 queries, 66744743 resolved
  Lossy hashing with 68.907.937 elements; reclaimed 59871171 elements with 12.722.935.511 bytes in 60 runs
                312248792 queries, 231703062 resolved
  Cache failures: 80545730
  Diskstore: 80.545.730 lookups, 29.999 maximum lookup size, 43.280.445 revisits detected
             144.320 pages written, 2.244.434 pages read, 65.536 pagesize, 128 partitions, 30.000 lookupbuf
State size histogram: 57:2, 76:5, 95:5, 114:5, 133:5, 145:206.723.743, 152:172.269.769
See the entire logfile!

Using path reduction, 5 processes amount to 68.9 million states. Contrary to the Lunar-4 scenario, this protocol produces a whole bunch of revisits, namely 310 million states are visited again! This is of course very expensive. Still, the total size of the states visited stay short of the 16GB of main memory available (while consuming just a little more than 4GB of memory for the model checker), so the OS disk access caching takes over and speeds things up pretty much - nothing is ever read from the disk! For smaller memories, an estimated 90 minutes disk I/O overhead should be added, but this is not yet troublesome... I expected an entire week!

A similar size developement of the open set, but much more disk lookups are required! The "final phase" checks can be seen here, too. If we take a closer look at the first 1000 seconds, the cooperation of the different components of CMC becomes visible:

  1. For the first 280 seconds, the lossy hash table is filled. Memory is consumed in a linear fashion, and little candidates are found - they originate from hash collisions in the Bloom filter.
  2. At approx. 280 seconds, the capacity of the lossy hash table is exceeded, and states are reclaimed. Memory consumption is slowed down, but begins to grow again, as more candidates are found. This is due to the increasing unreliability of the lossy hash map. New memory is required for candidate states, and open states. Note that the open set has linear growth in this period.
  3. At approx. 600 seconds, the first disk check is triggered. The candidate set is emptied, and some states are added to the open set - not much, however. Note that due to the partitioning, the search still continues during this period, so upon checking the last partition, new candidates have been found for partitions checked earlier - this is why the candidate state count does not reach 0.
  4. At 620 seconds, the check is complete. Memory consumption stays stagnant. Note that there is a rapid decrease in the open set at 680 seconds - this may be induced by checking reopened states whose immediate environment has already been searched, and still sits in the lossy hashing.
  5. At 750 seconds, we see a small increase in total memory requirement: This is due to the growing open set (we never deallocate memory once it is allocated, right until search termination). While the open set posed no problem for this model, it might become the prime problem later. However, we could store parts of it to disk - but have to insert it to the candidate set upon loading it again, as the states might have been visited in the meantime.

The effectiveness of the caches

The following test presents results obtained from checking models with a different number of bits and lossy hashing capacity. This is a little problematic due to the neccesity to do perform a quadratic number of tests, which prohibits the use of really large models.

(This part is definitely outdated, since it makes wrong use of the Bloom filter (one hash function only). See the paper for the new figures.)

Leader election

The first example is the leader election protocol for 10 processes (get the logfiles). We fixed the reclaimahead to 10% of the maximum states in lossy hashing, the numbers of partitions to 16, the page size to 2MB and the candidate set to 65536 entries.
Althought the algorithm is not optimized to operate on disk only, we report on the total runtime, mainly to show that it strongly correlates with the pages read from disk:
Runtimes Pages read and written
The run with completely empty caches (at least almost empty caches, to keep the algorithm running) took 1 hour and 11 minutes to complete, whereas the algorithms that did not use the disk required just a little more than 6 minutes. Note that the number of written pages depends on the lossy hash size only.
Memory requirements, as measured with memtime, turn out to be somewhat unexpected:
Memory requirements
It is more memory consumptive to use no lossy hashing at all than using just a little bit! I assume that this is due to the increased amount of local transitions, which require memory for storing the candidates etc. Admittedly, the model is rather small, being a little shy of 2M states. The complete test took approx. 24 hours to complete. Also, our implementation uses a rather crude memory management that is not very good at reusing space.

A small deduction can be made on this data: Lossy hashing does not help too much (safe the small treshold to get over the local transitions - I guess), but consumes lots of memory. Bitstate hashing, however, is cheap and helps a lot.

Using direct IO

For the leader election protocol, this is a test of various page sizes with all other parameter staying fixed:
Influence of disk page size on total
runtime
The disk page size has a large influence on the runtime. A little can be gained by using asynchronous I/O, which is done by concurrently reading a disk page while the first disk page is processed against the state index. However, disk I/O is so prohibitively slow that it dominates the total runtime too much. This leads to the idea of spending more time on decreasing I/O times, by compressing the data; this will be discussed in the next chapter. Direct I/O is easy to achieve, but has some pitfalls: Just add a O_DIRECT flag to (f)open. However, you will get an error if the I/O buffer in memory are not softblocksize-aligned, and their size is no multiple of the softblocksize. The funny thing about this is that I have no clue where to get the softblocksize for my system! However, working with the memory page size (4096 on most Linux systems) works fine. Some implementation details:

Compressing the disk pages

As I/O time is prohibitively expensive, every effort should be made to keep I/O transfer as low as possible. There are two approaches here: Using multiple partitions reduces the number of disk reads on checking a few states. This becomes important at the end of the run, when relatively few states need to be checked. It will also play a big role with depth-first search. Still, most of the written pages need to be checked on every lookup of more pages than partitions - if the hashfunction is good (and Jenkin's hashing has shown to be quite good when it comes to this).

The other approach is to store a compressed form of the states. Like every compression, this may be lossless or lossful compression. Lossless compression may be employed at the state generator module, but can only compress one state at a time. When writing disk pages, multiple states may be compressed together, which is a novelty in model checking. Starting with an uncompressed state, all subsequent states on the page may be stored by diff strings only, which yields a compression from 15% to 50%, depending on the sorting of states. Even better, and almost independent from state order, is the Lempel-Ziv algorithm, as implemented in the zlib compression library. When compressing the disk pages written by the model checker, without compression, a ratio of a twentyfold size decrease may be obtained.

Even better compression, at the cost of completeness, is given by hash compaction. A hash function (we use SHA1) is used to calculate a hash digest, which is then stored in place of the original state. The more bytes that digest contains, the less likely a collision becomes. Using 40 bits produces a hundred-fold compression for larger models! Of course, this produces vast improvements in runtime. However, hash compaction provides an overapproximation on state revisits, and may thus prohibit the detection of errors in some rare, yet possible, cases.

Using the ZLIB

The zlib compression library provides a effective implementation of the DEFLATE algorithm, which is a combination of the Lempel-Ziv-Algorithm and Huffmann coding. It can be used to compress the pages written to disk, and uncompress them again upon checking states against the disk.

The results are surprisingly good. Usually, compression ratios far below 10% are achieved - at a relatively small overhead. This is due to two facts about state:

To illustrate this facts, take a look at an example dump file which contains the first 100 states of the sfbad4 example.

This are some early results (sizes are approximates):
modelnormal timezlib timeuncompressed state datacompressed state datastates
leader.10.20 4:02.6262:21.980 678MB20MB1.95M
lunar4 49:21.6418:48.301 4GB248MB8,11M
5peterson 3:37:46.0492:04:11.811 8.8GB953MB68,91M
The zlib provides a parameter for the compression quality, which can be set from 1 to 9 and influences the compression quality as well as the runtime required by the zlib. For the lunar4 model, this are results obtained for the different quality settings. As you can see, the model was checked in approx 50 minutes without the zlib, and in less than 9 minutes with the zlib and quality setting 6.

Further tests

We did a small test run to compare different reclaim sizes. Theoretically, it should be better to schedule only smal portions of the state space for reclamation (the states are only actually removed from the lossy hash table at the very moment a new state is inserted). If few states are scheduled, states are longer subjected to gathering of statistics, so the choice should be better.

However, this was not found in practical measurements, with the leader election protocol (see the logs):

The runtime is influenced by the number of lookup runs as well as the number of cache failures. The number of cache misses is strongly decreasing (or stagnant), which is unexpected. It can only be guessed that this is due to the premature evaluation of state before statistical data could be gathered. The stagnant cache-miss-ratio experienced with the outgoing and combined evaluators (combined takes both incoming, outgoing and age into account) indicates that this is the case; the data they base their evaluation on does not change over time (assuming that the incoming ratio is not affecting the treshold for the combined evaluator).

The runtime behaviour is very unexpected. Most likely, this is an artefact of the model used.

Further tests to be done


Moritz Hammer, Mar 06, 2006