Ludwig-Maximilians-Universität München, Institut für Informatik
Lehr- und Forschungseinheit für Programmierung und Softwaretechnik
CMC test runs
CMC is model checker implemented in component technology. It uses the
NIPS Promela virtual
machine by Michael
Weber.
Using double caching for large state spaces
We use a special caching technique for checking large state spaces (in
the range of 10-100 million states, and even way beyond). This is done
using hard disk memory, but we aim at reducing the required number of
disk queries by employing caching structures in memory.
Lunar-4 scenario
A first example is the Lunar protocol, here variation 4 without any
optimization technique. We can check this protocol, which has 8.1
million states, in under a GB of main memory. With 8GB main memory,
the model can be checked without disk. The following run has been done
on a system with 2GB. On a 64-bit machine with 16GB memory, it can be
checked in less than 3 minutes!
Search complete: visited 8.119.319 states, created 8.723.018 states, search time elapsed: 1.917s, maximum state size 603
7.828.962 single successor states, 0 dead ends, 53.431 DFS problems (53 real)
Double caching: 6 explicit lookups
Bloom filter, k=1, size 1073741824: 0 partial collisions
8088540 bits set, 8119319 entries, collision rate: 1.00, fill rate:0.01
8723019 queries, 8088540 resolved
Lossy hashing with 8119319 elements; reclaimed 7727374 elements with 306.257.134 bytes in 74 runs
634479 queries, 572781 resolved
Cache failures: 61698
Diskstore: 61.698 lookups, 290 maximum lookup size, 27.575 revisits detected
71.745 pages written, 264.999 pages read, 65.536 pagesize, 128 partitions, 65.536 lookupbuf
State size histogram: 220:23, 262:3, 304:3, 346:3, 380:1, 387:1, 388:5, 434:157, 441:50, 488:3.445, 495:672, 542:77.329, 549:3.124, 596:8.633.222, 603:4.980
See the entire logfile!
This model has two fascinating properties: first, it has rather lot of
different state sizes, and the largest one has only a few states!
Also, only a very small amount of states are revisited. This helps to
perform fast, of course!

This image shows the size of the candidate set (containing states that
need to be queried against the disk), open set (containing states that
need to be processed) and the total number of visited states.
We see that 90% of the states are found within the first 10% of the
time. Afterwards, the disk checks dominate. During the checks, almost
all candidates are discovered to be unvisited. Soon enough, the
candidate set becomes very small, however, for the last few states the
disk has to be checked completely. This produces an expensive "final
phase" where the remaining states need to be checked - they might lead
to a big, unexplored region, after all. Note that the first lookup
reopens almost every candidate state, whereas the second lookup only
reopens a fractional amount.
Peterson's mutual exclusion example for 5 processes
Search complete: visited 68.907.937 states, created 378.993.534 states, search time elapsed: 3.849s, maximum state size 152
26 single successor states, 0 dead ends, 415.581 DFS problems (415.581 real)
Double caching: 5 explicit lookups
Bloom filter, k=1, size 1073741824: 0 partial collisions
66744743 bits set, 68907937 entries, collision rate: 1.03, fill rate:0.06
378993535 queries, 66744743 resolved
Lossy hashing with 68.907.937 elements; reclaimed 59871171 elements with 12.722.935.511 bytes in 60 runs
312248792 queries, 231703062 resolved
Cache failures: 80545730
Diskstore: 80.545.730 lookups, 29.999 maximum lookup size, 43.280.445 revisits detected
144.320 pages written, 2.244.434 pages read, 65.536 pagesize, 128 partitions, 30.000 lookupbuf
State size histogram: 57:2, 76:5, 95:5, 114:5, 133:5, 145:206.723.743, 152:172.269.769
See the entire logfile!
Using path reduction, 5 processes amount to 68.9 million
states. Contrary to the Lunar-4 scenario, this protocol produces a
whole bunch of revisits, namely 310 million states are visited again!
This is of course very expensive. Still, the total size of the states
visited stay short of the 16GB of main memory available (while
consuming just a little more than 4GB of memory for the model
checker), so the OS disk access caching takes over and speeds things
up pretty much - nothing is ever read from the disk! For smaller
memories, an estimated 90 minutes disk I/O overhead should be added,
but this is not yet troublesome... I expected an entire week!

A similar size developement of the open set, but much more disk
lookups are required! The "final phase" checks can be seen here,
too. If we take a closer look at the first 1000 seconds, the
cooperation of the different components of CMC becomes visible:

- For the first 280 seconds, the lossy hash table is filled. Memory
is consumed in a linear fashion, and little candidates are found -
they originate from hash collisions in the Bloom filter.
- At approx. 280 seconds, the capacity of the lossy hash table is
exceeded, and states are reclaimed. Memory consumption is slowed down,
but begins to grow again, as more candidates are found. This is due to
the increasing unreliability of the lossy hash map. New memory is
required for candidate states, and open states. Note that the open set
has linear growth in this period.
- At approx. 600 seconds, the first disk check is triggered. The
candidate set is emptied, and some states are added to the open set -
not much, however. Note that due to the partitioning, the search still
continues during this period, so upon checking the last partition,
new candidates have been found for partitions checked earlier - this
is why the candidate state count does not reach 0.
- At 620 seconds, the check is complete. Memory consumption stays
stagnant. Note that there is a rapid decrease in the open set at 680
seconds - this may be induced by checking reopened states whose
immediate environment has already been searched, and still sits in the
lossy hashing.
- At 750 seconds, we see a small increase in total memory
requirement: This is due to the growing open set (we never deallocate
memory once it is allocated, right until search termination). While
the open set posed no problem for this model, it might become the
prime problem later. However, we could store parts of it to disk - but
have to insert it to the candidate set upon loading it again, as the
states might have been visited in the meantime.
The effectiveness of the caches
The following test presents results obtained from checking models with
a different number of bits and lossy hashing capacity. This is a
little problematic due to the neccesity to do perform a quadratic
number of tests, which prohibits the use of really large models.
(This
part is definitely outdated, since it makes wrong use of the Bloom
filter (one hash function only). See the paper for the new figures.)
Leader election
The first example is the leader election protocol for 10 processes
(get the logfiles). We fixed
the reclaimahead to 10% of the maximum states in lossy hashing, the
numbers of partitions to 16, the page size to 2MB and the candidate
set to 65536 entries.
Althought the algorithm is not optimized to operate on disk only, we
report on the total runtime, mainly to show that it strongly
correlates with the pages read from disk:
The run with completely empty caches (at least almost empty caches, to
keep the algorithm running) took 1 hour and 11 minutes to complete,
whereas the algorithms that did not use the disk required just a
little more than 6 minutes. Note that the number of written pages
depends on the lossy hash size only.
Memory requirements, as measured with memtime, turn out to
be somewhat unexpected:
It is more memory consumptive to use no lossy hashing at all than
using just a little bit! I assume that this is due to the increased
amount of local transitions, which require memory for storing the
candidates etc. Admittedly, the model is rather small, being a little
shy of 2M states. The complete test took approx. 24 hours to
complete. Also, our implementation uses a rather crude memory
management that is not very good at reusing space.
A small deduction can be made on this data: Lossy hashing
does not help too much (safe the small treshold to get over the local
transitions - I guess), but consumes lots of memory. Bitstate hashing,
however, is cheap and helps a lot.
Using direct IO
For the leader election protocol, this is a test of various page sizes
with all other parameter staying fixed:
The disk page size has a large influence on the runtime. A little can
be gained by using asynchronous I/O, which is done by concurrently
reading a disk page while the first disk page is processed against the
state index. However, disk I/O is so prohibitively slow that it
dominates the total runtime too much. This leads to the idea of
spending more time on decreasing I/O times, by compressing the data;
this will be discussed in the next chapter.
Direct I/O is easy to achieve, but has some pitfalls: Just add a
O_DIRECT flag to (f)open. However, you will
get an error if the I/O buffer in memory are not
softblocksize-aligned, and their size is no multiple of the
softblocksize. The funny thing about this is that I have no clue where
to get the softblocksize for my system! However, working with the
memory page size (4096 on most Linux systems) works fine. Some
implementation details:
- Obtaining page-aligned memory can be done using
valloc. This assumes that softblocksize and memory page
size match (or the latter is a multiple of the former).
- Using
O_TRUNC slows things down to a terrible crawl
for the first write access, down to a few minutes (writing pages at a
blazing speed of approx. two seconds per page), while the second write
takes up mere seconds. This only happens if the files are deleted; it
is therefore recommended not to delete them. Funny enough, extending
the files does not take time, once the first page is written (writing
a few bytes does not suffice).
- As a rule of thumb, the larger the page size, the faster the I/O
access (save for the writing the first page). There should be hardware
limits, but I've not yet reached them (maybe they, too, are
irrelevant, as we do a linear search). For the time being, it seems
that page size is limited by memory only.
- Writing pages consumes time where nothing depends on the
writing. Using parallelized I/O (
aio_) will save the
smaller of the two time requirements of reading and checking pages. In
practice, however, the I/O time is too dominating to have any
measurable effect in using asynchronous I/O. Rather, problems with
concurrent access to memory regions used as I/O buffers becomes a big
problem, so using mprotect is recommended. For further
developement, it might be interesting to use more CPU power in order
to save I/O time. The Lempel-Ziv algorithm can provide compression down to 5%
of the original size. This amounts to about 30 bytes per state in our
examples. Of course, using just 5 bytes with hash compaction is even faster,
but voids completeness.
Compressing the disk pages
As I/O time is prohibitively expensive, every effort should be made to
keep I/O transfer as low as possible. There are two approaches here:
Using multiple partitions reduces the number of disk reads on checking
a few states. This becomes important at the end of the run, when
relatively few states need to be checked. It will also play a big role
with depth-first search. Still, most of the written pages need to be
checked on every lookup of more pages than partitions - if the
hashfunction is good (and Jenkin's hashing has shown to be quite good
when it comes to this).
The other approach is to store a compressed form of the states. Like
every compression, this may be lossless or lossful
compression. Lossless compression may be employed at the state
generator module, but can only compress one state at a time. When
writing disk pages, multiple states may be compressed together, which
is a novelty in model checking. Starting with an uncompressed state,
all subsequent states on the page may be stored by diff strings only,
which yields a compression from 15% to 50%, depending on the sorting
of states. Even better, and almost independent from state order, is
the Lempel-Ziv algorithm, as implemented in the zlib compression
library. When compressing the disk pages written by the model checker,
without compression, a ratio of a twentyfold size decrease may be
obtained.
Even better compression, at the cost of completeness, is given by hash
compaction. A hash function (we use SHA1) is used to calculate a hash
digest, which is then stored in place of the original state. The more bytes
that digest contains, the less likely a collision becomes. Using 40
bits produces a hundred-fold compression for larger models! Of course,
this produces vast improvements in runtime. However, hash compaction
provides an overapproximation on state revisits, and may thus prohibit
the detection of errors in some rare, yet possible, cases.
Using the ZLIB
The zlib compression library
provides a effective implementation of the DEFLATE algorithm, which is
a combination of the Lempel-Ziv-Algorithm and Huffmann coding. It can
be used to compress the pages written to disk, and uncompress them
again upon checking states against the disk.
The results are surprisingly good. Usually, compression ratios far
below 10% are achieved - at a relatively small overhead. This is due
to two facts about state:
- They are riddled with zeroes
- States usually differ in very few locations only
To illustrate this facts, take a look at an example dump file which contains the
first 100 states of the sfbad4 example.
This are some early results (sizes are approximates):
| model | normal time | zlib
time | uncompressed state data | compressed state
data | states |
leader.10.20 |
4:02.626 | 2:21.980 |
678MB | 20MB | 1.95M |
lunar4 |
49:21.641 | 8:48.301 |
4GB | 248MB | 8,11M |
5peterson |
3:37:46.049 | 2:04:11.811 |
8.8GB | 953MB | 68,91M |
The zlib provides a parameter for the compression quality, which can
be set from 1 to 9 and influences the compression quality as well as
the runtime required by the zlib. For the lunar4 model,
this are results obtained for the different quality settings. As you
can see, the model was checked in approx 50 minutes without the zlib,
and in less than 9 minutes with the zlib and quality setting 6.

Further tests
We did a small test run to compare different reclaim
sizes. Theoretically, it should be better to schedule only smal
portions of the state space for reclamation (the states are only
actually removed from the lossy hash table at the very moment a new
state is inserted). If few states are scheduled, states are longer
subjected to gathering of statistics, so the choice should be better.
However, this was not found in practical measurements, with the leader
election protocol (see the logs):

The runtime is influenced by the
number of lookup runs as well as the number of cache failures. The
number of cache misses is strongly decreasing (or stagnant), which is
unexpected. It can only be guessed that this is due to the premature
evaluation of state before statistical data could be gathered. The
stagnant cache-miss-ratio experienced with the outgoing and combined
evaluators (combined takes both incoming, outgoing and age into
account) indicates that this is the case; the data they base their
evaluation on does not change over time (assuming that the incoming
ratio is not affecting the treshold for the combined evaluator).
The runtime behaviour is very unexpected. Most likely, this is an artefact of the model used.
Further tests to be done
- Compare disk I/O vs. CPU time
- Compare the oblivious caches
- Compare various LZ-level in terms of total runtime/disk I/O
Moritz Hammer, Mar 06, 2006