Creating a Perfect Hash Table

A Performant, Parallel, Random Acyclic-Hypergraph Approach

Current status: draft. Last update: 10th June, 2018.

TL;DR

This article documents a recent assignment regarding the implementation of a perfect hash table. I discuss the initial goal as a set of requirements, then capture design decisions and provide an implementation overview.

Requirements

Author a perfect hash table component that provides offline table generation and fast lookup performance. Assume a key and value width of ULONG (32 bits). Optimize for key set sizes ranging from 10,000 to 40,000 keys on average, with up to 100,000 keys on the high end.

Assume a key distribution similar to that of shared library address offsets; not linear, but definitely not random, either. Do not make assumptions about the presence of leading or trailing zeros (i.e. alignment), although feel free to optimize for this if it doesn't detract from the component's performance on less predictable key sets.

Prioritize lookup speed over generation speed. Optimize the lookup algorithm to minimize latency involved in a single Value = Lookup(Key) call. Table generation will be done ahead of time, in a separate process, and only needs to ensure that tables can be generated in a reasonable amount of time given the size of the input key set. Linear overhead is ideal, quadratic less so, exponential would be infeasible.

Prioritize lookup speed over table size (memory requirements). A larger table size, within reason, is an acceptable trade-off if it yields a faster lookup speed. A minimal perfect hash table, where each key maps to exactly one table location, is not a requirement, nor is it prohibited.

Note the inevitable tradeoff in size and performance with regards to the masking method used by the implementation. Modulus-oriented solutions, those that use the % operator in C, tend to be slower (modulus division can take upward of 90 cycles), but yield smaller table sizes. Solutions relying on power-of-2 based table sizes boast much faster masking routines (e.g. Input & (Size-1)), but incur greater table size overhead.

The offline generation process takes, as its input, a key file. The file will be an array of ULONG keys (i.e. binary data). The number of elements in the array can be ascertained by dividing the file size by sizeof(ULONG). It produces, as its output, a perfect hash table file that can be subsequently loaded and used in separate processes.

Callers wishing to use a given perfect hash table will need to load the file produced in the step above. This will yield an interface from which the hash table can be interacted with. At a bare minimum, the interface should support the following semantics:

extern PULONG Table;
ULONG Lookup(ULONG Key) { return Table[PerfectHashFunction(Key)]; }
VOID Insert(ULONG Key, ULONG Value) { Table[PerfectHashFunction(Key)] = Value; }

The interface requirements are flexible and can be extended as long as the criteria above are met at a bare minimum.

The behavior of looking up or inserting a key that wasn't in the original input set is undefined. That is, the implementation is not required to detect or protect against this scenario — that is the responsibility of the caller.

Feel free to review existing works on the topic, particularly the cmph open source project, the GNU gperf library, and the plethora of papers on the subject of perfect hashing and minimal perfect hashing.

Getting Started

This was an interesting project. I'd never written a perfect hash table before, nor was I familiar with the landscape for doing such a thing. I spent about three days reviewing existing work, including the cmph project's source code. (I ended up collecting about 147 (!) documents on the topic (papers, PhD thesis, slides, etc) in my PDFs repo over the course of the project.)

Algorithm Decisions

The algorithm I settled on is the acyclic random 2-part hypergraph (or r-graph, where r = 2). The algorithm works as follows: for each key, generate two independent hash values. Mask these values such that they fall within the confines of the number of vertices picked for the table (more on this later; for now, assume the number of vertices exceeds the number of keys, or edges by at least 2x). These masked hash values now become the two vertices, and are added to a graph structure by a connecting edge. The edge is simply the 0..N index being used for enumeration, e.g.:

for (Index = 0; Index < NumberOfKeys; Index++) {

    Key = Keys[Index];

    Hash1 = HashFunction1(Key);
    Hash2 = HashFunction2(Key);

    Vertex1 = MaskHashFunction(Hash1);
    Vertex2 = MaskHashFunction(Hash2);

    Edge = Index;

    GraphAddEdge(Graph, Edge, Vertex1, Vertex2);
}

Once constructed, the graph is assessed to determine whether or not it is acyclic. If the graph is acyclic, it means every vertex has at most 1 degree of connectivity to other vertices. We want an acyclic graph. If it's not acyclic, the attempt has failed, the graph is thrown away, and a new attempt is made, using new random seed data to drive the two hash functions. Once an acyclic graph is found, it's relatively straight forward to convert this into a data structure that can be used as a perfect hash table.

This algorithm first has roots in A Versatile Data Structure for Edge-Oriented Graph Algorithms (Ebert, 1987). Its application to perfect hashing appears in A Family of Perfect Hashing Methods (Majewski, Wormald, Havas, Czech, 1996), where they focus on more rigorous proofs of the runtime complexity associated with acyclic r-graphs, extending on the work in their earlier paper, Graphs, Hypergraphs and Hashing (1994), and their initial works on the matter, An Optimal Algorithm for generating Minimal Perfect Hash Functions (Majewski, Havas, Czech, 1992).

There is one thing that stood out in their 1996 paper (page 9) that I was able to verify experimentally (after finally hacking the CHM algorithm in the CMHP project into a working state). To summarize, sans heavy math notation: the probability that we find a perfect hash solution by identifying an acyclic r-graph (r = 2) is 99.9% within 18 iterations. On average, a solution is found in √3 attempts.

This is referred to as a geometric distribution, something I hadn't come across before. It is a very desirable trait, especially for this particular problem. In essence, the more we do it, the more likely we'll figure it out. Assuming there is a solution (i.e. the hash function is performing properly), the chance of us not solving it is provably infinitesimal, which is neat.

Consider a gambler who is going all-in on heads against the Universe in an infinite game of coin toss. We are the Universe. Play it long enough, and we'll eventually see a tail.

I made the following notes around day 3 of the project:

In my experiments, even with 10 million random keys, graph creation took about 6-7 seconds on the 64-bit release build. On average it found a solution usually within 1-3 iterations. The worst-case I saw was 7 iterations.

The nice thing about the graph creation step is that each iteration can be palmed off to a threadpool worker, such that you can attempt to find a graph solution in parallel up to NCPU. On my 12 core box at home, there is a very high probability I'll find a solution in the first batch of 12 iterations submitted in parallel — thus, my clock time for solving the perfect hash stays relatively consistent at 6-7 seconds, give or take.

This algorithm is not the fastest, nor the most state-of-the-art, nor does it have the lowest bits-per-key storage requirement, nor will I be aiming for a minimal perfect hash function. However, it's simple (relatively), I understand it, I can explain it on a whiteboard without having to continually reference a paper, and it's definitely fast enough in the generation stage based on our target static key sets. It's also old; graph theory has flourished since the 60s, and this particular algorithm came onto the scene in 1992, and has been cited widely.

(The current state-of-the-art depicted in papers like Fast Scalable Construction of (Minimal Perfect Hash) Functions (Genuzio, Ottaviano, Vigna, 2016) use "lazy Gaussian elimination" to try tackle the minimal perfect hash problem on ultra-large key sets (in the billions and above). That is interesting, but not a wise choice for me to tackle in a week, nor does it improve on our target static key sets, which are very modest in size in comparison.)

Another reason I'm favoring the algorithm I've chosen is that because the generation stage is so cheap, relatively, and we have that nice probabilistic guarantee that we'll "probably" find a solution by iteration 18... that gives us a lot of leeway with regards to experimenting with the underlying hash functions used. I envision there being much faster, non-mod based hash functions we can experiment with, that actually have relatively poor "randomness" qualities unless a particularly good seed is found. Combined with the threadpool infrastructure for submitting iterations in parallel, I have a hunch that I'll be able to find some very fast hash functions that can still be solved in an acceptable amount of time. This will help greatly with our evaluation time; reducing the latency and CPU cycles required to perform the lookup.

There is one large risk item associated with my current plan: the key validation step of the cmph command line program just flat-out isn't working properly. The graph generation step *appears* to be doing the right thing, at least with regards to finding cyclic graphs and discarding them, etc. The validation step works on small key sizes, however, after about 500, I'm seeing rampant conflicts and severe degradation of the underlying hash functions (i.e. everything is hashing to 394 or 85 or something).

The last paragraph depicts some of the issues I had trying to get the cmph program to validate the perfect hash tables it was supposedly generating. Try as I might, I just couldn't get the thing to generate solutions that were actually valid (despite it reporting that they were valid) after a few hours of fiddling.

More impressively, though, is that the bug survived and still exists in my complete reimplementation of their initial modulus-oriented masking approach. This only became apparent in the latter stages of the project, when I authored more robust validation and test logic. Thankfully though, my power-of-2 based approach does work, and it's a lot faster, so, who knows. The modulus functionality was only implemented for a reference point, I didn't anticipate using it as the final "fast" version of the perfect hash solution, so I haven't investigated why it doesn't work properly any further. It's a little disconcerting, though.

Initial Design Decisions

A few of the initial guiding sentiments regarding the design follow:

I wanted to be able to easily mix and match different algorithms, hash functions and masking types. I figured this would make experimentation easier, and generally promote sensible de-coupling of internal components. Thus, the main interface for creating a perfect hash table is parameterized by three enums for the desired algorithm, hash function, and masking type, respectively. These identifiers, depicted below, are also stored with the on-disk representation of the perfect hash table, such that the loading component knows which implementations to select when preparing an interface for use.

Algorithm
Hash Function
Masking Type

//
// Define an enumeration for identifying which backend algorithm variant to
// use for creating the perfect hash table.
//

typedef enum _PERFECT_HASH_TABLE_ALGORITHM_ID {

    //
    // Explicitly define a null algorithm to take the 0-index slot.
    // This makes enum validation easier.
    //

    PerfectHashTableNullAlgorithmId = 0,

    //
    // Begin valid algorithms.
    //

    PerfectHashTableDefaultAlgorithmId = 1,
    PerfectHashTableChm01AlgorithmId = 1,

    //
    // End valid algorithms.
    //

    //
    // N.B. Keep the next value last.
    //

    PerfectHashTableInvalidAlgorithmId,

} PERFECT_HASH_TABLE_ALGORITHM_ID;
typedef PERFECT_HASH_TABLE_ALGORITHM_ID *PPERFECT_HASH_TABLE_ALGORITHM_ID;

//
// Provide a simple inline algorithm validation routine.
//

FORCEINLINE
BOOLEAN
IsValidPerfectHashTableAlgorithmId(
    _In_ PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId
    )
{
    return (
        AlgorithmId > PerfectHashTableNullAlgorithmId &&
        AlgorithmId < PerfectHashTableInvalidAlgorithmId
    );
}

//
// Define an enumeration for identifying which hash function variant to use.
//

typedef enum _PERFECT_HASH_TABLE_HASH_FUNCTION_ID {

    //
    // Explicitly define a null algorithm to take the 0-index slot.
    // This makes enum validation easier.
    //

    PerfectHashTableNullHashFunctionId              = 0,

    //
    // Begin valid hash functions.
    //

    PerfectHashTableHashCrc32RotateFunctionId       = 1,
    PerfectHashTableDefaultHashFunctionId           = 1,

    PerfectHashTableHashJenkinsFunctionId           = 2,

    //
    // N.B. The following three hash functions are purposefully terrible from
    //      the perspective of generating a good distribution of hash values.
    //      They all have very simple operations and are intended to test the
    //      theory that even with a poor hash function, once we find the right
    //      seed, the hash quality is unimportant.
    //

    PerfectHashTableHashRotateXorFunctionId         = 3,
    PerfectHashTableHashAddSubXorFunctionId         = 4,
    PerfectHashTableHashXorFunctionId               = 5,

    //
    // End valid hash functions.
    //

    //
    // N.B. Keep the next value last.
    //

    PerfectHashTableInvalidHashFunctionId,

} PERFECT_HASH_TABLE_HASH_FUNCTION_ID;
typedef PERFECT_HASH_TABLE_HASH_FUNCTION_ID
      *PPERFECT_HASH_TABLE_HASH_FUNCTION_ID;

//
// Provide a simple inline hash function validation routine.
//

FORCEINLINE
BOOLEAN
IsValidPerfectHashTableHashFunctionId(
    _In_ PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId
    )
{
    return (
        HashFunctionId > PerfectHashTableNullHashFunctionId &&
        HashFunctionId < PerfectHashTableInvalidHashFunctionId
    );
}

//
// Define an enumeration for identifying the type of table masking used by the
// underlying perfect hash table.  This has performance and size implications.
// Modulus masking typically results in smaller tables at the expenses of slower
// modulus-based hash functions.  Non-modulus masking requires power-of-2 sized
// tables, which will be larger, but the resulting mask function can be done
// by logical AND instructions, which are fast.
//

typedef enum _PERFECT_HASH_TABLE_MASK_FUNCTION_ID {

    //
    // Null masking type.
    //

    PerfectHashTableNullMaskFunctionId          = 0,

    //
    // Being valid masking types.
    //

    PerfectHashTableModulusMaskFunctionId       = 1,

    PerfectHashTableAndMaskFunctionId           = 2,
    PerfectHashTableDefaultMaskFunctionId       = 2,

    PerfectHashTableXorAndMaskFunctionId        = 3,
    PerfectHashTableFoldAutoMaskFunctionId      = 4,
    PerfectHashTableFoldOnceMaskFunctionId      = 5,
    PerfectHashTableFoldTwiceMaskFunctionId     = 6,
    PerfectHashTableFoldThriceMaskFunctionId    = 7,

    //
    // End valid masking types.
    //

    //
    // N.B. Keep the next value last.
    //

    PerfectHashTableInvalidMaskFunctionId,


} PERFECT_HASH_TABLE_MASK_FUNCTION_ID;
typedef PERFECT_HASH_TABLE_MASK_FUNCTION_ID
      *PPERFECT_HASH_TABLE_MASK_FUNCTION_ID;

//
// Provide a simple inline masking type validation routine.
//

FORCEINLINE
BOOLEAN
IsValidPerfectHashTableMaskFunctionId(
    _In_ PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
    )
{
    return (
        MaskFunctionId > PerfectHashTableNullMaskFunctionId &&
        MaskFunctionId < PerfectHashTableInvalidMaskFunctionId
    );
}

//
// Masking tends to fall into one of two buckets: modulus and not-modulus.
// Provide an inline routine that guarantees to match all current and future
// modulus masking function IDs.
//

FORCEINLINE
BOOLEAN
IsModulusMasking(
    _In_ PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
    )
{
    return MaskFunctionId == PerfectHashTableModulusMaskFunctionId;
}

The problem parallelizes incredibly well. The more threads you can have looking for an acylic graph, the better. There was no way I was implementing this as a single-threaded solution; it's rare to get a problem so well suited to a multithreaded approach, and the threadpool scaffolding provided by NT is sublime, so, that was a no-brainer.
(In fact, the way that it is currently written, you can't actually solve the graph *without* using a threadpool. You can dictate the level of concurrency you want, and specifying 1 means the threadpool only gets 1 thread, which makes debugging easier, but there is no way to isolate the solving process to avoid this. This is by design.)

Memory map all the things. All file system interaction is achieved via sections and memory maps. There is a separate threadpool used for handling file-oriented work (such as saving the solution to disk in parallel whilst the main thread verifies it is correct — which it will be unless we've got internal bugs). Memory mappings are used for the source .keys file, the resulting .pht1 file for the perfect hash table, and the .pht1:Info NTFS stream used to capture metadata about the perfect hash table (such as which algorithm, hash function, and masking function was used, sizes, stats etc).

For handling testing, due to the limited time constraints, I went for a big-bang systems level "self-test". Coupled with aggressive internal ASSERT()ing, this worked out pretty well. A self-test .exe is provided, which, when pointed at a data directory and given a set of algorithm, hash and mask IDs, will process all *.keys files in the directory; for each one, a perfect hash file is created, the on-disk version is then loaded and subsequently tested. This is all handled by the routine SelfTestPerfectHashTable, which also serves as a good example for how to exercise the entire system end-to-end.

Current Performance

(Note: this section is an active work-in-progress. It documents observed performance at the time of writing: 10th June, 2018. It is the most recently authored section in this article, and thus, depicts newer code samples and implementation details compared to the subsequent sections. (For example, the FastIndex() and FastIndexUnsafe() routines were only recently added as part of performance tuning and benchmarking work; these routines aren't mentioned in subsequent sections yet.)

The performance of the solution can be evaulated in two distinct areas: the performance of the perfect hash table generation functionality, i.e. how quickly can we create a perfect hash table for any given key set, and the lookup performance of the resulting table, i.e. how many cycles does it take to look up a key (best case, worst case), how many memory references are made for a key lookup, etc.

Performance TL;DR

Generation is really fast. Although the parallelism helps immensely, the underlying acyclic hypergraph algorithm described in the original papers deserves all of the credit.

(That being said, I couldn't get their modulus-based algorithm to generate valid solutions, either in their original code or my reimplementation of their algorithm. I'm glad I got my AND-based masking version working, otherwise this assignment would have been a disaster.)

The hash function I use has three CRC32 instructions, one rotate, and one XOR. The latency for just the hashing logic is sub 10 cycles, give or take.

The Jenkins hash function, for comparison, clocks in at about 30 cycles for just the hashing logic. If Jenkins is paired with the original modulus masking, the minimum latency jumps to over 100 cycles.

I investigated even cheaper hash functions, such as simply XOR'ing the input with two random seeds. They often work on smaller graphs, but the CRC32 + rotate + XOR already offers very attractive latency and good randomness, so it's a better general purpose option. Also, reducing the hashing cost from sub 10 to, say, sub 6 cycles — we're completely limited by the overhead of the current function signature and table layout at that point (i.e. we could shave off more if we passed the seeds in as parameters via r8 and r9 instead of constantly loading them from the table).

The FastIndex() routine (i.e. what Lookup() and Insert() call in order to obtain the array offset for a given key) requires a minimum of two memory references within the assigned array in order to construct the final index. All other memory references within the routine will be served from the L1 cache with high probability (e.g. the seed values).

The best case scenario, where everything results in an L1 cache hit, is about 23 cycles of latency. I think I could get this down to about 15 cycles with an optimized LEAF_ENTRY routine written in assembly, however, it doesn't make sense to pursue that until more robust tools are in place for measuring performance.

Worst case scenario is cache misses for both lookups, resulting in a fetch from main memory. That'll be hundreds of cycles of penalty. Smaller table sizes should reduce the likelihood of misses on the lookup fast path, however, access patterns will factor heavily into this; if, say, the most frequent keys accessed fall within a smaller subset of the total key set — the penalty for a larger table is offset. (E.g. 90% of the lookups are on 10% of the keys.) Making real-world key frequency metrics available during table creation would definitely add value when attempting to assess the cost of a larger table size.

Additional areas I'd look to improve performance:
- Analysis of the values in the input key set. What's the first value, what's the last value, what's the minimum leading zeros, maximum leading zeros, ditto for trailing zeros. What's the value distribution look like? What's the gap between values look like? Certain patterns will emerge for target key sets that will allow for specific optimizations (like shifting away known min trailing zeros, etc). I currently don't look at a single key value during the creation step from the perspective of directing the underlying algorithm.
- Investigate prefetching. We could potentially dispatch a prefetch (especially a prefetch with write-intent in the Insert() case) early enough to offset some of the memory latency involved in the two-part vertex lookup.
- Write a LEAF_ENTRY in assembly that has zero error checking, no COM dependencies (returns the index in rax, not HRESULT + output parameter), and takes the seed values in registers. (Regarding error checking: our FastIndex() routine still does the sanity check that Vertex1 != Vertex2 — the only way to trigger an E_FAIL return code in this case is passing in a key that isn't in the original input set, AND having it trigger a collision with the selected hash function and random seed data such that the same value is produced for both values. The chances of this are very, very slim with reasonable hash functions and good seed data. So, we could omit the S_OK or E_FAIL return codes in favor of returning the index directly in rax.)
- Currently, the user is responsible for providing the algorithm, hash function and masking type to use in order to create the table. With the current work on benchmarking functions, this could be improved such that the creation routine is given a goal to maximize, e.g. lowest latency for FastIndex(), and then it tries all known algo/hash/masking variants until it finds the best solution. This wouldn't take that much more effort and would ensure that the most optimal combo is always being picked for a given key set.

Generation Performance

Broadly speaking, generation performance is very good. The parallel approach to searching for acyclic graphs has paid off quite nicely — the thing just flies. The current implementation is incredibly defensive at the moment, too, and includes numerous expensive operations that could be relegated to a debug build or removed completely.

For example, in order to check the incoming keys are unique, I simply allocate a 512MB bitmap and set a bit for every key I see in the range of 0-MAX_ULONG. I verify I only see every key once, and then as an extra check, count the number of set bits to ensure they match the reported number of keys. This is overly defensive and very abusive to the cache.

Similarly, once a graph has been found, an initial internal verification routine walks the entire graph and verifies none of the assigned indices for the entire key set appear more than once. This happens before the final solution is saved to disk. Then, after we load the table from the on-disk instance, we do another thorough system test, exercising all of the public API routines (e.g. Index(), Lookup(), Insert(), etc), verifying the table is behaving correctly and keys are being mapped to unique indices.

Even with all this additional overhead in place, a solution is generally found within about 5 seconds on my machines, even for the largest key sets we support (currently around ~750,000 for no particular reason other than I haven't tested larger yet). For the key set sizes more representative of our target domain (e.g. around 10,000 to 100,000), the generation time feels like it's basically instant, at least from the user perception of running the command line and waiting for it to report back that it solved the graph.

One aspect worth noting is related to the initial hunch I expressed earlier:

.... I envision there being much faster, non-mod based hash functions we can experiment with, that actually have relatively poor "randomness" qualities unless a particularly good seed is found. Combined with the threadpool infrastructure for submitting iterations in parallel, I have a hunch that I'll be able to find some very fast hash functions that can still be solved in an acceptable amount of time.

Just to elaborate a bit further on that note — I was anticipating a situation where, when fed some overly-simple hash functions, the graph solving would take far longer than the 99.9% probability of success by iteration 18 guarantees cited by the academic work on the subject.

For example, assume we've got a 37,000 key set that solves on the first set of attempts (i.e. within 1..NCPU), when fed a good, but expensive, hashing function.

I pictured a scenario whereby, if fed a poorer-but-cheaper hash function, a solution would be found in, say, 2 minutes. Or 5 minutes. Or basically any time period that falls within our gut feeling of being acceptable for the given input key set size.

That hunch appears to be wrong; I have not seen any evidence of this in practice. A graph will either be solved within the those first ~20 attempts, or it will never be solved. On numerous occasions, I left the solver running overnight, such that I had 11 out of 12 cores searching for solutions at 5GHz for 8 hours, and... nada. No solution was ever found. We often got really close — I put in functionality to track the highest number of deleted edges we'd seen during solving, and remember seeing that we were regularly hitting 499,987 for a key set size of 500,000 (so 13 away from finding a solution). Graphs would either solve very quickly, or not at all.

Something else I found interesting is how quickly graphs would be solved once I implemented support for detecting when the attempts have exceeded a certain threshold (currently 100) and then resizing the table such that we double the number of underlying vertices, and re-trying the graph solving in parallel. Once this was in place, graphs would always be rapidly solved. Supplying poor hash functions would result in a number of table resize events, but as soon as the right size was found, the graph would be solved within the first batch of attempts.

Lookup Performance

Ironically, the most critical piece of this project — optimizing the performance of the Insert() and Lookup() routines — received the least amount of attention, at least in relation to the other components like the graph solving or threadpool scaffolding.

The first hash function I half-heartedly wrote as a quick stop-gap, is, amusingly, still the best performing one in terms of offering the best trade-off between lookup latency and hash quality (with the latter being proportional to underlying table size). (It uses three CRC32 instructions, a rotate, and an XOR. I creatively gave it the moniker Crc32Rotate.)

Smaller tables are ideal — the minimum table size we can hope for when using power-of-2 masking is the number of keys, rounded up to a power of two, yielding our edge count, and then rounding that up to the next power-of-2, yielding our vertices count, and thus, table size.

If two hash functions yield a solution from the same vertices count, the function with the lower latency (cycles per lookup) is preferable, because there is simply no benefit to spending extra cycles in order to try improve the quality of the hash.

However, the same isn't necessarily true for two hash functions that yield different table sizes. Every power-of-2 increase in table size results in a larger memory footprint, which translates to a larger cache footprint, which translates to an increase likelihood of cache misses when looking up the two vertex values from the assigned array, which directly impacts performance. The type of access pattern greatly affects performance in this scenario too; if a subset of keys are looked up far more frequently than the others, then they're likely to be in the cache versus not. However, if the access pattern is truly random, the likelihood of a cache miss during lookup is increased, because there's a larger footprint of possible values the keys could be hashing to.

In order to give the Crc32Rotate hash function something to be evaluated against, I eventually ended up authoring a version of the Jenkins hash, which is widely used and cited, and is the hash function used by the authors of the CMHP package. I wasn't particularly attracted to it when first picking a hash function as it has long dependency chains in its instruction stream, which inhibit the out-of-order and superscalar nature of contemporary CPUs (the Jenkins hash was authored in the late 90s, for what it's worth).

Let's take a look at both implementations before reviewing performance. We'll review the FastIndex() implementations of each routine, which inline the AND-based masking such that we don't have to make three function calls through the COM vtbl pipeline to perform what equates to a very simple instruction (Value & (TableSize - 1)). (Note: the FastIndex() idiom was only added very recently as part of the benchmarking work; thus, it's not mentioned in the later code walkthrough sections yet.)

Crc32Rotate
Jenkins

_Use_decl_annotations_
HRESULT
PerfectHashTableFastIndexImplChm01Crc32RotateHashAndMask(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    PULONG Index
    )
/*++

Routine Description:

    Looks up given key in a perfect hash table and returns its index.  This
    is a fast version of the normal Index() routine that inlines the Crc32Rotate
    hash function and AND masking.

    N.B. If Key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         key will hash to either an existing key's location or an empty slot,
         so there is potential for returning a non-unique index.)

Arguments:

    Table - Supplies a pointer to the table for which the key lookup is to be
        performed.

    Key - Supplies the key to look up.

    Index - Receives the index associated with this key.  The index will be
        between 0 and Table->HashSize-1, and can be safely used to offset
        directly into an appropriately sized array (e.g. Table->Values[]).

Return Value:

    S_OK on success, E_FAIL if the underlying hash function returned a failure.
    This will happen if the two hash values for a key happen to be identical.
    It shouldn't happen once a perfect graph has been created (i.e. it only
    happens when attempting to solve the graph).  The Index parameter will
    be cleared in the case of E_FAIL.

--*/
{
    ULONG A;
    ULONG B;
    ULONG C;
    ULONG D;
    ULONG Seed1;
    ULONG Seed2;
    ULONG Seed3;
    ULONG Input;
    PULONG Seeds;
    ULONG Masked;
    ULONG Vertex1;
    ULONG Vertex2;
    PULONG Assigned;
    ULONG MaskedLow;
    ULONG MaskedHigh;
    ULONGLONG Combined;

    //IACA_VC_START();

    //
    // Initialize aliases.
    //

    Seeds = &Table->Header->FirstSeed;
    Seed1 = Seeds[0];
    Seed2 = Seeds[1];
    Seed3 = Seeds[2];
    Input = Key;
    Assigned = Table->Data;

    //
    // Calculate the individual hash parts.
    //

    A = _mm_crc32_u32(Seed1, Input);
    B = _mm_crc32_u32(Seed2, _rotl(Input, 15));
    C = Seed3 ^ Input;
    D = _mm_crc32_u32(B, C);

    //IACA_VC_END();

    Vertex1 = A;
    Vertex2 = D;

    if (Vertex1 == Vertex2) {
        goto Error;
    }

    //
    // Mask each hash value such that it falls within the confines of the
    // number of vertices.
    //

    MaskedLow = Vertex1 & Table->HashMask;
    MaskedHigh = Vertex2 & Table->HashMask;

    //
    // Obtain the corresponding vertex values for the masked high and low hash
    // values.  These are derived from the "assigned" array that we construct
    // during the creation routine's assignment step (GraphAssign()).
    //

    Vertex1 = Assigned[MaskedLow];
    Vertex2 = Assigned[MaskedHigh];

    //
    // Combine the two values, then perform the index masking operation, such
    // that our final index into the array falls within the confines of the
    // number of edges, or keys, in the table.  That is, make sure the index
    // value is between 0 and Table->Keys->NumberOfElements-1.
    //

    Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;

    Masked = Combined & Table->IndexMask;

    //
    // Update the caller's pointer and return success.  The resulting index
    // value represents the array offset index for this given key in the
    // underlying table, and is guaranteed to be unique amongst the original
    // keys in the input set.
    //

    *Index = Masked;

    //IACA_VC_END();

    return S_OK;

Error:

    //
    // Clear the caller's pointer and return failure.  We should only hit this
    // point if the caller supplies a key that both: a) wasn't in the original
    // input set, and b) happens to result in a hash value where both the high
    // part and low part are identical, which is rare, but not impossible.
    //

    *Index = 0;
    return E_FAIL;
}

_Use_decl_annotations_
HRESULT
PerfectHashTableFastIndexImplChm01JenkinsHashAndMask(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    PULONG Index
    )
/*++

Routine Description:

    Looks up given key in a perfect hash table and returns its index.  This
    is a fast version of the normal Index() routine that inlines the Jenkins
    hash function and AND masking.

    N.B. If Key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         key will hash to either an existing key's location or an empty slot,
         so there is potential for returning a non-unique index.)

Arguments:

    Table - Supplies a pointer to the table for which the key lookup is to be
        performed.

    Key - Supplies the key to look up.

    Index - Receives the index associated with this key.  The index will be
        between 0 and Table->HashSize-1, and can be safely used to offset
        directly into an appropriately sized array (e.g. Table->Values[]).

Return Value:

    S_OK on success, E_FAIL if the underlying hash function returned a failure.
    This will happen if the two hash values for a key happen to be identical.
    It shouldn't happen once a perfect graph has been created (i.e. it only
    happens when attempting to solve the graph).  The Index parameter will
    be cleared in the case of E_FAIL.

--*/
{
    ULONG A;
    ULONG B;
    ULONG C;
    ULONG D;
    ULONG E;
    ULONG F;
    PBYTE Byte;
    ULONG Seed1;
    ULONG Seed2;
    ULONG Input;
    PULONG Seeds;
    ULONG Masked;
    ULONG Vertex1;
    ULONG Vertex2;
    PULONG Assigned;
    ULONG MaskedLow;
    ULONG MaskedHigh;
    ULONGLONG Combined;

    //
    // Initialize aliases.
    //

    //IACA_VC_START();

    Seeds = &Table->Header->FirstSeed;
    Seed1 = Seeds[0];
    Seed2 = Seeds[1];
    Input = Key;

    Byte = (PBYTE)&Input;

    //
    // Generate the first hash.
    //

    A = B = 0x9e3779b9;
    C = Seed1;

    A += (((ULONG)Byte[3]) << 24);
    A += (((ULONG)Byte[2]) << 16);
    A += (((ULONG)Byte[1]) <<  8);
    A += ((ULONG)Byte[0]);

    A -= B; A -= C; A ^= (C >> 13);
    B -= C; B -= A; B ^= (A <<  8);
    C -= A; C -= B; C ^= (B >> 13);
    A -= B; A -= C; A ^= (C >> 12);
    B -= C; B -= A; B ^= (A << 16);
    C -= A; C -= B; C ^= (B >>  5);
    A -= B; A -= C; A ^= (C >>  3);
    B -= C; B -= A; B ^= (A << 10);
    C -= A; C -= B; C ^= (B >> 15);

    Vertex1 = C;

    //
    // Generate the second hash.
    //

    D = E = 0x9e3779b9;
    F = Seed2;

    D += (((ULONG)Byte[3]) << 24);
    D += (((ULONG)Byte[2]) << 16);
    D += (((ULONG)Byte[1]) <<  8);
    D += ((ULONG)Byte[0]);

    D -= E; D -= F; D ^= (F >> 13);
    E -= F; E -= D; E ^= (D <<  8);
    F -= D; F -= E; F ^= (E >> 13);
    D -= E; D -= F; D ^= (F >> 12);
    E -= F; E -= D; E ^= (D << 16);
    F -= D; F -= E; F ^= (E >>  5);
    D -= E; D -= F; D ^= (F >>  3);
    E -= F; E -= D; E ^= (D << 10);
    F -= D; F -= E; F ^= (E >> 15);

    //IACA_VC_END();

    Vertex2 = F;

    if (Vertex1 == Vertex2) {
        goto Error;
    }

    //
    // Mask each hash value such that it falls within the confines of the
    // number of vertices.
    //

    MaskedLow = Vertex1 & Table->HashMask;
    MaskedHigh = Vertex2 & Table->HashMask;

    //
    // Obtain the corresponding vertex values for the masked high and low hash
    // values.  These are derived from the "assigned" array that we construct
    // during the creation routine's assignment step (GraphAssign()).
    //

    Assigned = Table->Data;

    Vertex1 = Assigned[MaskedLow];
    Vertex2 = Assigned[MaskedHigh];

    //
    // Combine the two values, then perform the index masking operation, such
    // that our final index into the array falls within the confines of the
    // number of edges, or keys, in the table.  That is, make sure the index
    // value is between 0 and Table->Keys->NumberOfElements-1.
    //

    Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;

    Masked = Combined & Table->IndexMask;

    //
    // Update the caller's pointer and return success.  The resulting index
    // value represents the array offset index for this given key in the
    // underlying table, and is guaranteed to be unique amongst the original
    // keys in the input set.
    //

    *Index = Masked;

    //IACA_VC_END();

    return S_OK;

Error:

    //
    // Clear the caller's pointer and return failure.  We should only hit this
    // point if the caller supplies a key that both: a) wasn't in the original
    // input set, and b) happens to result in a hash value where both the high
    // part and low part are identical, which is rare, but not impossible.
    //

    *Index = 0;
    return E_FAIL;
}

As you can see, the CRC32 routine is doing far less work to derive the two 32-bit hash values for a given input key. The IACA profiles for each routine tell a similar story:

Crc32Rotate (IACA)
Jenkins (IACA)

S:\tracer>iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 19.00 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  2.5     0.0  |  2.0  |  7.0     7.0  |  7.0     7.0  |  2.0  |  2.5  |  2.0  |  2.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | mov rbx, r8
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov r9, qword ptr [rcx+0xc8]
|   1*     |             |      |             |             |      |      |      |      | mov r10, rcx
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov edi, dword ptr [r9+0x64]
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov eax, dword ptr [r9+0x6c]
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov r11d, dword ptr [r9+0x68]
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | xor eax, edx
|   0X     |             |      |             |             |      |      |      |      | crc32 edi, edx
|   2      | 1.0         |      |             |             |      |      | 1.0  |      | rol edx, 0xf
|   0X     |             |      |             |             |      |      |      |      | crc32 r11d, edx
|   0X     |             |      |             |             |      |      |      |      | crc32 r11d, eax
|   1*     |             |      |             |             |      |      |      |      | cmp edi, r11d
|   0*F    |             |      |             |             |      |      |      |      | jnz 0x19
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [r8], 0x0
|   1      |             | 0.5  |             |             |      | 0.5  |      |      | mov eax, 0x80004005
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rdi, qword ptr [rsp+0x10]
|   3^     |             |      | 1.0     1.0 |             |      |      |      |      | ret
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov r8, qword ptr [rcx+0x98]
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [rcx+0x44]
|   1*     |             |      |             |             |      |      |      |      | mov eax, edi
|   1#     |             | 0.5  |             | 1.0     1.0 |      | 0.5  |      |      | mov rdi, qword ptr [rsp+0x10]
|   1      | 0.5         |      |             |             |      | 0.5  |      |      | and rax, rcx
|   1*     |             |      |             |             |      |      |      |      | mov edx, r11d
|   1      |             | 0.5  |             |             |      |      | 0.5  |      | and rdx, rcx
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [r8+rdx*4]
|   2      | 0.5         |      |             | 1.0     1.0 |      | 0.5  |      |      | add ecx, dword ptr [r8+rax*4]
|   1*     |             |      |             |             |      |      |      |      | xor eax, eax
|   2^     | 0.5         |      | 1.0     1.0 |             |      | 0.5  |      |      | and ecx, dword ptr [r10+0x48]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rbx], ecx
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
Total Num Of µops: 34

S:\tracer>iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 42.00 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles | 21.0     0.0  | 22.0  |  7.5     7.5  |  7.5     7.5  |  2.0  | 22.0  | 22.0  |  2.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov r10, qword ptr [rcx+0xc8]
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x10
|   1*     |             |      |             |             |      |      |      |      | mov r11d, edx
|   1*     |             |      |             |             |      |      |      |      | movzx edx, al
|   1*     |             |      |             |             |      |      |      |      | mov rbx, rcx
|   1      | 1.0         |      |             |             |      |      |      |      | shr r11d, 0x18
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov esi, dword ptr [r10+0x64]
|   1*     |             |      |             |             |      |      |      |      | mov rdi, r8
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov r10d, dword ptr [r10+0x68]
|   1      |             | 1.0  |             |             |      |      |      |      | mov r8d, 0x9e3779b9
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x8
|   1*     |             |      |             |             |      |      |      |      | movzx ecx, al
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xd
|   1      |             |      |             |             |      |      | 1.0  |      | shl r11d, 0x8
|   1      |             |      |             |             |      | 1.0  |      |      | add r11d, edx
|   1*     |             |      |             |             |      |      |      |      | movzx r9d, r9b
|   1*     |             |      |             |             |      |      |      |      | mov edx, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl r11d, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | sub edx, esi
|   1      |             |      |             |             |      | 1.0  |      |      | add r11d, ecx
|   1*     |             |      |             |             |      |      |      |      | mov ecx, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | shl r11d, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | sub ecx, esi
|   1      |             |      |             |             |      | 1.0  |      |      | add edx, r11d
|   1      |             | 1.0  |             |             |      |      |      |      | xor edx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, r10d
|   1      | 1.0         |      |             |             |      |      |      |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      |             |      |             |             |      |      | 1.0  |      | shl eax, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | sub r8d, r10d
|   1      |             |      |             |             |      | 1.0  |      |      | xor ecx, eax
|   1      | 1.0         |      |             |             |      |      |      |      | add r9d, r11d
|   1      |             | 1.0  |             |             |      |      |      |      | sub esi, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xd
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, edx
|   1      | 1.0         |      |             |             |      |      |      |      | xor esi, eax
|   1      |             | 1.0  |             |             |      |      |      |      | sub edx, esi
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xc
|   1      |             |      |             |             |      | 1.0  |      |      | sub edx, ecx
|   1      | 1.0         |      |             |             |      |      |      |      | xor edx, eax
|   1      |             | 1.0  |             |             |      |      |      |      | sub ecx, esi
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      |             |      |             |             |      |      | 1.0  |      | shl eax, 0x10
|   1      | 1.0         |      |             |             |      |      |      |      | xor ecx, eax
|   1      |             | 1.0  |             |             |      |      |      |      | sub esi, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x5
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, edx
|   1      | 1.0         |      |             |             |      |      |      |      | xor esi, eax
|   1      |             | 1.0  |             |             |      |      |      |      | sub edx, esi
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x3
|   1      |             |      |             |             |      | 1.0  |      |      | sub edx, ecx
|   1      | 1.0         |      |             |             |      |      |      |      | xor edx, eax
|   1      |             | 1.0  |             |             |      |      |      |      | sub ecx, esi
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      |             |      |             |             |      |      | 1.0  |      | shl eax, 0xa
|   1      | 1.0         |      |             |             |      |      |      |      | xor ecx, eax
|   1*     |             |      |             |             |      |      |      |      | mov eax, r10d
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xd
|   1      |             | 1.0  |             |             |      |      |      |      | sub esi, ecx
|   1      |             |      |             |             |      | 1.0  |      |      | xor r9d, eax
|   1      | 1.0         |      |             |             |      |      |      |      | shr ecx, 0xf
|   1      |             | 1.0  |             |             |      |      |      |      | sub r8d, r9d
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      |             |      |             |             |      |      | 1.0  |      | xor esi, ecx
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r10d, r8d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | sub r10d, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xd
|   1      |             | 1.0  |             |             |      |      |      |      | xor r10d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, r10d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r10d
|   1      |             |      |             |             |      |      | 1.0  |      | sub r9d, r8d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xc
|   1      |             | 1.0  |             |             |      |      |      |      | xor r9d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r8d, r10d
|   1      |             |      |             |             |      |      | 1.0  |      | sub r8d, r9d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x10
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r10d, r8d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | sub r10d, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0x5
|   1      |             | 1.0  |             |             |      |      |      |      | xor r10d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, r10d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r10d
|   1      |             |      |             |             |      |      | 1.0  |      | sub r9d, r8d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0x3
|   1      |             | 1.0  |             |             |      |      |      |      | sub r8d, r10d
|   1      |             |      |             |             |      | 1.0  |      |      | xor r9d, eax
|   1      |             |      |             |             |      |      | 1.0  |      | sub r8d, r9d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0xa
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r10d, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | shr r8d, 0xf
|   1      | 1.0         |      |             |             |      |      |      |      | sub r10d, r9d
|   1      |             | 1.0  |             |             |      |      |      |      | xor r10d, r8d
|   1*     |             |      |             |             |      |      |      |      | cmp esi, r10d
|   0*F    |             |      |             |             |      |      |      |      | jnz 0x1d
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdi], 0x0
|   1      |             |      |             |             |      | 1.0  |      |      | mov eax, 0x80004005
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rsi, qword ptr [rsp+0x10]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rdi, qword ptr [rsp+0x18]
|   3^     |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | ret
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov edx, dword ptr [rbx+0x44]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov r8, qword ptr [rbx+0x98]
|   1*     |             |      |             |             |      |      |      |      | mov ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, r10d
|   1      |             |      |             |             |      |      | 1.0  |      | and rcx, rax
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1#     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rsi, qword ptr [rsp+0x10]
|   1      |             | 1.0  |             |             |      |      |      |      | and rdx, rax
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov eax, dword ptr [r8+rcx*4]
|   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | add eax, dword ptr [r8+rdx*4]
|   2^     |             |      | 0.5     0.5 | 0.5     0.5 |      |      | 1.0  |      | and eax, dword ptr [rbx+0x48]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdi], eax
|   1*     |             |      |             |             |      |      |      |      | xor eax, eax
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rdi, qword ptr [rsp+0x18]
Total Num Of µops: 138

A Note on Modulus Masking

For the sake of comparison, let's take a look at the IACA profile of the modulus-masking based approach used in the original academic papers. (Note that I wasn't able to get their modulus-masking algorithm to generate valid graphs without collisions — both in the original CMPH project and my reimplementation of their algorithm. So, the following routine isn't used anywhere. It is useful for depicting the impact of the modulus operator, though.)

Converting the AND-masking to a %-mask for the Jenkins hash function results in a diff that looks like this:


% diff -u JenkinsAnd.c JenkinsMod.c
--- JenkinsAnd.c        2018-06-10 20:48:49.330504900 -0700
+++ JenkinsMod.c        2018-06-10 20:50:06.373304600 -0700
@@ -1,6 +1,6 @@
 _Use_decl_annotations_
 HRESULT
-PerfectHashTableFastIndexImplChm01JenkinsHashAndMask(
+PerfectHashTableFastIndexImplChm01JenkinsHashModMask(
     PPERFECT_HASH_TABLE Table,
     ULONG Key,
     PULONG Index
@@ -130,8 +130,8 @@
     // number of vertices.
     //

-    MaskedLow = Vertex1 & Table->HashMask;
-    MaskedHigh = Vertex2 & Table->HashMask;
+    MaskedLow = Vertex1 % Table->HashModulus;
+    MaskedHigh = Vertex2 % Table->HashModulus;

     //
     // Obtain the corresponding vertex values for the masked high and low hash
@@ -153,7 +153,7 @@

     Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;

-    Masked = Combined & Table->IndexMask;
+    Masked = Combined % Table->IndexModulus;

     //
     // Update the caller's pointer and return success.  The resulting index

The IACA profile is as follows:

Jenkins-Modulus (IACA)

S:\tracer>iaca x64\Release\PerfectHashTable.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\PerfectHashTable.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 119.00 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles | 30.0     0.0  | 28.0  |  7.0     7.0  |  7.0     7.0  |  2.0  | 29.0  | 29.0  |  2.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov r10, qword ptr [rcx+0xc8]
|   1*     |             |      |             |             |      |      |      |      | mov r9d, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1*     |             |      |             |             |      |      |      |      | mov r11d, edx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x10
|   1*     |             |      |             |             |      |      |      |      | mov rdi, r8
|   1*     |             |      |             |             |      |      |      |      | movzx edx, al
|   1      |             | 1.0  |             |             |      |      |      |      | mov r8d, 0x9e3779b9
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov esi, dword ptr [r10+0x64]
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x8
|   1*     |             |      |             |             |      |      |      |      | movzx ecx, al
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xd
|   1      |             |      |             |             |      |      | 1.0  |      | shr r11d, 0x18
|   1      | 1.0         |      |             |             |      |      |      |      | shl r11d, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | add r11d, edx
|   1*     |             |      |             |             |      |      |      |      | movzx r9d, r9b
|   1*     |             |      |             |             |      |      |      |      | mov edx, r9d
|   1      |             |      |             |             |      |      | 1.0  |      | shl r11d, 0x8
|   1      |             |      |             |             |      | 1.0  |      |      | sub edx, esi
|   1      |             | 1.0  |             |             |      |      |      |      | add r11d, ecx
|   1*     |             |      |             |             |      |      |      |      | mov ecx, r8d
|   1      | 1.0         |      |             |             |      |      |      |      | shl r11d, 0x8
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, esi
|   1      |             | 1.0  |             |             |      |      |      |      | add edx, r11d
|   1      |             |      |             |             |      | 1.0  |      |      | xor edx, eax
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xd
|   1      | 1.0         |      |             |             |      |      |      |      | sub esi, edx
|   1      |             | 1.0  |             |             |      |      |      |      | xor esi, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub edx, esi
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xc
|   1      | 1.0         |      |             |             |      |      |      |      | sub edx, ecx
|   1      |             | 1.0  |             |             |      |      |      |      | xor edx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, esi
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x10
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x5
|   1      | 1.0         |      |             |             |      |      |      |      | sub esi, edx
|   1      |             | 1.0  |             |             |      |      |      |      | xor esi, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub edx, esi
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0x3
|   1      | 1.0         |      |             |             |      |      |      |      | sub edx, ecx
|   1      |             | 1.0  |             |             |      |      |      |      | xor edx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, esi
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, edx
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0xa
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub esi, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | shr ecx, 0xf
|   1      | 1.0         |      |             |             |      |      |      |      | sub esi, edx
|   1      |             | 1.0  |             |             |      |      |      |      | xor esi, ecx
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [r10+0x68]
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | sub r8d, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      | 1.0         |      |             |             |      |      |      |      | add r9d, r11d
|   1      |             |      |             |             |      |      | 1.0  |      | shr eax, 0xd
|   1      |             | 1.0  |             |             |      |      |      |      | xor r9d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r8d, r9d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x8
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, r8d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xd
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | sub r9d, r8d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0xc
|   1      |             | 1.0  |             |             |      |      |      |      | xor r9d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r8d, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | sub r8d, r9d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0x10
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, r8d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | sub ecx, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0x5
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub r9d, ecx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      |             |             |      |      | 1.0  |      | sub r9d, r8d
|   1      | 1.0         |      |             |             |      |      |      |      | shr eax, 0x3
|   1      |             | 1.0  |             |             |      |      |      |      | sub r8d, ecx
|   1      |             |      |             |             |      | 1.0  |      |      | xor r9d, eax
|   1      |             |      |             |             |      |      | 1.0  |      | sub r8d, r9d
|   1*     |             |      |             |             |      |      |      |      | mov eax, r9d
|   1      | 1.0         |      |             |             |      |      |      |      | shl eax, 0xa
|   1      |             | 1.0  |             |             |      |      |      |      | xor r8d, eax
|   1      |             |      |             |             |      | 1.0  |      |      | sub ecx, r8d
|   1      |             |      |             |             |      |      | 1.0  |      | shr r8d, 0xf
|   1      | 1.0         |      |             |             |      |      |      |      | sub ecx, r9d
|   1      |             | 1.0  |             |             |      |      |      |      | xor ecx, r8d
|   1*     |             |      |             |             |      |      |      |      | cmp esi, ecx
|   0*F    |             |      |             |             |      |      |      |      | jnz 0x1d
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdi], 0x0
|   1      |             |      |             |             |      | 1.0  |      |      | mov eax, 0x80004005
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov rsi, qword ptr [rsp+0x10]
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rdi, qword ptr [rsp+0x18]
|   3^     |             |      | 1.0     1.0 |             |      |      |      |      | ret
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov r10, qword ptr [rbx+0x98]
|   1*     |             |      |             |             |      |      |      |      | xor edx, edx
|   1*     |             |      |             |             |      |      |      |      | mov eax, ecx
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov ecx, dword ptr [rbx+0x58]
|   0X     |             |      |             |             |      |      |      |      | div dword ptr [rbx+0x54]
|   1*     |             |      |             |             |      |      |      |      | mov eax, esi
|   1#     | 1.0         |      |             | 1.0     1.0 |      |      |      |      | mov rsi, qword ptr [rsp+0x10]
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov r9d, dword ptr [r10+rdx*4]
|   1*     |             |      |             |             |      |      |      |      | xor edx, edx
|   0X     |             |      |             |             |      |      |      |      | div dword ptr [rbx+0x54]
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rbx, qword ptr [rsp+0x8]
|   1      |             |      | 1.0     1.0 |             |      |      |      |      | mov eax, dword ptr [r10+rdx*4]
|   1*     |             |      |             |             |      |      |      |      | xor edx, edx
|   1      |             |      |             |             |      |      | 1.0  |      | add rax, r9
|  32      | 9.0         | 7.0  |             |             |      | 9.0  | 7.0  |      | div rcx
|   1*     |             |      |             |             |      |      |      |      | xor eax, eax
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | mov dword ptr [rdi], edx
|   1      |             |      |             | 1.0     1.0 |      |      |      |      | mov rdi, qword ptr [rsp+0x18]
Total Num Of µops: 168

The three div operations bump the µop count from 119 up to 168, a 1.4x increase. The block throughput is even worse, jumping from 43 to 119, ~2.7x increase. At 19 cycles, the CRC32-Rotate version is clearly in the lead, 2.2x better than Jenkins, and 6.2x better than Jenkins with modulus masking as per the original paper.

In Defense of Jenkins...

It definitely produces a better quality hash than the CRC32-based routine. This is evidenced by the fact that, against our 50 auto-generated linear/random key files of varying sizes — I've not seen Jenkins result in a single table resize event.

On the other hand, on the same data set, the CRC32 routine will typically need to resort to around 15 table resize events in order to find a solution. The affected 15 files are all the random ones (I've not seen any linear ones needing resizing, even the large ones). There could be some pathological issue that is easily fixed, but I haven't investigated in detail, yet. On all of the key sets that are more representative of our target, the CRC32 routine obtains a solution easily.

Implementation Notes

(The following section serves as a general introduction to the project's organization, concepts, files, etc. It aims to provide a bit of background context to some of the idioms that will be observed when reviewing the code. The next section, Code Walkthrough, digs into the implementation details in a more end-to-end type fashion.)

From an implementation perspective, I decided to develop the component as part of my tracer project for two reasons: a) the tracer project already has a lot of scaffolding in place (especially for the boring bits) that I'd be able to re-use, which is useful when time-constrained, and b) a perfect hash table is actually a pretty neat component that I can see my self using down the track.

The implementation is creatively-named PerfectHashTable, and is a DLL.

The PerfectHashTableSelfTestExe project, also very creatively-named, is the standalone self-test .exe.

Development of the project was done in Visual Studio 2017 via the PerfectHashTable.sln solution, which includes Debug, Release, PGInstrument and PGOptimize configurations.

The coding style and conventions are the same as the rest of the tracer project: Cutler Normal Form C, heavily commented, and generally very NT-esque in style.

Components in the tracer project have an interesting restriction in that they can't use the C runtime library in any way (due to the need to potentially be remote injected into a target process, and not wanting to deal with varying levels of C runtime availability). Instead, they make extensive, exclusive use of the NT runtime primitives afforded by ntdll.dll, kernel32.dll, and ntoskrnl.exe. This functionality is wrapped up by a component named Rtl (inspired by the NT runtime library of the same name). There is a huge structure named RTL, which contains the kitchen sink of things we need across all components, as well as function pointers to DDK-type functionality normally not available if you're including <Windows.h> (such as bitmaps, AVL tables, prefix tables, etc). (Note: one of the original design objectives of the RTL structure was to facilitate writing components that could run in both kernel and user mode without needing recompilation or #ifdefs. That is, you could author and test the vast majority of your functionality in user mode, where development, debugging and testing is easier, without needing to deploy it into the kernel until much later in the development lifecycle.)

The reason for mentioning this is that the first two parameters for every public function of the PerfectHashTable component are usually Rtl and Allocator. E.g.:

_Use_decl_annotations_
BOOLEAN
CreatePerfectHashTable(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PPERFECT_HASH_TABLE_CONTEXT Context,
    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId,
    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId,
    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId,
    PULARGE_INTEGER NumberOfTableElementsPointer,
    PPERFECT_HASH_TABLE_KEYS Keys,
    PCUNICODE_STRING HashTablePath
    )

The ALLOCATOR structure encapsulates the memory management functions like Calloc() and Free(), such that we're not dependent upon CRT linkage to equivalent functions. Initially, it just had the basic set of routines required to mimic Python's PyMemAllocatorEx structure: malloc(), calloc(), realloc() and free(). However, since then, it grew to support a plethora of routines, many of which driven by the TraceStore component, but also more general functions, such as aligned memory allocation.

Short story long: if you see Rtl our Allocator variables anywhere, it's just our runtime glue or memory allocator stuff.

Moving on, all public functions get a SAL-annotated function pointer typedef in the main public header file (PerfectHashTable.h). Functions are defined for creating and destroying keys and perfect hash tables, creating and destroying contexts (a context is required to create a table), and loading a previously-created table.

I use the same API pattern I used for the StringTable2 component, which I discuss here. This involves defining a structure named PERFECT_HASH_TABLE_API, and an inline function, LoadPerfectHashTableApi().

The structure looks like this:

typedef struct _Struct_size_bytes_(SizeOfStruct) _PERFECT_HASH_TABLE_API {

    //
    // Size of the structure, in bytes.  This is filled in automatically by
    // LoadPerfectHashTableApi() based on the initial SizeOfAnyApi parameter.
    //

    _In_range_(sizeof(struct _PERFECT_HASH_TABLE_API),
               sizeof(struct _PERFECT_HASH_TABLE_API_EX)) ULONG SizeOfStruct;

    //
    // Number of function pointers contained in the structure.  This is filled
    // in automatically by LoadPerfectHashTableApi() based on the initial
    // SizeOfAnyApi parameter divided by the size of a function pointer.
    //

    ULONG NumberOfFunctions;

    //
    // Begin function pointers.
    //

    union {
        PVOID FirstFunctionPointer;
        PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;
    };

    PLOAD_PERFECT_HASH_TABLE_KEYS LoadPerfectHashTableKeys;
    PDESTROY_PERFECT_HASH_TABLE_KEYS DestroyPerfectHashTableKeys;

    PCREATE_PERFECT_HASH_TABLE_CONTEXT CreatePerfectHashTableContext;
    PDESTROY_PERFECT_HASH_TABLE_CONTEXT DestroyPerfectHashTableContext;

    PCREATE_PERFECT_HASH_TABLE CreatePerfectHashTable;
    PLOAD_PERFECT_HASH_TABLE LoadPerfectHashTable;
    PTEST_PERFECT_HASH_TABLE TestPerfectHashTable;

    PINITIALIZE_PERFECT_HASH_TABLE_ALLOCATOR InitializePerfectHashAllocator;

    PINITIALIZE_PERFECT_HASH_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
        InitializePerfectHashAllocatorFromRtlBootstrap;

} PERFECT_HASH_TABLE_API;
typedef PERFECT_HASH_TABLE_API *PPERFECT_HASH_TABLE_API;

And the loader routine looks like this:

FORCEINLINE
BOOLEAN
LoadPerfectHashTableApi(
    _In_ PRTL Rtl,
    _Inout_ HMODULE *ModulePointer,
    _In_opt_ PUNICODE_STRING ModulePath,
    _In_ ULONG SizeOfAnyApi,
    _Out_writes_bytes_all_(SizeOfAnyApi) PPERFECT_HASH_TABLE_ANY_API AnyApi
    )
/*++

Routine Description:

    Loads the perfect hash table module and resolves all API functions for
    either the PERFECT_HASH_TABLE_API or PERFECT_HASH_TABLE_API_EX structure.
    The desired API is indicated by the SizeOfAnyApi parameter.

    Example use:

        PERFECT_HASH_TABLE_API_EX GlobalApi;
        PPERFECT_HASH_TABLE_API_EX Api;

        Success = LoadPerfectHashApi(Rtl,
                                     NULL,
                                     NULL,
                                     sizeof(GlobalApi),
                                     (PPERFECT_HASH_TABLE_ANY_API)&GlobalApi);
        ASSERT(Success);
        Api = &GlobalApi;

    In this example, the extended API will be provided as our sizeof(GlobalApi)
    will indicate the structure size used by PERFECT_HASH_TABLE_API_EX.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    ModulePointer - Optionally supplies a pointer to an existing module handle
        for which the API symbols are to be resolved.  May be NULL.  If not
        NULL, but the pointed-to value is NULL, then this parameter will
        receive the handle obtained by LoadLibrary() as part of this call.
        If the string table module is no longer needed, but the program will
        keep running, the caller should issue a FreeLibrary() against this
        module handle.

    ModulePath - Optionally supplies a pointer to a UNICODE_STRING structure
        representing a path name of the string table module to be loaded.
        If *ModulePointer is not NULL, it takes precedence over this parameter.
        If NULL, and no module has been provided via *ModulePointer, loading
        will be attempted via LoadLibraryA("PerfectHashTable.dll")'.

    SizeOfAnyApi - Supplies the size, in bytes, of the underlying structure
        pointed to by the AnyApi parameter.

    AnyApi - Supplies the address of a structure which will receive resolved
        API function pointers.  The API furnished will depend on the size
        indicated by the SizeOfAnyApi parameter.

Return Value:

    TRUE on success, FALSE on failure.

--*/

This is a little quirky, but I've found it to be a useful way to dynamically load and use small components at runtime without requiring any static linking or dll export/import glue.

Publicly, the keys and context structures are opaque. Routines are exposed to create them and destroy them, and that's it. The structure details are reserved for main private header for the component, PerfectHashTablePrivate.h, which is available for inclusion by all of our internal .c files. Pre-compiled headers are used, such that the only file each individual .c file needs to include is stdafx.h.

The actual perfect hash table interface that implements the required Insert() and Lookup() is actually exposed as a COM-style vtbl structure named PERFECT_HASH_TABLE_VTBL. It's not technically a COM interface, as we don't expose a QueryInterface() function pointer in the first slot, but it has the same initial IUnknown layout, and leverages the reference counting facilities implied by AddRef() and Release() to manage lifetime. (This differs from all of the other structures, which will typically have both a Create and Destroy-type method.) The structure, with an accompanying definition of the loader function pointer and all public vtbl methods for interacting with the perfect hash table, looks like this:

//
// Forward definition of the interface.
//

typedef struct _PERFECT_HASH_TABLE_VTBL PERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL *PPERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL **PPPERFECT_HASH_TABLE;

//
// Define a minimal vtbl encapsulation structure if we're a public
// (i.e. non-internal) build.  The actual structure is defined in
// PerfectHashTablePrivate.h.
//

#ifndef _PERFECT_HASH_TABLE_INTERNAL_BUILD
typedef struct _PERFECT_HASH_TABLE {
    PPERFECT_HASH_TABLE_VTBL Vtbl;
} PERFECT_HASH_TABLE;
#else
typedef struct _PERFECT_HASH_TABLE PERFECT_HASH_TABLE;
#endif
typedef PERFECT_HASH_TABLE *PPERFECT_HASH_TABLE;

typedef
_Check_return_
_Success_(return != 0)
BOOLEAN
(NTAPI LOAD_PERFECT_HASH_TABLE)(
    _In_ PRTL Rtl,
    _In_ PALLOCATOR Allocator,
    _In_opt_ PPERFECT_HASH_TABLE_KEYS Keys,
    _In_ PCUNICODE_STRING Path,
    _Out_ PPERFECT_HASH_TABLE *TablePointer
    );
typedef LOAD_PERFECT_HASH_TABLE *PLOAD_PERFECT_HASH_TABLE;

//
// Define the public perfect hash table functions.
//

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_INSERT)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Key,
    _In_ ULONG Value,
    _Out_opt_ PULONG PreviousValue
    );
typedef PERFECT_HASH_TABLE_INSERT *PPERFECT_HASH_TABLE_INSERT;

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_LOOKUP)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Key,
    _Out_ PULONG Value
    );
typedef PERFECT_HASH_TABLE_LOOKUP *PPERFECT_HASH_TABLE_LOOKUP;

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_DELETE)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Key,
    _Out_opt_ PULONG PreviousValue
    );
typedef PERFECT_HASH_TABLE_DELETE *PPERFECT_HASH_TABLE_DELETE;

//
// Given a key, this routine returns the relative index of the key in the
// underlying hash table.  This is guaranteed to be within the bounds of the
// table size.
//

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_INDEX)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Key,
    _In_ PULONG Index
    );
typedef PERFECT_HASH_TABLE_INDEX *PPERFECT_HASH_TABLE_INDEX;

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_HASH)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Input,
    _Out_ PULONGLONG Hash
    );
typedef PERFECT_HASH_TABLE_HASH *PPERFECT_HASH_TABLE_HASH;

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_MASK_HASH)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONG Input,
    _Out_ PULONG Masked
    );
typedef PERFECT_HASH_TABLE_MASK_HASH *PPERFECT_HASH_TABLE_MASK_HASH;

typedef
HRESULT
(NTAPI PERFECT_HASH_TABLE_MASK_INDEX)(
    _In_ PPERFECT_HASH_TABLE Table,
    _In_ ULONGLONG Input,
    _Out_ PULONG Masked
    );
typedef PERFECT_HASH_TABLE_MASK_INDEX *PPERFECT_HASH_TABLE_MASK_INDEX;

//
// Loaded hash tables are reference counted using the AddRef()/Release() COM
// semantics.  The number of AddRef() calls should match the number of Release()
// calls.  The resources will be released when the final Release() is called.
//

typedef
ULONG
(NTAPI PERFECT_HASH_TABLE_ADD_REF)(
    _In_ PPERFECT_HASH_TABLE Table
    );
typedef PERFECT_HASH_TABLE_ADD_REF *PPERFECT_HASH_TABLE_ADD_REF;

typedef
ULONG
(NTAPI PERFECT_HASH_TABLE_RELEASE)(
    _In_ PPERFECT_HASH_TABLE Table
    );
typedef PERFECT_HASH_TABLE_RELEASE *PPERFECT_HASH_TABLE_RELEASE;

//
// The interface as a vtbl.  Note that we're *almost* a valid COM interface,
// except for the NULL pointer that will occupy the first slot where the impl
// for QueryInterface() is meant to live.
//

typedef struct _PERFECT_HASH_TABLE_VTBL {
    PVOID Unused;
    PPERFECT_HASH_TABLE_ADD_REF AddRef;
    PPERFECT_HASH_TABLE_RELEASE Release;
    PPERFECT_HASH_TABLE_INSERT Insert;
    PPERFECT_HASH_TABLE_LOOKUP Lookup;
    PPERFECT_HASH_TABLE_DELETE Delete;
    PPERFECT_HASH_TABLE_INDEX Index;
    PPERFECT_HASH_TABLE_HASH Hash;
    PPERFECT_HASH_TABLE_MASK_HASH MaskHash;
    PPERFECT_HASH_TABLE_MASK_INDEX MaskIndex;
} PERFECT_HASH_TABLE_VTBL;
typedef PERFECT_HASH_TABLE_VTBL *PPERFECT_HASH_TABLE_VTBL;

The COM-style vtbl pattern worked quite nicely here, and provided the perfect amount of exposure between the public and private definitions of the table interface. It allows us to mix and match who provides what with regards to the vtbl function pointers, based on the selected algorithm, hash function and masking type. The PerfectHashTableConstants.h file exposes an internal routine named InitializeExtendedVtbl, which looks like this:

//
// Helper inline routine for initializing the extended vtbl interface.
//

FORCEINLINE
VOID
InitializeExtendedVtbl(
    _In_ PPERFECT_HASH_TABLE Table,
    _Inout_ PPERFECT_HASH_TABLE_VTBL_EX Vtbl
    )
{
    Vtbl->AddRef = PerfectHashTableAddRef;
    Vtbl->Release = PerfectHashTableRelease;
    Vtbl->Insert = PerfectHashTableInsert;
    Vtbl->Lookup = PerfectHashTableLookup;
    Vtbl->Delete = PerfectHashTableDelete;
    Vtbl->Index = IndexRoutines[Table->AlgorithmId];
    Vtbl->Hash = HashRoutines[Table->HashFunctionId];
    Vtbl->MaskHash = MaskHashRoutines[Table->MaskFunctionId];
    Vtbl->MaskIndex = MaskIndexRoutines[Table->MaskFunctionId];
    Vtbl->SeededHash = SeededHashRoutines[Table->HashFunctionId];
    Table->Vtbl = Vtbl;
}

Note that, thanks to COM, the first five routines can actually be serviced by generic, algorithm-agnostic implementations. That is, because the Insert(), Lookup(), and Delete() routines simply use the Index() routine to obtain the array offset for the given input key, they don't need to know anything about the backend algorithm, hash or masking type. This actually makes all of these functions very simple, as can be seen below:

Insert
Lookup
Delete
AddRef
Release

_Use_decl_annotations_
HRESULT
PerfectHashTableInsert(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    ULONG Value,
    PULONG PreviousValue
    )
/*++

Routine Description:

    Looks up given key in a perfect hash table and returns the value set by
    the Insert() routine.  If no insertion has taken place for this key, this
    routine guarantees to return 0 as the value.

    N.B. If Key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         key will hash to either an existing key's location or an empty slot,
         so there is potential to corrupt the table in the sense that previously
         inserted values will be trampled over.)

Arguments:

    Table - Supplies a pointer to the table to insert the key/value into.

    Key - Supplies the key to insert.

    Value - Supplies the value to insert.

    PreviousValue - Optionally supplies a pointer that will receive the previous
        value at the relevant table location prior to this insertion.  If no
        prior insertion, the previous value is guaranteed to be 0.

Return Value:

    S_OK in all normal operating conditions.  E_FAIL may be returned in some
    cases when passed a key not in the original input set.  The PreviousValue
    parameter, if non-NULL, will be cleared in this case.

--*/
{
    ULONG Index;
    ULONG Existing;
    HRESULT Result;

    //
    // Obtain the index for this key.
    //

    Result = Table->Vtbl->Index(Table, Key, &Index);

    if (FAILED(Result)) {

        //
        // Clear the caller's pointer if applicable and return error.
        //

        if (ARGUMENT_PRESENT(PreviousValue)) {
            *PreviousValue = 0;
        }

        return E_FAIL;
    }

    //
    // Get the existing value.
    //

    Existing = Table->Values[Index];

    //
    // Write the new value.
    //

    Table->Values[Index] = Value;

    //
    // Update the caller's pointer if applicable.
    //

    if (ARGUMENT_PRESENT(PreviousValue)) {
        *PreviousValue = Existing;
    }

    return S_OK;
}

_Use_decl_annotations_
HRESULT
PerfectHashTableLookup(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    PULONG Value
    )
/*++

Routine Description:

    Looks up given key in a perfect hash table and returns the value set by
    the Insert() routine.  If no insertion has taken place for this key, this
    routine guarantees to return 0 as the value.

    N.B. If key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         value returned will be the value for some other key in the table that
         hashes to the same location -- or potentially an empty slot in the
         table.)

Arguments:

    Table - Supplies a pointer to the table for which the key lookup is to be
        performed.

    Key - Supplies the key to look up.

    Value - Receives the vaue for the given key.

Return Value:

    S_OK in all normal operating conditions.  E_FAIL may be returned in some
    cases when passed a key not in the original input set.  The Value parameter
    will be set to NULL in this case.

--*/
{
    ULONG Index;
    HRESULT Result;

    //
    // Obtain the index for this key.
    //

    Result = Table->Vtbl->Index(Table, Key, &Index);

    if (FAILED(Result)) {

        //
        // Clear the caller's pointers and return the error code.
        //

        *Value = 0;
        return E_FAIL;
    }

    *Value = Table->Values[Index];

    return S_OK;
}

_Use_decl_annotations_
HRESULT
PerfectHashTableDelete(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    PULONG PreviousValue
    )
/*++

Routine Description:

    Deletes a key from a perfect hash table, optionally returning the value
    prior to deletion back to the caller.  Deletion simply clears the value
    associated with the key, and thus, is a simple O(1) operation.  Deleting
    a key that has not yet been inserted has no effect other than potentially
    returning 0 as the previous value.  That is, a caller can safely issue
    deletes of keys regardless of whether or not said keys were inserted first.

    N.B. If key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         key will hash to either an existing key's location or an empty slot,
         so there is potential to corrupt the table in the sense that a
         previously inserted value for an unrelated, valid key will be cleared.)

Arguments:

    Table - Supplies a pointer to the table for which the key is to be deleted.

    Key - Supplies the key to delete.

    PreviousValue - Optionally supplies a pointer that will receive the previous
        value at the relevant table location prior to this deletion.  If no
        prior insertion, the previous value is guaranteed to be 0.

Return Value:

    S_OK in all normal operating conditions.  E_FAIL may be returned in some
    cases when passed a key not in the original input set.  The PreviousValue
    parameter, if non-NULL, will be cleared in this case.

--*/
{
    ULONG Index;
    ULONG Existing;
    HRESULT Result;

    //
    // Obtain the index for this key.
    //

    Result = Table->Vtbl->Index(Table, Key, &Index);

    if (FAILED(Result)) {

        //
        // Clear the caller's pointer if applicable and return error.
        //

        if (ARGUMENT_PRESENT(PreviousValue)) {
            *PreviousValue = 0;
        }

        return E_FAIL;
    }

    //
    // Get the existing value.
    //

    Existing = Table->Values[Index];

    //
    // Clear the value.
    //

    Table->Values[Index] = 0;

    //
    // Update the caller's pointer if applicable.
    //

    if (ARGUMENT_PRESENT(PreviousValue)) {
        *PreviousValue = Existing;
    }

    return S_OK;
}

_Use_decl_annotations_
ULONG
PerfectHashTableAddRef(
    PPERFECT_HASH_TABLE Table
    )
/*++

Routine Description:

    Increments the reference count for a perfect hash table.

Arguments:

    Table - Supplies a pointer to the table for which the reference count
        is to be incremented.

Return Value:

    The new reference count.

--*/
{
    return InterlockedIncrement(&Table->ReferenceCount);
}

_Use_decl_annotations_
ULONG
PerfectHashTableRelease(
    PPERFECT_HASH_TABLE Table
    )
/*++

Routine Description:

    Decrements the reference count associated with a perfect hash table.  If
    this is the last reference, the table is destroyed.

Arguments:

    Table - Supplies a pointer to the table for which the reference count
        is to be decremented.

Return Value:

    The new reference count.

--*/
{
    ULONG Count = InterlockedDecrement(&Table->ReferenceCount);
    PPERFECT_HASH_TABLE TablePointer = Table;

    if (Count > 0) {
        return Count;
    }

    DestroyPerfectHashTable(&TablePointer, NULL);

    return Count;
}

I prefer this COM-style vtbl approach to the cmph project's approach, which uses giant switch statements to select appropriate backends, e.g.:


cmph_t *cmph_new(cmph_config_t *mph)
{
	cmph_t *mphf = NULL;
	double c = mph->c;

	DEBUGP("Creating mph with algorithm %s\n", cmph_names[mph->algo]);
	switch (mph->algo)
	{
		case CMPH_CHM:
			DEBUGP("Creating chm hash\n");
			mphf = chm_new(mph, c);
			break;
		case CMPH_BMZ: /* included - Fabiano */
			DEBUGP("Creating bmz hash\n");
			mphf = bmz_new(mph, c);
			break;
		case CMPH_BMZ8: /* included - Fabiano */
			DEBUGP("Creating bmz8 hash\n");
			mphf = bmz8_new(mph, c);
			break;
		case CMPH_BRZ: /* included - Fabiano */
			DEBUGP("Creating brz hash\n");
			if (c >= 2.0) brz_config_set_algo(mph, CMPH_FCH);
			else brz_config_set_algo(mph, CMPH_BMZ8);
			mphf = brz_new(mph, c);
			break;
		case CMPH_FCH: /* included - Fabiano */
			DEBUGP("Creating fch hash\n");
			mphf = fch_new(mph, c);
			break;
		case CMPH_BDZ: /* included - Fabiano */
			DEBUGP("Creating bdz hash\n");
			mphf = bdz_new(mph, c);
			break;
		case CMPH_BDZ_PH: /* included - Fabiano */
			DEBUGP("Creating bdz_ph hash\n");
			mphf = bdz_ph_new(mph, c);
			break;
		case CMPH_CHD_PH: /* included - Fabiano */
			DEBUGP("Creating chd_ph hash\n");
			mphf = chd_ph_new(mph, c);
			break;
		case CMPH_CHD: /* included - Fabiano */
			DEBUGP("Creating chd hash\n");
			mphf = chd_new(mph, c);
			break;
		default:
			assert(0);
	}
	return mphf;
}

int cmph_dump(cmph_t *mphf, FILE *f)
{
	switch (mphf->algo)
	{
		case CMPH_CHM:
			return chm_dump(mphf, f);
		case CMPH_BMZ: /* included - Fabiano */
			return bmz_dump(mphf, f);
		case CMPH_BMZ8: /* included - Fabiano */
			return bmz8_dump(mphf, f);
		case CMPH_BRZ: /* included - Fabiano */
			return brz_dump(mphf, f);
		case CMPH_FCH: /* included - Fabiano */
			return fch_dump(mphf, f);
		case CMPH_BDZ: /* included - Fabiano */
			return bdz_dump(mphf, f);
		case CMPH_BDZ_PH: /* included - Fabiano */
			return bdz_ph_dump(mphf, f);
		case CMPH_CHD_PH: /* included - Fabiano */
			return chd_ph_dump(mphf, f);
		case CMPH_CHD: /* included - Fabiano */
			return chd_dump(mphf, f);
		default:
			assert(0);
	}
	assert(0);
	return 0;
}

The Index() routine is algorithm-specific, and forms the backbone of the perfect hash table functionality: the ability to take a key and return a unique index that can be used to offset into an array of data.

Here's what the Index() routine looks like for our CHM algorithm implementation. Note that it too leverages the COM interface for driving the hash and masking routines.

_Use_decl_annotations_
HRESULT
PerfectHashTableIndexImplChm01(
    PPERFECT_HASH_TABLE Table,
    ULONG Key,
    PULONG Index
    )
/*++

Routine Description:

    Looks up given key in a perfect hash table and returns its index.

    N.B. If Key did not appear in the original set the hash table was created
         from, the behavior of this routine is undefined.  (In practice, the
         key will hash to either an existing key's location or an empty slot,
         so there is potential for returning a non-unique index.)

Arguments:

    Table - Supplies a pointer to the table for which the key lookup is to be
        performed.

    Key - Supplies the key to look up.

    Index - Receives the index associated with this key.  The index will be
        between 0 and NumberOfKeys-1, and can be safely used to offset directly
        into an appropriately sized array (e.g. Table->Values[]).

Return Value:

    S_OK on success, E_FAIL if the underlying hash function returned a failure.
    This will happen if the two hash values for a key happen to be identical.
    It shouldn't happen once a perfect graph has been created (i.e. it only
    happens when attempting to solve the graph).  The Index parameter will
    be cleared in the case of E_FAIL.

--*/
{
    ULONG Masked;
    ULONG Vertex1;
    ULONG Vertex2;
    ULONG MaskedLow;
    ULONG MaskedHigh;
    PULONG Assigned;
    ULONGLONG Combined;
    ULARGE_INTEGER Hash;

    //
    // Hash the incoming key into the 64-bit representation, which is two
    // 32-bit ULONGs in disguise, each one driven by a separate seed value.
    //

    if (FAILED(Table->Vtbl->Hash(Table, Key, &Hash.QuadPart))) {
        goto Error;
    }

    //
    // Mask each hash value such that it falls within the confines of the
    // number of vertices.  That is, make sure the value is between 0 and
    // Table->NumberOfVertices-1.
    //

    if (FAILED(Table->Vtbl->MaskHash(Table, Hash.LowPart, &MaskedLow))) {
        goto Error;
    }

    if (FAILED(Table->Vtbl->MaskHash(Table, Hash.HighPart, &MaskedHigh))) {
        goto Error;
    }

    //
    // Obtain the corresponding vertex values for the masked high and low hash
    // values.  These are derived from the "assigned" array that we construct
    // during the creation routine's assignment step (GraphAssign()).
    //

    Assigned = Table->Data;

    Vertex1 = Assigned[MaskedLow];
    Vertex2 = Assigned[MaskedHigh];

    //
    // Combine the two values, then perform the index masking operation, such
    // that our final index into the array falls within the confines of the
    // number of edges, or keys, in the table.  That is, make sure the index
    // value is between 0 and Table->Keys->NumberOfElements-1.
    //

    Combined = (ULONGLONG)Vertex1 + (ULONGLONG)Vertex2;

    if (FAILED(Table->Vtbl->MaskIndex(Table, Combined, &Masked))) {
        goto Error;
    }

    //
    // Update the caller's pointer and return success.  The resulting index
    // value represents the array offset index for this given key in the
    // underlying table, and is guaranteed to be unique amongst the original
    // keys in the input set.
    //

    *Index = Masked;

    return S_OK;

Error:

    //
    // Clear the caller's pointer and return failure.  We should only hit this
    // point if the caller supplies a key that both: a) wasn't in the original
    // input set, and b) happens to result in a hash value where both the high
    // part and low part are identical, which is rare, but not impossible.
    //

    *Index = 0;

    return E_FAIL;
}

This routine lives in our Chm_01.c file, which is our only algorithm backend at the moment. This file, together with its counterpart Chm_01.h header, features all of the algorithm-specific structures and routines related to the implementation. It is here we define the graph structures specific to the CHM algorithm: GRAPH_DIMENSIONS, GRAPH_INFO, and GRAPH.

GRAPH_DIMENSIONS
GRAPH_INFO
GRAPH

//
// Define the primary dimensions governing the graph size.
//

typedef struct _GRAPH_DIMENSIONS {

    //
    // Number of edges in the graph.  This corresponds to the number of keys
    // in our input set.  If modulus masking is active, the number of keys and
    // the number of edges will be identical.  Otherwise, the number of edges
    // will be the number of keys rounded up to a power of 2.
    //

    ULONG NumberOfEdges;

    //
    // Total number of edges in the graph.  This will be twice the size of the
    // NumberOfEdges value above, due to the quirky way the underlying r-graph
    // algorithm captures two hash values in the same list and offsets the
    // second set after the first, e.g.:
    //
    //      Edge2 = Edge1 + Graph->NumberOfEdges;
    //

    ULONG TotalNumberOfEdges;

    //
    // Number of vertices in the graph.  This will vary based on the masking
    // type.  It is doubled every time a graph resize event is encountered.
    //

    ULONG NumberOfVertices;

    //
    // The number of edges in the graph, rounded up to a power of 2, and then
    // shifted the appropriate amount to extract the exponent part for 2^n.
    //

    BYTE NumberOfEdgesPowerOf2Exponent;

    //
    // As above, but rounded up to the next power of 2 first.
    //

    BYTE NumberOfEdgesNextPowerOf2Exponent;

    //
    // The same exponent logic applied to the number of vertices as per the
    // NumberOfVertices field above.
    //

    BYTE NumberOfVerticesPowerOf2Exponent;

    //
    // And again for the next value.
    //

    BYTE NumberOfVerticesNextPowerOf2Exponent;

} GRAPH_DIMENSIONS;
C_ASSERT(sizeof(GRAPH_DIMENSIONS) == 16);
typedef GRAPH_DIMENSIONS *PGRAPH_DIMENSIONS;

//
// Define various memory offsets associated with a given graph structure.
// This allows parallel worker threads to reset their local GRAPH instance
// back to the initial state each time they want to try a new random seed.
//

typedef struct _GRAPH_INFO {

    //
    // Number of pages consumed by the entire graph and all backing arrays.
    //

    ULONG NumberOfPagesPerGraph;

    //
    // Page size (e.g. 4096, 2MB).
    //

    ULONG PageSize;

    //
    // Total number of graphs created.  This will match the maximum concurrency
    // level of the upstream context.
    //

    ULONG NumberOfGraphs;

    //
    // Number of RTL_BITMAP structures used by the graph.
    //

    USHORT NumberOfBitmaps;

    //
    // Size of the GRAPH structure.
    //

    USHORT SizeOfGraphStruct;

    //
    // System allocation granularity.  We align the memory map for the on-disk
    // structure using this value initially.
    //

    ULONG AllocationGranularity;

    //
    // If a masking type other than modulus is active, the AbsoluteEdge() needs
    // a way to mask edge values that exceed the number of edges in the table.
    // It does this via EdgeMask, which is initialized to the number of edges
    // (which will be power-of-2 sized for non-modulus masking), minus 1, such
    // that all lower bits will be set.
    //

    ULONG EdgeMask;

    //
    // Also capture the mask required to isolate vertices.
    //

    ULONG VertexMask;

    //
    // Graph dimensions.  This information is duplicated in the graph due to
    // it being accessed frequently.
    //

    GRAPH_DIMENSIONS Dimensions;

    //
    // Pointer to the owning context.
    //

    PPERFECT_HASH_TABLE_CONTEXT Context;

    //
    // Base address of the entire graph allocation.
    //

    union {
        PVOID BaseAddress;
        struct _GRAPH *FirstGraph;
    };

    //
    // Array sizes.
    //

    ULONGLONG EdgesSizeInBytes;
    ULONGLONG NextSizeInBytes;
    ULONGLONG FirstSizeInBytes;
    ULONGLONG PrevSizeInBytes;
    ULONGLONG AssignedSizeInBytes;
    ULONGLONG ValuesSizeInBytes;

    //
    // Deleted edges bitmap buffer size.
    //

    ULONGLONG DeletedEdgesBitmapBufferSizeInBytes;

    //
    // Visited vertices bitmap buffer size.
    //

    ULONGLONG VisitedVerticesBitmapBufferSizeInBytes;

    //
    // Assigned bitmap buffer size.
    //

    ULONGLONG AssignedBitmapBufferSizeInBytes;

    //
    // Index bitmap buffer size.
    //

    ULONGLONG IndexBitmapBufferSizeInBytes;

    //
    // The allocation size of the graph, including structure size and all
    // array and bitmap buffer sizes.
    //

    ULONGLONG AllocSize;

    //
    // Allocation size rounded up to the nearest page size multiple.
    //

    ULONGLONG FinalSize;

} GRAPH_INFO;
typedef GRAPH_INFO *PGRAPH_INFO;

//
// Define the graph structure.  This represents an r-graph, or a hypergraph,
// or an r-partite 2-uniform graph, or any other seemingly unlimited number
// of names floating around in academia for what appears to be exactly the
// same thing.
//

typedef struct _Struct_size_bytes_(SizeOfStruct) _GRAPH {

    //
    // List entry used to push the graph onto the context's work list.
    //

    SLIST_ENTRY ListEntry;

    //
    // Edge and vertex masks that can be used when non-modulus masking is in
    // place.  Both of these values are duplicated from the info structure as
    // they are accessed frequently.
    //
    //

    ULONG EdgeMask;
    ULONG VertexMask;

    //
    // Duplicate the mask type, as well, as this directs AbsoluteEdge()'s
    // decision to use the two masks above.
    //

    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;

    //
    // Duplicate the number of keys, as this is also frequently referenced.
    //

    ULONG NumberOfKeys;

    //
    // Structure size, in bytes.
    //

    _Field_range_(== , sizeof(struct _GRAPH)) ULONG SizeOfStruct;

    //
    // Graph flags.
    //

    GRAPH_FLAGS Flags;

    //
    // Pointer to the info structure describing various sizes.
    //

    PGRAPH_INFO Info;

    //
    // Graph attempt.  This ID is derived from an interlocked increment against
    // Context->Attempts, and represents the attempt number across all threads.
    //

    ULONGLONG Attempt;

    //
    // A localized attempt number that reflects the number of attempts made
    // by just this thread.
    //

    ULONG ThreadAttempt;

    //
    // Thread ID of the thread that owns us.  Each callback thread is provided
    // a single graph, and will attempt to solve the perfect hash table until
    // told otherwise.  Thus, there's a 1:1 relationship between graph instance
    // and owning thread.
    //

    ULONG ThreadId;

    //
    // Counter that is incremented each time we delete an edge during the
    // acyclic graph detection stage.
    //

    ULONG DeletedEdgeCount;

    //
    // Counter that is incremented each time we visit a vertex during the
    // assignment stage.
    //

    ULONG VisitedVerticesCount;

    //
    // Capture collisions during assignment step.
    //

    ULONG Collisions;

    //
    // Inline the GRAPH_DIMENSIONS structure.  This is available from the
    // GRAPH_INFO structure, however, it's accessed frequently, so we inline
    // it to avoid the extra level of indirection.
    //

    union {

        struct {
            ULONG NumberOfEdges;
            ULONG TotalNumberOfEdges;
            ULONG NumberOfVertices;
            BYTE NumberOfEdgesPowerOf2Exponent;
            BYTE NumberOfEdgesNextPowerOf2Exponent;
            BYTE NumberOfVerticesPowerOf2Exponent;
            BYTE NumberOfVerticesNextPowerOf2Exponent;
        };

        GRAPH_DIMENSIONS Dimensions;
    };

    //
    // Duplicate the context pointer.  (This is also available from Info.)
    //

    PPERFECT_HASH_TABLE_CONTEXT Context;

    //
    // Edges array.  The number of elements in this array is governed by the
    // TotalNumberOfEdges field, and will be twice the number of edges.
    //

    PEDGE Edges;

    //
    // Array of the "next" edge array, as per the referenced papers.  The number
    // of elements in this array is also governed by TotalNumberOfEdges.
    //

    PEDGE Next;

    //
    // Array of vertices.  Number of elements is governed by the
    // NumberOfVertices field.
    //

    PVERTEX First;

    //
    // The original CHM paper in 1996 references a "prev" array to "facilitate
    // fast deletion".  However, the chmp project appears to have switched to
    // using bitmaps.  Let's reserve a slot for the "prev" array anyway.
    //

    PVERTEX Prev;

    //
    // Array of assigned vertices.  Number of elements is governed by the
    // NumberOfVertices field.
    //

    PVERTEX Assigned;

    //
    // Array of values indexed by the offsets in the Assigned array.  This
    // essentially allows us to simulate a loaded table that supports the
    // Insert(), Index() and Lookup() routines as part of graph validation.
    //

    PULONG Values;

    //
    // Bitmap used to capture deleted edges as part of the acyclic detection
    // stage.  The SizeOfBitMap will reflect TotalNumberOfEdges.
    //

    RTL_BITMAP DeletedEdges;

    //
    // Bitmap used to capture vertices visited as part of the assignment stage.
    // The SizeOfBitMap will reflect NumberOfVertices.
    //

    RTL_BITMAP VisitedVertices;

    //
    // Bitmap used to test the correctness of the Assigned array.
    //

    RTL_BITMAP AssignedBitmap;

    //
    // Bitmap used to track indices during the assignment step.
    //

    RTL_BITMAP IndexBitmap;

    //
    // Capture the seeds used for each hash function employed by the graph.
    //

    ULONG NumberOfSeeds;

    struct {
        union {
            struct {
                union {
                    ULONG Seed1;
                    ULONG FirstSeed;
                };
                ULONG Seed2;
            };
            ULARGE_INTEGER Seeds12;
        };
        union {
            struct {
                ULONG Seed3;
                union {
                    ULONG Seed4;
                    ULONG LastSeed;
                };
            };
            ULARGE_INTEGER Seeds34;
        };
    };

} GRAPH;
typedef GRAPH *PGRAPH;

In general, the structures and functions defined in our PerfectHashTablePrivate.h file are algorithm agnostic. That is, they don't deal with the notion of a graph, as that's an implementation detail of the CHM algorithm.

Code Walkthrough

Let's examine the code by walking through the behavior of the SelfTestPerfectHashTable() routine.

Test Data

To generate some test data, you have two options. If you have Python and NumPy installed, you can generate a bunch of linear and random key files via the generate.py file that lives in the data subdirectory of the PerfectHashTable component. Alternatively, you can check out a small git repository that houses some existing sample test data.

Python Generated

The generate.py file and its output follows. It will generate linear and random arrays of given sizes, saving a .txt representation of the values, a .keys representation of the binary data (which is the actual file we load as part of perfect hash table creation), and, if matplotlib is installed, a .png plot of the data, which allows a quick sanity check that the value distribution is what we expect.

(Note: the random routine creates a double-sized array, then calls np.unique() against it to ensure all the key values are unique, then isolates the [:size] portion we're actually interested in. One side effect of calling np.unique() is that our resulting array is sorted. We don't stipulate whether key files are required to be sorted at the moment, as it makes no difference to the underlying perfect hash table algorithm we're using. i.e. we don't suffer the sort of pathological worst-case performance issues some data structures exhibit when inserting data in sorted order. Nevertheless, if you look at the random files compared to the linear files, they're usually quite similar, given everything is sorted.)

generate.py
Output


from __future__ import print_function

import numpy as np

try:

    import matplotlib.pyplot as plt

    # Turn off interactive mode to disable plots being displayed
    # prior to saving them to disk.
    plt.ioff()

    def save_array_plot_to_png_file(filename, a):
        plt.plot(a)
        plt.savefig(filename)
        print("Generated %s." % filename)

except ImportError:

    def save_array_plot_to_png_file(filename, a):
        pass

def save_array_to_text_file(filename, a):
    with open(filename, 'wb') as f:
        for i in a:
            f.write(str(i).zfill(9).encode('ascii'))
            f.write(b'\n')
    print("Generated %s." % filename)

def save_array_to_binary_file(filename, a):
    fp = np.memmap(filename, dtype='uint32', mode='w+', shape=a.shape)
    fp[:] = a[:]
    del fp
    print("Generated %s." % filename)

def save_array(prefix, a):
    l = len(a)
    png_filename = '%s-%d.png' % (prefix, l)
    text_filename = '%s-%d.txt' % (prefix, l)
    binary_filename = '%s-%d.keys' % (prefix, l)
    save_array_plot_to_png_file(png_filename, a)
    save_array_to_text_file(text_filename, a)
    save_array_to_binary_file(binary_filename, a)

def gen_random_unique_array(size):
    a = np.random.randint(0, (1 << 32)-1, size * 2, dtype='uint32')
    a = np.unique(a)[:size]
    return a if len(a) == size else None

def gen_linear_array(size):
    a = np.linspace(0, size, num=size, dtype='uint32')
    return a

sizes = (
    2000, 4000, 4050, 5000, 10000, 15000, 17000, 25000, 31000, 33000, 50000,
    63000, 65500, 75000, 100000, 121000, 125000, 150000, 200000, 225000, 245000,
    255000, 265000, 389161, 472374,
)

functions = (
    ('linear', gen_linear_array),
    ('random', gen_random_unique_array),
)

def main():
    for (name, func) in functions:
        for size in sizes:
            a = func(size)
            if a is None:
                continue
            save_array(name, a)

if __name__ == '__main__':
    main()

 (py35) C:\Users\Trent\Home\src\tracer\PerfectHashTable\data>python generate.py
Generated linear-2000.png.
Generated linear-2000.txt.
Generated linear-2000.keys.
Generated linear-4000.png.
Generated linear-4000.txt.
Generated linear-4000.keys.
Generated linear-4050.png.
Generated linear-4050.txt.
Generated linear-4050.keys.
Generated linear-5000.png.
Generated linear-5000.txt.
Generated linear-5000.keys.
Generated linear-10000.png.
Generated linear-10000.txt.
Generated linear-10000.keys.
Generated linear-15000.png.
Generated linear-15000.txt.
Generated linear-15000.keys.
Generated linear-17000.png.
Generated linear-17000.txt.
Generated linear-17000.keys.
Generated linear-25000.png.
Generated linear-25000.txt.
Generated linear-25000.keys.
Generated linear-31000.png.
Generated linear-31000.txt.
Generated linear-31000.keys.
Generated linear-33000.png.
Generated linear-33000.txt.
Generated linear-33000.keys.
Generated linear-50000.png.
Generated linear-50000.txt.
Generated linear-50000.keys.
Generated linear-63000.png.
Generated linear-63000.txt.
Generated linear-63000.keys.
Generated linear-65500.png.
Generated linear-65500.txt.
Generated linear-65500.keys.
Generated linear-75000.png.
Generated linear-75000.txt.
Generated linear-75000.keys.
Generated linear-100000.png.
Generated linear-100000.txt.
Generated linear-100000.keys.
Generated linear-121000.png.
Generated linear-121000.txt.
Generated linear-121000.keys.
Generated linear-125000.png.
Generated linear-125000.txt.
Generated linear-125000.keys.
Generated linear-150000.png.
Generated linear-150000.txt.
Generated linear-150000.keys.
Generated linear-200000.png.
Generated linear-200000.txt.
Generated linear-200000.keys.
Generated linear-225000.png.
Generated linear-225000.txt.
Generated linear-225000.keys.
Generated linear-245000.png.
Generated linear-245000.txt.
Generated linear-245000.keys.
Generated linear-255000.png.
Generated linear-255000.txt.
Generated linear-255000.keys.
Generated linear-265000.png.
Generated linear-265000.txt.
Generated linear-265000.keys.
Generated linear-389161.png.
Generated linear-389161.txt.
Generated linear-389161.keys.
Generated linear-472374.png.
Generated linear-472374.txt.
Generated linear-472374.keys.
Generated random-2000.png.
Generated random-2000.txt.
Generated random-2000.keys.
Generated random-4000.png.
Generated random-4000.txt.
Generated random-4000.keys.
Generated random-4050.png.
Generated random-4050.txt.
Generated random-4050.keys.
Generated random-5000.png.
Generated random-5000.txt.
Generated random-5000.keys.
Generated random-10000.png.
Generated random-10000.txt.
Generated random-10000.keys.
Generated random-15000.png.
Generated random-15000.txt.
Generated random-15000.keys.
Generated random-17000.png.
Generated random-17000.txt.
Generated random-17000.keys.
Generated random-25000.png.
Generated random-25000.txt.
Generated random-25000.keys.
Generated random-31000.png.
Generated random-31000.txt.
Generated random-31000.keys.
Generated random-33000.png.
Generated random-33000.txt.
Generated random-33000.keys.
Generated random-50000.png.
Generated random-50000.txt.
Generated random-50000.keys.
Generated random-63000.png.
Generated random-63000.txt.
Generated random-63000.keys.
Generated random-65500.png.
Generated random-65500.txt.
Generated random-65500.keys.
Generated random-75000.png.
Generated random-75000.txt.
Generated random-75000.keys.
Generated random-100000.png.
Generated random-100000.txt.
Generated random-100000.keys.
Generated random-121000.png.
Generated random-121000.txt.
Generated random-121000.keys.
Generated random-125000.png.
Generated random-125000.txt.
Generated random-125000.keys.
Generated random-150000.png.
Generated random-150000.txt.
Generated random-150000.keys.
Generated random-200000.png.
Generated random-200000.txt.
Generated random-200000.keys.
Generated random-225000.png.
Generated random-225000.txt.
Generated random-225000.keys.
Generated random-245000.png.
Generated random-245000.txt.
Generated random-245000.keys.
Generated random-255000.png.
Generated random-255000.txt.
Generated random-255000.keys.
Generated random-265000.png.
Generated random-265000.txt.
Generated random-265000.keys.
Generated random-389161.png.
Generated random-389161.txt.
Generated random-389161.keys.
Generated random-472374.png.
Generated random-472374.txt.
Generated random-472374.keys.

Git Clone

If you don't have Python and NumPy installed or you can't get the generate.py to work, you can clone a small git repository named perfecthash, which contains a data directory that has a test file in it named mshtml-37209.keys:

                    C:\Users\Trent\Home\src> git clone https://github.com/tpn/perfecthash
                    ...
                    C:\Users\Trent\Home\src> cd perfecthash\data
                    C:\Users\Trent\Home\src\perfecthash\data> dir
                    ...

Running the Self-Test

If you clone the tracer project and open PerfectHashTable.sln in Visual Studio 2017, you should be able to build the project from scratch. Selecting the Debug build configuration will make the code easier to step through if you'd like to follow along with an interactive debugging session.

Let's run the debug build of the self-test program against the data directory we prepared earlier. (I'll use the perfecthash\data directory for now as it only has two key files in it, mshtml-10.keys and mshtml-37209.keys.) It's worth noting up-front how not user friendly thePerfectHashTableSelfTest.exe program is. It's the bare minimum command line wrapper around the SelfTestPerfectHashTable() routine. It takes five mandatory parameters: the test directory containing the *.keys files you wish to process, the algorithm ID, the hash function ID, the mask function ID, and the maximum concurrency (specifying 0 will use all cores). (You can optionally add a dummy character as the final parameter which will print a little Press any key to continue. message and then wait for a keypress before exiting — this is useful when debugging via Visual Studio and you want the console window to stay open so you can review results.) Usage:

C:\Users\Trent\Home\src\tracer>x64\Debug\PerfectHashTableSelfTest.exe
Usage: PerfectHashTableSelfTest.exe <TestDataDirectory (must be fully-qualified)>
                                    <AlgorithmId>
                                    <HashFunctionId>
                                    <MaskFunctionId>
                                    <MaximumConcurrency (0-ncpu)>
                                    [PauseBeforeExit (can be any character)]
E.g.: PerfectHashTableSelfTest.exe C:\Users\Trent\Home\src\perfecthash\data 1 1 2 0

Let's try run it against our perfecthash\data directory. We'll use the CHM algorithm (1... the only option), the CRC32-Rotate hash function (1), the AND-based masking function (2), and a max concurrency of 0, which will default to all cores, which will be 4 on this Surface Book.

S:\tracer>x64\Debug\PerfectHashTableSelfTest.exe S:\perfecthash\data 1 1 2 0

    Starting perfect hash self-test for directory: S:\perfecthash\data.
    Processing key file: mshtml-10.keys
    Successfully created perfect hash table: S:\perfecthash\data\mshtml-10.pht1.
    Successfully loaded perfect hash table: S:\perfecthash\data\mshtml-10.pht1.
    Algorithm: Chm01 (1).
    Hash Function: Crc32Rotate (1).
    Mask Function: And (2).
    Successfully tested perfect hash table.
    Concurrency: 4.
    Number of attempts: 1.
    Number of failed attempts: 0.
    Number of solutions found: 1.
    Number of keys: 10.
    Number of table elements (vertices): 32.
    Seed 1: 1086980845.
    Seed 2: 3416976708.
    Seed 3: 2208417421.
    Seed 4: 3947037297.
    Cycles to solve: 178412.
    Cycles to verify: 3924.
    Cycles to prepare file: 821965.
    Cycles to save file: 9093112.
    Microseconds to solve: 63.
    Microseconds to verify: 1.
    Microseconds to prepare file: 292.
    Microseconds to save file: 3238.

    Processing key file: mshtml-37209.keys
    Successfully created perfect hash table: S:\perfecthash\data\mshtml-37209.pht1.
    Successfully loaded perfect hash table: S:\perfecthash\data\mshtml-37209.pht1.
    Algorithm: Chm01 (1).
    Hash Function: Crc32Rotate (1).
    Mask Function: And (2).
    Successfully tested perfect hash table.
    Concurrency: 4.
    Number of attempts: 3.
    Number of failed attempts: 0.
    Number of solutions found: 3.
    Number of keys: 37209.
    Number of table elements (vertices): 131072.
    Seed 1: 43757427.
    Seed 2: 283337092.
    Seed 3: 1819527900.
    Seed 4: 1437745965.
    Cycles to solve: 71649060.
    Cycles to verify: 6385313.
    Cycles to prepare file: 41221291.
    Cycles to save file: 51229166.
    Microseconds to solve: 25516.
    Microseconds to verify: 2274.
    Microseconds to prepare file: 14680.
    Microseconds to save file: 18244.

Your output should look similar to the output above. There will be two new files, mshtml-10.pht1 and mshtml-37209.pht1 in that perfecthash\data directory. The .pht1 file is the array of "assigned" values we use as part of index construction for the CHM algorithm. When the output says Successfully loaded perfect hash table ..., it is referring to the .pht1 files.

The metadata for the perfect hash table is stored in an NTFS stream named :Info that is attached to the .pht1 file. We can view this via the Get-Item -Path mshtml-37209.pht1 -Stream * PowerShell command:

PS S:\perfecthash\data> Get-Item -Path mshtml-37209.pht1 -Stream *

    PSPath        : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data\mshtml-37209.pht1::$DATA
    PSParentPath  : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data
    PSChildName   : mshtml-37209.pht1::$DATA
    PSDrive       : S
    PSProvider    : Microsoft.PowerShell.Core\FileSystem
    PSIsContainer : False
    FileName      : S:\perfecthash\data\mshtml-37209.pht1
    Stream        : :$DATA
    Length        : 524288

    PSPath        : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data\mshtml-37209.pht1:Info
    PSParentPath  : Microsoft.PowerShell.Core\FileSystem::S:\perfecthash\data
    PSChildName   : mshtml-37209.pht1:Info
    PSDrive       : S
    PSProvider    : Microsoft.PowerShell.Core\FileSystem
    PSIsContainer : False
    FileName      : S:\perfecthash\data\mshtml-37209.pht1
    Stream        : Info
    Length        : 256

You can see the normal file data in the first stream (the default file stream) named $DATA. This is reported as length 524,288 bytes. If we divide this by 4 (the sizeof(ULONG)), we get 131,072 — which corresponds to the number of vertices reported for the final table in the output:

...
Number of table elements (vertices): 131072.
...

The `:Info` Stream

Following that is our custom :Info NTFS stream of length 256 bytes. As mentioned earlier, all of our file I/O is done through memory maps. During the creation step, the :Info stream is created and mapped at a base address accessible through Table->InfoStreamBaseAddress (we'll discuss the Table structure soon). In our CHM implementation routine, we literally cast the base address to an on-disk structure named GRAPH_INFO_ON_DISK, which is required to embed a common header structure named TABLE_INFO_ON_DISK_HEADER as its first member.

Let's look at these two header structures, as well as some initialization and finalization code in the CHM implementation that depicts how we work with them. The initialization code is taken from the CreatePerfectHashTableImplChm01 routine, and the finalization code is taken from the FileWorkCallbackChm01.

GRAPH_INFO_ON_DISK
TABLE_INFO_ON_DISK_HEADER
Initialization
Finalization

//
// Define an on-disk representation of the graph's information.  This is stored
// in the NTFS stream extending from the backing file named :Info.  It is
// responsible for storing information about the on-disk mapping such that it
// can be reloaded from disk and used as a perfect hash table.  The structure
// must always embed the TABLE_INFO_ON_DISK_HEADER structure such that the
// generic loader routine can access on-disk versions saved by different algos
// in order to extract the algorithm ID and determine a suitable loader func to
// use.
//

typedef struct _Struct_size_bytes_(Header.SizeOfStruct) _GRAPH_INFO_ON_DISK {

    //
    // Include the required header.
    //

    TABLE_INFO_ON_DISK_HEADER Header;

    //
    // Additional information we capture is mostly just for informational
    // and debugging purposes.
    //

    //
    // Inline the GRAPH_DIMENSIONS structure.
    //

    union {

        struct {
            ULONG NumberOfEdges;
            ULONG TotalNumberOfEdges;
            ULONG NumberOfVertices;
            BYTE NumberOfEdgesPowerOf2Exponent;
            BYTE NumberOfEdgesNextPowerOf2Exponent;
            BYTE NumberOfVerticesPowerOf2Exponent;
            BYTE NumberOfVerticesNextPowerOf2Exponent;
        };

        GRAPH_DIMENSIONS Dimensions;
    };

} GRAPH_INFO_ON_DISK;
C_ASSERT(sizeof(GRAPH_INFO_ON_DISK) <= PAGE_SIZE);
typedef GRAPH_INFO_ON_DISK *PGRAPH_INFO_ON_DISK;

//
// Metadata about a perfect hash table is stored in an NTFS stream named :Info
// that is tacked onto the end of the perfect hash table's file name.  Define
// a structure, TABLE_INFO_ON_DISK_HEADER, that literally represents the on-disk
// layout of this metadata.  Each algorithm implementation must write out an
// info record that conforms with this common header.  They are free to extend
// it with additional details.
//

typedef union _TABLE_INFO_ON_DISK_HEADER_FLAGS {

    struct {

        //
        // Unused bits.
        //

        ULONG Unused:32;

    };

    LONG AsLong;
    ULONG AsULong;

} TABLE_INFO_ON_DISK_HEADER_FLAGS;
C_ASSERT(sizeof(TABLE_INFO_ON_DISK_HEADER_FLAGS) == sizeof(ULONG));

typedef struct _Struct_size_bytes_(SizeOfStruct) _TABLE_INFO_ON_DISK_HEADER {

    //
    // A magic value used to identify the structure.
    //

    ULARGE_INTEGER Magic;

    //
    // Size of the structure, in bytes.
    //
    // N.B. We don't annotate this with a _Field_range_ SAL annotation as the
    //      value will vary depending on which parameters were used to create
    //      the table.
    //

    ULONG SizeOfStruct;

    //
    // Flags.
    //

    TABLE_INFO_ON_DISK_HEADER_FLAGS Flags;

    //
    // Algorithm that was used.
    //

    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId;

    //
    // Hash function that was used.
    //

    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId;

    //
    // Masking type.
    //

    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;

    //
    // Size of an individual key element, in bytes.
    //

    ULONG KeySizeInBytes;

    //
    // The concurrency level used to generate the hash.
    //

    ULONG Concurrency;

    //
    // Pad out to 8 bytes.
    //

    ULONG Padding;

    //
    // Number of keys in the input set.  This is used to size an appropriate
    // array for storing values.
    //

    ULARGE_INTEGER NumberOfKeys;

    //
    // Final number of elements in the underlying table.  This will vary
    // depending on how the graph was created.  If modulus masking is in use,
    // this will reflect the number of keys (unless a custom table size was
    // requested during creation).  Otherwise, this will be the number of keys
    // rounded up to the next power of 2.  (That is, take the number of keys,
    // round up to a power of 2, then round that up to the next power of 2.)
    //

    ULARGE_INTEGER NumberOfTableElements;

    //
    // Capture the hash and index details required by MaskHash(), MaskIndex()
    // and Index() routines.
    //

    ULONG HashSize;
    ULONG IndexSize;

    ULONG HashShift;
    ULONG IndexShift;

    ULONG HashMask;
    ULONG IndexMask;

    ULONG HashFold;
    ULONG IndexFold;

    ULONG HashModulus;
    ULONG IndexModulus;

    //
    // Seed data.
    //

    ULONG NumberOfSeeds;

    union {
        ULONG Seed1;
        ULONG FirstSeed;
    };

    ULONG Seed2;
    ULONG Seed3;

    union {
        ULONG Seed4;
        ULONG LastSeed;
    };

    //
    // Capture statistics about the perfect hash table solution that can be
    // useful during analysis and performance comparisons.
    //

    //
    // Number of attempts at solving the solution.
    //

    ULONGLONG NumberOfAttempts;

    //
    // Number of failed attempts at solving the solution.
    //

    ULONGLONG NumberOfFailedAttempts;

    //
    // If solutions are being sought in parallel, more than one thread may
    // find a solution before it detects that someone else has already found
    // a solution (in which case, it stops solving and returns from the pool
    // callback).  This counter measures the number of solutions that were
    // found in parallel.  It corresponds to the Context->FinishedCount value.
    //
    // With a good hashing function and large concurrency value, this value
    // may often be relatively large.  E.g. with a concurrency level of 12, it
    // would not be surprising to see 12 attempts reported, and 10 solutions
    // found.  (Meaning that our concurrency level was a tad unnecessary, or
    // our hash function is unnecessarily good (which usually means expensive).)
    //
    // N.B. Finding multiple solutions in parallel is harmless, if not a little
    //      wasteful of CPU time.  The threads can detect if they are the first
    //      ones to find a solution prior to continuing with the assignment
    //      step, such that the assignment can be avoided if another thread has
    //      beaten them to that point.  They also check at regular intervals
    /       during graph solving, and will do a fast-path exit as soon as they
    //      detect another thread has found a solution.
    //

    ULONGLONG NumberOfSolutionsFound;

    //
    // If a solution can't be found within a configurable threshold, a "table
    // resize" event will be generated.  This results in doubling the number
    // of vertices being used for the backing table and trying the solution
    // again.  The following field captures how many times this occurred.
    //

    ULONGLONG NumberOfTableResizeEvents;

    //
    // This counter captures the sum of all prior attempts at solving the
    // solution before giving up and resizing the table.  It excludes the
    // attempts made by the winning table size (captured by NumberOfAttempts).
    // It will be zero if no resize events occurred.  We don't keep a separate
    // failed counter here, as the resize event implies all attempts failed.
    //

    ULONGLONG TotalNumberOfAttemptsWithSmallerTableSizes;

    //
    // The following counter captures the initial table size that was attempted
    // in order to solve the solution.  It will differ from the final table size
    // if there were resize events.  As we simply double the size of the table
    // on each resize event, we can extrapolate the different table sizes that
    // were tried prior to finding a winning one by looking at the initial size
    // attempted and the number of resize events.
    //

    ULONGLONG InitialTableSize;

    //
    // The following counter captures the closest we came to solving the graph
    // in previous attempts before a resize event occurred.  This is calculated
    // by taking the number of edges and subtracting the value of the context's
    // HighestDeletedEdgesCount counter.  The value represents the additional
    // number of 1 degree edges we needed to delete in order to obtain an
    // acyclic graph.  A very low number indicates that we came very close to
    // solving it as there were very few hash collisions with the seed values
    // we picked.  A very high number indicates we had no chance and there were
    // collisions galore.
    //

    ULONGLONG ClosestWeCameToSolvingGraphWithSmallerTableSizes;

    //
    // Number of cycles it took to solve the solution for the winning thread.
    // (This does not factor in the total cycle time consumed by all threads.)
    //

    ULARGE_INTEGER SolveCycles;

    //
    // Number of microseconds taken to solve the solution.
    //

    ULARGE_INTEGER SolveMicroseconds;

    //
    // Number of cycles taken to verify the solution.
    //

    ULARGE_INTEGER VerifyCycles;

    //
    // Number of microseconds taken to verify the solution.
    //

    ULARGE_INTEGER VerifyMicroseconds;

    //
    // Number of cycles taken to prepare the file.
    //

    ULARGE_INTEGER PrepareFileCycles;

    //
    // Number of microseconds taken to prepare the file.
    //

    ULARGE_INTEGER PrepareFileMicroseconds;

    //
    // Number of cycles taken to save the file.
    //

    ULARGE_INTEGER SaveFileCycles;

    //
    // Number of microseconds taken to save the file.
    //

    ULARGE_INTEGER SaveFileMicroseconds;

} TABLE_INFO_ON_DISK_HEADER;
typedef TABLE_INFO_ON_DISK_HEADER *PTABLE_INFO_ON_DISK_HEADER;

    //
    // Save the on-disk representation of the graph information.  This is a
    // smaller subset of data needed in order to load a previously-solved
    // graph as a perfect hash table.  The data resides in an NTFS stream named
    // :Info off the main perfect hash table file.  It will have been mapped for
    // us already at Table->InfoStreamBaseAddress.
    //

    OnDiskInfo = (PGRAPH_INFO_ON_DISK)Table->InfoStreamBaseAddress;
    ASSERT(OnDiskInfo);
    OnDiskHeader = &OnDiskInfo->Header;
    OnDiskHeader->Magic.LowPart = TABLE_INFO_ON_DISK_MAGIC_LOWPART;
    OnDiskHeader->Magic.HighPart = TABLE_INFO_ON_DISK_MAGIC_HIGHPART;
    OnDiskHeader->SizeOfStruct = sizeof(*OnDiskInfo);
    OnDiskHeader->Flags.AsULong = 0;
    OnDiskHeader->Concurrency = Context->MaximumConcurrency;
    OnDiskHeader->AlgorithmId = Context->AlgorithmId;
    OnDiskHeader->MaskFunctionId = Context->MaskFunctionId;
    OnDiskHeader->HashFunctionId = Context->HashFunctionId;
    OnDiskHeader->KeySizeInBytes = sizeof(ULONG);
    OnDiskHeader->HashSize = Table->HashSize;
    OnDiskHeader->IndexSize = Table->IndexSize;
    OnDiskHeader->HashShift = Table->HashShift;
    OnDiskHeader->IndexShift = Table->IndexShift;
    OnDiskHeader->HashMask = Table->HashMask;
    OnDiskHeader->IndexMask = Table->IndexMask;
    OnDiskHeader->HashFold = Table->HashFold;
    OnDiskHeader->IndexFold = Table->IndexFold;
    OnDiskHeader->HashModulus = Table->HashModulus;
    OnDiskHeader->IndexModulus = Table->IndexModulus;
    OnDiskHeader->NumberOfKeys.QuadPart = (
        Table->Keys->NumberOfElements.QuadPart
    );
    OnDiskHeader->NumberOfSeeds = ((
        FIELD_OFFSET(GRAPH, LastSeed) -
        FIELD_OFFSET(GRAPH, FirstSeed)
    ) / sizeof(ULONG)) + 1;

    //
    // This will change based on masking type and whether or not the caller
    // has provided a value for NumberOfTableElements.  For now, keep it as
    // the number of vertices.
    //

    OnDiskHeader->NumberOfTableElements.QuadPart = (
        NumberOfVertices.QuadPart
    );

    CopyMemory(&OnDiskInfo->Dimensions, Dim, sizeof(*Dim));

        case FileWorkSaveId: {

            PULONG Dest;
            PULONG Source;
            PGRAPH Graph;
            ULONG WaitResult;
            BOOLEAN Success;
            ULONGLONG SizeInBytes;
            LARGE_INTEGER EndOfFile;
            PPERFECT_HASH_TABLE Table;
            PTABLE_INFO_ON_DISK_HEADER Header;

            //
            // Indicate the save event has completed upon return of this
            // callback.
            //

            SetOnReturnEvent = SavedEvent;

            //
            // Initialize aliases.
            //

            Table = Context->Table;
            Dest = (PULONG)Table->BaseAddress;
            Graph = (PGRAPH)Context->SolvedContext;
            Source = Graph->Assigned;
            Header = Table->Header;

            SizeInBytes = (
                Header->NumberOfTableElements.QuadPart *
                Header->KeySizeInBytes
            );

            //
            // The graph has been solved.  Copy the array of assigned values
            // to the mapped area we prepared earlier (above).
            //

            CopyMemory(Dest, Source, SizeInBytes);

            //
            // Save the seed values used by this graph.  (Everything else in
            // the on-disk info representation was saved earlier.)
            //

            Header->Seed1 = Graph->Seed1;
            Header->Seed2 = Graph->Seed2;
            Header->Seed3 = Graph->Seed3;
            Header->Seed4 = Graph->Seed4;

            //
            // Kick off a flush file buffers now before we wait on the verified
            // event.  The flush will be a blocking call.  The wait on verified
            // will be blocking if the event isn't signaled.  So, we may as well
            // get some useful blocking work done, before potentially going into
            // another wait state where we're not doing anything useful.
            //

            ASSERT(FlushFileBuffers(Table->FileHandle));

            //
            // Stop the save file timer here, after flushing the file buffers,
            // but before we potentially wait on the verified state.
            //

            CONTEXT_END_TIMERS(SaveFile);

            //
            // Wait on the verification complete event.  This is done in the
            // main thread straight after it dispatches our file work callback
            // (that ended up here).  We need to block on this event as we want
            // to save the timings for verification to the header.
            //

            WaitResult = WaitForSingleObject(Context->VerifiedEvent, INFINITE);
            ASSERT(WaitResult == WAIT_OBJECT_0);

            //
            // When we mapped the array in the work item above, we used a size
            // that was aligned with the system allocation granularity.  We now
            // want to set the end of file explicitly to the exact size of the
            // underlying array.  To do this, we unmap the view, delete the
            // section, set the file pointer to where we want, set the end of
            // file (which will apply the file pointer position as EOF), then
            // close the file handle.
            //

            ASSERT(UnmapViewOfFile(Table->BaseAddress));
            Table->BaseAddress = NULL;

            ASSERT(CloseHandle(Table->MappingHandle));
            Table->MappingHandle = NULL;

            EndOfFile.QuadPart = SizeInBytes;

            Success = SetFilePointerEx(Table->FileHandle,
                                       EndOfFile,
                                       NULL,
                                       FILE_BEGIN);

            ASSERT(Success);

            ASSERT(SetEndOfFile(Table->FileHandle));

            ASSERT(CloseHandle(Table->FileHandle));
            Table->FileHandle = NULL;

            //
            // Stop the save file timers, then copy all the timer values into
            // the header (before closing the :Info stream).
            //

            CONTEXT_SAVE_TIMERS_TO_HEADER(Solve);
            CONTEXT_SAVE_TIMERS_TO_HEADER(Verify);
            CONTEXT_SAVE_TIMERS_TO_HEADER(PrepareFile);
            CONTEXT_SAVE_TIMERS_TO_HEADER(SaveFile);

            //
            // Save the number of attempts and number of finished solutions.
            //

            Header->NumberOfAttempts = Context->Attempts;
            Header->NumberOfFailedAttempts = Context->FailedAttempts;
            Header->NumberOfSolutionsFound = Context->FinishedCount;

            //
            // Finalize the :Info stream the same way we handled the backing
            // file above; unmap, delete section, set file pointer, set eof,
            // close file.
            //

            ASSERT(UnmapViewOfFile(Table->InfoStreamBaseAddress));
            Table->InfoStreamBaseAddress = NULL;

            ASSERT(CloseHandle(Table->InfoStreamMappingHandle));
            Table->InfoStreamMappingHandle = NULL;

            //
            // The file size for the :Info stream will be the size of our
            // on-disk info structure.
            //

            EndOfFile.QuadPart = sizeof(*OnDiskInfo);

            Success = SetFilePointerEx(Table->InfoStreamFileHandle,
                                       EndOfFile,
                                       NULL,
                                       FILE_BEGIN);

            ASSERT(Success);

            ASSERT(SetEndOfFile(Table->InfoStreamFileHandle));

            ASSERT(CloseHandle(Table->InfoStreamFileHandle));
            Table->InfoStreamFileHandle = NULL;

            break;
        }

So, to summarize, the :Info NTFS stream is memory mapped and then directly cast into a GRAPH_INFO_ON_DISK structure, which is required to embed the TABLE_INFO_ON_DISK_HEADER structure as its first element. The LoadPerfectHashTable() routine is then able to memory map the same :Info stream at a later point (or in a separate process), validate the header semantics, extract the algorithm ID, then call the algorithm implementation's custom loader routine.

This is how we satisfy the original off-line generation requirement. (That is, making sure the creation step persists something that can be loaded and used at an arbitrary future time, in a different process, potentially on a different machine.)

SelfTestPerfectHashTable

Let's step back and take a look at the self test routine, as this exercises all of the main public API routines, and serves as a good starting point. I'll provide two code samples; the first is a vastly simplified pseudo-code depiction of the routine, sans all details other than the absolute bare minimum, the second is the full routine.

Pseudo
Full

_Use_decl_annotations_
BOOLEAN
SelfTestPerfectHashTable(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PPERFECT_HASH_TABLE_ANY_API AnyApi,
    PCUNICODE_STRING TestDataDirectory,
    PULONG MaximumConcurrency,
    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId,
    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId,
    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
    )
/*++

Routine Description:

    Performs a self-test of the entire PerfectHashTable component.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for all memory allocations.

    AnyApi - Supplies a pointer to an initialized PERFECT_HASH_TABLE_ANY_API
        structure.  Note that this must be an instance of the extended API;
        this is verified by looking at the Api-&SizeOfStruct field and ensuring
        it matches our expected size of the extended API structure.

    TestDataDirectory - Supplies a pointer to a UNICODE_STRING structure that
        represents a fully-qualified path of the test data directory.

    MaximumConcurrency - Optionally supplies a pointer to a variable that
        contains the desired maximum concurrency to be used for the underlying
        threadpool.  If NULL, or non-NULL but points to a value of 0, then the
        number of system processors will be used as a default value.

        N.B. This value is passed directly to SetThreadpoolThreadMinimum() and
             SetThreadpoolThreadMaximum().

    AlgorithmId - Supplies the algorithm to use.

    MaskFunctionId - Supplies the type of masking to use.

    HashFunctionId - Supplies the hash function to use.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{

    //
    // Enumerate over all keys in the test data directory.
    //

    for (KeysPath in FindFiles(TestDataDirectory, "*.keys")) {

        //
        // Create a new perfect hash table context.
        //

        Api->CreatePerfectHashTableContext(Rtl,
                                           Allocator,
                                           MaximumConcurrency,
                                           &Context);


        //
        // Load the keys file.
        //

        Api->LoadPerfectHashTableKeys(Rtl,
                                      Allocator,
                                      &KeysPath,
                                      &Keys);

        //
        // Keys were loaded successfully.  Construct the equivalent path name
        // for the backing perfect hash table when persisted to disk.  Although
        // this can be automated for us as part of CreatePerfectHashTable(),
        // having it backed by our memory simplifies things a little further
        // down the track when we want to load the table via the path but have
        // destroyed the original table that came from CreatePerfectHashTable().
        //

        CreateTablePathFromKeysPath(&Keys, &TablePath);

        //
        // We now have the fully-qualified path name of the backing perfect
        // hash table file living in TablePath.  Continue with creation of the
        // perfect hash table, using this path we've just created and the keys
        // that were loaded.
        //

        Api->CreatePerfectHashTable(Rtl,
                                    Allocator,
                                    Context,
                                    AlgorithmId,
                                    MaskFunctionId,
                                    HashFunctionId,
                                    NULL,
                                    Keys,
                                    &TablePath);

        //
        // Load the perfect hash table we just created.
        //

        Api->LoadPerfectHashTable(Rtl,
                                     Allocator,
                                     Keys,
                                     &TablePath,
                                     &Table);

        //
        // Table was loaded successfully from disk.  Obtain the names of all
        // the enumeration IDs.  Currently these should always match the same
        // enums provided as input parameters to this routine.
        //

        Api->GetAlgorithmName(Table->AlgorithmId, &AlgorithmName);

        Api->GetHashFunctionName(Table->HashFunctionId, &HashFunctionName);

        Api->GetMaskFunctionName(Table->MaskFunctionId, &MaskFunctionName);

        //
        // Test the table.  The TRUE parameter corresponds to the DebugBreakOnFailure
        // boolean parameter, indicating that we want an immediate __debugbreak() if
        // a failure is detected.
        //

        Api->TestPerfectHashTable(Table, TRUE);

        //
        // Dump some stats from the header.
        //

        DumpHeaderStats(Table->Header);

        //
        // Destroy the table, keys, and context.
        //

        Table->Vtbl->Release(Table);

        Api->DestroyPerfectHashTableKeys(&Keys);

        Api->DestroyPerfectHashTableContext(&Context, NULL);

    }
}

_Use_decl_annotations_
BOOLEAN
SelfTestPerfectHashTable(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PPERFECT_HASH_TABLE_ANY_API AnyApi,
    PCUNICODE_STRING TestDataDirectory,
    PULONG MaximumConcurrency,
    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId,
    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId,
    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId
    )
/*++

Routine Description:

    Performs a self-test of the entire PerfectHashTable component.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for all memory allocations.

    AnyApi - Supplies a pointer to an initialized PERFECT_HASH_TABLE_ANY_API
        structure.  Note that this must be an instance of the extended API;
        this is verified by looking at the Api->SizeOfStruct field and ensuring
        it matches our expected size of the extended API structure.

    TestDataDirectory - Supplies a pointer to a UNICODE_STRING structure that
        represents a fully-qualified path of the test data directory.

    MaximumConcurrency - Optionally supplies a pointer to a variable that
        contains the desired maximum concurrency to be used for the underlying
        threadpool.  If NULL, or non-NULL but points to a value of 0, then the
        number of system processors will be used as a default value.

        N.B. This value is passed directly to SetThreadpoolThreadMinimum() and
             SetThreadpoolThreadMaximum().

    AlgorithmId - Supplies the algorithm to use.

    MaskFunctionId - Supplies the type of masking to use.

    HashFunctionId - Supplies the hash function to use.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    PWSTR Dest;
    PWSTR Source;
    USHORT Length;
    USHORT BaseLength;
    USHORT NumberOfPages;
    BOOLEAN Success;
    BOOLEAN Failed;
    BOOLEAN IsProcessTerminating = FALSE;
    PWCHAR Buffer;
    PWCHAR BaseBuffer;
    PWCHAR WideOutput;
    PWCHAR WideOutputBuffer;
    HANDLE FindHandle = NULL;
    HANDLE WideOutputHandle;
    HANDLE ProcessHandle = NULL;
    ULONG Failures;
    ULONG BytesWritten;
    ULONG WideCharsWritten;
    ULONGLONG BufferSize;
    ULONGLONG WideOutputBufferSize;
    LONG_INTEGER AllocSize;
    LARGE_INTEGER BytesToWrite;
    LARGE_INTEGER WideCharsToWrite;
    WIN32_FIND_DATAW FindData;
    UNICODE_STRING SearchPath;
    UNICODE_STRING KeysPath;
    UNICODE_STRING TablePath;
    PPERFECT_HASH_TABLE Table;
    PPERFECT_HASH_TABLE_KEYS Keys;
    PTABLE_INFO_ON_DISK_HEADER Header;
    PPERFECT_HASH_TABLE_CONTEXT Context;
    PPERFECT_HASH_TABLE_API Api;
    PPERFECT_HASH_TABLE_API_EX ApiEx;
    PUNICODE_STRING AlgorithmName;
    PUNICODE_STRING HashFunctionName;
    PUNICODE_STRING MaskFunctionName;
    UNICODE_STRING Suffix = RTL_CONSTANT_STRING(L"*.keys");
    UNICODE_STRING TableSuffix = RTL_CONSTANT_STRING(L"pht1");

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Rtl)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Allocator)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(AnyApi)) {

        return FALSE;

    } else {

        Api = &AnyApi->Api;

        if (Api->SizeOfStruct == sizeof(*ApiEx)) {

            ApiEx = &AnyApi->ApiEx;

        } else {

            //
            // The API should be the extended version.  (Otherwise, how did
            // the caller even find this function?)
            //

            return FALSE;
        }

    }

    if (!ARGUMENT_PRESENT(TestDataDirectory)) {
        return FALSE;
    }

    if (!IsValidMinimumDirectoryUnicodeString(TestDataDirectory)) {
        return FALSE;
    }

    if (!IsValidPerfectHashTableAlgorithmId(AlgorithmId)) {

        return FALSE;

    } else {


    }

    //
    // Arguments have been validated, proceed.
    //

    //
    // Create a buffer we can use for stdout, using a very generous buffer size.
    //

    NumberOfPages = 10;
    Success = Rtl->CreateBuffer(Rtl,
                                &ProcessHandle,
                                NumberOfPages,
                                NULL,
                                &WideOutputBufferSize,
                                &WideOutputBuffer);

    if (!Success) {
        return FALSE;
    }

    WideOutput = WideOutputBuffer;

    //
    // Create a buffer we can use for temporary path construction.  We want it
    // to be MAX_USHORT in size, so (1 << 16) >> PAGE_SHIFT converts this into
    // the number of pages we need.
    //

    NumberOfPages = (1 << 16) >> PAGE_SHIFT;
    Success = Rtl->CreateBuffer(Rtl,
                                &ProcessHandle,
                                NumberOfPages,
                                NULL,
                                &BufferSize,
                                &BaseBuffer);

    if (!Success) {
        return FALSE;
    }

    Buffer = BaseBuffer;

    //
    // Get a reference to the stdout handle.
    //

    WideOutputHandle = GetStdHandle(STD_OUTPUT_HANDLE);
    ASSERT(WideOutputHandle);

    //
    // Calculate the size required for a new concatenated wide string buffer
    // that combines the test data directory with the "*.keys" suffix.  The
    // 2 * sizeof(*Dest) accounts for the joining slash and trailing NULL.
    //

    AllocSize.LongPart = TestDataDirectory->Length;
    AllocSize.LongPart += Suffix.Length + (2 * sizeof(*Dest));

    ASSERT(!AllocSize.HighPart);

    SearchPath.Buffer = (PWSTR)Buffer;

    if (!SearchPath.Buffer) {
        goto Error;
    }

    //
    // Copy incoming test data directory name.
    //

    Length = TestDataDirectory->Length;
    CopyMemory(SearchPath.Buffer,
               TestDataDirectory->Buffer,
               Length);

    //
    // Advance our Dest pointer to the end of the directory name, write a
    // slash, then copy the suffix over.
    //

    Dest = (PWSTR)RtlOffsetToPointer(SearchPath.Buffer, Length);
    *Dest++ = L'\\';
    CopyMemory(Dest, Suffix.Buffer, Suffix.Length);

    //
    // Wire up the search path length and maximum length variables.  The max
    // length will be our AllocSize, length will be this value minus 2 to
    // account for the trailing NULL.
    //

    SearchPath.MaximumLength = AllocSize.LowPart;
    SearchPath.Length = AllocSize.LowPart - sizeof(*Dest);
    ASSERT(SearchPath.Buffer[SearchPath.Length] == L'\0');

    //
    // Advance the buffer past this string allocation, up to the next 16-byte
    // boundary.
    //

    Buffer = (PWSTR)(
        RtlOffsetToPointer(
            Buffer,
            ALIGN_UP(SearchPath.MaximumLength, 16)
        )
    );

    WIDE_OUTPUT_RAW(WideOutput,
                    L"Starting perfect hash self-test for directory: ");
    WIDE_OUTPUT_UNICODE_STRING(WideOutput, TestDataDirectory);
    WIDE_OUTPUT_RAW(WideOutput, L".\n");
    WIDE_OUTPUT_FLUSH();

    //
    // Create a find handle for the <test data>\*.keys search pattern we
    // created.
    //

    FindHandle = FindFirstFileW(SearchPath.Buffer, &FindData);

    if (!FindHandle || FindHandle == INVALID_HANDLE_VALUE) {

        ULONG LastError;

        //
        // Check to see if we failed because there were no files matching the
        // wildcard *.keys in the test directory.  In this case, GetLastError()
        // will report ERROR_FILE_NOT_FOUND.
        //

        LastError = GetLastError();

        if (LastError == ERROR_FILE_NOT_FOUND) {

            WIDE_OUTPUT_RAW(WideOutput,
                            L"No files matching pattern '*.keys' found in "
                            L"test data directory.\n");
            WIDE_OUTPUT_FLUSH();

            goto End;

        } else {

            //
            // We failed for some other reason.
            //

            WIDE_OUTPUT_RAW(WideOutput,
                            L"FindFirstFileW() failed with error code: ");
            WIDE_OUTPUT_INT(WideOutput, LastError);
            WIDE_OUTPUT_LF(WideOutput);
            WIDE_OUTPUT_FLUSH();

            goto Error;
        }
    }

    //
    // Initialize the fully-qualified keys path.
    //

    KeysPath.Buffer = Buffer;
    CopyMemory(KeysPath.Buffer, TestDataDirectory->Buffer, Length);

    //
    // Advance our Dest pointer to the end of the directory name, then write
    // a slash.
    //

    Dest = (PWSTR)RtlOffsetToPointer(KeysPath.Buffer, Length);
    *Dest++ = L'\\';

    //
    // Update the length to account for the slash we just wrote, then make a
    // copy of it in the variable BaseLength.
    //

    Length += sizeof(*Dest);
    BaseLength = Length;

    //
    // Zero the failure count and begin the main loop.
    //

    Failures = 0;

    do {

        //
        // Clear the failure flag at the top of every loop invocation.
        //

        Failed = FALSE;

        WIDE_OUTPUT_RAW(WideOutput, L"Processing key file: ");
        WIDE_OUTPUT_WCSTR(WideOutput, (PCWSZ)FindData.cFileName);
        WIDE_OUTPUT_LF(WideOutput);
        WIDE_OUTPUT_FLUSH();

        //
        // Create a new perfect hash table context.
        //

        Success = Api->CreatePerfectHashTableContext(Rtl,
                                                     Allocator,
                                                     MaximumConcurrency,
                                                     &Context);

        if (!Success) {

            //
            // We can't do anything without a context.
            //

            WIDE_OUTPUT_RAW(WideOutput, L"Fatal: failed to create context.\n");
            WIDE_OUTPUT_FLUSH();
            Failures++;
            break;
        }

        //
        // Copy the filename over to the fully-qualified keys path.
        //

        Dest = (PWSTR)RtlOffsetToPointer(KeysPath.Buffer, BaseLength);
        Source = (PWSTR)FindData.cFileName;

        while (*Source) {
            *Dest++ = *Source++;
        }
        *Dest = L'\0';

        Length = (USHORT)RtlOffsetFromPointer(Dest, KeysPath.Buffer);
        KeysPath.Length = Length;
        KeysPath.MaximumLength = Length + sizeof(*Dest);
        ASSERT(KeysPath.Buffer[KeysPath.Length >> 1] == L'\0');
        ASSERT(&KeysPath.Buffer[KeysPath.Length >> 1] == Dest);

        Success = Api->LoadPerfectHashTableKeys(Rtl,
                                                Allocator,
                                                &KeysPath,
                                                &Keys);

        if (!Success) {

            WIDE_OUTPUT_RAW(WideOutput, L"Failed to load keys for ");
            WIDE_OUTPUT_UNICODE_STRING(WideOutput, &KeysPath);
            WIDE_OUTPUT_RAW(WideOutput, L".\n");
            WIDE_OUTPUT_FLUSH();

            Failures++;
            goto DestroyContext;
        }

        //
        // Keys were loaded successfully.  Construct the equivalent path name
        // for the backing perfect hash table when persisted to disk.  Although
        // this can be automated for us as part of CreatePerfectHashTable(),
        // having it backed by our memory simplifies things a little further
        // down the track when we want to load the table via the path but have
        // destroyed the original table that came from CreatePerfectHashTable().
        //

        //
        // Align the Dest pointer up to a 16-byte boundary.  (We add 1 to its
        // current value to advance it past the terminating NULL of the keys
        // path.)
        //

        Dest = (PWSTR)ALIGN_UP(Dest + 1, 16);

        //
        // We know the keys file ended with ".keys".  We're going to use the
        // identical name for the backing hash table file, except replace the
        // ".keys" extension at the end with ".pht1".  So, we can just copy the
        // lengths used for KeysPath, plus the entire buffer, then just copy the
        // new extension over the old one.  The 1 doesn't have any significance
        // other than it padding out the extension length such that it matches
        // the length of ".keys".  (Although it may act as a nice versioning
        // tool down the track.)
        //

        TablePath.Length = KeysPath.Length;
        TablePath.MaximumLength = KeysPath.MaximumLength;
        TablePath.Buffer = Dest;

        //
        // Copy the keys path over.
        //

        CopyMemory(Dest, KeysPath.Buffer, KeysPath.MaximumLength);

        //
        // Advance the Dest pointer to the end of the buffer, then retreat it
        // five characters, such that it's positioned on the 'k' of keys.
        //

        Dest += (KeysPath.MaximumLength >> 1) - 5;
        ASSERT(*Dest == L'k');
        ASSERT(*(Dest - 1) == L'.');

        //
        // Copy the "pht1" extension over "keys".
        //

        Source = TableSuffix.Buffer;
        while (*Source) {
            *Dest++ = *Source++;
        }
        *Dest = L'\0';

        //
        // Sanity check invariants.
        //

        ASSERT(TablePath.Buffer[TablePath.Length >> 1] == L'\0');
        ASSERT(&TablePath.Buffer[TablePath.Length >> 1] == Dest);

        //
        // We now have the fully-qualified path name of the backing perfect
        // hash table file living in TablePath.  Continue with creation of the
        // perfect hash table, using this path we've just created and the keys
        // that were loaded.
        //

        Success = Api->CreatePerfectHashTable(Rtl,
                                              Allocator,
                                              Context,
                                              AlgorithmId,
                                              MaskFunctionId,
                                              HashFunctionId,
                                              NULL,
                                              Keys,
                                              &TablePath);

        if (!Success) {

            WIDE_OUTPUT_RAW(WideOutput, L"Failed to create perfect hash "
                                        L"table for keys: ");
            WIDE_OUTPUT_UNICODE_STRING(WideOutput, &KeysPath);
            WIDE_OUTPUT_RAW(WideOutput, L".\n");
            WIDE_OUTPUT_FLUSH();

            Failed = TRUE;
            Failures++;
            goto DestroyKeys;
        }

        WIDE_OUTPUT_RAW(WideOutput, L"Successfully created perfect "
                                    L"hash table: ");
        WIDE_OUTPUT_UNICODE_STRING(WideOutput, &TablePath);
        WIDE_OUTPUT_RAW(WideOutput, L".\n");

        //
        // Load the perfect hash table we just created.
        //

        Success = Api->LoadPerfectHashTable(Rtl,
                                            Allocator,
                                            Keys,
                                            &TablePath,
                                            &Table);

        if (!Success) {

            WIDE_OUTPUT_RAW(WideOutput, L"Failed to load perfect hash table: ");
            WIDE_OUTPUT_UNICODE_STRING(WideOutput, &TablePath);
            WIDE_OUTPUT_RAW(WideOutput, L".\n");
            WIDE_OUTPUT_FLUSH();

            Failures++;
            goto DestroyKeys;

        }

        //
        // Table was loaded successfully from disk.  Obtain the names of all
        // the enumeration IDs.  Currently these should always match the same
        // enums provided as input parameters to this routine.
        //
        // N.B. I'm being lazy with the ASSERT()s here instead of reporting the
        //      error properly like we do with other failures.
        //

        ASSERT(Api->GetAlgorithmName(Table->AlgorithmId, &AlgorithmName));

        ASSERT(Api->GetHashFunctionName(Table->HashFunctionId,
                                        &HashFunctionName));

        ASSERT(Api->GetMaskFunctionName(Table->MaskFunctionId,
                                        &MaskFunctionName));


        WIDE_OUTPUT_RAW(WideOutput, L"Successfully loaded perfect "
                                    L"hash table: ");
        WIDE_OUTPUT_UNICODE_STRING(WideOutput, &TablePath);
        WIDE_OUTPUT_RAW(WideOutput, L".\n");

        WIDE_OUTPUT_RAW(WideOutput, L"Algorithm: ");
        WIDE_OUTPUT_UNICODE_STRING(WideOutput, AlgorithmName);
        WIDE_OUTPUT_RAW(WideOutput, L" (");
        WIDE_OUTPUT_INT(WideOutput, Table->AlgorithmId);
        WIDE_OUTPUT_RAW(WideOutput, L").\n");

        WIDE_OUTPUT_RAW(WideOutput, L"Hash Function: ");
        WIDE_OUTPUT_UNICODE_STRING(WideOutput, HashFunctionName);
        WIDE_OUTPUT_RAW(WideOutput, L" (");
        WIDE_OUTPUT_INT(WideOutput, Table->HashFunctionId);
        WIDE_OUTPUT_RAW(WideOutput, L").\n");

        WIDE_OUTPUT_RAW(WideOutput, L"Mask Function: ");
        WIDE_OUTPUT_UNICODE_STRING(WideOutput, MaskFunctionName);
        WIDE_OUTPUT_RAW(WideOutput, L" (");
        WIDE_OUTPUT_INT(WideOutput, Table->MaskFunctionId);
        WIDE_OUTPUT_RAW(WideOutput, L").\n");

        WIDE_OUTPUT_FLUSH();

        //
        // Test the table.
        //

        Success = Api->TestPerfectHashTable(Table, TRUE);

        if (!Success) {

            WIDE_OUTPUT_RAW(WideOutput, L"Test failed for perfect hash table "
                                        L"loaded from disk: ");
            WIDE_OUTPUT_UNICODE_STRING(WideOutput, &TablePath);
            WIDE_OUTPUT_RAW(WideOutput, L".\n");
            WIDE_OUTPUT_FLUSH();

            Failures++;
            Failed = TRUE;
            goto DestroyTable;
        }

        WIDE_OUTPUT_RAW(WideOutput, L"Successfully tested perfect hash "
                                    L"table.\n");

        //
        // Initialize header alias.
        //

        Header = Table->Header;

        //
        // Define some helper macros here for dumping stats.
        //

#define STATS_INT(String, Name)                               \
        WIDE_OUTPUT_RAW(WideOutput, String);                  \
        WIDE_OUTPUT_INT(WideOutput, Table->Header->##Name##); \
        WIDE_OUTPUT_RAW(WideOutput, L".\n")

#define STATS_QUAD(String, Name)                                       \
        WIDE_OUTPUT_RAW(WideOutput, String);                           \
        WIDE_OUTPUT_INT(WideOutput, Table->Header->##Name##.QuadPart); \
        WIDE_OUTPUT_RAW(WideOutput, L".\n")

        if (Header->NumberOfTableResizeEvents > 0) {

            STATS_INT(L"Number of table resize events: ",
                      NumberOfTableResizeEvents);

            STATS_INT(L"Total number of attempts with smaller table sizes: ",
                      TotalNumberOfAttemptsWithSmallerTableSizes);

            STATS_INT(L"First table size attempted: ",
                      InitialTableSize);

            STATS_INT(L"Closest we came to solving the graph in previous "
                      L"attempts by number of deleted edges away: ",
                      ClosestWeCameToSolvingGraphWithSmallerTableSizes);

        }

        STATS_INT(L"Concurrency: ", Concurrency);
        STATS_INT(L"Number of attempts: ", NumberOfAttempts);
        STATS_INT(L"Number of failed attempts: ", NumberOfFailedAttempts);
        STATS_INT(L"Number of solutions found: ", NumberOfSolutionsFound);

        STATS_QUAD(L"Number of keys: ", NumberOfKeys);
        STATS_QUAD(L"Number of table elements (vertices): ",
                   NumberOfTableElements);

        STATS_INT(L"Seed 1: ", Seed1);
        STATS_INT(L"Seed 2: ", Seed2);
        STATS_INT(L"Seed 3: ", Seed3);
        STATS_INT(L"Seed 4: ", Seed4);


        STATS_QUAD(L"Cycles to solve: ", SolveCycles);
        STATS_QUAD(L"Cycles to verify: ", VerifyCycles);
        STATS_QUAD(L"Cycles to prepare file: ", PrepareFileCycles);
        STATS_QUAD(L"Cycles to save file: ", SaveFileCycles);

        STATS_QUAD(L"Microseconds to solve: ", SolveMicroseconds);
        STATS_QUAD(L"Microseconds to verify: ", VerifyMicroseconds);
        STATS_QUAD(L"Microseconds to prepare file: ", PrepareFileMicroseconds);
        STATS_QUAD(L"Microseconds to save file: ", SaveFileMicroseconds);


        WIDE_OUTPUT_RAW(WideOutput, L"\n\n");
        WIDE_OUTPUT_FLUSH();

        //
        // Destroy the table.
        //

DestroyTable:

        Table->Vtbl->Release(Table);

DestroyKeys:

        Success = Api->DestroyPerfectHashTableKeys(&Keys);
        if (!Success) {

            WIDE_OUTPUT_RAW(WideOutput, L"Failed to destroy keys for ");
            WIDE_OUTPUT_UNICODE_STRING(WideOutput, &KeysPath);
            WIDE_OUTPUT_RAW(WideOutput, L".\n");
            WIDE_OUTPUT_FLUSH();

            Failures++;
        }

DestroyContext:

        Success = Api->DestroyPerfectHashTableContext(&Context, NULL);
        if (!Success) {

            //
            // Failure to destroy a context is a fatal error that we can't
            // recover from.  Bomb out now.
            //

            WIDE_OUTPUT_RAW(WideOutput, L"Fatal: failed to destroy context.\n");
            WIDE_OUTPUT_FLUSH();

            Failures++;
            break;
        }

    } while (FindNextFile(FindHandle, &FindData));

    //
    // Self test complete!
    //

    if (!Failures) {
        Success = TRUE;
        goto End;
    }

    //
    // Intentional follow-on to Error.
    //

Error:

    Success = FALSE;

    //
    // Intentional follow-on to End.
    //

End:

    //
    // We can't do much if any of these routines error out, hence the NOTHINGs.
    //

    if (WideOutputBuffer) {
        if (!Rtl->DestroyBuffer(Rtl, ProcessHandle, &WideOutputBuffer)) {
            NOTHING;
        }
        WideOutput = NULL;
    }

    if (BaseBuffer) {
        if (!Rtl->DestroyBuffer(Rtl, ProcessHandle, &BaseBuffer)) {
            NOTHING;
        }
        Buffer = NULL;
    }

    if (FindHandle) {
        if (!FindClose(FindHandle)) {
            NOTHING;
        }
        FindHandle = NULL;
    }

    return Success;
}

The self-test routine creates a context, loads a key file, creates a perfect hash table using the context from the key file, tests the perfect hash table, then destroys it, the keys, and the context. Let's take a look at each routine and any corresponding data structures involved.

Context Creation

This routine is responsible for creating a PERFECT_HASH_TABLE_CONTEXT structure, which we use to encapsulate all of the details regarding the threadpool machinery we provide to the backend algorithm implementations such that solutions can be sought in parallel. The context is passed to the CreatePerfectHashTable() routine (along with the keys we load in the next step).

PERFECT_HASH_TABLE_CONTEXT

The public header gets an opaque structure for the context. We don't need to expose any internal details of the structure to consumers, other than providing with an opaque pointer they can subsequently pass to the creation routine, and then to the DestroyPerfectHashTableContext() routine once they're finished creating perfect hash tables.

The structure is defined in the private header PerfectHashTablePrivate.h and looks like this:

//
// Define a runtime context to encapsulate threadpool resources.  This is
// passed to CreatePerfectHashTable() and allows for algorithms to search for
// perfect hash solutions in parallel.
//

typedef union _PERFECT_HASH_TABLE_CONTEXT_FLAGS {
    struct {
        ULONG Unused:32;
    };
    LONG AsLong;
    ULONG AsULong;
} PERFECT_HASH_TABLE_CONTEXT_FLAGS;
C_ASSERT(sizeof(PERFECT_HASH_TABLE_CONTEXT_FLAGS) == sizeof(ULONG));
typedef PERFECT_HASH_TABLE_CONTEXT_FLAGS *PPERFECT_HASH_TABLE_CONTEXT_FLAGS;

typedef struct _Struct_size_bytes_(SizeOfStruct) _PERFECT_HASH_TABLE_CONTEXT {

    //
    // Size of the structure, in bytes.
    //

    _Field_range_(==, sizeof(struct _PERFECT_HASH_TABLE_CONTEXT))
        ULONG SizeOfStruct;

    //
    // Flags.
    //

    PERFECT_HASH_TABLE_CONTEXT_FLAGS Flags;

    //
    // The algorithm in use.
    //

    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId;

    //
    // The masking type in use.
    //

    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;

    //
    // The hash function in use.
    //

    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId;

    //
    // Pointer to an initialized RTL structure.
    //

    PRTL Rtl;

    //
    // Pointer to an initialized allocator.
    //

    PALLOCATOR Allocator;

    //
    // Pointer to the API structure in use.
    //

    PPERFECT_HASH_TABLE_ANY_API AnyApi;

    //
    // Pointer to the active perfect hash table.
    //

    struct _PERFECT_HASH_TABLE *Table;

    //
    // The highest number of deleted edges count encountered by a worker thread.
    // This is useful when debugging a poorly performing hash/mask combo that is
    // failing to find a solution.
    //

    volatile ULONG HighestDeletedEdgesCount;

    //
    // The number of attempts we'll make at trying to solve the graph before
    // giving up and resizing with a larger underlying table.
    //

    ULONG ResizeTableThreshold;

    //
    // Limit on how many times a resize will be kicked off.
    //

    ULONG ResizeLimit;

    //
    // Define the events used to communicate various internal state changes
    // between the CreatePerfectHashTable() function and the algorithm-specific
    // creation routine.
    //
    // N.B. All of these events are created with the manual reset flag set to
    //      TRUE, such that they stay signalled even after they have satisfied
    //      a wait.
    //

    //
    // A global "shutdown" event handle that threads can query to determine
    // whether or not they should continue processing at various internal
    // checkpoints.
    //

    union {
        HANDLE ShutdownEvent;
        PVOID FirstEvent;
    };

    //
    // This event will be set if an algorithm was successful in finding a
    // perfect hash.  Either it or the FailedEvent will be set; never both.
    //

    HANDLE SucceededEvent;

    //
    // This event will be set if an algorithm failed to find a perfect hash
    // solution.  This may be due to the algorithm exhausting all possible
    // options, hitting a time limit, or potentially as a result of being
    // forcibly terminated or some other internal error.  It will never be
    // set if SucceededEvent is also set.
    //

    HANDLE FailedEvent;

    //
    // The following event is required to be set by an algorithm's creation
    // routine upon completion (regardless of success or failure).  This event
    // is waited upon by the CreatePerfectHashTable() function, and thus, is
    // critical in synchronizing the execution of parallel perfect hash solution
    // finding.
    //

    HANDLE CompletedEvent;

    //
    // The following event is set when a worker thread detects that the number
    // of attempts has exceeded a specified threshold, and that the main thread
    // should cancel the current attempts and try again with a larger vertex
    // table size.
    //

    HANDLE TryLargerTableSizeEvent;

    //
    // The following event is set when a worker thread has completed preparing
    // the underlying backing file in order for the solved graph to be persisted
    // to disk.
    //

    HANDLE PreparedFileEvent;

    //
    // The following event is set by the main thread when it has completed
    // verification of the solved graph.  It is used to signal to the save
    // file worker that verification has finished such that cycle counts can
    // be captured in order to calculate the number of cycles and microseconds
    // it took to verify the graph.
    //

    HANDLE VerifiedEvent;

    //
    // The following event is set when a worker thread has completed saving the
    // solved graph to disk.
    //

    union {
        HANDLE SavedFileEvent;
        PVOID LastEvent;
    };

    //
    // N.B. All events are created as named events, using the random object
    //      name generation helper Rtl->CreateRandomObjectNames().  This will
    //      fill out an array of PUNICODE_STRING pointers.  The next field
    //      points to the first element of that array.  Subsequent fields
    //      capture various book-keeping items about the random object names
    //      allocation (provided by the Rtl routine).
    //

    PUNICODE_STRING ObjectNames;
    PPUNICODE_STRING ObjectNamesPointerArray;
    PWSTR ObjectNamesWideBuffer;
    ULONG SizeOfObjectNamesWideBuffer;
    ULONG NumberOfObjects;

    //
    // Number of attempts made by the algorithm to find a solution.
    //

    volatile ULONGLONG Attempts;

    //
    // Counters used for capturing performance information.  We capture both a
    // cycle count, using __rdtsc(), plus a "performance counter" count, via
    // QueryPerformanceCounter().  The former provides a finer resolution, but
    // can't be used to calculate elapsed microseconds due to turbo boost and
    // variable frequencies.  The latter provides a coarser resolution, but
    // can be used to convert into elapsed microseconds (via the frequency,
    // also captured below).
    //

    LARGE_INTEGER Frequency;

    //
    // Capture the time required to solve the perfect hash table.  This is not
    // a sum of all cycles consumed by all worker threads; it is the cycles
    // consumed between the "main" thread (i.e. the CreatePerfectHashTable()
    // impl routine (CreatePerfectHashTableImplChm01())) dispatching parallel
    // work to the threadpool, and a solution being found.
    //

    ULARGE_INTEGER SolveStartCycles;
    LARGE_INTEGER SolveStartCounter;

    ULARGE_INTEGER SolveEndCycles;
    LARGE_INTEGER SolveEndCounter;

    ULARGE_INTEGER SolveElapsedCycles;
    ULARGE_INTEGER SolveElapsedMicroseconds;

    //
    // Capture the time required to verify the solution.  This involves walking
    // the entire key set, applying the perfect hash function to derive an index
    // into the Assigned array, and verifying that we only saw each index value
    // at most once.
    //
    // This is a reasonably good measure of the combined performance of the
    // chosen hash and mask algorithm, with lower cycles and counter values
    // indicating better performance.
    //

    ULARGE_INTEGER VerifyStartCycles;
    LARGE_INTEGER VerifyStartCounter;

    ULARGE_INTEGER VerifyEndCycles;
    LARGE_INTEGER VerifyEndCounter;

    ULARGE_INTEGER VerifyElapsedCycles;
    ULARGE_INTEGER VerifyElapsedMicroseconds;

    //
    // Capture the time required to prepare the backing .pht1 file in the file
    // work threadpool.
    //

    ULARGE_INTEGER PrepareFileStartCycles;
    LARGE_INTEGER PrepareFileStartCounter;

    ULARGE_INTEGER PrepareFileEndCycles;
    LARGE_INTEGER PrepareFileEndCounter;

    ULARGE_INTEGER PrepareFileElapsedCycles;
    ULARGE_INTEGER PrepareFileElapsedMicroseconds;

    //
    // Capture the time required to save the final Assigned array to the backing
    // file prepared in an earlier step.  This is also dispatched to the file
    // work thread pool, and consists of a memory copy from the assigned array
    // of the graph to the base address of the backing file's memory map, then
    // flushing the map, unmapping it, closing the section, and closing the
    // file.
    //

    ULARGE_INTEGER SaveFileStartCycles;
    LARGE_INTEGER SaveFileStartCounter;

    ULARGE_INTEGER SaveFileEndCycles;
    LARGE_INTEGER SaveFileEndCounter;

    ULARGE_INTEGER SaveFileElapsedCycles;
    ULARGE_INTEGER SaveFileElapsedMicroseconds;

    //
    // Number of failed attempts at solving the graph across all threads.
    //

    volatile ULONGLONG FailedAttempts;

    //
    // The main threadpool callback environment, used for solving perfect hash
    // solutions in parallel.
    //

    TP_CALLBACK_ENVIRON MainCallbackEnv;
    PTP_CLEANUP_GROUP MainCleanupGroup;
    PTP_POOL MainThreadpool;
    PTP_WORK MainWork;
    SLIST_HEADER MainWorkListHead;
    ULONG MinimumConcurrency;
    ULONG MaximumConcurrency;

    //
    // The algorithm is responsible for registering an appropriate callback
    // for main thread work items in this next field.
    //

    PPERFECT_HASH_TABLE_MAIN_WORK_CALLBACK MainWorkCallback;

    //
    // A threadpool for offloading file operations.
    //

    TP_CALLBACK_ENVIRON FileCallbackEnv;
    PTP_CLEANUP_GROUP FileCleanupGroup;
    PTP_POOL FileThreadpool;
    PTP_WORK FileWork;
    SLIST_HEADER FileWorkListHead;

    //
    // Provide a means for file work callbacks to indicate an error back to
    // the creation routine by incrementing the following counter.
    //

    volatile ULONG FileWorkErrors;
    volatile ULONG FileWorkLastError;

    //
    // The algorithm is responsible for registering an appropriate callback
    // for file work threadpool work items in this next field.
    //

    PPERFECT_HASH_TABLE_FILE_WORK_CALLBACK FileWorkCallback;

    //
    // If a threadpool worker thread finds a perfect hash solution, it will
    // enqueue a "Finished!"-type work item to a separate threadpool, captured
    // by the following callback environment.  This allows for a separate
    // threadpool worker to schedule the cancellation of other in-progress
    // and outstanding perfect hash solution attempts without deadlocking.
    //
    // This threadpool environment is serviced by a single thread.
    //
    // N.B. This cleanup only refers to the main graph solving thread pool.
    //      The file threadpool is managed by the implicit lifetime of the
    //      algorithm's creation routine (e.g. CreatePerfectHashTableImplChm01).
    //

    TP_CALLBACK_ENVIRON FinishedCallbackEnv;
    PTP_POOL FinishedThreadpool;
    PTP_WORK FinishedWork;
    SLIST_HEADER FinishedWorkListHead;

    //
    // If a worker thread successfully finds a perfect hash solution, it will
    // push its solution to the FinishedListHead above, then submit a finished
    // work item via SubmitThreadpoolWork(Context->FinishedWork).
    //
    // This callback will be processed by the finished group above, and provides
    // a means for that thread to set the ShutdownEvent and cancel outstanding
    // main work callbacks.
    //
    // N.B. Although we only need one solution, we don't prevent multiple
    //      successful solutions from being pushed to the FinishedListHead.
    //      Whatever the first solution is that the finished callback pops
    //      off that list is the solution that wins.
    //

    volatile ULONGLONG FinishedCount;

    //
    // Similar to the Finished group above, provide an Error group that also
    // consists of a single thread.  If a main threadpool worker thread runs
    // into a fatal error that requires termination of all in-progress and
    // outstanding threadpool work items, it can just dispatch a work item
    // to this particular pool (e.g. SubmitThreadpoolWork(Context->ErrorWork)).
    //
    // There is no ErrorListHead as no error information is captured that needs
    // communicating back to a central location.
    //

    TP_CALLBACK_ENVIRON ErrorCallbackEnv;
    PTP_POOL ErrorThreadpool;
    PTP_WORK ErrorWork;

    //
    // An opaque pointer that can be used by the algorithm to stash additional
    // context.
    //

    PVOID AlgorithmContext;

    //
    // An opaque pointer that can be used by the hash function to stash
    // additional context.
    //

    PVOID HashFunctionContext;

    //
    // An opaque pointer to the winning solution (i.e. the solved graph).
    //

    PVOID SolvedContext;

} PERFECT_HASH_TABLE_CONTEXT;
typedef PERFECT_HASH_TABLE_CONTEXT *PPERFECT_HASH_TABLE_CONTEXT;

//
// Define helper macros for marking start/end points for the context's
// cycle/counter fields.  When starting, we put __rdtsc() last, and when
// stopping we put it first, as its resolution is more sensitive than the
// QueryPerformanceCounter() routine.
//

#define CONTEXT_START_TIMERS(Name)                           \
    QueryPerformanceCounter(&Context->##Name##StartCounter); \
    Context->##Name##StartCycles.QuadPart = __rdtsc()

#define CONTEXT_END_TIMERS(Name)                              \
    Context->##Name##EndCycles.QuadPart = __rdtsc();          \
    QueryPerformanceCounter(&Context->##Name##EndCounter);    \
    Context->##Name##ElapsedCycles.QuadPart = (               \
        Context->##Name##EndCycles.QuadPart -                 \
        Context->##Name##StartCycles.QuadPart                 \
    );                                                        \
    Context->##Name##ElapsedMicroseconds.QuadPart = (         \
        Context->##Name##EndCounter.QuadPart -                \
        Context->##Name##StartCounter.QuadPart                \
    );                                                        \
    Context->##Name##ElapsedMicroseconds.QuadPart *= 1000000; \
    Context->##Name##ElapsedMicroseconds.QuadPart /= (        \
        Context->Frequency.QuadPart                           \
    )

#define CONTEXT_SAVE_TIMERS_TO_HEADER(Name)                                    \
    Header->##Name##Cycles.QuadPart = Context->##Name##ElapsedCycles.QuadPart; \
    Header->##Name##Microseconds.QuadPart = (                                  \
        Context->##Name##ElapsedMicroseconds.QuadPart                          \
    )

CreatePerfectHashTableContext()

The creation routine is defined in CreatePerfectHashTableContext.c. Pseudo and full code samples follow.

Pseudo
Full

_Use_decl_annotations_
BOOLEAN
CreatePerfectHashTableContext(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PULONG MaximumConcurrencyPointer,
    PPERFECT_HASH_TABLE_CONTEXT *ContextPointer
    )
/*++

Routine Description:

    Creates and initializes a PERFECT_HASH_TABLE_CONTEXT structure, which
    creates a main threadpool with a maximum of threads indicated by the
    MaximumConcurrency parameter, if applicable.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for all memory allocations.

    AnyApi - Supplies a pointer to the active API structure in use.

    MaximumConcurrency - Optionally supplies a pointer to a variable that
        contains the desired maximum concurrency to be used for the underlying
        threadpool.  If NULL, or non-NULL but points to a value of 0, then the
        number of system processors will be used as a default value.

        N.B. This value is passed directly to SetThreadpoolThreadMinimum() and
             SetThreadpoolThreadMaximum().

    ContextPointer - Supplies the address of a variable that receives the
        address of the newly created PERFECT_HASH_TABLE_CONTEXT structure on
        success, NULL on failure.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{

    ValidateArguments();

    //
    // Calculate allocation size required by the structure.
    //

    CalculateAllocationSize(&AllocSize);

    //
    // Allocate it.
    //

    Context = AllocateContext(AllocSize);

    //
    // Create the random object names for our underlying events.
    //

    Success = Rtl->CreateRandomObjectNames(Rtl,
                                           Allocator,
                                           Allocator,
                                           NumberOfContextObjectPrefixes,
                                           64,
                                           NULL,
                                           Context->ObjectNamesPointerArray,
                                           Prefixes,
                                           &SizeOfNamesWideBuffer,
                                           &NamesWideBuffer);


    //
    // Create named events.
    //

    for (Index = 0; Index < NumberOfEvents; Index++, Event++, Name++) {

        //
        // We want all of our events to be manual reset, such that they stay
        // signaled even after they've satisfied a wait.
        //

        BOOLEAN ManualReset = TRUE;

        *Event = Rtl->CreateEventW(NULL,
                                   ManualReset,
                                   FALSE,
                                   Name->Buffer);

    }

    //
    // Create our threadpools: main work, file work, finished, error.
    //

    CreateMainWorkThreadpoolResources(Context);

    CreateFileWorkThreadpoolResources(Context);

    CreateFinishedWorkThreadpoolResources(Context);

    CreateErrorWorkThreadpoolResources(Context);

    *ContextPointer = Context;

    return TRUE;
}

_Use_decl_annotations_
BOOLEAN
CreatePerfectHashTableContext(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PULONG MaximumConcurrencyPointer,
    PPERFECT_HASH_TABLE_CONTEXT *ContextPointer
    )
/*++

Routine Description:

    Creates and initializes a PERFECT_HASH_TABLE_CONTEXT structure, which
    creates a main threadpool with a maximum of threads indicated by the
    MaximumConcurrency parameter, if applicable.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for all memory allocations.

    AnyApi - Supplies a pointer to the active API structure in use.

    MaximumConcurrency - Optionally supplies a pointer to a variable that
        contains the desired maximum concurrency to be used for the underlying
        threadpool.  If NULL, or non-NULL but points to a value of 0, then the
        number of system processors will be used as a default value.

        N.B. This value is passed directly to SetThreadpoolThreadMinimum() and
             SetThreadpoolThreadMaximum().

    ContextPointer - Supplies the address of a variable that receives the
        address of the newly created PERFECT_HASH_TABLE_CONTEXT structure on
        success, NULL on failure.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    BYTE Index;
    BYTE NumberOfEvents;
    ULONG LastError;
    PHANDLE Event;
    BOOLEAN Success;
    PBYTE Buffer;
    PBYTE ExpectedBuffer;
    ULONG MaximumConcurrency;
    ULONG NumberOfProcessors;
    ULONG SizeOfNamesWideBuffer;
    PWSTR NamesWideBuffer;
    PTP_POOL Threadpool;
    ULARGE_INTEGER AllocSize;
    ULARGE_INTEGER ObjectNameArraySize;
    ULARGE_INTEGER ObjectNamePointersArraySize;
    PUNICODE_STRING Name;
    PPUNICODE_STRING Names;
    PPUNICODE_STRING Prefixes;
    PPERFECT_HASH_TABLE_CONTEXT Context = NULL;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Rtl)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Allocator)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(ContextPointer)) {
        return FALSE;
    }

    //
    // Default the maximum concurrency to 0 if a NULL pointer was provided.
    //

    if (ARGUMENT_PRESENT(MaximumConcurrencyPointer)) {

        MaximumConcurrency = *MaximumConcurrencyPointer;

    } else {

        MaximumConcurrency = 0;

    }

    //
    // Argument validation complete.  Clear the caller's pointer up-front and
    // continue with creation.
    //

    *ContextPointer = NULL;

    //
    // Calculate the size required by the array of UNICODE_STRING structures
    // that trail the context, then the array of addresses to those structures.
    //

    ObjectNameArraySize.QuadPart = (
        (NumberOfContextObjectPrefixes * sizeof(UNICODE_STRING))
    );

    ObjectNamePointersArraySize.QuadPart = (
        (NumberOfContextObjectPrefixes * sizeof(PUNICODE_STRING))
    );

    ASSERT(!ObjectNameArraySize.HighPart);
    ASSERT(!ObjectNamePointersArraySize.HighPart);

    //
    // Calculate allocation size required by the structure.
    //

    AllocSize.QuadPart = (

        //
        // Account for the size of the context structure itself.
        //

        sizeof(PERFECT_HASH_TABLE_CONTEXT) +

        //
        // Account for the object name overhead.
        //

        ObjectNameArraySize.QuadPart +
        ObjectNamePointersArraySize.QuadPart

    );

    //
    // Sanity check we haven't overflowed.
    //

    ASSERT(!AllocSize.HighPart);

    //
    // Allocate space for the context.
    //

    Context = (PPERFECT_HASH_TABLE_CONTEXT)(
        Allocator->Calloc(Allocator->Context,
                          1,
                          AllocSize.LowPart)
    );

    if (!Context) {
        return FALSE;
    }

    //
    // Allocation was successful, continue with initialization.
    //

    Context->SizeOfStruct = sizeof(*Context);
    Context->Rtl = Rtl;
    Context->Allocator = Allocator;
    Context->Flags.AsULong = 0;

    //
    // The context structure will be trailed by an array of UNICODE_STRING
    // structures that will be filled in by Rtl->CreateRandomObjectNames().
    // This is then followed by an array of PUNICODE_STRING pointers, with
    // each one initialized to point to the corresponding offset into the
    // first array.  This is a bit quirky, but it's required by the Rtl
    // method (which allows for prefixes to be ignored for certain objects
    // if desired).
    //

    Buffer = (PBYTE)Context;
    Buffer += sizeof(*Context);
    Context->ObjectNames = (PUNICODE_STRING)Buffer;

    Buffer += ObjectNameArraySize.LowPart;
    Context->ObjectNamesPointerArray = (PPUNICODE_STRING)Buffer;

    Buffer += ObjectNamePointersArraySize.LowPart;

    //
    // If our pointer arithmetic was correct, Buffer should match the base
    // address of the context plus the total allocation size at this point.
    // Assert this invariant now.
    //

    ExpectedBuffer = RtlOffsetToPointer(Context, AllocSize.LowPart);
    ASSERT(Buffer == ExpectedBuffer);

    //
    // Wire up the pointer array to the object names.
    //

    Names = Context->ObjectNamesPointerArray;

    for (Index = 0; Index < NumberOfContextObjectPrefixes; Index++) {
        Names[Index] = Context->ObjectNames + Index;
    }

    //
    // Create the random object names for our underlying events.
    //

    Prefixes = (PPUNICODE_STRING)&ContextObjectPrefixes;

    Success = Rtl->CreateRandomObjectNames(Rtl,
                                           Allocator,
                                           Allocator,
                                           NumberOfContextObjectPrefixes,
                                           64,
                                           NULL,
                                           Context->ObjectNamesPointerArray,
                                           Prefixes,
                                           &SizeOfNamesWideBuffer,
                                           &NamesWideBuffer);

    if (!Success) {
        goto Error;
    }

    //
    // Wire up the wide buffer pointer and containing size.
    //

    Context->ObjectNamesWideBuffer = NamesWideBuffer;
    Context->SizeOfObjectNamesWideBuffer = SizeOfNamesWideBuffer;
    Context->NumberOfObjects = NumberOfContextObjectPrefixes;


    //
    // Initialize the event pointer to the first handle, and the name pointer
    // to the first UNICODE_STRING pointer.  Obtain the number of events.
    //

    Event = (PHANDLE)&Context->FirstEvent;
    Name = &Context->ObjectNames[0];
    NumberOfEvents = GetNumberOfContextEvents(Context);

    for (Index = 0; Index < NumberOfEvents; Index++, Event++, Name++) {

        //
        // We want all of our events to be manual reset, such that they stay
        // signaled even after they've satisfied a wait.
        //

        BOOLEAN ManualReset = TRUE;

        *Event = Rtl->CreateEventW(NULL,
                                   ManualReset,
                                   FALSE,
                                   Name->Buffer);

        LastError = GetLastError();

        if (!*Event || LastError == ERROR_ALREADY_EXISTS) {

            //
            // As the event names are random, a last error that indicates the
            // name already exists is evident of a pretty serious problem.
            // Treat this as a fatal.
            //

            goto Error;
        }
    }

    //
    // If the maximum concurrency is 0, default to the number of processors.
    //

    NumberOfProcessors = GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS);

    if (MaximumConcurrency == 0) {
        MaximumConcurrency = NumberOfProcessors;
    }

    Context->MinimumConcurrency = MaximumConcurrency;
    Context->MaximumConcurrency = MaximumConcurrency;

    //
    // Create the Main threadpool structures.  This threadpool creates a fixed
    // number of threads equal to the maximum concurrency specified by the user
    // (e.g. min threads is set to the same value as max threads).
    //

    Threadpool = Context->MainThreadpool = CreateThreadpool(NULL);
    if (!Threadpool) {
        goto Error;
    }

    if (!SetThreadpoolThreadMinimum(Threadpool, MaximumConcurrency)) {
        goto Error;
    }

    SetThreadpoolThreadMaximum(Threadpool, MaximumConcurrency);

    //
    // Initialize the Main threadpool environment, set its priority to
    // low, then associate it with the Main threadpool.
    //

    InitializeThreadpoolEnvironment(&Context->MainCallbackEnv);
    SetThreadpoolCallbackPriority(&Context->MainCallbackEnv,
                                  TP_CALLBACK_PRIORITY_LOW);
    SetThreadpoolCallbackPool(&Context->MainCallbackEnv,
                              Context->MainThreadpool);

    //
    // Create a cleanup group for the Main threadpool and register it.
    //

    Context->MainCleanupGroup = CreateThreadpoolCleanupGroup();
    if (!Context->MainCleanupGroup) {
        goto Error;
    }

    SetThreadpoolCallbackCleanupGroup(&Context->MainCallbackEnv,
                                      Context->MainCleanupGroup,
                                      CleanupCallback);

    //
    // Create a work object for the Main threadpool.
    //

    Context->MainWork = CreateThreadpoolWork(MainWorkCallback,
                                             Context,
                                             &Context->MainCallbackEnv);

    if (!Context->MainWork) {
        goto Error;
    }

    //
    // Initialize the main work list head.
    //

    InitializeSListHead(&Context->MainWorkListHead);

    //
    // Create the File threadpool structures.  We currently use the default
    // number of system threads for this threadpool.  We don't have to clamp
    // it at the moment as we don't bulk-enqueue work items (that can cause
    // the threadpool machinery to think more threads need to be created).
    //

    Threadpool = Context->FileThreadpool = CreateThreadpool(NULL);
    if (!Threadpool) {
        goto Error;
    }

    //
    // Initialize the File threadpool environment and associate it with the
    // File threadpool.
    //

    InitializeThreadpoolEnvironment(&Context->FileCallbackEnv);
    SetThreadpoolCallbackPool(&Context->FileCallbackEnv,
                              Context->FileThreadpool);

    //
    // Create a cleanup group for the File threadpool and register it.
    //

    Context->FileCleanupGroup = CreateThreadpoolCleanupGroup();
    if (!Context->FileCleanupGroup) {
        goto Error;
    }

    SetThreadpoolCallbackCleanupGroup(&Context->FileCallbackEnv,
                                      Context->FileCleanupGroup,
                                      CleanupCallback);

    //
    // Create a work object for the File threadpool.
    //

    Context->FileWork = CreateThreadpoolWork(FileWorkCallback,
                                             Context,
                                             &Context->FileCallbackEnv);

    if (!Context->FileWork) {
        goto Error;
    }

    //
    // Initialize the main work list head.
    //

    InitializeSListHead(&Context->FileWorkListHead);

    //
    // Create the Finished and Error threadpools and associated resources.
    // These are slightly easier as we only have 1 thread maximum for each
    // pool and no cleanup group is necessary.
    //

    Context->FinishedThreadpool = CreateThreadpool(NULL);
    if (!Context->FinishedThreadpool) {
        goto Error;
    }

    if (!SetThreadpoolThreadMinimum(Context->FinishedThreadpool, 1)) {
        goto Error;
    }

    SetThreadpoolThreadMaximum(Context->FinishedThreadpool, 1);

    Context->FinishedWork = CreateThreadpoolWork(FinishedWorkCallback,
                                                 Context,
                                                 &Context->FinishedCallbackEnv);
    if (!Context->FinishedWork) {
        goto Error;
    }

    //
    // Initialize the finished list head.
    //

    InitializeSListHead(&Context->FinishedWorkListHead);

    //
    // Create the Error threadpool.
    //

    Context->ErrorThreadpool = CreateThreadpool(NULL);
    if (!Context->ErrorThreadpool) {
        goto Error;
    }

    if (!SetThreadpoolThreadMinimum(Context->ErrorThreadpool, 1)) {
        goto Error;
    }

    SetThreadpoolThreadMaximum(Context->ErrorThreadpool, 1);

    Context->ErrorWork = CreateThreadpoolWork(ErrorWorkCallback,
                                              Context,
                                              &Context->ErrorCallbackEnv);
    if (!Context->ErrorWork) {
        goto Error;
    }


    //
    // We're done!  Jump to the end of the routine to finish up.
    //

    Success = TRUE;
    goto End;

Error:

    Success = FALSE;

    //
    // Call the destroy routine on the context if it is non-NULL.
    //

    if (Context) {

        if (!DestroyPerfectHashTableContext(&Context, NULL)) {

            //
            // There's nothing we can do here.
            //

            NOTHING;
        }

        //
        // N.B. DestroyPerfectHashTableContext() should clear the Context
        //      pointer.  Assert that invariant now.
        //

        ASSERT(Context == NULL);
    }

    //
    // Intentional follow-on to End.
    //

End:

    //
    // Update the caller's pointer and return.
    //
    // N.B. Context could be NULL here, which is fine.
    //

    *ContextPointer = Context;

    return Success;
}

The Rtl->CreateRandomObjectNames() function warrants a little extra explanation. Its role can best be appreciated when viewing the named event handles created for the context. The following screenshot was taken with Process Hacker 2, which has a convenient means for interacting with a process's named events:

Being able to select an event and set, reset or pulse it can be very useful during development and debugging, especially in a mulithreaded environment where the events will typically have one or more threads waiting on them at various points.

The CreateRandomObjectNames() routine, when given an array of object name prefixes, will create names suitable for passing directly to any of the NT routines that allow naming of resources (events, sections, etc). It appends a configurable amount of Base64-encoded random data to each name, ensuring uniqueness across the entire machine.

Key Loading

The key loader routine is one of our more simple routines. It memory maps the given .keys file such that it can be accessed as a linear 1D array of ULONG values by the creation routines down the track. It produces an instance of a PERFECT_HASH_TABLE_KEYS structure, for which an opaque pointer is returned to the caller.

PERFECT_HASH_TABLE_KEYS

//
// Cap the maximum key set size we're willing to process.
//

#define MAXIMUM_NUMBER_OF_KEYS 500000

//
// Define the PERFECT_HASH_TABLE_KEYS_FLAGS structure.
//

typedef union _PERFECT_HASH_TABLE_FLAGS_KEYS {
    struct _Struct_size_bytes_(sizeof(ULONG)) {

        //
        // Unused bits.
        //

        ULONG Unused:32;
    };

    LONG AsLong;
    ULONG AsULong;
} PERFECT_HASH_TABLE_KEYS_FLAGS;
C_ASSERT(sizeof(PERFECT_HASH_TABLE_KEYS_FLAGS) == sizeof(ULONG));
typedef PERFECT_HASH_TABLE_KEYS_FLAGS *PPERFECT_HASH_TABLE_KEYS_FLAGS;

//
// Define the PERFECT_HASH_TABLE_KEYS structure.
//

typedef struct _Struct_size_bytes_(SizeOfStruct) _PERFECT_HASH_TABLE_KEYS {

    //
    // Reserve a slot for a vtable.  Currently unused.
    //

    PPVOID Vtbl;

    //
    // Size of the structure, in bytes.
    //

    _Field_range_(==, sizeof(struct _PERFECT_HASH_TABLE_KEYS))
        ULONG SizeOfStruct;

    //
    // Flags.
    //

    PERFECT_HASH_TABLE_KEYS_FLAGS Flags;

    //
    // Pointer to an initialized RTL structure.
    //

    PRTL Rtl;

    //
    // Pointer to an initialized ALLOCATOR structure.
    //

    PALLOCATOR Allocator;

    //
    // Pointer to the API structure in use.
    //

    PPERFECT_HASH_TABLE_ANY_API AnyApi;

    //
    // Number of keys in the mapping.
    //

    ULARGE_INTEGER NumberOfElements;

    //
    // Handle to the underlying keys file.
    //

    HANDLE FileHandle;

    //
    // Handle to the memory mapping for the keys file.
    //

    HANDLE MappingHandle;

    //
    // Base address of the memory map.
    //

    union {
        PVOID BaseAddress;
        PULONG Keys;
    };

    //
    // Fully-qualified, NULL-terminated path of the source keys file.
    //

    UNICODE_STRING Path;

} PERFECT_HASH_TABLE_KEYS;
typedef PERFECT_HASH_TABLE_KEYS *PPERFECT_HASH_TABLE_KEYS;

LoadPerfectHashTableKeys

Pseudo
Full

_Use_decl_annotations_
BOOLEAN
LoadPerfectHashTableKeys(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PCUNICODE_STRING Path,
    PPERFECT_HASH_TABLE_KEYS *KeysPointer
    )
/*++

Routine Description:

    Loads a keys file from the file system and returns a PERFECT_HASH_TABLE_KEYS
    structure.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for al memory allocations.

    Path - Supplies a pointer to a UNICODE_STRING structure that represents
        a fully-qualified path of the keys to use for the perfect hash table.

        N.B. Path must be NULL-terminated, which is not normally required for
             UNICODE_STRING structures.  Howver, the underlying buffer is passed
             to CreateFileW(), which requires a NULL-terminated wstr.

    KeysPointers - Supplies the address of a variable that will receive the
        address of the newly created PERFECT_HASH_TABLE_KEYS structure if the
        routine is successful, or NULL if the routine fails.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{

    ValidateArguments();

    //
    // Open the file, create a file mapping, then map it into memory.
    //

    FileHandle = Rtl->CreateFileW(
        Path->Buffer,
        GENERIC_READ,
        FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_SEQUENTIAL_SCAN | FILE_FLAG_OVERLAPPED,
        NULL
    );

    //
    // Verify file details such as size.
    //

    VerifyFileDetails();

    //
    // Create the file mapping.
    //

    MappingHandle = CreateFileMappingW(FileHandle,
                                       NULL,
                                       PAGE_READONLY,
                                       FileInfo.EndOfFile.HighPart,
                                       FileInfo.EndOfFile.LowPart,
                                       NULL);

    //
    // Successfully created a file mapping.  Now map it into memory.
    //

    BaseAddress = MapViewOfFile(MappingHandle,
                                FILE_MAP_READ,
                                0,
                                0,
                                FileInfo.EndOfFile.LowPart);

    //
    // The file has been mapped successfully.  Calculate the size required for
    // the keys structure and a copy of the Path's underlying unicode string
    // buffer.
    //

    CalculateAllocSize(&AllocSize);

    //
    // Allocate the structure.
    //

    Keys = AllocKeys(AllocSize);

    //
    // Initialize fields then duplicate the keys path.
    //

    InitializeFields(Keys);
    CopyKeysPath(Keys, Path);

    *KeysPointer = Keys;

    return Success;
}

_Use_decl_annotations_
BOOLEAN
LoadPerfectHashTableKeys(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PCUNICODE_STRING Path,
    PPERFECT_HASH_TABLE_KEYS *KeysPointer
    )
/*++

Routine Description:

    Loads a keys file from the file system and returns a PERFECT_HASH_TABLE_KEYS
    structure.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for al memory allocations.

    Path - Supplies a pointer to a UNICODE_STRING structure that represents
        a fully-qualified path of the keys to use for the perfect hash table.

        N.B. Path must be NULL-terminated, which is not normally required for
             UNICODE_STRING structures.  Howver, the underlying buffer is passed
             to CreateFileW(), which requires a NULL-terminated wstr.

    KeysPointers - Supplies the address of a variable that will receive the
        address of the newly created PERFECT_HASH_TABLE_KEYS structure if the
        routine is successful, or NULL if the routine fails.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    BOOLEAN Success;
    PVOID BaseAddress = NULL;
    HANDLE FileHandle = NULL;
    HANDLE MappingHandle = NULL;
    LARGE_INTEGER AllocSize;
    LARGE_INTEGER NumberOfElements;
    FILE_STANDARD_INFO FileInfo;
    PPERFECT_HASH_TABLE_KEYS Keys = NULL;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Rtl)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Allocator)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(KeysPointer)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Path)) {
        return FALSE;
    }

    if (!IsValidMinimumDirectoryNullTerminatedUnicodeString(Path)) {
        return FALSE;
    }

    //
    // Clear the caller's pointer up-front.
    //

    *KeysPointer = NULL;

    //
    // Open the file, create a file mapping, then map it into memory.
    //

    FileHandle = Rtl->CreateFileW(
        Path->Buffer,
        GENERIC_READ,
        FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_SEQUENTIAL_SCAN | FILE_FLAG_OVERLAPPED,
        NULL
    );

    if (!FileHandle || FileHandle == INVALID_HANDLE_VALUE) {
        return FALSE;
    }

    //
    // N.B. Error handling should 'goto Error' from this point onward now that
    //      we have resources that may need to be cleaned up.
    //

    Success = GetFileInformationByHandleEx(
        FileHandle,
        (FILE_INFO_BY_HANDLE_CLASS)FileStandardInfo,
        &FileInfo,
        sizeof(FileInfo)
    );

    if (!Success) {
        goto Error;
    }

    //
    // Make sure the file is a multiple of our key size.
    //

    if (FileInfo.EndOfFile.QuadPart % 4ULL) {
        goto Error;
    }

    //
    // The number of elements in the key file can be ascertained by right
    // shifting by 2.
    //

    NumberOfElements.QuadPart = FileInfo.EndOfFile.QuadPart >> 2;

    //
    // Sanity check the number of elements.  There shouldn't be more than
    // MAX_ULONG.
    //

    ASSERT(!NumberOfElements.HighPart);

    //
    // Create the file mapping.
    //

    MappingHandle = CreateFileMappingW(FileHandle,
                                       NULL,
                                       PAGE_READONLY,
                                       FileInfo.EndOfFile.HighPart,
                                       FileInfo.EndOfFile.LowPart,
                                       NULL);

    if (!MappingHandle || MappingHandle == INVALID_HANDLE_VALUE) {
        goto Error;
    }

    //
    // Successfully created a file mapping.  Now map it into memory.
    //

    BaseAddress = MapViewOfFile(MappingHandle,
                                FILE_MAP_READ,
                                0,
                                0,
                                FileInfo.EndOfFile.LowPart);

    if (!BaseAddress) {
        goto Error;
    }

    //
    // The file has been mapped successfully.  Calculate the size required for
    // the keys structure and a copy of the Path's underlying unicode string
    // buffer.
    //

    AllocSize.QuadPart = (

        //
        // Account for the keys structure size.
        //

        sizeof(*Keys) +

        //
        // Account for the length of the UNICODE_STRING buffer and terminating
        // NULL.
        //

        Path->Length + sizeof(Path->Buffer[0])

    );

    //
    // Sanity check our size.
    //

    ASSERT(!AllocSize.HighPart);

    //
    // Proceed with allocation.
    //

    Keys = (PPERFECT_HASH_TABLE_KEYS)(
        Allocator->Calloc(Allocator->Context,
                          1,
                          AllocSize.LowPart)
    );

    if (!Keys) {
        goto Error;
    }

    //
    // Fill out the main structure details.
    //

    Keys->SizeOfStruct = sizeof(*Keys);
    Keys->Rtl = Rtl;
    Keys->Allocator = Allocator;
    Keys->FileHandle = FileHandle;
    Keys->MappingHandle = MappingHandle;
    Keys->BaseAddress = BaseAddress;
    Keys->NumberOfElements.QuadPart = NumberOfElements.QuadPart;

    //
    // Initialize the path length variables, point the buffer at the space after
    // our structure, then copy the string over and NULL-terminate it.
    //

    Keys->Path.Length = Path->Length;
    Keys->Path.MaximumLength = Path->Length + sizeof(Path->Buffer[0]);
    Keys->Path.Buffer = (PWSTR)RtlOffsetToPointer(Keys, sizeof(*Keys));
    CopyMemory(Keys->Path.Buffer, Path->Buffer, Path->Length);
    Keys->Path.Buffer[Path->Length >> 1] = L'\0';

    //
    // We've completed initialization, indicate success and jump to the end.
    //

    Success = TRUE;

    goto End;

Error:

    Success = FALSE;

    //
    // Clean up any resources we may have allocated.
    //

    if (BaseAddress) {
        UnmapViewOfFile(BaseAddress);
        BaseAddress = NULL;
    }

    if (MappingHandle) {
        CloseHandle(MappingHandle);
        MappingHandle = NULL;
    }

    if (FileHandle) {
        CloseHandle(FileHandle);
        FileHandle = NULL;
    }

    if (Keys) {
        Allocator->FreePointer(Allocator->Context, &Keys);
    }

    //
    // Intentional follow-on to End.
    //

End:

    //
    // Update the caller's pointer and return.
    //
    // N.B. Keys could be NULL here, which is fine.
    //

    *KeysPointer = Keys;

    return Success;
}

With a context and keys structure in hand, we're ready to try create a perfect hash table.

Perfect Hash Table Creation

Table creation, if successful, does not return an opaque pointer to a structure that can be subsequently interacted with by other routines — unlike the context creation and key loading routines we've examined up to now.

Instead, it writes out the .pht1 perfect hash table file to disk, such that it can be loaded (via the filename) in a subsequent call to the LoadPerfectHashTable() routine. Now, that being said, both the create and load routines use the same C structure for the table: PERFECT_HASH_TABLE. The public definition of this structure, as we saw earlier, was simply a wrapper around a vtbl pointer. The private definition in PerfectHashTablePrivate.h, still adheres to the vtbl interface, but extends it with all the additional data we need.

Let's take a look at the table structure first, then examine the creation routine.

`PERFECT_HASH_TABLE`

//
// Define the PERFECT_HASH_TABLE_FLAGS structure.
//

typedef union _PERFECT_HASH_TABLE_FLAGS {
    struct _Struct_size_bytes_(sizeof(ULONG)) {

        //
        // When set, indicates the table came from CreatePerfectHashTable().
        //
        // Invariant:
        //
        //  - If Created == TRUE:
        //      Assert Loaded == FALSE
        //

        ULONG Created:1;

        //
        // When set, indicates the table came from LoadPerfectHashTable().
        //
        // Invariant:
        //
        //  - If Loaded == TRUE:
        //      Assert Created == FALSE
        //

        ULONG Loaded:1;

        //
        // Unused bits.
        //

        ULONG Unused:30;
    };

    LONG AsLong;
    ULONG AsULong;
} PERFECT_HASH_TABLE_FLAGS;
C_ASSERT(sizeof(PERFECT_HASH_TABLE_FLAGS) == sizeof(ULONG));
typedef PERFECT_HASH_TABLE_FLAGS *PPERFECT_HASH_TABLE_FLAGS;

//
// Define the PERFECT_HASH_TABLE structure.
//

typedef struct _Struct_size_bytes_(SizeOfStruct) _PERFECT_HASH_TABLE {

    //
    // Our extended vtbl slot comes first, COM-style.
    //

    struct _PERFECT_HASH_TABLE_VTBL_EX *Vtbl;

    //
    // Size of the structure, in bytes.
    //

    _Field_range_(==, sizeof(struct _PERFECT_HASH_TABLE)) ULONG SizeOfStruct;

    //
    // Flags.
    //

    PERFECT_HASH_TABLE_FLAGS Flags;

    //
    // Generic singly-linked list entry.
    //

    SLIST_ENTRY ListEntry;

    //
    // Pointer to an initialized RTL structure.
    //

    PRTL Rtl;

    //
    // Pointer to an initialized ALLOCATOR structure.
    //

    PALLOCATOR Allocator;

    //
    // Reference count.  Affected by AddRef() and Release().
    //

    volatile ULONG ReferenceCount;

    //
    // Capture the number of elements in the underlying perfect hash table.
    // This refers to the number of vertices for the CHM algorithm, or can
    // mean the rounded-up power-of-2 size.  The masking implementations need
    // an agnostic way to access this value, which is why it is provided here
    // at the table level (despite being obtainable from things like the number
    // of vertices or Keys->NumberOfElements).
    //

    ULONG HashSize;
    ULONG IndexSize;

    //
    // Similarly, provide a convenient way to access the table "shift" amount
    // if a shifting mask routine is active.  This value is the number of
    // trailing zeros of the Size above for tables whose size is a power of 2.
    // It is not used if modulus masking is active.
    //

    ULONG HashShift;
    ULONG IndexShift;

    //
    // Mask.
    //

    ULONG HashMask;
    ULONG IndexMask;

    //
    // The following value represents how many times we need to XOR the high
    // part of a word with the low part of a word -- where word is being used
    // in the general computer word sense (i.e. not a 2-byte short) -- such
    // that the final value is within the bounds of the table mask.  It is
    // also the value of log2(Table->Shift).
    //

    ULONG HashFold;
    ULONG IndexFold;

    //
    // If modulus masking is active, this represents the modulus that will be
    // used for masking, e.g. Input %= Table->Modulus.
    //

    ULONG HashModulus;
    ULONG IndexModulus;

    //
    // If a caller provided the number of table elements as a parameter to the
    // CreatePerfectHashTable() function, that value will be captured here.  It
    // overrides the default sizing heuristics.  (If non-zero, it will be at
    // least equal to or greater than the number of keys.)
    //

    ULARGE_INTEGER RequestedNumberOfTableElements;

    //
    // The algorithm in use.
    //

    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId;

    //
    // The masking type in use.
    //

    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;

    //
    // The hash function in use.
    //

    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId;

    //
    // Pointer to the keys corresponding to this perfect hash table.  May be
    // NULL.
    //

    PPERFECT_HASH_TABLE_KEYS Keys;

    //
    // Pointer to the PERFECT_HASH_TABLE_CONTEXT structure in use.
    //

    PPERFECT_HASH_TABLE_CONTEXT Context;

    //
    // Handle to the backing file.
    //

    HANDLE FileHandle;

    //
    // Handle to the memory mapping for the backing file.
    //

    HANDLE MappingHandle;

    //
    // Base address of the memory map for the backing file.
    //

    union {
        PVOID BaseAddress;
        PULONG Data;
    };

    //
    // Fully-qualified, NULL-terminated path of the backing file.  The path is
    // automatically derived from the keys file.
    //

    UNICODE_STRING Path;

    //
    // Capture the mapping size of the underlying array, which will be aligned
    // up to a system allocation granularity.
    //

    ULARGE_INTEGER MappingSizeInBytes;

    //
    // Handle to the info stream backing file.
    //

    HANDLE InfoStreamFileHandle;

    //
    // Handle to the memory mapping for the backing file.
    //

    HANDLE InfoStreamMappingHandle;

    //
    // Base address of the memory map for the :Info stream.
    //

    union {
        PVOID InfoStreamBaseAddress;
        struct _TABLE_INFO_ON_DISK_HEADER *Header;
    };

    //
    // Fully-qualified, NULL-terminated path of the :Info stream associated with
    // the path above.
    //

    UNICODE_STRING InfoStreamPath;

    //
    // Capture the mapping size and actual structure size for the :Info stream.
    //

    ULARGE_INTEGER InfoMappingSizeInBytes;
    ULARGE_INTEGER InfoActualStructureSizeInBytes;

    //
    // If a table is loaded successfully, an array will be allocated for storing
    // values (as part of the Insert()/Lookup() API), the base address for which
    // is captured by the next field.
    //

    union {
        PVOID ValuesBaseAddress;
        PULONG Values;
    };

    //
    // During creation, a large bitmap is created to cover the entire range of
    // possible ULONG keys.  This is used to ensure no duplicate keys appear in
    // the input key set, and also assists in debugging.
    //

    RTL_BITMAP KeysBitmap;

} PERFECT_HASH_TABLE;
typedef PERFECT_HASH_TABLE *PPERFECT_HASH_TABLE;

`CreatePerfectHashTable()`

The creation routine is the first point where we dispatch to algorithm-specific implementation routines.

Full


BOOLEAN
CreatePerfectHashTable(
    PRTL Rtl,
    PALLOCATOR Allocator,
    PPERFECT_HASH_TABLE_CONTEXT Context,
    PERFECT_HASH_TABLE_ALGORITHM_ID AlgorithmId,
    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId,
    PERFECT_HASH_TABLE_HASH_FUNCTION_ID HashFunctionId,
    PULARGE_INTEGER NumberOfTableElementsPointer,
    PPERFECT_HASH_TABLE_KEYS Keys,
    PCUNICODE_STRING HashTablePath
    )
/*++

Routine Description:

    Creates and initializes a PERFECT_HASH_TABLE structure from a given set
    of keys, using the requested algorithm.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    Allocator - Supplies a pointer to an initialized ALLOCATOR structure that
        will be used for all memory allocations.

    Context - Supplies a pointer to an initialized PERFECT_HASH_TABLE_CONTEXT
        structure that can be used by the underlying algorithm in order to
        search for perfect hash solutions in parallel.

    AlgorithmId - Supplies the algorithm to use.

    MaskFunctionId - Supplies the type of masking to use.  The algorithm and hash
        function must both support the requested masking type.

    HashFunctionId - Supplies the hash function to use.

    NumberOfTableElementsPointer - Optionally supplies a pointer to a
        ULARGE_INTEGER structure that, if non-zero, indicates the number of
        elements to assume when sizing the hash table.  If a non-NULL pointer
        is supplied, it will receive the final number of elements in the table
        if a solution could be found.

        N.B. If this value is non-zero, it must be equal to or greater than
             the number of keys indicated by the Keys parameter.  (It should be
             at least 2.09 times the number of keys; the higher the value, the
             larger the hash table, the faster a perfect hash solution will be
             found.)

    Keys - Supplies a pointer to a PERFECT_HASH_TABLE_KEYS structure obtained
        from LoadPerfectHashTableKeys().

    HashTablePath - Optionally supplies a pointer to a UNICODE_STRING structure
        that represents the fully-qualified, NULL-terminated path of the backing
        file used to save the hash table.  If NULL, the file name of the keys
        file will be used, with ".pht1" appended to it.

Return Value:

    TRUE on success, FALSE on failure.

    If TRUE, the table will be persisted at the path described by for the
    HashTablePath parameter above.  This can be subsequently interacted with
    once loaded via LoadPerfectHashTable().

--*/
{
    BOOLEAN Success;
    BOOLEAN UsingKeysPath;
    PWSTR Dest;
    PWSTR Source;
    PBYTE Buffer;
    PWSTR PathBuffer;
    ULONG Key;
    ULONG Index;
    ULONG Result;
    ULONGLONG Bit;
    ULONG ShareMode;
    ULONG LastError;
    ULONG NumberOfKeys;
    ULONG DesiredAccess;
    ULONG NumberOfSetBits;
    ULONG InfoMappingSize;
    ULONG FlagsAndAttributes;
    PLONGLONG BitmapBuffer;
    USHORT VtblExSize;
    PVOID BaseAddress;
    HANDLE FileHandle;
    HANDLE MappingHandle;
    HANDLE ProcessHandle;
    SYSTEM_INFO SystemInfo;
    ULARGE_INTEGER AllocSize;
    ULONG_INTEGER PathBufferSize;
    ULONGLONG KeysBitmapBufferSize;
    ULONG IncomingPathBufferSizeInBytes;
    ULONG_INTEGER AlignedPathBufferSize;
    ULONG_INTEGER InfoStreamPathBufferSize;
    ULONG_INTEGER AlignedInfoStreamPathBufferSize;
    ULARGE_INTEGER NumberOfTableElements;
    PPERFECT_HASH_TABLE_VTBL_EX Vtbl;
    PPERFECT_HASH_TABLE Table = NULL;
    UNICODE_STRING Suffix = RTL_CONSTANT_STRING(L".pht1");
    UNICODE_STRING InfoStreamSuffix = RTL_CONSTANT_STRING(L":Info");

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Rtl)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Allocator)) {
        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Context)) {

        return FALSE;

    } else if (Context->Table) {

        //
        // We don't support a context being used more than once at the moment.
        // If it has been used before, Context->Table will have a value, and
        // thus, we need to error out.
        //

        return FALSE;
    }

    if (!ARGUMENT_PRESENT(Keys)) {

        return FALSE;

    } else {

        //
        // Ensure the number of keys is within our maximum tested limit.
        //

        if (Keys->NumberOfElements.QuadPart > MAXIMUM_NUMBER_OF_KEYS) {

            return FALSE;

        }
    }

    if (!IsValidPerfectHashTableMaskFunctionId(MaskFunctionId)) {
        return FALSE;
    }

    if (ARGUMENT_PRESENT(NumberOfTableElementsPointer)) {

        //
        // If the number of table elements is non-zero, verify it is greater
        // than or equal to the number of keys present.
        //

        NumberOfTableElements.QuadPart = (
            NumberOfTableElementsPointer->QuadPart
        );

        if (NumberOfTableElements.QuadPart > 0) {

            if (NumberOfTableElements.QuadPart <
                Keys->NumberOfElements.QuadPart) {

                //
                // Requested table size is too small, abort.
                //

                return FALSE;
            }
        }
    } else {

        NumberOfTableElements.QuadPart = 0;

    }

    if (ARGUMENT_PRESENT(HashTablePath) &&
        !IsValidMinimumDirectoryNullTerminatedUnicodeString(HashTablePath)) {

        return FALSE;
    }

    if (!IsValidPerfectHashTableAlgorithmId(AlgorithmId)) {
        return FALSE;
    }

    if (!IsValidPerfectHashTableHashFunctionId(HashFunctionId)) {
        return FALSE;
    }

    //
    // Calculate the allocation size required for the structure, including the
    // memory required to take a copy of the backing file name path.
    //

    AllocSize.QuadPart = sizeof(*Table);

    if (ARGUMENT_PRESENT(HashTablePath)) {

        //
        // Use the length of the caller-provided path, plus a trailing NULL.
        //

        PathBufferSize.LongPart = (
            HashTablePath->Length +
            sizeof(HashTablePath->Buffer[0])
        );

        UsingKeysPath = FALSE;
        IncomingPathBufferSizeInBytes = HashTablePath->Length;
        PathBuffer = HashTablePath->Buffer;

    } else {

        //
        // No path has been provided by the caller, so we'll use the path of
        // the keys file with ".pht1" appended.  Perform a quick invariant check
        // first: maximum length should be 1 character (2 bytes) larger than
        // length.  (This is handled in LoadPerfectHashTableKeys().)
        //

        ASSERT(Keys->Path.Length + sizeof(Keys->Path.Buffer[0]) ==
               Keys->Path.MaximumLength);

        PathBufferSize.LongPart = (Keys->Path.MaximumLength + Suffix.Length);

        UsingKeysPath = TRUE;
        IncomingPathBufferSizeInBytes = Keys->Path.Length;
        PathBuffer = Keys->Path.Buffer;
    }

    //
    // Align the path buffer up to a 16 byte boundary.
    //

    AlignedPathBufferSize.LongPart = ALIGN_UP(PathBufferSize.LongPart, 16);

    //
    // Sanity check we haven't overflowed MAX_USHORT for the path buffer size.
    //

    ASSERT(!AlignedPathBufferSize.HighPart);

    //
    // Add the path buffer size to the structure allocation size, then check
    // we haven't overflowed MAX_ULONG.
    //

    AllocSize.QuadPart += AlignedPathBufferSize.LowPart;

    ASSERT(!AllocSize.HighPart);

    //
    // Calculate the size required by the :Info stream that will be created
    // for the on-disk metadata.  We derive this by adding the length of the
    // path to the length of the :Info suffix, plus an additional trailing NULL.
    //

    InfoStreamPathBufferSize.LongPart = (
        PathBufferSize.LowPart +
        InfoStreamSuffix.Length +
        sizeof(InfoStreamSuffix.Buffer[0])
    );

    //
    // Align the size up to a 16 byte boundary.
    //

    AlignedInfoStreamPathBufferSize.LongPart = (
        ALIGN_UP(
            InfoStreamPathBufferSize.LowPart,
            16
        )
    );

    //
    // Sanity check we haved overflowed MAX_USHORT.
    //

    ASSERT(!AlignedInfoStreamPathBufferSize.HighPart);

    //
    // Add the stream path size to the total size and perform a final overflow
    // check.
    //

    AllocSize.QuadPart += AlignedInfoStreamPathBufferSize.LowPart;

    ASSERT(!AllocSize.HighPart);

    //
    // Account for the vtbl interface size.
    //

    VtblExSize = GetVtblExSizeRoutines[AlgorithmId]();
    AllocSize.QuadPart += VtblExSize;

    ASSERT(!AllocSize.HighPart);

    //
    // Allocate space for the structure and backing paths.
    //

    Table = (PPERFECT_HASH_TABLE)(
        Allocator->Calloc(Allocator->Context,
                          1,
                          AllocSize.LowPart)
    );

    if (!Table) {
        return FALSE;
    }

    //
    // Allocation was successful, continue with initialization.
    //

    Table->SizeOfStruct = sizeof(*Table);
    Table->Rtl = Rtl;
    Table->Allocator = Allocator;
    Table->Flags.AsULong = 0;
    Table->Keys = Keys;
    Table->Context = Context;
    Context->Table = Table;

    //
    // Our main enumeration IDs get replicated in both structures.
    //

    Table->AlgorithmId = Context->AlgorithmId = AlgorithmId;
    Table->MaskFunctionId = Context->MaskFunctionId = MaskFunctionId;
    Table->HashFunctionId = Context->HashFunctionId = HashFunctionId;

    //
    // Carve out the backing memory structures for the unicode string buffers
    // for the path names.
    //

    Buffer = RtlOffsetToPointer(Table, sizeof(*Table));
    Table->Path.Buffer = (PWSTR)Buffer;
    CopyMemory(Table->Path.Buffer, PathBuffer, IncomingPathBufferSizeInBytes);

    if (!UsingKeysPath) {

        //
        // Inherit the lengths provided by the input parameter string.
        //

        Table->Path.Length = HashTablePath->Length;
        Table->Path.MaximumLength = HashTablePath->MaximumLength;

    } else {

        //
        // Replace the ".keys" suffix with ".pht1".
        //

        Dest = Table->Path.Buffer;
        Dest += (Keys->Path.MaximumLength >> 1) - 5ULL;

        ASSERT(*Dest == L'k');
        ASSERT(*(Dest - 1) == L'.');

        Source = Suffix.Buffer;

        while (*Source) {
            *Dest++ = *Source++;
        }

        *Dest = L'\0';

        //
        // We can use the Keys->Path lengths directly.
        //

        Table->Path.Length = Keys->Path.Length;
        Table->Path.MaximumLength = Keys->Path.MaximumLength;
    }

    //
    // Advance past the aligned path buffer size such that we're positioned at
    // the start of the info stream buffer.
    //

    Buffer += AlignedPathBufferSize.LongPart;
    Table->InfoStreamPath.Buffer = (PWSTR)Buffer;
    Table->InfoStreamPath.MaximumLength = InfoStreamPathBufferSize.LowPart;
    Table->InfoStreamPath.Length = (
        Table->InfoStreamPath.MaximumLength -
        sizeof(Table->InfoStreamPath.Buffer[0])
    );

    //
    // Advance the buffer to the vtbl interface area and initialize it.
    //

    Buffer += AlignedInfoStreamPathBufferSize.LongPart;
    Vtbl = (PPERFECT_HASH_TABLE_VTBL_EX)Buffer;
    InitializeExtendedVtbl(Table, Vtbl);

    //
    // Copy the full path into the info stream buffer.
    //

    CopyMemory(Table->InfoStreamPath.Buffer,
               Table->Path.Buffer,
               Table->Path.Length);

    Dest = Table->InfoStreamPath.Buffer;
    Dest += (Table->Path.Length >> 1);
    ASSERT(*Dest == L'\0');

    //
    // Copy the :Info suffix over.
    //

    Source = InfoStreamSuffix.Buffer;

    while (*Source) {
        *Dest++ = *Source++;
    }

    *Dest = L'\0';

    //
    // We've finished initializing our two unicode string buffers for the
    // backing file and it's :Info counterpart.  Now, let's open file handles
    // to them.
    //

    //
    // Open the file handle for the backing hash table store.
    //

    ShareMode = (
        FILE_SHARE_READ  |
        FILE_SHARE_WRITE |
        FILE_SHARE_DELETE
    );

    DesiredAccess = (
        GENERIC_READ |
        GENERIC_WRITE
    );

    FlagsAndAttributes = FILE_FLAG_OVERLAPPED;

    FileHandle = Rtl->CreateFileW(Table->Path.Buffer,
                                  DesiredAccess,
                                  ShareMode,
                                  NULL,
                                  OPEN_ALWAYS,
                                  FlagsAndAttributes,
                                  NULL);

    LastError = GetLastError();

    Table->FileHandle = FileHandle;

    if (!FileHandle || FileHandle == INVALID_HANDLE_VALUE) {

        //
        // Failed to open the file successfully.
        //

        goto Error;

    } else if (LastError == ERROR_ALREADY_EXISTS) {

        //
        // The file was opened successfully, but it already existed.  Clear the
        // local last error variable then truncate the file.
        //

        LastError = ERROR_SUCCESS;

        Result = SetFilePointer(FileHandle, 0, NULL, FILE_BEGIN);
        if (Result == INVALID_SET_FILE_POINTER) {
            LastError = GetLastError();
            goto Error;
        }

        Success = SetEndOfFile(FileHandle);
        if (!Success) {
            LastError = GetLastError();
            goto Error;
        }

        //
        // We've successfully truncated the file.  The creation routine
        // implementation can now allocate the space required for it as part
        // of successful graph solving.
        //

    }

    //
    // The :Info stream is slightly different.  As it's a fixed size metadata
    // record, we can memory map an entire section up-front prior to calling
    // the algorithm implementation.  So, do that now.
    //

    FileHandle = Rtl->CreateFileW(Table->InfoStreamPath.Buffer,
                                  DesiredAccess,
                                  ShareMode,
                                  NULL,
                                  OPEN_ALWAYS,
                                  FlagsAndAttributes,
                                  NULL);

    Table->InfoStreamFileHandle = FileHandle;

    LastError = GetLastError();

    if (!FileHandle || FileHandle == INVALID_HANDLE_VALUE) {

        //
        // Failed to open the file successfully.
        //

        goto Error;

    } else if (LastError == ERROR_ALREADY_EXISTS) {

        //
        // The file was opened successfully, but it already existed.  Clear the
        // local last error variable then truncate the file.
        //

        LastError = ERROR_SUCCESS;

        Result = SetFilePointer(FileHandle, 0, NULL, FILE_BEGIN);
        if (Result == INVALID_SET_FILE_POINTER) {
            LastError = GetLastError();
            goto Error;
        }

        Success = SetEndOfFile(FileHandle);
        if (!Success) {
            LastError = GetLastError();
            goto Error;
        }

        //
        // We've successfully truncated the :Info file.
        //

    }

    //
    // Get the system allocation granularity, as we use this to govern the size
    // we request of the underlying file mapping.
    //

    GetSystemInfo(&SystemInfo);

    InfoMappingSize = SystemInfo.dwAllocationGranularity;
    ASSERT(InfoMappingSize >= PAGE_SIZE);

    //
    // Create a file mapping for the :Info stream.
    //

    MappingHandle = CreateFileMappingW(FileHandle,
                                       NULL,
                                       PAGE_READWRITE,
                                       0,
                                       InfoMappingSize,
                                       NULL);

    Table->InfoStreamMappingHandle = MappingHandle;
    Table->InfoMappingSizeInBytes.QuadPart = InfoMappingSize;

    if (!MappingHandle || MappingHandle == INVALID_HANDLE_VALUE) {
        goto Error;
    }

    //
    // We successfully created a file mapping.  Proceed with mapping it into
    // memory.
    //

    BaseAddress = MapViewOfFile(MappingHandle,
                                FILE_MAP_READ | FILE_MAP_WRITE,
                                0,
                                0,
                                InfoMappingSize);

    Table->InfoStreamBaseAddress = BaseAddress;

    if (!BaseAddress) {
        goto Error;
    }

    //
    // Set the number of table elements requested by the user (0 is a valid
    // value here).
    //

    Table->RequestedNumberOfTableElements.QuadPart = (
        NumberOfTableElements.QuadPart
    );

    //
    // Allocate a 512MB buffer for the keys bitmap.
    //

    KeysBitmapBufferSize = ((1ULL << 32ULL) >> 3ULL);

    //
    // Try a large page allocation for the bitmap buffer.
    //

    ProcessHandle = GetCurrentProcess();

    BaseAddress = Rtl->TryLargePageVirtualAllocEx(ProcessHandle,
                                                  NULL,
                                                  KeysBitmapBufferSize,
                                                  MEM_COMMIT,
                                                  PAGE_READWRITE);

    Table->KeysBitmap.Buffer = (PULONG)BaseAddress;

    if (!BaseAddress) {

        //
        // Failed to create a bitmap buffer, abort.
        //

        LastError = GetLastError();
        goto Error;
    }

    //
    // Initialize the keys bitmap.
    //

    Table->KeysBitmap.SizeOfBitMap = (ULONG)-1;
    BitmapBuffer = (PLONGLONG)Table->KeysBitmap.Buffer;

    ASSERT(!Keys->NumberOfElements.HighPart);

    NumberOfKeys = Keys->NumberOfElements.LowPart;

    //
    // Loop through all the keys, obtain the bitmap bit representation, verify
    // that the bit hasn't been set yet, and set it.
    //

    for (Index = 0; Index < NumberOfKeys; Index++) {

        Key = Keys->Keys[Index];
        Bit = Key + 1;

        ASSERT(!BitTestAndSet64(BitmapBuffer, Bit));

    }

    //
    // Count all bits set.  It should match the number of keys.
    //

    NumberOfSetBits = Rtl->RtlNumberOfSetBits(&Table->KeysBitmap);
    ASSERT(NumberOfSetBits == NumberOfKeys);

    //
    // Common initialization is complete, dispatch remaining work to the
    // algorithm's creation routine.
    //

    Success = CreationRoutines[AlgorithmId](Table);
    if (!Success) {
        goto Error;
    }

    //
    // Update the caller's number of table elements pointer, if applicable.
    //

    if (ARGUMENT_PRESENT(NumberOfTableElementsPointer)) {
        NumberOfTableElementsPointer->QuadPart = Table->HashSize;
    }

    //
    // We're done!  Set the reference count to 1 and finish up.
    //

    Table->ReferenceCount = 1;
    goto End;

Error:

    Success = FALSE;

    //
    // Intentional follow-on to End.
    //

End:

    //
    // Call the destroy routine on the table if one is present.
    //
    // N.B. We currently always delete the table if it is created successfully
    //      so as to ensure the only way to use a table is by loading one from
    //      disk via LoadPerfectHashTable().
    //

    if (Table) {

        if (!DestroyPerfectHashTable(&Table, NULL)) {

            //
            // There's nothing we can do here.
            //

            NOTHING;
        }

        //
        // N.B. DestroyPerfectHashTable() should clear the Table pointer.
        //      Assert that invariant now.
        //

        ASSERT(Table == NULL);
    }

    return Success;
}

Algorithm-Specific Implementation Details

This section covers implementation details specific to our implementation of the CHM algorithm.

CreatePerfectHashTableImplChm01

This is the implementation of our CHM algorithm's creation routine. It is responsible for: determining the appropriate memory requirements for a given graph used to try solve the perfect hash, allocating and initializing the appropriate data structures, submitting threadpool work for file preparation, parallel graph solving, and file-saving (once a solution has been found), and verifying the found solution is actually correct. Additionally, it has support for detecting table resize events, which will be signaled by a worker when it detects the number of attempts at solving a graph have exceeded a threshold. This results in the routine doubling the number of vertices used for the underlying graph and re-attempting to solve.

It makes extensive use of threadpools. It can be viewed as the main "parent" coordinator thread, dispatching relevant work to the file-work and main-work threadpools as required.

Full

_Use_decl_annotations_
BOOLEAN
CreatePerfectHashTableImplChm01(
    PPERFECT_HASH_TABLE Table
    )
/*++

Routine Description:

    Attempts to create a perfect hash table using the CHM algorithm and a
    2-part random hypergraph.

Arguments:

    Table - Supplies a pointer to a partially-initialized PERFECT_HASH_TABLE
        structure.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    PRTL Rtl;
    USHORT Index;
    PULONG Keys;
    PGRAPH Graph;
    PBYTE Buffer;
    BOOLEAN Success;
    USHORT PageSize;
    USHORT PageShift;
    ULONG_PTR LastPage;
    ULONG_PTR ThisPage;
    BYTE NumberOfEvents;
    PVOID BaseAddress = NULL;
    ULONG WaitResult;
    GRAPH_INFO Info;
    PBYTE Unusable;
    ULONG LastError;
    ULONG NumberOfKeys;
    BOOLEAN CaughtException;
    PALLOCATOR Allocator;
    HANDLE ProcessHandle;
    PHANDLE Event;
    USHORT NumberOfGraphs;
    USHORT NumberOfPagesPerGraph;
    USHORT NumberOfGuardPages;
    ULONG TotalNumberOfPages;
    USHORT NumberOfBitmaps;
    PGRAPH_DIMENSIONS Dim;
    PSLIST_ENTRY ListEntry;
    SYSTEM_INFO SystemInfo;
    FILE_WORK_ITEM SaveFile;
    FILE_WORK_ITEM PrepareFile;
    PGRAPH_INFO_ON_DISK OnDiskInfo;
    PTABLE_INFO_ON_DISK_HEADER OnDiskHeader;
    ULONGLONG Closest;
    ULONGLONG LastClosest;
    ULONGLONG NextSizeInBytes;
    ULONGLONG PrevSizeInBytes;
    ULONGLONG FirstSizeInBytes;
    ULONGLONG EdgesSizeInBytes;
    ULONGLONG ValuesSizeInBytes;
    ULONGLONG AssignedSizeInBytes;
    ULONGLONG TotalBufferSizeInBytes;
    ULONGLONG UsableBufferSizeInBytesPerBuffer;
    ULONGLONG ExpectedTotalBufferSizeInBytes;
    ULONGLONG ExpectedUsableBufferSizeInBytesPerBuffer;
    ULONGLONG GraphSizeInBytesIncludingGuardPage;
    PERFECT_HASH_TABLE_MASK_FUNCTION_ID MaskFunctionId;
    ULARGE_INTEGER AllocSize;
    ULARGE_INTEGER NumberOfEdges;
    ULARGE_INTEGER NumberOfVertices;
    ULARGE_INTEGER TotalNumberOfEdges;
    ULARGE_INTEGER DeletedEdgesBitmapBufferSizeInBytes;
    ULARGE_INTEGER VisitedVerticesBitmapBufferSizeInBytes;
    ULARGE_INTEGER AssignedBitmapBufferSizeInBytes;
    ULARGE_INTEGER IndexBitmapBufferSizeInBytes;
    PPERFECT_HASH_TABLE_CONTEXT Context = Table->Context;
    HANDLE Events[5];

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Table)) {
        return FALSE;
    }

    //
    // The following label is jumped to by code later in this routine when we
    // detect that we've exceeded a plausible number of attempts at finding a
    // graph solution with the given number of vertices, and have bumped up
    // the vertex count (by adjusting Table->RequestedNumberOfElements) and
    // want to try again.
    //

RetryWithLargerTableSize:

    //
    // Initialize aliases.
    //

    Rtl = Table->Rtl;
    Keys = (PULONG)Table->Keys->BaseAddress;;
    Allocator = Table->Allocator;
    Context = Table->Context;
    MaskFunctionId = Context->MaskFunctionId;

    //
    // If no threshold has been set, use the default.
    //

    if (!Context->ResizeTableThreshold) {
        Context->ResizeTableThreshold = GRAPH_SOLVING_ATTEMPTS_THRESHOLD;
        Context->ResizeLimit = GRAPH_SOLVING_RESIZE_TABLE_LIMIT;
    }

    //
    // Explicitly reset all events.  This ensures everything is back in the
    // starting state if we happen to be attempting to solve the graph after
    // a resize event.
    //

    Event = (PHANDLE)&Context->FirstEvent;
    NumberOfEvents = GetNumberOfContextEvents(Context);

    for (Index = 0; Index < NumberOfEvents; Index++, Event++) {

        ASSERT(*Event && *Event != INVALID_HANDLE_VALUE);
        ASSERT(ResetEvent(*Event));

    }

    //
    // The number of edges in our graph is equal to the number of keys in the
    // input data set if modulus masking is in use.  It will be rounded up to
    // a power of 2 otherwise.
    //

    NumberOfEdges.QuadPart = Table->Keys->NumberOfElements.QuadPart;

    //
    // Sanity check we're under MAX_ULONG.
    //

    ASSERT(!NumberOfEdges.HighPart);

    NumberOfKeys = NumberOfEdges.LowPart;

    //
    // Determine the number of vertices.  If a caller has requested a certain
    // table size, Table->RequestedNumberOfTableElements will be non-zero, and
    // takes precedence.
    //

    if (Table->RequestedNumberOfTableElements.QuadPart) {

        NumberOfVertices.QuadPart = (
            Table->RequestedNumberOfTableElements.QuadPart
        );

        if (IsModulusMasking(MaskFunctionId)) {

            //
            // Nothing more to do with modulus masking; we'll verify the number
            // of vertices below.
            //

            NOTHING;

        } else {

            //
            // For non-modulus masking, make sure the number of vertices are
            // rounded up to a power of 2.  The number of edges will be rounded
            // up to a power of 2 from the number of keys.
            //

            NumberOfVertices.QuadPart = (
                RoundUpPowerOf2(NumberOfVertices.LowPart)
            );

            NumberOfEdges.QuadPart = (
                RoundUpPowerOf2(NumberOfEdges.LowPart)
            );

        }

    } else {

        //
        // No table size was requested, so we need to determine how many
        // vertices to use heuristically.  The main factor is what type of
        // masking has been requested.  The chm.c implementation, which is
        // modulus based, uses a size multiplier (c) of 2.09, and calculates
        // the final size via ceil(nedges * (double)2.09).  We can avoid the
        // need for doubles and linking with a math library (to get ceil())
        // and just use ~2.25, which we can calculate by adding the result
        // of right shifting the number of edges by 1 to the result of left
        // shifting said edge count by 2 (simulating multiplication by 0.25).
        //
        // If we're dealing with modulus masking, this will be the exact number
        // of vertices used.  For other types of masking, we need the edges size
        // to be a power of 2, and the vertices size to be the next power of 2.
        //

        if (IsModulusMasking(MaskFunctionId)) {

            NumberOfVertices.QuadPart = NumberOfEdges.QuadPart << 1ULL;
            NumberOfVertices.QuadPart += NumberOfEdges.QuadPart >> 2ULL;

        } else {

            //
            // Round up the edges to a power of 2.
            //

            NumberOfEdges.QuadPart = RoundUpPowerOf2(NumberOfEdges.LowPart);

            //
            // Make sure we haven't overflowed.
            //

            ASSERT(!NumberOfEdges.HighPart);

            //
            // For the number of vertices, round the number of edges up to the
            // next power of 2.
            //

            NumberOfVertices.QuadPart = (
                RoundUpNextPowerOf2(NumberOfEdges.LowPart)
            );

        }
    }

    //
    // Another sanity check we haven't exceeded MAX_ULONG.
    //

    ASSERT(!NumberOfVertices.HighPart);

    //
    // The r-graph (r = 2) nature of this implementation results in various
    // arrays having twice the number of elements indicated by the edge count.
    // Capture this number now, as we need it in various size calculations.
    //

    TotalNumberOfEdges.QuadPart = NumberOfEdges.QuadPart;
    TotalNumberOfEdges.QuadPart <<= 1ULL;

    //
    // Another overflow sanity check.
    //

    ASSERT(!TotalNumberOfEdges.HighPart);

    //
    // Make sure vertices > edges.
    //

    ASSERT(NumberOfVertices.QuadPart > NumberOfEdges.QuadPart);

    //
    // Calculate the size required for the DeletedEdges bitmap buffer.  One
    // bit is used per TotalNumberOfEdges.  Convert the bits into bytes by
    // shifting right 3 (dividing by 8) then align it up to a 16 byte boundary.
    // We add 1 before shifting to account 1-based bitmaps vs 0-based indices.
    //

    DeletedEdgesBitmapBufferSizeInBytes.QuadPart = (
        ALIGN_UP(((ALIGN_UP(TotalNumberOfEdges.QuadPart + 1, 8)) >> 3), 16)
    );

    ASSERT(!DeletedEdgesBitmapBufferSizeInBytes.HighPart);

    //
    // Calculate the size required for the VisitedVertices bitmap buffer.  One
    // bit is used per NumberOfVertices.  Convert the bits into bytes by
    // shifting right 3 (dividing by 8) then align it up to a 16 byte boundary.
    // We add 1 before shifting to account 1-based bitmaps vs 0-based indices.
    //

    VisitedVerticesBitmapBufferSizeInBytes.QuadPart = (
        ALIGN_UP(((ALIGN_UP(NumberOfVertices.QuadPart + 1, 8)) >> 3), 16)
    );

    ASSERT(!VisitedVerticesBitmapBufferSizeInBytes.HighPart);

    //
    // Calculate the size required for the AssignedBitmap bitmap buffer.  One
    // bit is used per NumberOfVertices.  Convert the bits into bytes by shifting
    // right 3 (dividing by 8) then align it up to a 16 byte boundary.
    // We add 1 before shifting to account 1-based bitmaps vs 0-based indices.
    //

    AssignedBitmapBufferSizeInBytes.QuadPart = (
        ALIGN_UP(((ALIGN_UP(NumberOfVertices.QuadPart + 1, 8)) >> 3), 16)
    );

    ASSERT(!AssignedBitmapBufferSizeInBytes.HighPart);

    //
    // Calculate the size required for the IndexBitmap bitmap buffer.  One
    // bit is used per NumberOfVertices.  Convert the bits into bytes by shifting
    // right 3 (dividing by 8) then align it up to a 16 byte boundary.
    // We add 1 before shifting to account 1-based bitmaps vs 0-based indices.
    //

    IndexBitmapBufferSizeInBytes.QuadPart = (
        ALIGN_UP(((ALIGN_UP(NumberOfVertices.QuadPart + 1, 8)) >> 3), 16)
    );

    ASSERT(!IndexBitmapBufferSizeInBytes.HighPart);

    //
    // Calculate the sizes required for each of the arrays.  We collect them
    // into independent variables as it makes carving up the allocated buffer
    // easier down the track.
    //

    EdgesSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->Edges) * TotalNumberOfEdges.QuadPart)
    );

    NextSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->Next) * TotalNumberOfEdges.QuadPart)
    );

    FirstSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->First) * NumberOfVertices.QuadPart)
    );

    PrevSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->Prev) * TotalNumberOfEdges.QuadPart)
    );

    AssignedSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->Assigned) * NumberOfVertices.QuadPart)
    );

    //
    // Calculate the size required for the values array.  This is used as part
    // of verification, where we essentially do Insert(Key, Key) in combination
    // with bitmap tracking of assigned indices, which allows us to detect if
    // there are any colliding indices, and if so, what was the previous key
    // that mapped to the same index.
    //

    ValuesSizeInBytes = (
        ALIGN_UP_POINTER(sizeof(*Graph->Values) * NumberOfVertices.QuadPart)
    );

    //
    // Calculate the total size required for the underlying graph, such that
    // we can allocate memory via a single call to the allocator.
    //

    AllocSize.QuadPart = ALIGN_UP_POINTER(

        //
        // Account for the size of the graph structure.
        //

        sizeof(GRAPH) +

        //
        // Account for the size of the Graph->Edges array, which is double
        // sized.
        //

        EdgesSizeInBytes +

        //
        // Account for the size of the Graph->Next array; also double sized.
        //

        NextSizeInBytes +

        //
        // Account for the size of the Graph->First array.  This is sized
        // proportional to the number of vertices.
        //

        FirstSizeInBytes +

        //
        // Account for the size of the Graph->Prev array, also double sized.
        //

        PrevSizeInBytes +

        //
        // Account for Graph->Assigned array of vertices.
        //

        AssignedSizeInBytes +

        //
        // Account for the Table->Values array of values for the perfect hash
        // table, indexed via the result of the table's Index() method.
        //

        ValuesSizeInBytes +

        //
        // Account for the size of the bitmap buffer for Graph->DeletedEdges.
        //

        DeletedEdgesBitmapBufferSizeInBytes.QuadPart +

        //
        // Account for the size of the bitmap buffer for Graph->VisitedVertices.
        //

        VisitedVerticesBitmapBufferSizeInBytes.QuadPart +

        //
        // Account for the size of the bitmap buffer for Graph->AssignedBitmap.
        //

        AssignedBitmapBufferSizeInBytes.QuadPart +

        //
        // Account for the size of the bitmap buffer for Graph->IndexBitmap.
        //

        IndexBitmapBufferSizeInBytes.QuadPart +

        //
        // Keep a dummy 0 at the end such that the last item above can use an
        // addition sign at the end of it, which minimizes the diff churn when
        // adding a new size element.
        //

        0

    );

    //
    // Capture the number of bitmaps here, where it's close to the lines above
    // that indicate how many bitmaps we're dealing with.  The number of bitmaps
    // accounted for above should match this number.  Visually confirm this any
    // time a new bitmap buffer is accounted for.
    //
    // N.B. We ASSERT() in InitializeGraph() if we detect a mismatch between
    //      Info->NumberOfBitmaps and a local counter incremented each time
    //      we initialize a bitmap.
    //

    NumberOfBitmaps = 4;

    //
    // Sanity check the size hasn't overflowed.
    //

    ASSERT(!AllocSize.HighPart);

    //
    // Calculate the number of pages required by each graph, then extrapolate
    // the number of guard pages and total number of pages.  We currently use
    // 4KB for the page size (i.e. we're not using large pages).
    //

    PageSize = PAGE_SIZE;
    PageShift = (USHORT)TrailingZeros(PageSize);
    NumberOfGraphs = (USHORT)Context->MaximumConcurrency;
    NumberOfPagesPerGraph = ROUND_TO_PAGES(AllocSize.LowPart) >> PageShift;
    NumberOfGuardPages = (USHORT)Context->MaximumConcurrency;
    TotalNumberOfPages = (
        (NumberOfGraphs * NumberOfPagesPerGraph) +
        NumberOfGuardPages
    );
    GraphSizeInBytesIncludingGuardPage = (
        (ULONGLONG)PageSize +
        ((ULONGLONG)NumberOfPagesPerGraph * (ULONGLONG)PageSize)
    );

    //
    // Create multiple buffers separated by guard pages using a single call
    // to VirtualAllocEx().
    //

    ProcessHandle = NULL;

    Success = Rtl->CreateMultipleBuffers(Rtl,
                                         &ProcessHandle,
                                         PageSize,
                                         NumberOfGraphs,
                                         NumberOfPagesPerGraph,
                                         NULL,
                                         NULL,
                                         &UsableBufferSizeInBytesPerBuffer,
                                         &TotalBufferSizeInBytes,
                                         &BaseAddress);

    if (!Success) {
        LastError = GetLastError();
        return FALSE;
    }

    //
    // N.B. Subsequent errors must 'goto Error' at this point to ensure our
    //      cleanup logic kicks in.
    //

    //
    // Assert the sizes returned by the buffer allocation match what we're
    // expecting.
    //

    ExpectedTotalBufferSizeInBytes = (
        (ULONGLONG)TotalNumberOfPages *
        (ULONGLONG)PageSize
    );

    ExpectedUsableBufferSizeInBytesPerBuffer = (
        (ULONGLONG)NumberOfPagesPerGraph *
        (ULONGLONG)PageSize
    );

    ASSERT(TotalBufferSizeInBytes == ExpectedTotalBufferSizeInBytes);
    ASSERT(UsableBufferSizeInBytesPerBuffer ==
           ExpectedUsableBufferSizeInBytesPerBuffer);

    //
    // Initialize the GRAPH_INFO structure with all the sizes captured earlier.
    // (We zero it first just to ensure any of the padding fields are cleared.)
    //

    ZeroStruct(Info);

    Info.PageSize = PageSize;
    Info.AllocSize = AllocSize.QuadPart;
    Info.Context = Context;
    Info.BaseAddress = BaseAddress;
    Info.NumberOfPagesPerGraph = NumberOfPagesPerGraph;
    Info.NumberOfGraphs = NumberOfGraphs;
    Info.NumberOfBitmaps = NumberOfBitmaps;
    Info.SizeOfGraphStruct = sizeof(GRAPH);
    Info.EdgesSizeInBytes = EdgesSizeInBytes;
    Info.NextSizeInBytes = NextSizeInBytes;
    Info.FirstSizeInBytes = FirstSizeInBytes;
    Info.PrevSizeInBytes = PrevSizeInBytes;
    Info.AssignedSizeInBytes = AssignedSizeInBytes;
    Info.ValuesSizeInBytes = ValuesSizeInBytes;
    Info.AllocSize = AllocSize.QuadPart;
    Info.FinalSize = UsableBufferSizeInBytesPerBuffer;

    Info.DeletedEdgesBitmapBufferSizeInBytes = (
        DeletedEdgesBitmapBufferSizeInBytes.QuadPart
    );

    Info.VisitedVerticesBitmapBufferSizeInBytes = (
        VisitedVerticesBitmapBufferSizeInBytes.QuadPart
    );

    Info.AssignedBitmapBufferSizeInBytes = (
        AssignedBitmapBufferSizeInBytes.QuadPart
    );

    Info.IndexBitmapBufferSizeInBytes = (
        IndexBitmapBufferSizeInBytes.QuadPart
    );

    //
    // Capture the system allocation granularity.  This is used to align the
    // backing memory maps used for the table array.
    //

    GetSystemInfo(&SystemInfo);
    Info.AllocationGranularity = SystemInfo.dwAllocationGranularity;

    //
    // Copy the dimensions over.
    //

    Dim = &Info.Dimensions;
    Dim->NumberOfEdges = NumberOfEdges.LowPart;
    Dim->TotalNumberOfEdges = TotalNumberOfEdges.LowPart;
    Dim->NumberOfVertices = NumberOfVertices.LowPart;

    Dim->NumberOfEdgesPowerOf2Exponent = (BYTE)(
        TrailingZeros64(RoundUpPowerOf2(NumberOfEdges.LowPart))
    );

    Dim->NumberOfEdgesNextPowerOf2Exponent = (BYTE)(
        TrailingZeros64(RoundUpNextPowerOf2(NumberOfEdges.LowPart))
    );

    Dim->NumberOfVerticesPowerOf2Exponent = (BYTE)(
        TrailingZeros64(RoundUpPowerOf2(NumberOfVertices.LowPart))
    );

    Dim->NumberOfVerticesNextPowerOf2Exponent = (BYTE)(
        TrailingZeros64(RoundUpNextPowerOf2(NumberOfVertices.LowPart))
    );

    //
    // If non-modulus masking is active, initialize the edge and vertex masks.
    //

    if (!IsModulusMasking(MaskFunctionId)) {

        Info.EdgeMask = NumberOfEdges.LowPart - 1;
        Info.VertexMask = NumberOfVertices.LowPart - 1;

        //
        // Sanity check our masks are correct: their popcnts should match the
        // exponent value identified above whilst filling out the dimensions
        // structure.
        //

        ASSERT(_mm_popcnt_u32(Info.EdgeMask) ==
               Dim->NumberOfEdgesPowerOf2Exponent);

        ASSERT(_mm_popcnt_u32(Info.VertexMask) ==
               Dim->NumberOfVerticesPowerOf2Exponent);

    }

    //
    // Set the Modulus, Size, Shift, Mask and Fold fields of the table, such
    // that the Hash and Mask vtbl functions operate correctly.
    //
    // N.B. Shift, Mask and Fold are meaningless for modulus masking.
    //
    // N.B. If you change these fields, you'll probably need to change something
    //      in LoadPerfectHashTableImplChm01() too.
    //

    Table->HashModulus = NumberOfVertices.LowPart;
    Table->IndexModulus = NumberOfEdges.LowPart;
    Table->HashSize = NumberOfVertices.LowPart;
    Table->IndexSize = NumberOfEdges.LowPart;
    Table->HashShift = TrailingZeros(Table->HashSize);
    Table->IndexShift = TrailingZeros(Table->IndexSize);
    Table->HashMask = (Table->HashSize - 1);
    Table->IndexMask = (Table->IndexSize - 1);
    Table->HashFold = Table->HashShift >> 3;
    Table->IndexFold = Table->IndexShift >> 3;

    //
    // If auto folding has been requested, set the appropriate function ID based
    // on the table fold and update the vtbl.
    //

    if (MaskFunctionId == PerfectHashTableFoldAutoMaskFunctionId) {

        ASSERT(Table->HashFold >= 0 && Table->HashFold <= 4);

        switch (Table->HashFold) {
            case 4:
            case 3:
            case 2:
                MaskFunctionId = PerfectHashTableFoldOnceMaskFunctionId;
                break;

            case 1:
                MaskFunctionId = PerfectHashTableFoldTwiceMaskFunctionId;
                break;

            case 0:
                MaskFunctionId = PerfectHashTableFoldThriceMaskFunctionId;
                break;
        }

        Table->MaskFunctionId = MaskFunctionId;
        Context->MaskFunctionId = MaskFunctionId;
        Table->Vtbl->MaskHash = MaskHashRoutines[MaskFunctionId];
        Table->Vtbl->MaskIndex = MaskIndexRoutines[MaskFunctionId];
    }

    //
    // Save the on-disk representation of the graph information.  This is a
    // smaller subset of data needed in order to load a previously-solved
    // graph as a perfect hash table.  The data resides in an NTFS stream named
    // :Info off the main perfect hash table file.  It will have been mapped for
    // us already at Table->InfoStreamBaseAddress.
    //

    OnDiskInfo = (PGRAPH_INFO_ON_DISK)Table->InfoStreamBaseAddress;
    ASSERT(OnDiskInfo);
    OnDiskHeader = &OnDiskInfo->Header;
    OnDiskHeader->Magic.LowPart = TABLE_INFO_ON_DISK_MAGIC_LOWPART;
    OnDiskHeader->Magic.HighPart = TABLE_INFO_ON_DISK_MAGIC_HIGHPART;
    OnDiskHeader->SizeOfStruct = sizeof(*OnDiskInfo);
    OnDiskHeader->Flags.AsULong = 0;
    OnDiskHeader->Concurrency = Context->MaximumConcurrency;
    OnDiskHeader->AlgorithmId = Context->AlgorithmId;
    OnDiskHeader->MaskFunctionId = Context->MaskFunctionId;
    OnDiskHeader->HashFunctionId = Context->HashFunctionId;
    OnDiskHeader->KeySizeInBytes = sizeof(ULONG);
    OnDiskHeader->HashSize = Table->HashSize;
    OnDiskHeader->IndexSize = Table->IndexSize;
    OnDiskHeader->HashShift = Table->HashShift;
    OnDiskHeader->IndexShift = Table->IndexShift;
    OnDiskHeader->HashMask = Table->HashMask;
    OnDiskHeader->IndexMask = Table->IndexMask;
    OnDiskHeader->HashFold = Table->HashFold;
    OnDiskHeader->IndexFold = Table->IndexFold;
    OnDiskHeader->HashModulus = Table->HashModulus;
    OnDiskHeader->IndexModulus = Table->IndexModulus;
    OnDiskHeader->NumberOfKeys.QuadPart = (
        Table->Keys->NumberOfElements.QuadPart
    );
    OnDiskHeader->NumberOfSeeds = ((
        FIELD_OFFSET(GRAPH, LastSeed) -
        FIELD_OFFSET(GRAPH, FirstSeed)
    ) / sizeof(ULONG)) + 1;

    //
    // This will change based on masking type and whether or not the caller
    // has provided a value for NumberOfTableElements.  For now, keep it as
    // the number of vertices.
    //

    OnDiskHeader->NumberOfTableElements.QuadPart = (
        NumberOfVertices.QuadPart
    );

    CopyMemory(&OnDiskInfo->Dimensions, Dim, sizeof(*Dim));

    //
    // Set the context's main work callback to our worker routine, and the algo
    // context to our graph info structure.
    //

    Context->MainWorkCallback = ProcessGraphCallbackChm01;
    Context->AlgorithmContext = &Info;

    //
    // Set the context's file work callback to our worker routine.
    //

    Context->FileWorkCallback = FileWorkCallbackChm01;

    //
    // Prepare the initial "file preparation" work callback.  This will extend
    // the backing file to the appropriate size.
    //

    ZeroStruct(PrepareFile);
    PrepareFile.FileWorkId = FileWorkPrepareId;
    InterlockedPushEntrySList(&Context->FileWorkListHead,
                              &PrepareFile.ListEntry);

    CONTEXT_START_TIMERS(PrepareFile);

    SubmitThreadpoolWork(Context->FileWork);

    //
    // Capture initial cycles as reported by __rdtsc() and the performance
    // counter.  The former is used to report a raw cycle count, the latter
    // is used to convert to microseconds reliably (i.e. unaffected by turbo
    // boosting).
    //

    QueryPerformanceFrequency(&Context->Frequency);

    CONTEXT_START_TIMERS(Solve);

    //
    // We're ready to create threadpool work for the graph.
    //

    Buffer = (PBYTE)BaseAddress;
    Unusable = Buffer;

    for (Index = 0; Index < NumberOfGraphs; Index++) {

        //
        // Invariant check: at the top of the loop, Buffer and Unusable should
        // point to the same address (which will be the base of the current
        // graph being processed).  Assert this invariant now.
        //

        ASSERT(Buffer == Unusable);

        //
        // Carve out the graph pointer, and bump the unusable pointer past the
        // graph's pages, such that it points to the first byte of the guard
        // page.
        //

        Graph = (PGRAPH)Buffer;
        Unusable = Buffer + UsableBufferSizeInBytesPerBuffer;

        //
        // Sanity check the page alignment logic.  If we subtract 1 byte from
        // Unusable, it should reside on a different page.  Additionally, the
        // two pages should be separated by at most a single page size.
        //

        ThisPage = ALIGN_DOWN(Unusable,   PageSize);
        LastPage = ALIGN_DOWN(Unusable-1, PageSize);
        ASSERT(LastPage < ThisPage);
        ASSERT((ThisPage - LastPage) == PageSize);

        //
        // Verify the guard page is working properly by wrapping an attempt to
        // write to it in a structured exception handler that will catch the
        // access violation trap.
        //
        // N.B. We only do this if we're not actively being debugged, as the
        //      traps get dispatched to the debugger engine first as part of
        //      the "first-pass" handling logic of the kernel.
        //

        if (!IsDebuggerPresent()) {

            CaughtException = FALSE;

            TRY_PROBE_MEMORY{

                *Unusable = 1;

            } CATCH_EXCEPTION_ACCESS_VIOLATION{

                CaughtException = TRUE;

            }

            ASSERT(CaughtException);
        }

        //
        // Guard page is working properly.  Push the graph onto the context's
        // main work list head and submit the corresponding threadpool work.
        //

        InterlockedPushEntrySList(&Context->MainWorkListHead,
                                  &Graph->ListEntry);
        SubmitThreadpoolWork(Context->MainWork);

        //
        // Advance the buffer past the graph size and guard page.  Copy the
        // same address to the Unusable variable as well, such that our top
        // of the loop invariants hold true.
        //

        Buffer += GraphSizeInBytesIncludingGuardPage;
        Unusable = Buffer;

        //
        // If our key set size is small and our maximum concurrency is large,
        // we may have already solved the graph, in which case, we can stop
        // submitting new solver attempts and just break out of the loop here.
        //

        if (!ShouldWeContinueTryingToSolveGraph(Context)) {
            break;
        }
    }

    //
    // Wait on the context's events.
    //

    Events[0] = Context->SucceededEvent;
    Events[1] = Context->CompletedEvent;
    Events[2] = Context->ShutdownEvent;
    Events[3] = Context->FailedEvent;
    Events[4] = Context->TryLargerTableSizeEvent;

    WaitResult = WaitForMultipleObjects(ARRAYSIZE(Events),
                                        Events,
                                        FALSE,
                                        INFINITE);

    //
    // If the wait result indicates the try larger table size event was set,
    // deal with that, first.
    //

    if (WaitResult == WAIT_OBJECT_0+4) {

        //
        // The number of attempts at solving this graph have exceeded the
        // threshold.  Set the shutdown event in order to trigger all worker
        // threads to abort their current attempts and wait on the main thread
        // work to complete.
        //

        SetEvent(Context->ShutdownEvent);
        WaitForThreadpoolWorkCallbacks(Context->MainWork, TRUE);

        //
        // Perform a blocking wait for the prepare file work to complete.  (It
        // would be highly unlikely that this event hasn't been set yet.)
        //

        WaitResult = WaitForSingleObject(Context->PreparedFileEvent, INFINITE);

        ASSERT(WaitResult == WAIT_OBJECT_0);

        //
        // There are no more threadpool callbacks running.  However, a thread
        // could have finished a solution between the time the try larger table
        // size event was set, and this point.  So, check the finished count
        // first.  If it indicates a solution, jump to that handler code.
        //

        if (Context->FinishedCount > 0) {
            goto FinishedSolution;
        }

        //
        // Destroy the existing buffer we allocated for this attempt.  We'll
        // need a new, larger one to accomodate the resize.
        //

        ASSERT(Rtl->DestroyBuffer(Rtl, ProcessHandle, &BaseAddress));

        //
        // Increment the resize counter and update the total number of attempts
        // in the header.  Then, determine how close we came to solving the
        // graph, and store that in the header as well if it's the best so far
        // (or no previous version is present).
        //

        OnDiskHeader->NumberOfTableResizeEvents++;
        OnDiskHeader->TotalNumberOfAttemptsWithSmallerTableSizes += (
            Context->Attempts
        );

        Closest = NumberOfEdges.LowPart - Context->HighestDeletedEdgesCount;
        LastClosest = (
            OnDiskHeader->ClosestWeCameToSolvingGraphWithSmallerTableSizes
        );

        if (!LastClosest || Closest < LastClosest) {

            OnDiskHeader->ClosestWeCameToSolvingGraphWithSmallerTableSizes = (
                Closest
            );

        }

        //
        // If this is our first resize, capture the initial size we used.
        //

        if (!OnDiskHeader->InitialTableSize) {

            OnDiskHeader->InitialTableSize = NumberOfVertices.QuadPart;

        }

        //
        // Reset the remaining counters.
        //

        Context->Attempts = 0;
        Context->FailedAttempts = 0;
        Context->HighestDeletedEdgesCount = 0;

        //
        // Double the vertex count.  If we have overflowed max ULONG, abort.
        //

        Table->RequestedNumberOfTableElements.QuadPart = (
            NumberOfVertices.QuadPart
        );

        Table->RequestedNumberOfTableElements.QuadPart <<= 1ULL;

        if (Table->RequestedNumberOfTableElements.HighPart) {
            Success = FALSE;
            goto End;
        }

        //
        // Unmap the existing mapping and close the section.
        //

        ASSERT(UnmapViewOfFile(Table->BaseAddress));
        Table->BaseAddress = NULL;

        ASSERT(CloseHandle(Table->MappingHandle));
        Table->MappingHandle = NULL;

        //
        // Jump back to the start and try again with a larger vertex count.
        //

        goto RetryWithLargerTableSize;
    }

    //
    // The wait result did not indicate a resize event.  Ignore the wait
    // result for now; determine if the graph solving was successful by the
    // finished count of the context.  We'll corroborate that with whatever
    // events have been signaled shortly.
    //

    Success = (Context->FinishedCount > 0);

    if (!Success) {

        //
        // Invariant check: if no worker thread registered a solved graph (i.e.
        // Context->FinishedCount > 0), then verify that the shutdown event was
        // set.  If our WaitResult above indicates WAIT_OBJECT_2, we're done.
        // If not, verify explicitly.
        //

        if (WaitResult != WAIT_OBJECT_0+2) {

            //
            // Manually test that the shutdown event has been signaled.
            //

            WaitResult = WaitForSingleObject(Context->ShutdownEvent, 0);

            ASSERT(WaitResult == WAIT_OBJECT_0);
        }

        //
        // Clean up the main thread work group members and clear the work field.
        // This will block until all the worker threads have returned.  We need
        // to put this in place prior to jumping to the End: label as that step
        // will destroy the buffer we allocated earlier for the parallel graphs,
        // which we mustn't do if any threads are still working.
        //

        Context->MainWork = NULL;
        CloseThreadpoolCleanupGroupMembers(Context->MainCleanupGroup,
                                           TRUE,
                                           NULL);

        //
        // Perform the same operation for the file work threadpool.  Note that
        // the only work we've dispatched to this pool at this point is the
        // initial file preparation work.
        //

        Context->FileWork = NULL;
        CloseThreadpoolCleanupGroupMembers(Context->FileCleanupGroup,
                                           TRUE,
                                           NULL);

        goto End;
    }

    //
    // Pop the winning graph off the finished list head.
    //

FinishedSolution:

    ListEntry = InterlockedPopEntrySList(&Context->FinishedWorkListHead);
    ASSERT(ListEntry);

    Graph = CONTAINING_RECORD(ListEntry, GRAPH, ListEntry);

    //
    // Note this graph as the one solved to the context.  This is used by the
    // save file work callback we dispatch below.
    //

    Context->SolvedContext = Graph;

    //
    // Graphs always pass verification in normal circumstances.  The only time
    // they don't is if there's an internal bug in our code.  So, knowing that
    // the graph is probably correct, we can dispatch the file work required to
    // save it to disk to the file work threadpool whilst we verify it has been
    // solved correctly.
    //

    ZeroStruct(SaveFile);
    SaveFile.FileWorkId = FileWorkSaveId;

    //
    // Before we dispatch the save file work, make sure the preparation has
    // completed.
    //

    WaitResult = WaitForSingleObject(Context->PreparedFileEvent, INFINITE);
    if (WaitResult != WAIT_OBJECT_0 || Context->FileWorkErrors > 0) {
        __debugbreak();
        Success = FALSE;
        goto End;
    }

    //
    // Push this work item to the file work list head and submit the threadpool
    // work for it.
    //

    CONTEXT_START_TIMERS(SaveFile);

    InterlockedPushEntrySList(&Context->FileWorkListHead, &SaveFile.ListEntry);
    SubmitThreadpoolWork(Context->FileWork);

    //
    // Capture another round of cycles and performance counter values, then
    // continue with verification of the solution.
    //

    CONTEXT_START_TIMERS(Verify);

    Success = VerifySolvedGraph(Graph);

    CONTEXT_END_TIMERS(Verify);

    //
    // Set the verified event (regardless of whether or not we succeeded in
    // verification).  The save file work will be waiting upon it in order to
    // write the final timing details to the on-disk header.
    //

    SetEvent(Context->VerifiedEvent);

    ASSERT(Success);

    //
    // Wait on the saved file event before returning.
    //

    WaitResult = WaitForSingleObject(Context->SavedFileEvent, INFINITE);
    if (WaitResult != WAIT_OBJECT_0 || Context->FileWorkErrors > 0) {
        __debugbreak();
        Success = FALSE;
    }

End:

    //
    // Destroy the buffer we created earlier.
    //
    // N.B. Although we used Rtl->CreateMultipleBuffers(), we can still free
    //      the underlying buffer via Rtl->DestroyBuffer(), as only a single
    //      VirtualAllocEx() call was dispatched for the entire buffer.
    //

    if (BaseAddress) {
        Rtl->DestroyBuffer(Rtl, ProcessHandle, &BaseAddress);
    }

    //
    // Explicitly reset all events before returning.
    //

    Event = (PHANDLE)&Context->FirstEvent;
    NumberOfEvents = GetNumberOfContextEvents(Context);

    for (Index = 0; Index < NumberOfEvents; Index++, Event++) {

        ASSERT(*Event && *Event != INVALID_HANDLE_VALUE);
        ASSERT(ResetEvent(*Event));

    }

    return Success;
}

Appendix

Rtl->CreateRandomObjectNames()

This helper routine is used to assist with the creation of event names that have a set of prefixes we specify (e.g. PerfectHashTableContext_ShutdownEvent), and then a configurable amount of trailing Base64-encoded random data, such that the final name is likely to be unique within the current session with high probability.

_Use_decl_annotations_
BOOL
CreateRandomObjectNames(
    PRTL Rtl,
    PALLOCATOR TemporaryAllocator,
    PALLOCATOR WideBufferAllocator,
    USHORT NumberOfNames,
    USHORT LengthOfNameInChars,
    PUNICODE_STRING NamespacePrefix,
    PPUNICODE_STRING NamesArrayPointer,
    PPUNICODE_STRING PrefixArrayPointer,
    PULONG SizeOfWideBufferInBytes,
    PPWSTR WideBufferPointer
    )
/*++

Routine Description:

    This routine writes Base64-encoded random data to an existing buffer in a
    format suitable for subsequent use of UNICODE_STRING-based system names for
    things like events or shared memory handles.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.  If the crypto
        subsystem hasn't yet been initialized, this routine will also initialize
        it.

    TemporaryAllocator - Supplies a pointer to an initialized ALLOCATOR struct
        that this routine will use for temporary allocations.  (Any temporarily
        allocated memory will be freed before the routine returns, regardless
        of success/error.)

    WideBufferAllocator - Supplies a pointer to an initialized ALLOCATOR struct
        that this routine will use to allocate the final wide character buffer
        that contains the base64-encoded random data.  This data will then have
        the object namespace+prefix and trailing NULL characters overlaid on
        top of it.  (That is, the UNICODE_STRING structures pointed to by the
        NamesArray will have their Buffer addresses point within this buffer
        space.)  The caller is responsible for freeing this address (which will
        be received via the output param WideBufferPointer).

    NumberOfNames - Supplies the number of names that will be carved out of the
        provided WideBuffer by the caller.  This parameter is used in concert
        with the LengthOfNameInChars parameter to ensure the buffer is laid out
        in the correct format.

    LengthOfNameInChars - Supplies the desired length of each name string in
        characters.  This length is assumed to include the trailing NULL and
        the prefix -- that is, the space required to contain the prefix and
        trailing NULL will be subtracted from this parameter.  For optimal
        layout, this parameter should be a power of 2 -- with 64 and 128 being
        good default values.

    NamespacePrefix - Optionally supplies a pointer to a UNICODE_STRING to
        use as the namespace (prefix) for each string.  If NULL, this value
        defaults L"Local\\".  (If L"Global\\" is used, the caller is responsible
        for ensuring the SeCreateGlobalPrivilege privilege is enabled.)

    NamesArrayPointer - Supplies a pointer to the first element of an array of
        of pointers to UNICODE_STRING structures that will be filled out with
        the details of the corresponding object name.  Sufficient space should
        be allocated such that the array contains sizeof(UNICODE_STRING) *
        NumberOfNames in space.

    PrefixArrayPointer - Optionally supplies a pointer to the first element of
        an array of pointers to UNICODE_STRING structures which can be used to
        further customize the name of the object after the namespace but before
        the random character data.  If a NULL pointer resides at a given array
        element, it is assumed no prefix is desired for this element.

    SizeOfWideBufferInBytes - Receives the size in bytes of the buffer allocated
        to store the object names' wide character data.

    WideBufferPointer - Receives the base address of the object names wide char
        buffer.


Return Value:

    TRUE on success, FALSE on error.

--*/
{
    BOOL Success;
    USHORT Index;
    USHORT Count;
    LONG PrefixLengthInChars;
    LONG NumberOfWideBase64CharsToCopy;
    LONG CharsRemaining;
    LONG CharsUsed;
    LONG RandomCharsUsed;
    LONG FinalCharCount;
    ULONG CryptFlags;
    ULONG SizeOfBinaryBufferInBytes;
    ULONG SizeOfWideBase64BufferInBytes = 0;
    ULONG OldLengthOfWideBase64BufferInChars;
    ULONG LengthOfWideBase64BufferInChars;
    PBYTE BinaryBuffer;
    PWCHAR Dest;
    PWCHAR WideBase64Buffer = NULL;
    PUNICODE_STRING String;
    PUNICODE_STRING Prefix;
    PPUNICODE_STRING Prefixes;
    PUNICODE_STRING Namespace;
    UNICODE_STRING LocalNamespace = RTL_CONSTANT_STRING(L"Local\\");

    //
    // Validate arguments.
    //

    if (ARGUMENT_PRESENT(NamespacePrefix)) {
        if (!IsValidUnicodeStringWithMinimumLengthInChars(NamespacePrefix,7)) {
            return FALSE;
        }
        Namespace = NamespacePrefix;
    } else {
        Namespace = &LocalNamespace;
    }

    //
    // Namespace length should be (far) less than the desired name length.
    //

    if (Namespace->Length >= (LengthOfNameInChars << 1)) {
        __debugbreak();
        return FALSE;
    }

    //
    // If the namespace ends with a trailing NULL, omit it by reducing the
    // length by one wide character.  Then, verify the final character is a
    // slash.
    //

    if (Namespace->Buffer[(Namespace->Length >> 1) - 1] == L'\0') {
        Namespace->Length -= sizeof(WCHAR);
    }

    if (Namespace->Buffer[(Namespace->Length >> 1) - 1] != L'\\') {
        __debugbreak();
        return FALSE;
    }

    if (ARGUMENT_PRESENT(PrefixArrayPointer)) {
        Prefixes = PrefixArrayPointer;
    } else {
        Prefixes = NULL;
    }

    //
    // Make sure the crypto subsystem is available.
    //

    if (!Rtl->Flags.Crypt32Initialized) {
        if (!InitCrypt32(Rtl)) {
            return FALSE;
        }
    }

    //
    // Allocate a buffer for the initial binary data; we generate more random
    // data than we need here, but it's easier than trying to get everything
    // exact up-front (i.e. base64->binary size conversions).
    //
    // N.B. We use Allocator->Malloc() instead of the Calloc() here as the
    //      existing memory data will contribute as a seed value.
    //

    SizeOfBinaryBufferInBytes = NumberOfNames * LengthOfNameInChars;

    BinaryBuffer = (PBYTE)(
        TemporaryAllocator->Malloc(
            TemporaryAllocator->Context,
            SizeOfBinaryBufferInBytes
        )
    );

    if (!BinaryBuffer) {
        return FALSE;
    }

    //
    // Allocate a wide character buffer for the base64-encoded binary data that
    // is double the length of the binary buffer -- this is simpler than trying
    // to get the exact conversion right.
    //

    SizeOfWideBase64BufferInBytes = ALIGN_UP_POINTER(
        NumberOfNames *
        LengthOfNameInChars *
        sizeof(WCHAR) *
        2
    );

    WideBase64Buffer = (PWCHAR)(
        WideBufferAllocator->Calloc(
            WideBufferAllocator->Context,
            1,
            SizeOfWideBase64BufferInBytes
        )
    );

    if (!WideBase64Buffer) {
        goto Error;
    }

    //
    // We successfully allocated space for our two buffers.  Fill the first one
    // with random data now.
    //

    Success = Rtl->CryptGenRandom(Rtl,
                                  SizeOfBinaryBufferInBytes,
                                  (PBYTE)BinaryBuffer);
    if (!Success) {
        Rtl->LastError = GetLastError();
        __debugbreak();
        goto Error;
    }

    //
    // Convert the entire binary data buffer into base64-encoded wide character
    // representation.  We calculate the number of wide characters the buffer
    // can receive by shifting the byte size right by one, then copy that value
    // into a second variable that CryptBinaryToStringW() can overwrite with
    // the actual length converted.
    //

    OldLengthOfWideBase64BufferInChars = SizeOfWideBase64BufferInBytes >> 1;
    LengthOfWideBase64BufferInChars = OldLengthOfWideBase64BufferInChars;

    CryptFlags = CRYPT_STRING_BASE64 | CRYPT_STRING_NOCRLF;
    Success = Rtl->CryptBinaryToStringW(BinaryBuffer,
                                        SizeOfBinaryBufferInBytes,
                                        CryptFlags,
                                        WideBase64Buffer,
                                        &LengthOfWideBase64BufferInChars);

    if (!Success) {
        Rtl->LastError = GetLastError();
        __debugbreak();
        goto Error;
    }

    //
    // Conversion of the binary data into base64-encoded data was successful,
    // so we can free the binary buffer now.
    //

    TemporaryAllocator->FreePointer(
        TemporaryAllocator->Context,
        &BinaryBuffer
    );

    //
    // Loop through the array of pointers to UNICODE_STRING structures and fill
    // each one out, adding the namespace and prefixes accordingly.
    //

    RandomCharsUsed = 0;

    for (Index = 0; Index < NumberOfNames; Index++) {

        //
        // Resolve the next unicode string pointer.
        //

        String = *(NamesArrayPointer + Index);

        //
        // Reset counters.  CharsUsed has one subtracted from it in addition
        // to the namespace length to account for the trailing NULL.
        //

        CharsUsed = (Namespace->Length >> 1) + 1;
        CharsRemaining = (LONG)LengthOfNameInChars - CharsUsed;

        if (Prefixes && ((Prefix = *(Prefixes + Index)) != NULL)) {

            //
            // Omit any trailing NULLs from the custom prefix provided by the
            // caller, then subtract the prefix length from the remaining bytes
            // and verify we've got a sensible number left.
            //

            PrefixLengthInChars = Prefix->Length >> 1;
            if (Prefix->Buffer[PrefixLengthInChars - 1] == L'\0') {
                PrefixLengthInChars -= 1;
            }

            CharsUsed += PrefixLengthInChars;
            CharsRemaining -= PrefixLengthInChars;
        } else {
            Prefix = NULL;
            PrefixLengthInChars = 0;
        }

        if (CharsRemaining <= 0) {
            __debugbreak();
            goto Error;
        }

        //
        // Final sanity check that the lengths add up.
        //

        NumberOfWideBase64CharsToCopy = CharsRemaining;

        FinalCharCount = (
            (Namespace->Length >> 1) +
            PrefixLengthInChars +
            NumberOfWideBase64CharsToCopy +
            1
        );

        if (FinalCharCount != LengthOfNameInChars) {
            __debugbreak();
            goto Error;
        }

        //
        // Everything checks out, fill out the unicode string details and copy
        // the relevant parts over: namespace, optional prefix and then finally
        // the random characters.
        //

        String->Length = (USHORT)(FinalCharCount << 1) - sizeof(WCHAR);
        String->MaximumLength = String->Length + sizeof(WCHAR);

        Dest = String->Buffer = (WideBase64Buffer + RandomCharsUsed);

        //
        // Copy the namespace and optionally prefix into the initial part of
        // the random character data buffer, then NULL terminate the name and
        // update counters.
        //

        Count = Namespace->Length >> 1;
        __movsw(Dest, Namespace->Buffer, Count);
        Dest += Count;
        RandomCharsUsed += Count;

        if (Prefix) {
            Count = (USHORT)PrefixLengthInChars;
            __movsw(Dest, Prefix->Buffer, Count);
            Dest += Count;
            RandomCharsUsed += Count;
        }

        Count = (USHORT)NumberOfWideBase64CharsToCopy + 1;
        Dest += Count - 1;
        *Dest = L'\0';
        RandomCharsUsed += Count;
    }

    //
    // We're done, indicate success and finish up.
    //

    Success = TRUE;
    goto End;

Error:

    Success = FALSE;

    //
    // Intentional follow-on to End.
    //

End:

    if (BinaryBuffer) {
        TemporaryAllocator->FreePointer(
            TemporaryAllocator->Context,
            &BinaryBuffer
        );
    }

    *WideBufferPointer = WideBase64Buffer;
    *SizeOfWideBufferInBytes = SizeOfWideBase64BufferInBytes;

    return Success;
}

Rtl->CreateMultipleBuffers()

This routine is used to create multiple buffers of equal sizes (where the size is specified in numbers of pages), separated by a "guard" page that has all permissions removed from it (such that any attempts to read or write to memory in that page will trap).

It is used to allocate all memory required by concurrent graph solving using a single VirtualAlloc() call.

_Use_decl_annotations_
BOOL
CreateMultipleBuffers(
    PRTL Rtl,
    PHANDLE TargetProcessHandle,
    ULONG PageSize,
    ULONG NumberOfBuffers,
    ULONG NumberOfPagesPerBuffer,
    PULONG AdditionalProtectionFlags,
    PULONG AdditionalAllocationTypeFlags,
    PULONGLONG UsableBufferSizeInBytesPerBuffer,
    PULONGLONG TotalBufferSizeInBytes,
    PPVOID BufferAddress
    )
/*++

Routine Description:

    Allocates a single buffer using VirtualAllocEx() sized using the following:

        (PageSize * NumberOfPagesPerBuffer * NumberOfBuffers) +
        (NumberOfBuffers * PageSize)

    This includes a guard page following each buffer's set of pages.  Memory
    protection is set on the guard page such that any access will trap.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    TargetProcessHandle - Optionally supplies a pointer to a variable that
        contains a process handle for which the memory is to be allocated.
        If non-NULL, but pointed-to value is 0, this will receive the handle
        of the current process.

    PageSize - Indicates the page size indicated by the NumberOfPagesPerBuffer
        parameter.  This should be either 4096 or 2MB.  If 2MB, the caller
        should also ensure a) MEM_LARGE_PAGES is also provided as the value in
        the parameter AdditionalAllocationTypeFlags, and b) relevant privileges
        have been obtained (e.g. Rtl->EnableManageVolumePrivilege() and
        Rtl->EnableLockMemoryPrivilege()).

    NumberOfBuffers - Supplies the number of individual buffers being serviced
        by this request.  This will also reflect the number of implicit guard
        pages included in the allocation.

    NumberOfPagesPerBuffer - Supplies the number of pages to assign for each
        buffer.

    AdditionalProtectionFlags - Optionally supplies flags that will be OR'd
        with the flProtect flags parameter of VirtualAllocEx() (e.g.
        PAGE_NOCACHE or PAGE_WRITECOMBINE).

    AdditionalAllocationTypeFlags - Optionally supplies flags that will be OR'd
        with the flAllocationType flags parameter of VirtualAllocEx().  This
        should include MEM_LARGE_PAGES if large pages are desired.

    UsableBufferSizeInBytesPerBuffer - Supplies the address of a variable that
        receives the number of usable bytes per buffer.

    TotalBufferSizeInBytes - Supplies the address of a variable that receives
        the total allocation size provided to VirtualAllocEx().

    BufferAddress - Supplies the address of a variable that receives the
        base address of the allocation if successful.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    BOOL Success;
    ULONG Index;
    PVOID Buffer;
    PBYTE Unusable;
    HANDLE ProcessHandle;
    ULONG ProtectionFlags;
    ULONG OldProtectionFlags;
    ULONG AllocationFlags;
    ULARGE_INTEGER TotalNumberOfPages;
    ULARGE_INTEGER AllocSizeInBytes;
    ULONGLONG UsableBufferSizeInBytes;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(Rtl)) {
        return FALSE;
    }

    if (!NumberOfPagesPerBuffer) {
        return FALSE;
    }

    if (!NumberOfBuffers) {
        return FALSE;
    }

    //
    // Verify page size is either 4KB or 2MB.
    //

    if (PageSize != 4096 && PageSize != (1 << 21)) {
        return FALSE;
    }

    //
    // If page size is 2MB, verify MEM_LARGE_PAGES has also been requested.
    //

    if (PageSize == (1 << 21)) {
        if (!ARGUMENT_PRESENT(AdditionalAllocationTypeFlags)) {
            return FALSE;
        }
        if (!(*AdditionalAllocationTypeFlags & MEM_LARGE_PAGES)) {
            return FALSE;
        }
    }

    if (!ARGUMENT_PRESENT(TargetProcessHandle)) {
        ProcessHandle = GetCurrentProcess();
    } else {
        if (!*TargetProcessHandle) {
            *TargetProcessHandle = GetCurrentProcess();
        }
        ProcessHandle = *TargetProcessHandle;
    }

    if (!ARGUMENT_PRESENT(UsableBufferSizeInBytesPerBuffer)) {
        return FALSE;
    } else {
        *UsableBufferSizeInBytesPerBuffer = 0;
    }

    if (!ARGUMENT_PRESENT(TotalBufferSizeInBytes)) {
        return FALSE;
    } else {
        *TotalBufferSizeInBytes = 0;
    }


    if (!ARGUMENT_PRESENT(BufferAddress)) {
        return FALSE;
    } else {
        *BufferAddress = NULL;
    }

    TotalNumberOfPages.QuadPart = (

        //
        // Account for the buffer related pages.
        //

        ((ULONGLONG)NumberOfPagesPerBuffer * (ULONGLONG)NumberOfBuffers) +

        //
        // Account for the guard pages; one for each buffer.
        //

        ((ULONGLONG)NumberOfBuffers)
    );

    //
    // Calculate the total allocation size required.
    //

    AllocSizeInBytes.QuadPart = (
        TotalNumberOfPages.QuadPart * (ULONGLONG)PageSize
    );

    ProtectionFlags = PAGE_READWRITE;
    if (ARGUMENT_PRESENT(AdditionalProtectionFlags)) {
        ProtectionFlags |= *AdditionalProtectionFlags;
    }

    AllocationFlags = MEM_RESERVE | MEM_COMMIT;
    if (ARGUMENT_PRESENT(AdditionalAllocationTypeFlags)) {
        AllocationFlags |= *AdditionalAllocationTypeFlags;
    }

    //
    // Validation of parameters complete.  Proceed with buffer allocation.
    //

    Buffer = Rtl->VirtualAllocEx(ProcessHandle,
                                 NULL,
                                 AllocSizeInBytes.QuadPart,
                                 AllocationFlags,
                                 ProtectionFlags);

    if (!Buffer) {
        ULONG LastError = GetLastError();
        return FALSE;
    }

    //
    // Buffer was successfully allocated.  Any failures after this point should
    // `goto Error` to ensure the memory is freed.
    //

    UsableBufferSizeInBytes = (ULONGLONG)(
        (ULONGLONG)NumberOfPagesPerBuffer *
        (ULONGLONG)PageSize
    );

    //
    // Loop through the assigned memory and toggle protection to PAGE_NOACCESS
    // for the guard pages.
    //

    Unusable = (PBYTE)Buffer;
    ProtectionFlags = PAGE_NOACCESS;

    for (Index = 0; Index < NumberOfBuffers; Index++) {

        Unusable += UsableBufferSizeInBytes;

        Success = Rtl->VirtualProtectEx(ProcessHandle,
                                        Unusable,
                                        PageSize,
                                        ProtectionFlags,
                                        &OldProtectionFlags);

        if (!Success) {
            goto Error;
        }

        //
        // Advance past this guard page.
        //

        Unusable += PageSize;
    }

    //
    // We're done, indicate success and finish up.
    //

    Success = TRUE;
    goto End;

Error:

    Success = FALSE;

    //
    // Buffer should be non-NULL at this point.  Assert this invariant, free the
    // allocated memory, clear the buffer pointer and set the alloc size to 0.
    //

    ASSERT(Buffer);
    Rtl->VirtualFreeEx(ProcessHandle, Buffer, 0, MEM_RELEASE);
    Buffer = NULL;

    //
    // Intentional follow-on to End.
    //

End:

    //
    // Update caller's parameters and return.
    //

    *BufferAddress = Buffer;

    if (Success) {
        *TotalBufferSizeInBytes = AllocSizeInBytes.QuadPart;
        *UsableBufferSizeInBytesPerBuffer = UsableBufferSizeInBytes;
    }

    return Success;
}

View this page's source on GitHub.

Follow @trentnelson