Is Prefix Of String In Table?

A Journey Into SIMD String Processing

AVX2
SIMD
C
Assembly
MASM

This article details an approach for efficiently determining if a given string prefix-matches a set of known strings. That is, do any of the known strings represent the prefix of a given string? A custom data structure is employed with successive implementations benchmarked to find the fastest possible solution.

Author

Trent Nelson

Published

May 4, 2018

Published: 4th May, 2018. Last updated: 1st November, 2024.

Thanks to Fabian Giesen, Wojciech Muła, Geoff Langdale, Daniel Lemire, and Kendall Willets for their valuable feedback on an early draft of this article.

Hours spent on this article to date: 230.56. Hours spent porting this article from raw HTML to Markdown in 2024: about 16-20. See Colophon for more details.

Hacker News discussion | Reddit discussion


TL;DR

I wrote some C and assembly code that uses SIMD instructions to perform prefix matching of strings. The C code was between 4-7x faster than the baseline implementation for prefix matching. The assembly code was 9-12x faster than the baseline specifically for the negative match case (determining that an incoming string definitely does not prefix match any of our known strings). The fastest negative match could be done in around 6 CPU cycles, which is pretty quick. (Integer division, for example, takes about 90 cycles.)

Overview

Goal: given a string, determine if it prefix-matches a set of known strings as fast as possible. That is, in a set of known strings, do any of them prefix match the incoming search string?

A reference implementation was written in C as a baseline, which simply looped through an array of strings, comparing each one, byte-by-byte, looking for a prefix match. Prefix match performance ranged from 28 CPU cycles to 130, and negative match performance was around 74 cycles.

A SIMD-friendly C structure called STRING_TABLE was derived. It is optimized for up to 16 strings, ideally of length less than or equal to 16 characters. The table is created from the set of known strings up-front; it is sorted by length, ascending, and a unique character (with regards to other characters at the same byte offset) is then extracted, along with its index. A 16-byte character array, STRING_SLOT, is used to capture the unique characters. A 16-element array of unsigned characters, SLOT_INDEX, is used to capture the index. Similarly, lengths are stored in the same fashion via SLOT_LENGTHS. Finally, a 16-element array of STRING_SLOTs is used to capture up to the first 16 bytes of each string in the set.

An example of the memory layout of the STRING_TABLE structure at run time, using sample test data, is depicted below. Note the width of each row is 16 bytes (128 bits), which is the size of an XMM register.

The STRING_TABLE Structure

The STRING_TABLE Structure

The layout of the STRING_TABLE structure allows us to determine if a given search string does not prefix match all 16 strings at once in 12 assembly instructions. This breaks down into 18 μops, with a block throughput of 3.48 cycles on Intel’s Skylake architecture. (In practice, this clocks in at around 6 CPU cycles.)

mov      rax,  String.Buffer[rdx]                   ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx]      ; Broadcast string length.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu  xmm0, xmmword ptr [rax]                    ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4                           ; Identify slots > string len.
vpshufb  xmm5, xmm0, StringTable.UniqueIndex[rcx]   ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx]   ; Compare rearranged to unique.
vptest   xmm1, xmm5                                 ; Unique slots AND (!long slots).
jnc      short Pfx10                                ; CY=0, continue with routine.
xor      eax, eax                                   ; CY=1, no match.
not      al                                         ; al = -1 (NO_MATCH_FOUND).
ret                                                 ; Return NO_MATCH_FOUND.
S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|μops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |1.0|   |       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ |   |1.0|0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of μops: 18

Here’s a simplified walk-through of a negative match in action, using the search string “CAT”:

StringTable Negative Match

StringTable Negative Match

Ten iterations of a function named IsPrefixOfStringInTable were authored. The tenth and final iteration was the fastest, prefix matching in as little as 19 cycles—a 4x improvement over the baseline. Negative matching took 11 cycles—a 6.7x improvement.

An assembly version of the algorithm was authored specifically to optimize for the negative match case, and was able to do so in as little as 8 cycles, representing a 9x improvement over the baseline. (It was a little bit slower than the fastest C routine in the case of prefix matches, though, as can be seen below.)

Feedback for an early draft of this article was then solicited via Twitter, resulting in four more iterations of the C version, and three more iterations of the assembly version. The PGO build of the fastest C version prefix matched in about 16 cycles (and also had the best “worst case input string” performance where three slots needed comparison), negative matching in about 26 cycles. The fifth iteration of the assembly version negative matched in about 6 cycles, a 3 and 1 cycle improvement, respectively.

Benchmark Overview

Benchmark Overview

We were then ready to publish, but felt compelled to investigate an odd performance quirk we’d noticed with one of the assembly routines, which yielded 7 more assembly versions. Were any of them faster? Let’s find out.

The Background

The Tracer Project

One of the frustrations I had with existing Python profilers was that there was no easy or efficient means to filter or exclude trace information based on the module name of the code being executed. I tackled this in my tracer project, which allows you to set an environment variable named TRACER_MODULE_NAMES to restrict which modules should be traced, e.g.:

set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy

If the code being executed is coming from the module myproject3.subproject.foo, then we need to trace it, as that string prefix matches the third entry on our list.

This article details the custom data structure and algorithm I came up with in order to try and solve the prefix matching problem more optimally with a SIMD approach. The resulting StringTable component is used extensively within the tracer project, and as such, must conform to unique constraints such as no use of the C runtime library and allocating all memory through TraceStore-backed allocators. Thus, it’s not really something you’d drop in to your current project in its current form. Hopefully, the article still proves to be interesting.

Note

The code samples provided herein are copied directly from the tracer project, which is written in C and assembly, and uses the Pascal-esque Cutler Normal Form style for C. If you’re used to the more UNIX-style Kernel Normal Form of C, it’s quite like that, except that it’s absolutely nothing like that, and all these code samples will probably be very jarring.

Baseline C Implementation

The simplest way of solving this in C is to have an array of C strings (i.e., NULL-terminated byte arrays), then for each string, loop through byte by byte and see if it prefix matches the search string.

//
// Declare a set of module names to be used as a string array.
//

const PCSZ ModuleNames[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    NULL,
};

//
// Define the function pointer typedef.
//

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_CSTR_IN_ARRAY)(
    _In_ PCSZ *StringArray,
    _In_ PCSZ String,
    _Out_opt_ PSTRING_MATCH Match
    );
typedef IS_PREFIX_OF_CSTR_IN_ARRAY *PIS_PREFIX_OF_CSTR_IN_ARRAY;

//
// Forward declaration.
//

IS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left && *Right && *Left++ == *Right++) {
            Count++;
        }

        if (Count > 0 && !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)Count;
                Match->String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}
const char *module_names[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    0,
};

struct string_match {
    /* Index of the match. */
    unsigned char index;

    /* Number of characters matched. */
    unsigned char number_of_chars_matched;

    /* Pad out to an 8-byte boundary. */
    unsigned short padding[3];

    /* Pointer to the string that was matched. */
    char *str;
};

unsigned char
is_prefix_of_c_str_in_array(const char **array,
                            const char *str,
                            struct string_match *match)
{
    char *left, *right, **target;
    unsigned int c, i = 0;

    for (target = array; target; target++, i++) {
        c = 0;
        left = str;
        right *target;
        while (*left && *right && *left++ == *right) {
            c++;
        }
        if (c > 0 && !*right) {
            if (match) {
                match->index = i;
                match->chars_matched = c;
                match->str = target[i];
            }
            return i;
        }
    }

    return -1;
}

Another type of code pattern that the string table attempts to replace is anything that does a lot of if/else if/else if-type string comparisons to look for keywords. For example, in the Quake III source, there’s some symbol/string processing logic that looks like this:

// call instructions reset currentArgOffset
if ( !strncmp( token, "CALL", 4 ) ) {
    EmitByte( &segment[CODESEG], OP_CALL );
    instructionCount++;
    currentArgOffset = 0;
    return;
}

// arg is converted to a reversed store
if ( !strncmp( token, "ARG", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_ARG );
    instructionCount++;
    if ( 8 + currentArgOffset >= 256 ) {
        CodeError( "currentArgOffset >= 256" );
        return;
    }
    EmitByte( &segment[CODESEG], 8 + currentArgOffset );
    currentArgOffset += 4;
    return;
}

// ret just leaves something on the op stack
if ( !strncmp( token, "RET", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_LEAVE );
    instructionCount++;
    EmitInt( &segment[CODESEG], 8 + currentLocals + currentArgs );
    return;
}

// pop is needed to discard the return value of
// a function
if ( !strncmp( token, "pop", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_POP );
    instructionCount++;
    return;
}
...

An example of using the string table approach for this problem is discussed in the Other Applications section.

The Proposed Interface

Let’s take a look at the interface we’re proposing, the IsPrefixOfStringInTable function, that this article is based upon:

The IsPrefixOfStringInTable Function

//
// Our string table index is simply a char, with -1 indicating no match found.
//

typedef CHAR STRING_TABLE_INDEX;
#define NO_MATCH_FOUND -1

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_STRING_IN_TABLE)(
    _In_ PSTRING_TABLE StringTable,
    _In_ PSTRING String,
    _Out_opt_ PSTRING_MATCH StringMatch
    );
typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE;

IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/

All implementations discussed in this article adhere to that function signature. The STRING_TABLE structure will be discussed shortly.

The STRING_MATCH Structure

The STRING_MATCH structure is used to optionally communicate information about the prefix match back to the caller. The index and characters matched fields are often very useful when using the string table for text parsing; see the other applications section below for an example.

The structure is defined as follows:

//
// This structure is used to communicate matches back to the caller.
//

typedef struct _STRING_MATCH {

    //
    // Index of the match.
    //

    BYTE Index;

    //
    // Number of characters matched.
    //

    BYTE NumberOfMatchedCharacters;

    //
    // Pad out to 8-bytes.
    //

    USHORT Padding[3];

    //
    // Pointer to the string that was matched.  The underlying buffer will
    // stay valid for as long as the STRING_TABLE struct persists.
    //

    PSTRING String;

} STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH;
C_ASSERT(sizeof(STRING_MATCH) == 16);

The Test Data

Instead of using some arbitrary Python module names, this article is going to focus on a string table constructed out of a set of 16 strings that represent reserved names of the NTFS file system, at least when it was first released way back in the early 90s.

This list is desirable as it has good distribution of characters, there is a good mix of both short and long entries, plus one oversized one ($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all strings lead with a common character (the dollar sign), preventing a simple first character optimization used by the initial version of the StringTable component I wrote in 2016.

So the scenario we’ll be emulating, in this case, is that we’ve just been passed a filename for creation, and we need to check if it prefix matches any of the reserved names.

Here’s the full list of NTFS names we’ll be using. We’re assuming 8-bit ASCII encoding (no UTF-8) and case sensitive. (If this were actually the NT kernel, we’d need to use wide characters with UTF-16 encoding, and be case-insensitive.)

NTFS Reserved Names

  • $AttrDef
  • $BadClus
  • $Bitmap
  • $Boot
  • $Extend
  • $LogFile
  • $MftMirr
  • $Mft
  • $Secure
  • $UpCase
  • $Volume
  • $Cairo
  • $INDEX_ALLOCATION
  • $DATA
  • ????
  • .

The ordering is important in certain cases. For example, when you have overlapping strings, such as $MftMirr and $Mft, you should put the longest strings first. They will be matched first, and as our routine terminates upon the first successful prefix match—if a longer string resided after a shorter one, it would never get detected.

Let’s review some guiding design requirements and cover some of the design decisions I made, which should help shape your understanding of the implementation.

Requirements and Design Decisions

  1. The STRING struct will be used to capture incoming search strings as well as the representation of any strings registered in the table (or more accurately, in the corresponding StringArray structure associated with the string table.
//
// The STRING structure used by the NT kernel.  Our STRING_ARRAY structure
// relies on an array of these structures.  We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//

typedef struct _STRING {
    USHORT Length;
    USHORT MaximumLength;
    ULONG  Padding;
    PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;
  1. The design should optimize for string lengths less than or equal to 16. Lengths greater than 16 are permitted, up to 128 bytes, but they incur more overhead during the prefix lookup.

  2. The design should prioritize the fast-path code where there is no match for a given search string. Being able to terminate the search as early as possible is ideal.

  3. The performance hits taken by unaligned data access are non-negligible, especially when dealing with XMM/YMM loads. Pay special care to alignment constraints and make sure that everything under our control is aligned on a suitable boundary.

Note

The only thing we can’t really control in the real world is the alignment of the incoming search string buffer, which will often be at undesirable alignments like 2, 4, 6, etc. Our test program explicitly aligns the incoming search strings on 32-byte boundaries to avoid the penalties associated with unaligned access.

The string table is geared toward a single-shot build. Once you’ve created it with a given string array or used a delimited environment variable, that’s it. There are no AddString() or RemoveString() routines. The order you provided the strings in will be the same order the table uses—no re-ordering will be done. Thus, for prefix matching purposes, if two strings share a common prefix, the longer one should go first, as the prefix search routine will check it first.

Only single matches are performed; the first match that qualifies as a prefix match (target string in table had length less than or equal to the search string, and all of its characters matched). There is no support for obtaining multiple matches—if you’ve constructed your string tables properly (no duplicate or incorrectly-ordered overlapping fields), you shouldn’t need to.

So, to summarize, the design guidelines are as follows:

  • Prioritize fast-path exit in the non-matched case. (I refer to this as negative matching in a lot of places.)

  • Optimize for up to 16 string slots, where each slot has up to 16 characters, ideally. It can have up to 128 in total; however, any bytes outside of the first sixteen live in the string array structure supporting the string table (accessible via pStringArray).

  • If a slot is longer than 16 characters, optimize for the assumption that it won’t be that much longer. For instance, assume a string of length 18 bytes is more common than 120 bytes.

The Data Structures

The primary data structure employed by this solution is the STRING_TABLE structure. It is composed of supporting structures: STRING_SLOT, SLOT_INDEX, and SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY structure from which it was created.

STRING_TABLE

Let’s review the STRING_TABLE view on GitHub structure first and then touch on the supporting structures.

//
// The STRING_TABLE struct is an optimized structure for testing whether a
// prefix entry for a string is in a table, with the expectation that the
// strings being compared will be relatively short (ideally <= 16 characters),
// and the table of string prefixes to compare to will be relatively small
// (ideally <= 16 strings).
//
// The overall goal is to be able to prefix match a string with the lowest
// possible (amortized) latency.  Fixed-size, memory-aligned character arrays,
// and SIMD instructions are used to try and achieve this.
//

typedef struct _STRING_TABLE {

    //
    // A slot where each individual element contains a uniquely-identifying
    // letter, with respect to the other strings in the table, of each string
    // in an occupied slot.
    //

    STRING_SLOT UniqueChars;

    //
    // (16 bytes consumed.)
    //

    //
    // For each unique character identified above, the following structure
    // captures the 0-based index of that character in the underlying string.
    // This is used as an input to vpshufb to rearrange the search string's
    // characters such that it can be vpcmpeqb'd against the unique characters
    // above.
    //

    SLOT_INDEX UniqueIndex;

    //
    // (32 bytes consumed.)
    //

    //
    // Length of the underlying string in each slot.
    //

    SLOT_LENGTHS Lengths;

    //
    // (48 bytes consumed, aligned at 16 bytes.)
    //

    //
    // Pointer to the STRING_ARRAY associated with this table, which we own
    // (we create it and copy the caller's contents at creation time and
    // deallocate it when we get destroyed).
    //
    // N.B.  We use pStringArray here instead of StringArray because the
    //       latter is a field name at the end of the struct.
    //
    //

    PSTRING_ARRAY pStringArray;

    //
    // (56 bytes consumed, aligned at 8 bytes.)
    //

    //
    // String table flags.
    //

    STRING_TABLE_FLAGS Flags;

    //
    // (60 bytes consumed, aligned at 4 bytes.)
    //

    //
    // A 16-bit bitmap indicating which slots are occupied.
    //

    USHORT OccupiedBitmap;

    //
    // A 16-bit bitmap indicating which slots have strings longer than 16 chars.
    //

    USHORT ContinuationBitmap;

    //
    // (64 bytes consumed, aligned at 64 bytes.)
    //

    //
    // The 16-element array of STRING_SLOT structs.  We want this to be aligned
    // on a 64-byte boundary, and it consumes 256-bytes of memory.
    //

    STRING_SLOT Slots[16];

    //
    // (320 bytes consumed, aligned at 64 bytes.)
    //

    //
    // We want the structure size to be a power of 2 such that an even number
    // can fit into a 4KB page (and reducing the likelihood of crossing page
    // boundaries, which complicates SIMD boundary handling), so we have an
    // extra 192-bytes to play with here.  The CopyStringArray() routine is
    // special-cased to allocate the backing STRING_ARRAY structure plus the
    // accommodating buffers in this space if it can fit.
    //
    // (You can test whether or not this occurred by checking the invariant
    //  `StringTable->pStringArray == &StringTable->StringArray`, if this
    //  is true, the array was allocated within this remaining padding space.)
    //

    union {
        STRING_ARRAY StringArray;
        CHAR Padding[192];
    };

} STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE;

//
// Assert critical size and alignment invariants at compile time.
//

C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots)   == 64);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320);
C_ASSERT(sizeof(STRING_TABLE) == 512);
struct string_table {
    char                       unique_chars[16];
    unsigned char              unique_index[16];
    unsigned char              slot_lengths[16];
    struct string_array       *string_array_ptr;
    struct string_table_flags  flags;
    unsigned short             occupied_bitmap;
    unsigned short             continuation_bitmap;
    char                       slots[16][16];
    union {
        struct string_array    string_array;
        char                   padding[184];
    } u;
};
STRING_TABLE struct
    UniqueChars         CHAR 16 dup  (?)
    UniqueIndex         BYTE 16 dup  (?)
    Lengths             BYTE 16 dup  (?)
    pStringArray        PSTRING_ARRAY ?
    Flags               ULONG         ?
    OccupiedBitmap      USHORT        ?
    ContinuationBitmap  USHORT        ?
    Slots               STRING_SLOT 16 dup ({ })
    union
        StringArray STRING_ARRAY {?}
        Padding CHAR 192 dup (?)
    ends
STRING_TABLE ends

;
; Assert our critical field offsets and structure size as per the same approach
; taken in StringTable.h.
;

.erre (STRING_TABLE.UniqueIndex  eq  16), @CatStr(<UnexpectedOffset STRING_TABLE.UniqueIndex: >, %(STRING_TABLE.UniqueIndex))
.erre (STRING_TABLE.Lengths      eq  32), @CatStr(<UnexpectedOffset STRING_TABLE.Lengths: >, %(STRING_TABLE.Lengths))
.erre (STRING_TABLE.pStringArray eq  48), @CatStr(<UnexpectedOffset STRING_TABLE.pStringArray: >, %(STRING_TABLE.pStringArray))
.erre (STRING_TABLE.Slots        eq  64), @CatStr(<UnexpectedOffset STRING_TABLE.Slots: >, %(STRING_TABLE.Slots))
.erre (STRING_TABLE.Padding      eq 320), @CatStr(<UnexpectedOffset STRING_TABLE.Padding: >, %(STRING_TABLE.Padding))
.erre (size STRING_TABLE eq 512), @CatStr(<IncorrectStructSize: STRING_TABLE: >, %(size STRING_TABLE))

PSTRING_TABLE typedef ptr STRING_TABLE

;
; CamelCase typedefs that are nicer to work with in assembly
; than their uppercase counterparts.
;

StringTable typedef STRING_TABLE

The following diagram depicts an in-memory representation of the STRING_TABLE structure using our NTFS reserved prefix names. It is created via the CreateStringTable routine, which we feature in the appendix of this article.

STRING_TABLE Diagram

STRING_TABLE Diagram

In order to improve the uniqueness of the unique characters selected from each string, the strings are sorted by length during string table creation and enumerated in this order while identifying unique characters. The rationale behind this is that shorter strings simply have fewer characters to choose from, while longer strings have more to choose from. If we identified unique characters in the order they appear in the string table, we may have longer strings preceding shorter ones, such that toward the end of the table, nothing unique can be extracted from the short ones.

The utility of the string table is maximized by ensuring a unique character is selected from every string; thus, we sort by length first. Note that the uniqueness is actually determined by offset:character pairs, with the offsets becoming the indices stored in the UniqueIndex slot. If you trace through the diagram above, you’ll see that the unique character in each slot matches the character in the corresponding string slot, indicated by the underlying index.

STRING_ARRAY

The string array captures a raw array representation of the underlying strings making up the string table. It is either embedded within the padding area at the end of the string table, or a separate allocation is made during string table creation. The main interface to creating a string table is via a STRING_ARRAY structure. The helper functions, CreateStringTableFromDelimitedString and CreateStringTableFromDelimitedEnvironmentVariable, simply break down their input into a STRING_ARRAY representation first before calling CreateStringTable.

typedef struct _Struct_size_bytes_(SizeInQuadwords>>3) _STRING_ARRAY {

    //
    // Size of the structure, in quadwords.  Why quadwords?  It allows us to
    // keep this size field to a USHORT, which helps with the rest of the
    // alignment in this struct (we want the STRING Strings[] array to start
    // on an 8-byte boundary).
    //
    // N.B.  We can't express the exact field size in the SAL annotation
    //       below, because the array of buffer sizes are inexpressible;
    //       however, we know the maximum length, so we can use the implicit
    //       invariant that the total buffer size can't exceed whatever num
    //       elements * max size is.
    //

    _Field_range_(<=, (
        sizeof(struct _STRING_ARRAY) +
        ((NumberOfElements - 1) * sizeof(STRING)) +
        (MaximumLength * NumberOfElements)
    ) >> 3)
    USHORT SizeInQuadwords;

    //
    // Number of elements in the array.
    //

    USHORT NumberOfElements;

    //
    // Minimum and maximum lengths for the String->Length fields.  Optional.
    //

    USHORT MinimumLength;
    USHORT MaximumLength;

    //
    // A pointer to the STRING_TABLE structure that "owns" us.
    //

    struct _STRING_TABLE *StringTable;

    //
    // The string array.  Number of elements in the array is governed by the
    // NumberOfElements field above.
    //

    STRING Strings[ANYSIZE_ARRAY];

} STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;
Note

The odd-looking macros _Struct_size_bytes_ and _Field_range_ are SAL Annotations. There’s a neat deck called Engineering Better Software at Microsoft which captures some interesting details about SAL, for those wanting to read more. The Code Analysis engine that uses the annotations is built upon the Z3 Theorem Prover, which is a fascinating little project in its own right.

And finally, we’re left with the smaller helper structs that we use to encapsulate the various innards of the string table. (I use unions that feature XMMWORD representations (which is a typedef of __m128i, representing an XMM register) as well as underlying byte/character representations, as I personally find it makes the resulting C code a bit nicer.)

STRING_SLOT

//
// String tables are composed of a 16 element array of 16 byte string "slots",
// which represent a unique character (with respect to other strings in the
// table) for a string in a given slot index. The STRING_SLOT structure
// provides a convenient wrapper around this construct.
//

typedef union DECLSPEC_ALIGN(16) _STRING_SLOT {
    XMMWORD CharsXmm;
    CHAR Char[16];
} STRING_SLOT, *PSTRING_SLOT, **PPSTRING_SLOT;
C_ASSERT(sizeof(STRING_SLOT) == 16);

SLOT_INDEX

//
// A 16 element array of 1 byte unsigned integers, used to capture the length
// of each string slot in a single XMM 128-bit register.
//

typedef union DECLSPEC_ALIGN(16) _SLOT_LENGTHS {
    XMMWORD SlotsXmm;
    BYTE Slots[16];
} SLOT_LENGTHS, *PSLOT_LENGTHS, **PPSLOT_LENGTHS;
C_ASSERT(sizeof(SLOT_LENGTHS) == 16);

String Table Construction

The CreateSingleStringTable routine is responsible for the construction of a new STRING_TABLE. It is here that we identify the unique set of characters (and their indices) to store in the first two fields of the string table.

//
// Define private types used by this module.
//

typedef struct _LENGTH_INDEX_ENTRY {
    BYTE Length;
    BYTE Index;
} LENGTH_INDEX_ENTRY;
typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY;

typedef struct _LENGTH_INDEX_TABLE {
    LENGTH_INDEX_ENTRY Entry[16];
} LENGTH_INDEX_TABLE;
typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE;

typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP {
    YMMWORD Ymm;
    XMMWORD Xmm[2];
    LONG Bits[(256 / (4 << 3))];  // 8
} CHARACTER_BITMAP;
C_ASSERT(sizeof(CHARACTER_BITMAP) == 32);
typedef CHARACTER_BITMAP *PCHARACTER_BITMAP;

typedef struct _SLOT_BITMAPS {
    CHARACTER_BITMAP Bitmap[16];
} SLOT_BITMAPS;
typedef SLOT_BITMAPS *PSLOT_BITMAPS;

//
// Function implementation.
//

_Use_decl_annotations_
PSTRING_TABLE
CreateSingleStringTable(
    PRTL Rtl,
    PALLOCATOR StringTableAllocator,
    PALLOCATOR StringArrayAllocator,
    PSTRING_ARRAY StringArray,
    BOOL CopyArray
    )
/*++

Routine Description:

    Allocates space for a STRING_TABLE structure using the provided allocators,
    then initializes it using the provided STRING_ARRAY.  If CopyArray is set
    to TRUE, the routine will copy the string array such that the caller is
    free to destroy it after the table has been successfully created.  If it
    is set to FALSE and StringArray->StringTable has a non-NULL value, it is
    assumed that sufficient space has already been allocated for the string
    table and this pointer will be used to initialize the rest of the structure.

    DestroyStringTable() must be called against the returned PSTRING_TABLE when
    the structure is no longer needed in order to ensure resources are released.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which
        will be used for creating the STRING_TABLE.

    StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which
        may be used to create the STRING_ARRAY if it cannot fit within the
        padding of the STRING_TABLE structure.  This is kept separate from the
        StringTableAllocator due to the stringent alignment requirements of the
        string table.

    StringArray - Supplies a pointer to an initialized STRING_ARRAY structure
        that contains the STRING structures that are to be added to the table.

    CopyArray - Supplies a boolean value indicating whether or not the
        StringArray structure should be deep-copied during creation.  This is
        typically set when the caller wants to be able to free the structure
        as soon as this call returns (or can't guarantee it will persist past
        this function's invocation, i.e. if it was stack allocated).

Return Value:

    A pointer to a valid PSTRING_TABLE structure on success, NULL on failure.
    Call DestroyStringTable() on the returned structure when it is no longer
    needed in order to ensure resources are cleaned up appropriately.

--*/
{
    BYTE Byte;
    BYTE Count;
    BYTE Index;
    BYTE Length;
    BYTE NumberOfElements;
    ULONG HighestBit;
    ULONG OccupiedMask;
    PULONG Bits;
    USHORT OccupiedBitmap;
    USHORT ContinuationBitmap;
    PSTRING_TABLE StringTable;
    PSTRING_ARRAY StringArray;
    PSTRING String;
    PSTRING_SLOT Slot;
    STRING_SLOT UniqueChars;
    SLOT_INDEX UniqueIndex;
    SLOT_INDEX LengthIndex;
    SLOT_LENGTHS Lengths;
    LENGTH_INDEX_TABLE LengthIndexTable;
    PCHARACTER_BITMAP Bitmap;
    SLOT_BITMAPS SlotBitmaps;
    PLENGTH_INDEX_ENTRY Entry;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(StringTableAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(StringArrayAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(SourceStringArray)) {
        return NULL;
    }

    if (SourceStringArray->NumberOfElements == 0) {
        return NULL;
    }

    //
    // Copy the incoming string array if applicable.
    //

    if (CopyArray) {

        StringArray = CopyStringArray(
            StringTableAllocator,
            StringArrayAllocator,
            SourceStringArray,
            FIELD_OFFSET(STRING_TABLE, StringArray),
            sizeof(STRING_TABLE),
            &StringTable
        );

        if (!StringArray) {
            return NULL;
        }

    } else {

        //
        // We're not copying the array, so initialize StringArray to point at
        // the caller's SourceStringArray, and StringTable to point at the
        // array's StringTable field (which will be non-NULL if sufficient
        // space has been allocated).
        //

        StringArray = SourceStringArray;
        StringTable = StringArray->StringTable;

    }

    //
    // If StringTable has no value, we've either been called with CopyArray set
    // to FALSE, or CopyStringArray() wasn't able to allocate sufficient space
    // for both the table and itself.  Either way, we need to allocate space for
    // the table.
    //

    if (!StringTable) {

        StringTable = (PSTRING_TABLE)(
            StringTableAllocator->AlignedCalloc(
                StringTableAllocator->Context,
                1,
                sizeof(STRING_TABLE),
                STRING_TABLE_ALIGNMENT
            )
        );

        if (!StringTable) {
            return NULL;
        }
    }

    //
    // Make sure the fields that are sensitive to alignment are, in fact,
    // aligned correctly.
    //

    if (!AssertStringTableFieldAlignment(StringTable)) {
        DestroyStringTable(StringTableAllocator,
                           StringArrayAllocator,
                           StringTable);
        return NULL;
    }

    //
    // At this point, we have copied the incoming StringArray if necessary,
    // and we've allocated sufficient space for the StringTable structure.
    // Enumerate over all of the strings, set the continuation bit if the
    // length > 16, set the relevant slot length, set the relevant unique
    // character entry, then move the first 16-bytes of the string into the
    // relevant slot via an aligned SSE mov.
    //

    //
    // Initialize pointers and counters, clear stack-based structures.
    //

    Slot = StringTable->Slots;
    String = StringArray->Strings;

    OccupiedBitmap = 0;
    ContinuationBitmap = 0;
    NumberOfElements = (BYTE)StringArray->NumberOfElements;
    UniqueChars.CharsXmm = _mm_setzero_si128();
    UniqueIndex.IndexXmm = _mm_setzero_si128();
    LengthIndex.IndexXmm = _mm_setzero_si128();

    //
    // Set all the slot lengths to 0x7f up front instead of defaulting
    // to zero.  This allows for simpler logic when searching for a prefix
    // string, which involves broadcasting a search string's length to an XMM
    // register, then doing _mm_cmpgt_epi8() against the lengths array and
    // the string length.  If we left the lengths as 0 for unused slots, they
    // would get included in the resulting comparison register (i.e. the high
    // bits would be set to 1), so we'd have to do a subsequent masking of
    // the result at some point using the OccupiedBitmap.  By defaulting the
    // lengths to 0x7f, we ensure they'll never get included in any cmpgt-type
    // SIMD matches.  (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8()
    // intrinsic assumes packed signed integers.)
    //

    Lengths.SlotsXmm = _mm_set1_epi8(0x7f);

    ZeroStruct(LengthIndexTable);
    ZeroStruct(SlotBitmaps);

    for (Count = 0; Count < NumberOfElements; Count++) {

        XMMWORD CharsXmm;

        //
        // Set the string length for the slot.
        //

        Length = Lengths.Slots[Count] = (BYTE)String->Length;

        //
        // Set the appropriate bit in the continuation bitmap if the string is
        // longer than 16 bytes.
        //

        if (Length > 16) {
            ContinuationBitmap |= (Count == 0 ? 1 : 1 << (Count + 1));
        }

        if (Count == 0) {

            Entry = &LengthIndexTable.Entry[0];
            Entry->Index = 0;
            Entry->Length = Length;

        } else {

            //
            // Perform a linear scan of the length-index table in order to
            // identify an appropriate insertion point.
            //

            for (Index = 0; Index < Count; Index++) {
                if (Length < LengthIndexTable.Entry[Index].Length) {
                    break;
                }
            }

            if (Index != Count) {

                //
                // New entry doesn't go at the end of the table, so shuffle
                // everything else down.
                //

                Rtl->RtlMoveMemory(&LengthIndexTable.Entry[Index + 1],
                                   &LengthIndexTable.Entry[Index],
                                   (Count - Index) * sizeof(*Entry));
            }

            Entry = &LengthIndexTable.Entry[Index];
            Entry->Index = Count;
            Entry->Length = Length;
        }

        //
        // Copy the first 16-bytes of the string into the relevant slot.  We
        // have taken care to ensure everything is 16-byte aligned by this
        // stage, so we can use SSE intrinsics here.
        //

        CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);
        _mm_store_si128(&(*Slot).CharsXmm, CharsXmm);

        //
        // Advance our pointers.
        //

        ++Slot;
        ++String;

    }

    //
    // Store the slot lengths.
    //

    _mm_store_si128(&(StringTable->Lengths.SlotsXmm), Lengths.SlotsXmm);

    //
    // Loop through the strings in order of shortest to longest and construct
    // the uniquely-identifying character table with corresponding index.
    //


    for (Count = 0; Count < NumberOfElements; Count++) {
        Entry = &LengthIndexTable.Entry[Count];
        Length = Entry->Length;
        Slot = &StringTable->Slots[Entry->Index];

        //
        // Iterate over each character in the slot and find the first one
        // without a corresponding bit set.
        //

        for (Index = 0; Index < Length; Index++) {
            Bitmap = &SlotBitmaps.Bitmap[Index];
            Bits = (PULONG)&Bitmap->Bits[0];
            Byte = Slot->Char[Index];
            if (!BitTestAndSet(Bits, Byte)) {
                break;
            }
        }

        UniqueChars.Char[Count] = Byte;
        UniqueIndex.Index[Count] = Index;
        LengthIndex.Index[Count] = Entry->Index;
    }

    //
    // Loop through the elements again such that the unique chars are stored
    // in the order they appear in the table.
    //

    for (Count = 0; Count < NumberOfElements; Count++) {
        for (Index = 0; Index < NumberOfElements; Index++) {
            if (LengthIndex.Index[Index] == Count) {
                StringTable->UniqueChars.Char[Count] = UniqueChars.Char[Index];
                StringTable->UniqueIndex.Index[Count] = UniqueIndex.Index[Index];
                break;
            }
        }
    }

    //
    // Generate and store the occupied bitmap.  Each bit, from low to high,
    // corresponds to the index of a slot.  When set, the slot is occupied.
    // When clear, it is not.  So, fill bits from the highest bit set down.
    //

    HighestBit = (1 << (StringArray->NumberOfElements-1));
    OccupiedMask = _blsmsk_u32(HighestBit);
    StringTable->OccupiedBitmap = (USHORT)OccupiedMask;

    //
    // Store the continuation bitmap.
    //

    StringTable->ContinuationBitmap = (USHORT)(ContinuationBitmap);

    //
    // Wire up the string array to the table.
    //

    StringTable->pStringArray = StringArray;

    //
    // And we're done, return the table.
    //

    return StringTable;
}

The Benchmark

The performance comparison graphs in the subsequent sections were generated in Excel, using CSV data output by the creatively-named program StringTable2BenchmarkExe.

Modern CPUs are fast, and timing is challenging, especially when you’re dealing with CPU cycle comparisons. No approach is perfect. Here’s what I settled on:

  1. The benchmark utility has #pragma optimize("", off) at the start of the file, which disables global optimizations, even in release (optimized) builds. This prevents the compiler from doing clever things regarding the scheduling of the timestamping logic, which affects reported times.

  2. The benchmark utility pins itself to a single core and sets its thread priority to the highest permissible value at startup. (Turbo is disabled on the computer, so the frequency is pinned to 3.68GHz.)

  3. The benchmark utility is fed an array of function pointers and test inputs. It iterates over each test input, then iterates over each function, calling it with the test input and potentially verifying the result (some functions are included for comparison but don’t produce correct results, so they don’t have their results verified).

  4. The test input string is copied into a local buffer aligned on a 32-byte boundary. This ensures that all test inputs are being compared fairly. (The natural alignment of the buffers varies anywhere from 2 to 512 bytes, and unaligned buffers have a significant impact on the timings.)

  5. The function is run once, with the result captured. If verification has been requested, the result is verified. We __debugbreak() immediately if there’s a mismatch, which is handy during development.

  6. NtDelayExecution(TRUE, 1) is called, which results in a sleep of approximately 100 nanoseconds. This forces a context switch, giving the thread a new scheduling quantum before each function is run.

  7. The function is executed 100 times for warmup.

  8. Timings are taken for 1000 iterations of the function using the given test input. The __rdtscp() intrinsic is used (which forces some serialization) to capture the timestamp counter before and after the iterations.

  9. This process is repeated 100 times. The minimum time observed to perform 1000 iterations (out of 100 attempts) is captured as the function’s best time.

Release vs PGO Oddities

All of the times in the graphs come from the profile-guided optimization (PGO) build of the StringTable component. The PGO build is faster than the normal release build in every case except one, where it is notably slower.

It’s… odd. I haven’t investigated it. The following graph depicts the affected function, IsPrefixOfStringInTable_1, and a few other versions for reference, showing the performance of the PGO build compared to the release build on the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".

Benchmark Release vs PGO

Benchmark Release vs PGO

Only that function is affected, and the problem mainly manifests with the two example test strings shown. As this routine essentially serves as one of the initial baseline implementations, it would be misleading to compare all optimized PGO versions to the abnormally slow baseline implementation. Therefore, the release and PGO timings were blended into a single CSV, and the Excel PivotTables select the minimum time for a given function and test input.

Thus, you’re always looking at the PGO timings, except for this outlier case where the release versions are faster.

The Implementations

Round 1

In this section, we’ll take a look at the various implementations I experimented with on the first pass, prior to soliciting any feedback. I figured there were a couple of ways I could present this information. First, I could hand-pick what I choose to show and hide, creating a rosy picture that makes it seem like I effortlessly arrived at the fastest implementation without much actual effort whatsoever.

Or I could show the gritty reality of how everything actually went down in chronological fashion, errors and all. And there were definitely some errors! For better or worse, I’ve chosen this route, so you’ll get to see some pretty tedious tweaks (changing a single line, for example) before the juicy stuff really kicks in.

Additionally, with the benefit of writing this section introduction retroactively, iterations 4 and 5 aren’t testing what I initially thought they were testing. I’ve left them in as is; if anything, it demonstrates the importance of only changing one thing at a time and making sure you’re testing what you think you’re testing. I’ll discuss the errors with those iterations later in the article.

C Implementations

IsPrefixOfCStrInArray

IsPrefixOfStringInTable_1 →

Let’s review the baseline implementation again, as that’s what we’re ultimately comparing ourselves against. This version enumerates the string array (and thus has a slightly different function signature than the STRING_TABLE-based functions) looking for prefix matches. No SIMD instructions are used. The timings captured should be proportional to the location of the test input string in the array. That is, it should take less time to prefix match strings that occur earlier in the array versus those that appear later.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left && *Right && *Left++ == *Right++) {
            Count++;
        }

        if (Count > 0 && !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)Count;
                Match->String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}

IsPrefixOfStringInTable_1

← IsPrefixOfCStrInArray | IsPrefixOfStringInTable_2 →

This version is similar to the IsPrefixOfCStrInArray implementation, except it utilizes the slot length information provided by the STRING_ARRAY structure and conforms to our standard IsPrefixOfStringInTable function signature. It uses no SIMD instructions.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_1(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine performs a simple linear scan of the string table looking for
    a prefix match against each slot.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    BYTE Left;
    BYTE Right;
    ULONG Index;
    ULONG Count;
    PSTRING_ARRAY StringArray;
    PSTRING TargetString;

    //IACA_VC_START();

    StringArray = StringTable->pStringArray;

    if (StringArray->MinimumLength > String->Length) {
        return NO_MATCH_FOUND;
    }

    for (Count = 0; Count < StringArray->NumberOfElements; Count++) {

        TargetString = &StringArray->Strings[Count];

        if (String->Length < TargetString->Length) {
            continue;
        }

        for (Index = 0; Index < TargetString->Length; Index++) {
            Left = String->Buffer[Index];
            Right = TargetString->Buffer[Index];
            if (Left != Right) {
                break;
            }
        }

        if (Index == TargetString->Length) {

            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Count;
                Match->NumberOfMatchedCharacters = (BYTE)Index;
                Match->String = TargetString;

            }

            return (STRING_TABLE_INDEX)Count;
        }

    }

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 1

Here’s the performance of these two baseline routines:

Benchmark 1

Benchmark 1

That’s an interesting result! Even without using any SIMD instructions, version 1, the IsPrefixOfStringInTable_1 routine, is faster (in all but one case) than the baseline IsPrefixOfCStrInArray routine, thanks to a more sophisticated data structure.

(And really, it’s not even using the sophisticated parts of the STRING_TABLE; it’s just leveraging the fact that we’ve captured the lengths of each string in the backing STRING_ARRAY structure by virtue of using the STRING structure to wrap our strings (instead of relying on the standard NULL-terminated C string approach).)


Before we look at IsPrefixOfStringInTable_2, which is the first of the routines to use SIMD instructions, it’s helpfu to know some backstory. The _2 version is based on the prefix matching routine I wrote for the first version of the StringTable component back in 2016. The layout of the STRING_TABLE struct differed in the first version; only the first character of each slot was used to do the initial exclusion (as opposed to the unique character), and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits), so the match bitmap had to be constructed slightly differently.

None of those details really apply to our second attempt at the StringTable component, detailed in this article. Our lengths are 8 bits, and we use unique characters in the initial negative match fast-path. However, the first version used an elaborate AVX2 prefix match routine geared toward matching long strings, attempting to use non-temporal streaming load instructions where possible (which would only make sense for a large number of long strings in specific cache-thrashing scenarios).

Compare our simpler implementation, IsPrefixMatch, used from version 3 onward, to the far more elaborate (and unnecessary) IsPrefixMatchAvx2:

IsPrefixMatch

FORCEINLINE
BYTE
IsPrefixMatch(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ BYTE Offset
    )
{
    PBYTE Left;
    PBYTE Right;
    BYTE Matched = 0;
    BYTE Remaining = (SearchString->Length - Offset) + 1;

    Left = (PBYTE)RtlOffsetToPointer(SearchString->Buffer, Offset);
    Right = (PBYTE)RtlOffsetToPointer(TargetString->Buffer, Offset);

    while (--Remaining && *Left++ == *Right++) {
        Matched++;
    }

    Matched += Offset;
    if (Matched != TargetString->Length) {
        return NO_MATCH_FOUND;
    }

    return Matched;
}

IsPrefixMatchAvx2

The AVX2 routine is overkill, especially considering the emphasis we put on favoring short strings over longer ones in the requirements section. However, we want to put broad statements like that to the test, so let’s include it as our first SIMD implementation to see how it stacks up against the simpler versions.

FORCEINLINE
USHORT
IsPrefixMatchAvx2(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ USHORT Offset
    )
{
    USHORT SearchStringRemaining;
    USHORT TargetStringRemaining;
    ULONGLONG SearchStringAlignment;
    ULONGLONG TargetStringAlignment;
    USHORT CharactersMatched = Offset;

    LONG Count;
    LONG Mask;

    PCHAR SearchBuffer;
    PCHAR TargetBuffer;

    STRING_SLOT SearchSlot;

    XMMWORD SearchXmm;
    XMMWORD TargetXmm;
    XMMWORD ResultXmm;

    YMMWORD SearchYmm;
    YMMWORD TargetYmm;
    YMMWORD ResultYmm;

    SearchStringRemaining = SearchString->Length - Offset;
    TargetStringRemaining = TargetString->Length - Offset;

    SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString->Buffer, Offset);
    TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString->Buffer, Offset);

    //
    // This routine is only called in the final stage of a prefix match when
    // we've already verified the slot's corresponding original string length
    // (referred in this routine as the target string) is less than or equal
    // to the length of the search string.
    //
    // We attempt as many 32-byte comparisons as we can, then as many 16-byte
    // comparisons as we can, then a final < 16-byte comparison if necessary.
    //
    // We use aligned loads if possible, falling back to unaligned if not.
    //

StartYmm:

    if (SearchStringRemaining >= 32 && TargetStringRemaining >= 32) {

        //
        // We have at least 32 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 32-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);
        TargetStringAlignment = GetAddressAlignment(TargetBuffer);

        if (SearchStringAlignment < 32) {
            SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer);
        } else {
            SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer);
        }

        if (TargetStringAlignment < 32) {
            TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer);
        } else {
            TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer);
        }

        //
        // Compare the two vectors.
        //

        ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm256_movemask_epi8(ResultYmm);

        //
        // There were at least 32 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 32 characters, we can terminate
        // this prefix search here.  (-1 == 0xffffffff == all bits set == all
        // characters matched.)
        //

        if (Mask != -1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 32 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 32-byte processing.
        //

        SearchStringRemaining -= 32;
        TargetStringRemaining -= 32;

        CharactersMatched += 32;

        SearchBuffer += 32;
        TargetBuffer += 32;

        goto StartYmm;
    }

    //
    // Intentional follow-on to StartXmm.
    //

StartXmm:

    //
    // Update the search string's alignment.
    //

    if (SearchStringRemaining >= 16 && TargetStringRemaining >= 16) {

        //
        // We have at least 16 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 16-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);

        if (SearchStringAlignment < 16) {
            SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
        } else {
            SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer);
        }

        TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer);

        //
        // Compare the two vectors.
        //

        ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm_movemask_epi8(ResultXmm);

        //
        // There were at least 16 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 16 characters, we can terminate
        // this prefix search here.  (-1 == 0xffff -> all bits set -> all chars
        // matched.)
        //

        if ((SHORT)Mask != (SHORT)-1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 16 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 16-byte processing.
        //

        SearchStringRemaining -= 16;
        TargetStringRemaining -= 16;

        CharactersMatched += 16;

        SearchBuffer += 16;
        TargetBuffer += 16;

        goto StartXmm;
    }

    if (TargetStringRemaining == 0) {

        //
        // We'll get here if we successfully prefix matched the search string
        // and all our buffers were aligned (i.e. we don't have a trailing
        // < 16 bytes comparison to perform).
        //

        return CharactersMatched;
    }

    //
    // If we get here, we have less than 16 bytes to compare.  Our target
    // strings are guaranteed to be 16-byte aligned, so we can load them
    // using an aligned stream load as in the previous cases.
    //

    TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer);

    //
    // Loading the remainder of our search string's buffer is a little more
    // complicated.  It could reside within 15 bytes of the end of the page
    // boundary, which would mean that a 128-bit load would cross a page
    // boundary.
    //
    // At best, the page will belong to our process and we'll take a performance
    // hit.  At worst, we won't own the page, and we'll end up triggering a hard
    // page fault.
    //
    // So, see if the current search buffer address plus 16 bytes crosses a page
    // boundary.  If it does, take the safe but slower approach of a ranged
    // memcpy (movsb) into a local stack-allocated STRING_SLOT structure.
    //

    if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) {

        //
        // No page boundary is crossed, so just do an unaligned 128-bit move
        // into our Xmm register.  (We could do the aligned/unaligned dance
        // here, but it's the last load we'll be doing (i.e. it's not
        // potentially on a loop path), so I don't think it's worth the extra
        // branch cost, although I haven't measured this empirically.)
        //

        SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);

    } else {

        //
        // We cross a page boundary, so only copy the the bytes we need via
        // __movsb(), then do an aligned stream load into the Xmm register
        // we'll use in the comparison.
        //

        __movsb((PBYTE)&SearchSlot.Char,
                (PBYTE)SearchBuffer,
                SearchStringRemaining);

        SearchXmm = _mm_stream_load_si128(&SearchSlot.CharsXmm);
    }

    //
    // Compare the final vectors.
    //

    ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

    //
    // Generate a mask from the result of the comparison, but mask off (zero
    // out) high bits from the target string's remaining length.
    //

    Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining);

    //
    // Count how many characters were matched and determine if we were a
    // successful prefix match or not.
    //

    Count = __popcnt(Mask);

    if ((USHORT)Count == TargetStringRemaining) {

        //
        // If we matched the same amount of characters as remaining in the
        // target string, we've successfully prefix matched the search string.
        // Return the total number of characters we matched.
        //

        CharactersMatched += (USHORT)Count;
        return CharactersMatched;
    }

    //
    // After all that work, our string match failed at the final stage!  Return
    // to the caller indicating we were unable to make a prefix match.
    //

    return NO_MATCH_FOUND;
}

IsPrefixOfStringInTable_2

← IsPrefixOfStringInTable_1 | IsPrefixOfStringInTable_3 →

Note

This is is the first time we’re seeing the full body of the SIMD-style IsPrefixOfStringInTable implementation. It’s heavily commented, and generally, the core algorithm doesn’t fundamentally change across iterations (just slight tweaks). I’d recommend reading through it thoroughly to build a mental model of how the matching algorithm works. It’s straightforward, and the subsequent iterations will make much more sense, as they’re typically presented as diffs against the previous version.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_2(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 2

Let’s see how version 2, our first SIMD attempt, performs in comparison to the two baselines.

Benchmark 2

Benchmark 2

Eek! Our first SIMD attempt actually has worse prefix matching performance in most cases! The only area where it shows a performance improvement is in negative matching.

IsPrefixOfStringInTable_3

← IsPrefixOfStringInTable_2 | IsPrefixOfStringInTable_4 →

For version 3, let’s replace the call to IsPrefixMatchAvx2 with our simpler version, IsPrefixMatch:

% diff -u IsPrefixOfStringInTable_2.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_2.c 2018-04-15 22:35:55.458773500 -0400
+++ IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_2(
+IsPrefixOfStringInTable_3(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -278,7 +278,7 @@

             TargetString = &StringTable->pStringArray->Strings[Index];

-            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
+            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

             if (CharactersMatched == NO_MATCH_FOUND) {
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 3

Benchmark 3

Benchmark 3

Phew! We finally see superior performance across the board. This ends the short-lived tenure of version 2, which is demonstrably worse in every case.

We’ll also omit the IsPrefixOfCStrInArray routine from the graphs for now (for the most part), as it has served its initial baseline purpose.

IsPrefixOfStringInTable_4

← IsPrefixOfStringInTable_3 | IsPrefixOfStringInTable_5 →

When I first wrote the initial string table code, I was experimenting with different strategies for loading the initial search string buffer. That resulted in the file StringLoadStoreOperations.h, which defined a bunch of helper macros. I’ve included them below, but don’t spend too much time absorbing them—they’re not good practice, and they all become irrelevant as soon as we switch to _mm_loadu_si128() in a few versions. I’m including them because they set the scene for versions 4, 5, and 6.

/*++

    VOID
    LoadSearchStringIntoXmmRegister_SEH(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    Attempts an aligned 128-bit load of String->Buffer into Slot.CharXmm via
    the _mm_load_si128() intrinsic.  The intrinsic is surrounded in a __try/
    __except block that catches EXCEPTION_ACCESS_VIOLATION exceptions.

    If such an exception is caught, the routine will check to see if the string
    buffer's address will cross a page boundary if 16-bytes are loaded.  If a
    page boundary would be crossed, a __movsb() intrinsic is used to copy only
    the bytes specified by String->Length, otherwise, an unaligned 128-bit load
    is attemped via the _mm_loadu_si128() intrinsic.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar)   \
    LengthVar = min(String->Length, 16);                               \
    TRY_SSE42_ALIGNED {                                                \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);      \
    } CATCH_EXCEPTION_ACCESS_VIOLATION {                               \
        if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {  \
            __movsb(Slot.Char, String->Buffer, LengthVar);             \
        } else {                                                       \
            Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \
        }                                                              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlignmentCheck(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine checks to see if a page boundary will be crossed if 16-bytes
    are loaded from the address supplied by String->Buffer.  If a page boundary
    will be crossed, a __movsb() intrinsic is used to only copy String->Length
    bytes into the given Slot.

    If no page boundary will be crossed by a 128-bit load, the alignment of
    the address supplied by String->Buffer is checked.  If the alignment isn't
    at least on a 16-byte boundary, an unaligned load will be issued via the
    _mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \
    LengthVar = min(String->Length, 16);                                       \
    if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {              \
        __movsb(Slot.Char, String->Buffer, LengthVar);                         \
    } else if (GetAddressAlignment(String->Buffer) < 16) {                     \
        Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);             \
    } else {                                                                   \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysUnaligned(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine performs an unaligned 128-bit load of the address supplied by
    String->Buffer into the given Slot via the _mm_loadu_si128() intrinsic.
    No checks are done regarding whether or not a page boundary will be crossed.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \
    LengthVar = min(String->Length, 16);                                   \
    if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {          \
        __movsb(Slot.Char, String->Buffer, LengthVar);                     \
    } else if (GetAddressAlignment(String->Buffer) < 16) {                 \
        Slot.CharsXmm = _mm_loadu_si128(String->Buffer);                   \
    } else {                                                               \
        Slot.CharsXmm = _mm_load_si128(String->Buffer);                    \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysMovsb(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine copies min(String->Length, 16) bytes from String->Buffer
    into the given Slot via the __movsb() intrinsic.  The memory referenced by
    the Slot is not cleared first via SecureZeroMemory().

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \
    LengthVar = min(String->Length, 16);                                     \
    __movsb(Slot.Char, String->Buffer, LengthVar);

In our StringTable2.vcxproj file, we have the following:


  <PropertyGroup Label="Globals">
    ...
    <LoadSearchStringStrategy>AlwaysMovsb</LoadSearchStringStrategy>
    <!--
    <LoadSearchStringStrategy>SEH</LoadSearchStringStrategy>
    <LoadSearchStringStrategy>AlignmentCheck</LoadSearchStringStrategy>
    <LoadSearchStringStrategy>AlwaysUnaligned</LoadSearchStringStrategy>
    -->

This setup allowed me to toggle which strategy I wanted to use for loading the search string into an XMM register. As shown above, the default is to use the AlwaysMovsb approach*; so, for version 4, let’s swap that out for the SEH approach, which wraps the aligned load in a structured exception handler that falls back to __movsb() if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.

[*]: Or was it?

Narrator: it wasn’t.

% diff -u IsPrefixOfStringInTable_4.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_4.c 2018-04-15 22:35:55.453274200 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_4(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses a structured exception
+    handler for loading the initial search string.

 Arguments:

@@ -123,7 +124,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_4(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses a structured exception
    handler for loading the initial search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 4

The performance of version 4 was slightly worse than 3 in every case:

Benchmark 4

Benchmark 4

Version 3 is still in the lead with the AlwaysMovsb-based search string loading approach.

Narrator: except the AlignmentCheck macro was actually active, not the AlwaysMovsb one.

IsPrefixOfStringInTable_5

← IsPrefixOfStringInTable_4 | IsPrefixOfStringInTable_6 →

Version 5 is an interesting one. It’s the first time we attempt to validate our claim that it’s more efficient to give the CPU a bunch of independent things to do up front, rather than adding more branches and attempting to terminate as early as possible.

Note: we’ll also explicitly use the LoadSearchStringIntoXmmRegister_AlwaysMovsb macro here, instead of LoadSearchStringIntoXmmRegister, to make it clear that we’re actually relying on the __movsb()-based string loading routine.

Narrator: can anyone spot the mistake with this logic?

% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_5.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_5.c 2018-04-15 13:24:52.480972900 -0400
@@ -16,9 +16,13 @@

 #include "stdafx.h"

+//
+// Variant of v3 with early-exits.
+//
+
 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_5(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +35,11 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses early exits (i.e.
+    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
+    the assertion we've been making that it's more optimal to give the CPU
+    to do a bunch of things up front versus doing something, then potentially
+    branching, doing the next thing, potentially branching, etc.

 Arguments:

@@ -51,6 +59,8 @@
 --*/
 {
     ULONG Bitmap;
+    ULONG CharBitmap;
+    ULONG LengthBitmap;
     ULONG Mask;
     ULONG Count;
     ULONG Length;
@@ -71,7 +81,6 @@
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
     XMMWORD IncludeSlotsByLength;
-    XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     StringArray = StringTable->pStringArray;
@@ -123,7 +132,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string
@@ -133,11 +142,6 @@
     UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                   StringTable->UniqueIndex.IndexXmm);

-    //
-    // Load the slot length array into an XMM register.
-    //
-
-    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

     //
     // Load the string table's unique character array into an XMM register.
@@ -146,13 +150,6 @@
     TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

     //
-    // Broadcast the search string's length into an XMM register.
-    //
-
-    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
-    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
-
-    //
     // Compare the search string's unique character with all of the unique
     // characters of strings in the table, saving the results into an XMM
     // register.  This comparison will indicate which slots we can ignore
@@ -162,6 +159,25 @@

     IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

+    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);
+
+    if (!CharBitmap) {
+        return NO_MATCH_FOUND;
+    }
+
+    //
+    // Load the slot length array into an XMM register.
+    //
+
+    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
+
+    //
+    // Broadcast the search string's length into an XMM register.
+    //
+
+    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
+    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
+
     //
     // Find all slots that are longer than the incoming string length, as these
     // are the ones we're going to exclude from any prefix match.
@@ -182,31 +198,16 @@

     IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

-    //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
-    //
-
-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

-    //
-    // Generate a mask.
-    //
+    if (!LengthBitmap) {
+        return NO_MATCH_FOUND;
+    }

-    Bitmap = _mm_movemask_epi8(IncludeSlots);
+    Bitmap = CharBitmap & LengthBitmap;

     if (!Bitmap) {
-
-        //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
-        //
-
-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_5(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses early exits (i.e.
    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
    the assertion we've been making that it's more optimal to give the CPU
    to do a bunch of things up front versus doing something, then potentially
    branching, doing the next thing, potentially branching, etc.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG CharBitmap;
    ULONG LengthBitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);


    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);

    if (!CharBitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

    if (!LengthBitmap) {
        return NO_MATCH_FOUND;
    }

    Bitmap = CharBitmap & LengthBitmap;

    if (!Bitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 5

If our theory is correct, the performance of this version should be worse, due to all the extra branches in the initial test. Let’s see if we’re right:

Benchmark 5

Benchmark 5

Holy smokes, version 5 is bad! It’s so bad it’s actually closest in performance to the failed version 2 with the elaborate AVX2 prefix matching routine.

Note

It was actually so close I double-checked the two routines to ensure they were correct; they were, so this is just a coincidence.

That’s good news, though, as it validates the assumption we’ve been working with since inception:

//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary.  That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa.  However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//

That’s the end of version 5’s tenure. TL;DR: fewer branches > more branches.

Narrator: more accurate TL;DR: __movsb() is slow, and always make sure you’re testing what you think you’re testing.]

IsPrefixOfStringInTable_6

← IsPrefixOfStringInTable_5 | IsPrefixOfStringInTable_7 →

Version 6 is boring. We tweak the initial loading of the search string, explicitly loading it via an unaligned load. If the underlying buffer is aligned on a 16-byte boundary, this is just as fast as an aligned load. If not, at least it doesn’t crash—it’s just slow.

Tip

If you attempt an aligned load on an address that isn’t aligned at a 16-byte boundary, the processor will generate an exception, causing your program to crash (assuming you don’t have any structured exception handlers in place to catch the error).

% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_6.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_6.c 2018-04-26 18:29:40.594556800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_6(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine differs from version 3 in that we do an unaligned load of
+    the search string buffer without any SEH wrappers or alignment checks.

 Arguments:

@@ -123,7 +124,8 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    SearchLength = min(String->Length, 16);
+    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

     //
     // Broadcast the search string's unique characters according to the string
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_6(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine differs from version 3 in that we do an unaligned load of
    the search string buffer without any SEH wrappers or alignment checks.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    SearchLength = min(String->Length, 16);
    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 6

Version 6 should be faster than version 3; we omit alignment checks, all of our input buffers are aligned at 32 bytes, and an unaligned XMM load of an aligned buffer should definitely be faster than a __movsb(). Let’s see:

Benchmark 6

Benchmark 6

We have a new winner! Version 3 had a good run, but it’s time to retire. Let’s tweak version 6 going forward.

Narrator: this is actually testing _mm_loadu_si128() against the AlignmentCheck routine, which first calls PointerToOffsetCrossesPageBoundary(), and then checks the address alignment before calling _mm_load_si128(). Since unaligned loads are just as fast as aligned loads as long as the underlying buffer is aligned, all this shows is that it’s slightly faster to skip the pointer boundary and address alignment checks, which isn’t too surprising.

IsPrefixOfStringInTable_7

← IsPrefixOfStringInTable_6 | IsPrefixOfStringInTable_8 →

Version 7 tweaks version 6 a little bit. We don’t need the search string length calculated so early in the routine. Let’s move it to later.

% diff -u IsPrefixOfStringInTable_6.c IsPrefixOfStringInTable_7.c
--- IsPrefixOfStringInTable_6.c 2018-04-15 22:35:55.450273700 -0400
+++ IsPrefixOfStringInTable_7.c 2018-04-26 10:00:53.905933700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_6(
+IsPrefixOfStringInTable_7(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,9 +31,10 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine differs from version 3 in that we do an aligned load of the
-    search string buffer without any SEH wrappers or alignment checks.  (Thus,
-    this routine will fault if the buffer is unaligned.)
+    This routine is based off version 6, but alters when we calculate the
+    "search length" for the given string, which is done via the expression
+    'min(String->Length, 16)'.  We don't need this value until later in the
+    routine, when we're ready to start comparing strings.

 Arguments:

@@ -125,7 +126,6 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    SearchLength = min(String->Length, 16);
     Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

     //
@@ -213,6 +213,13 @@
     }

     //
+    // Calculate the "search length" of the incoming string, which ensures we
+    // only compare up to the first 16 characters.
+    //
+
+    SearchLength = min(String->Length, 16);
+
+    //
     // A popcount against the mask will tell us how many slots we matched, and
     // thus, need to compare.
     //
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_7(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 6, but alters when we calculate the
    "search length" for the given string, which is done via the expression
    'min(String->Length, 16)'.  We don't need this value until later in the
    routine, when we're ready to start comparing strings.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

This is a tiny change; if it shows any performance difference, it should lean towards a positive change, although it’s possible the compiler deferred scheduling until after the initial negative match logic since the expression wasn’t used immediately. Let’s see.

Benchmark 7

Benchmark 7

Benchmark 7

Tiny change, tiny performance improvement! Looks like this saves a couple of cycles, thus ending the short-lived reign of version 6.

IsPrefixOfStringInTable_8

← IsPrefixOfStringInTable_7 | IsPrefixOfStringInTable_9 →

Version 8 is based off version 7, but omits the initial length test. Again, it’s another small change, but if version 5 was anything to go off, the less branches, the better.


% diff -u IsPrefixOfStringInTable_7.c IsPrefixOfStringInTable_8.c
--- IsPrefixOfStringInTable_7.c 2018-04-26 10:21:43.253466500 -0400
+++ IsPrefixOfStringInTable_8.c 2018-04-26 10:21:27.109761800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_7(
+IsPrefixOfStringInTable_8(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,10 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 6, but alters when we calculate the
-    "search length" for the given string, which is done via the expression
-    'min(String->Length, 16)'.  We don't need this value until later in the
-    routine, when we're ready to start comparing strings.
+    This routine is based off version 7, but omits the initial minimum
+    length test of the string array.

 Arguments:

@@ -63,7 +61,6 @@
     ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
-    PSTRING_ARRAY StringArray;
     STRING_SLOT Slot;
     STRING_SLOT Search;
     STRING_SLOT Compare;
@@ -77,17 +74,6 @@
     XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

-    StringArray = StringTable->pStringArray;
-
-    //
-    // If the minimum length of the string array is greater than the length of
-    // our search string, there can't be a prefix match.
-    //
-
-    if (StringArray->MinimumLength > String->Length) {
-        goto NoMatch;
-    }
-
     //
     // Unconditionally do the following five operations before checking any of
     // the results and determining how the search should proceed:
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_8(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 7, but omits the initial minimum
    length test of the string array.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 8

Benchmark 8

Benchmark 8

Hey, look at that, another win across the board! Omitting the length test shaves off a few more cycles for both prefix and negative matching. Version 7’s one-round reign has come to a timely end.

IsPrefixOfStringInTable_9

← IsPrefixOfStringInTable_8 | IsPrefixOfStringInTable_10 →

Version 9 tweaks version 8 by simply using return NO_MATCH_FOUND after the initial bitmap check instead of goto NoMatch. (The use of goto was a bit peculiar there anyway. We’re going to rewrite the body similarly for version 10, but let’s try to stick to making one change at a time.)

--- IsPrefixOfStringInTable_8.c 2018-04-26 10:30:52.337935400 -0400
+++ IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_8(
+IsPrefixOfStringInTable_9(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 7, but omits the initial minimum
-    length test of the string array.
+    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
+    after the initial bitmap check versus 'goto NoMatch'.

 Arguments:

@@ -195,7 +195,7 @@
         // search string.
         //

-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //
@@ -330,8 +330,6 @@
     // If we get here, we didn't find a match.
     //

-NoMatch:
-
     //IACA_VC_END();

     return NO_MATCH_FOUND;
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_9(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is a tweaked version of version 9 that does 'return NO_MATCH_FOUND'
    after the initial bitmap check versus 'goto NoMatch'.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 9

Benchmark 09

Benchmark 09

This is an interesting one. The return versus goto appears to have cost us a tiny bit with the first few test inputs—only about 0.2 more cycles, which is negligible in the grand scheme of things. (Though let’s not pull on that thread too much, or the entire premise of the article might start to unravel!)

Version 9 improves the negative match performance by a few cycles, so let’s keep it.

IsPrefixOfStringInTable_10

← IsPrefixOfStringInTable_9 | IsPrefixOfStringInTable_11 →

At this point, we’ve exhausted all the small, easy tweaks. Let’s rewrite the inner loop that performs the character comparison and see how that affects performance.

This should be an interesting one because the way it’s written now is… a bit odd. (I’ve clearly made some assumptions about optimal branch organization, to say the least.)

% diff -u IsPrefixOfStringInTable_9.c IsPrefixOfStringInTable_10.c
--- IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
+++ IsPrefixOfStringInTable_10.c        2018-04-26 10:38:09.357890400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_9(
+IsPrefixOfStringInTable_10(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
-    after the initial bitmap check versus 'goto NoMatch'.
+    This version is based off version 9, but rewrites the inner loop that
+    checks for comparisons.

 Arguments:

@@ -264,7 +264,17 @@

         CharactersMatched = __popcnt(Mask);

-        if ((USHORT)CharactersMatched == 16 && Length > 16) {
+        if ((USHORT)CharactersMatched < Length && Length <= 16) {
+
+            //
+            // The slot length is longer than the number of characters matched
+            // from the search string; this isn't a prefix match.  Continue.
+            //
+
+            continue;
+        }
+
+        if (Length > 16) {

             //
             // The first 16 characters in the string matched against this
@@ -283,46 +293,24 @@
                 //

                 continue;
-
-            } else {
-
-                //
-                // We successfully prefix matched the search string against
-                // this slot.  The code immediately following us deals with
-                // handling a successful prefix match at the initial slot
-                // level; let's avoid an unnecessary branch and just jump
-                // directly into it.
-                //
-
-                goto FoundMatch;
             }
         }

-        if ((USHORT)CharactersMatched == Length) {
-
-FoundMatch:
-
-            //
-            // This slot is a prefix match.  Fill out the Match structure if the
-            // caller provided a non-NULL pointer, then return the index of the
-            // match.
-            //
-
-
-            if (ARGUMENT_PRESENT(Match)) {
+        //
+        // This slot is a prefix match.  Fill out the Match structure if the
+        // caller provided a non-NULL pointer, then return the index of the
+        // match.
+        //

-                Match->Index = (BYTE)Index;
-                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
-                Match->String = &StringTable->pStringArray->Strings[Index];
+        if (ARGUMENT_PRESENT(Match)) {

-            }
+            Match->Index = (BYTE)Index;
+            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
+            Match->String = &StringTable->pStringArray->Strings[Index];

-            return (STRING_TABLE_INDEX)Index;
         }

-        //
-        // Not enough characters matched, so continue the loop.
-        //
+        return (STRING_TABLE_INDEX)Index;

     } while (--Count);
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_10(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 8, but rewrites the inner loop that
    checks for comparisons.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

That’s a nicer bit of logic—more C-like, less assembly-like, and arguably clearer. Let’s see how they compare. (This is an interesting one, as I genuinely don’t have a strong hunch about what kind of performance impact this will have; obviously, I thought the initial way of structuring the loop was optimal, and I had it in place for two years before deciding to embark on this article, which led to the rework we just saw. )

Benchmark 10

Benchmark 10

Benchmark 10

Hey, look at that! We’ve shaved off a few more cycles in most cases, especially for the negative matches!

Speeding Up Negative Matches with Assembly

Note

If you build the Tracer project, you can run a helper batch file in the root directory called cdb-simple.bat, which uses cdb to launch one of the project’s executables, ModuleLoader.exe. This will start up, load all of our tracing project’s DLLs, then allow the debugger to break in, yielding a debugger prompt from which we can easily disassemble functions, inspect runtime function entries, etc. This is the approach I used for capturing the output over the next couple of sections.

Now for the fun part! Let’s take a look at the disassembly of the initial part of version 10 responsible for the negative match logic and see if there are any improvements we can make.

0:000> uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h
00007fff`f69c1e04 c5fa6f5920      vmovdqu xmm3,xmmword ptr [rcx+20h]
00007fff`f69c1e09 4c8b6a08        mov     r13,qword ptr [rdx+8]
00007fff`f69c1e0d 4d8bf0          mov     r14,r8
00007fff`f69c1e10 440fb63a        movzx   r15d,byte ptr [rdx]
00007fff`f69c1e14 33ed            xor     ebp,ebp
00007fff`f69c1e16 44883c24        mov     byte ptr [rsp],r15b
00007fff`f69c1e1a 488bf1          mov     rsi,rcx
00007fff`f69c1e1d c4e279780c24    vpbroadcastb xmm1,byte ptr [rsp]
00007fff`f69c1e23 c4c17a6f6500    vmovdqu xmm4,xmmword ptr [r13]
00007fff`f69c1e29 c4e259004110    vpshufb xmm0,xmm4,xmmword ptr [rcx+10h]
00007fff`f69c1e2f c5f97411        vpcmpeqb xmm2,xmm0,xmmword ptr [rcx]
00007fff`f69c1e33 c5e164c9        vpcmpgtb xmm1,xmm3,xmm1
00007fff`f69c1e37 c5f1ef0d41320000 vpxor   xmm1,xmm1,xmmword ptr [StringTable2!_xmmffffffffffffffffffffffffffffffff (00007fff`f69c5080)]
00007fff`f69c1e3f c5e9dbd1        vpand   xmm2,xmm2,xmm1
00007fff`f69c1e43 c579d7c2        vpmovmskb r8d,xmm2
00007fff`f69c1e47 c5fa7f5c2410    vmovdqu xmmword ptr [rsp+10h],xmm3
00007fff`f69c1e4d 4585c0          test    r8d,r8d
00007fff`f69c1e50 0f849a000000    je      StringTable2!IsPrefixOfStringInTable_10+0x100 (00007fff`f69c1ef0)

There’s a bit of cruft at the start regarding setting up the function’s prologue (pushing non-volatile registers to the stack, etc. ). That’s to be expected for C (and C++, and basically every language); as the programmer, you don’t have any direct control over how many registers a compiler uses for a routine, how much stack space it uses, which registers it uses when, etc.

However, with assembly, we’re on the opposite end of the spectrum: we can control everything! We also have a little trick up our sleeves: the venerable LEAF_ENTRY.

Windows x64 ABI Calling Conventions

First, some background. The Windows x64 ABI and calling convention dictate two types of functions: NESTED_ENTRY and LEAF_ENTRY.

NESTED_ENTRY

NESTED_ENTRY is by far the most common; C and C++ functions are all implicitly NESTED_ENTRY functions. (The LEAF_ENTRY and NESTED_ENTRY symbols are MASM (ml64.exe) macro names, but the concept applies to all languages.)

LEAF_ENTRY

A LEAF_ENTRY can only be implemented in assembly. It is constrained in that it may not manipulate any of the non-volatile x64 registers (rbx, rdi, rsi, rsp, rbp, r12, r13, r14, r15, xmm6-15), nor may it call any other functions (since call implicitly modifies the stack pointer), nor may it have a structured exception handler (since handling an exception for a given stack frame also manipulates the stack pointer).

The reason for these constraints is that LEAF_ENTRY routines do not have any unwind information generated for them in their runtime function entries. Unwind information is used by the kernel to, well, unwind the modifications made to non-volatile registers while traversing back up through the call stack looking for an exception handler in the event of an exception.

For example, here’s the function entry and associated unwind information for the PGO build of the IsPrefixOfStringInTable_10 function:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_10
Debugger function entry 000001d8`2ea03cf8 for:
(00007fff`f8411df0)   StringTable2!IsPrefixOfStringInTable_10
Exact matches:
    StringTable2!IsPrefixOfStringInTable_10 (struct _STRING_TABLE *,
                                             struct _STRING *,
                                             struct _STRING_MATCH *)

BeginAddress      = 00000000`00001df0
EndAddress        = 00000000`00001e59
UnwindInfoAddress = 00000000`000054f8

Unwind info at 00007fff`f84154f8, 14 bytes
  version 1, flags 0, prolog 14, codes 8
  00: offs 14, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 58 reg: rsi.
  02: offs 14, unwind op 4, op info 5   UWOP_SAVE_NONVOL FrameOffset: 50 reg: rbp.
  04: offs 14, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
  05: offs 10, unwind op 0, op info f   UWOP_PUSH_NONVOL reg: r15.
  06: offs e, unwind op 0, op info e    UWOP_PUSH_NONVOL reg: r14.
  07: offs c, unwind op 0, op info d    UWOP_PUSH_NONVOL reg: r13.

We can see that this routine manipulates six non-volatile registers in total, including the stack pointer. The first instructions of the routine constitute the function’s prologue; in the disassembly, you can see that three of the rxx registers are pushed to the stack, followed by the allocation of 0x20 (32) bytes of stack space:

0:000> uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h

It also cheekily uses the home parameter space for stashing rbp and rsi instead of pushing them to the stack. That’s fair game, though—this is the PGO build, so I’d expect it to use some extra tricks to shave off a few cycles here and there. I’d do the same if I were writing assembly. (Side note: if you view the source of this page, there’s a commented-out section below that shows the runtime function entry for the release build of version 10; it uses nine registers instead of six and 40 bytes of stack space instead of 32. I wrote it before switching to using the PGO build for everything.)

The home parameter space is a 32-byte area that immediately follows the return address (i.e., the value of rsp when the function is entered); it is mandated by the x64 calling convention on Windows and is primarily intended to provide scratch space for a routine to home its parameter registers (i.e., the registers used for the first four arguments of a function: rcx, rdx, r8, and r9). This allows the four volatile registers to be repurposed within a routine while still providing a way to refer to the parameters if needed. That’s its intended use—however, it’s not strictly enforced, so you can essentially treat this area as a free 32-byte scratch space if you’re writing assembly.

Note

On a semi-related note, I’d highly recommend reading A History of Modern 64-bit Computing if you have some spare time. It’s a fascinating insight into contemporary x64 conventions we often take for granted, drawing on numerous interviews with industry luminaries like Dave Cutler and Linus Torvalds. I found it incredibly useful for understanding the why behind concepts like home parameter space, structured exception handling, runtime function entries, and why you can’t write inline assembly for x64 with MSVC anymore—apparently, it provides a direct vector for disrupting the mechanisms relied upon by the kernel stack unwinding functionality. (At least, I think that’s the reason—can anyone from Microsoft confirm?))


Assembly Implementations

IsPrefixOfStringInTable_x64_1

IsPrefixOfStringInTable_x64_2 →

So, knowing what we now know about the venerable little LEAF_ENTRY trick, let’s see if we can construct a simple routine in assembly that just deals with the negative match case.

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00

        ;IACA_VC_START

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result back into xmm0.
;

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm0, xmm0, xmm1

;
; Load the string table's unique character array into xmm2, and the lengths for
; each string slot into xmm3.
;

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm5 to all ones.  This is used later.
;

        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's unique character array (xmm0) against the string
; table's unique chars (xmm2), saving the result back into xmm0.
;

        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm5            ; Invert the result.

;
; Intersect-via-test xmm0 and xmm1 to identify string slots of a suitable
; length with a matching unique character.
;

        vptest      xmm0, xmm1                  ; Check for no match.
        ;jnz        short @F                    ; There was a match.
                                                ; (Not yet implemented.)

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ;
        not         al                          ; rax = -1
        ret

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Note how we don’t need to push anything to the stack since we didn’t manipulate any non-volatile registers. If an exception occurs within the body of our implementation (say we dereference a NULL pointer), the kernel knows it doesn’t have to undo any non-volatile register modifications (using offsets specified by the unwind information) because there isn’t any unwind information. It can simply advance to the frame before us (e. g. , rsp at the time of the fault, minus 8 bytes) as it continues its search for runtime function entries and associated unwind information. As you can see, the unwind info is effectively empty:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_x64_1
Debugger function entry 000001f9`048edf98 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_1 (void)

BeginAddress      = 00000000`00003290
EndAddress        = 00000000`000032cb
UnwindInfoAddress = 00000000`00004468

Unwind info at 00007ffd`15594468, 4 bytes
  version 1, flags 0, prolog 0, codes 0

Benchmark x64 1

Let’s see how this scrappy little fellow (who always returns NO_MATCH_FOUND but still mimics the steps required to successfully negative match) does against the leading C implementation at this point, version 10:

Benchmark x64 1

Benchmark x64 1

Fwoah, look at that, we’ve shaved about three cycles off the C version!

(Note that when I first wrote this, I was comparing the assembly version against the release build (not the PGO build), which was clocking in at about 13-14 cycles for negative matching. So getting it down to ~7.5 from 13-14 was a bit more exciting. Damn the PGO build and its 10.9-ish cycles for negative matching!)

The good news is that our theory about the performance of the LEAF_ENTRY looks like it’s paid off: we can reliably get about 7.5 cycles for negative matching.

IsPrefixOfStringInTable_x64_2

← IsPrefixOfStringInTable_x64_1 | IsPrefixOfStringInTable_x64_3 →

The bad news is that we now need to implement the rest of the functionality within the constraints of a LEAF_ENTRY!

The problem with a LEAF_ENTRY for anything more than a trivial bit of code is that you only have a handful of volatile registers to work with, and no stack space can be used for register spilling or temporaries. (Technically I could use the home parameter space, but, eh, we’re already avoiding stack spills, why not make life harder for ourselves and try to avoid all memory spilling.)

If you can’t spill to memory, your only option is really spilling to XMM registers via vpinsr and vpextr combinations, which, as you can see in the implementation of version 2 below, I have to do a lot.

(Also note: when I wrote this version, I didn’t use the disassembly from the C routines for guidance. I find that as soon as you start to grok the disassembly for a given routine, it becomes harder to think of ways to approach it from a fresh angle. Also, the LEAF_ENTRY aspect significantly limited what I could do anyway, so I figured I may as well just give it a crack from scratch and see what I could come up with. It would be an interesting point of reference compared to a future iteration that tries to improve on the disassembly of an optimized PGO version, for example.)

The diff view for this routine is less useful given the vast majority of the code is new, so I’ve put the full version of the code first. It’s based more or less on the approach used by version 8 of the C routine (I actually wrote it after I wrote version 8; versions 9 and 10 of the C routine (with the latter having the improved loop logic) came after).

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
% diff -u IsPrefixOfStringInTable_x64_1.asm IsPrefixOfStringInTable_x64_2.asm
--- IsPrefixOfStringInTable_x64_1.asm   2018-04-29 11:03:46.403568800 -0400
+++ IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
@@ -50,12 +50,12 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result back into xmm0.
+; result into xmm5.
 ;

         ;IACA_VC_START
@@ -63,34 +63,36 @@
         mov     rax, String.Buffer[rdx]
         vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
         vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm0, xmm0, xmm1
+        vpshufb xmm5, xmm0, xmm1

 ;
-; Load the string table's unique character array into xmm2, and the lengths for
-; each string slot into xmm3.
-;
+; Load the string table's unique character array into xmm2.

         vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
-        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

 ;
-; Set xmm5 to all ones.  This is used later.
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
 ;

-        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.
+        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Load the lengths of each string table slot into xmm3.
 ;
+        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+;
+; Set xmm2 to all ones.  We use this later to invert the length comparison.
+;
+
+        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

 ;
-; Compare the search string's unique character array (xmm0) against the string
-; table's unique chars (xmm2), saving the result back into xmm0.
+; Broadcast the byte-sized string length into xmm4.
 ;

-        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
 ; Compare the search string's length, which we've broadcasted to all 8-byte
@@ -100,30 +102,378 @@
 ; a slot with a length less than or equal to our search string's length.
 ;

-        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm5            ; Invert the result.
+        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
+        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

 ;
-; Intersect-and-test the unique character match xmm mask register (xmm0) with
+; Intersect-and-test the unique character match xmm mask register (xmm5) with
 ; the length match mask xmm register (xmm1).  This affects flags, allowing us
 ; to do a fast-path exit for the no-match case (where ZF = 1).
 ;

-        vptest      xmm0, xmm1                  ; Check for no match.
-        ;jnz        short @F                    ; There was a match.
-                                                ; (Not yet implemented.)
+        vptest      xmm5, xmm1                  ; Check for no match.
+        jnz         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
 ;

-        xor         eax, eax                    ;
-        not         al                          ; rax = -1
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+        ;IACA_VC_END
+
+;
+; (There was at least one match, continue with processing.)
+;
+
+;
+; Calculate the "search length" for the incoming search string, which is
+; equivalent of 'min(String->Length, 16)'.  (The search string's length
+; currently lives in xmm4, albeit as a byte-value broadcasted across the
+; entire register, so extract that first.)
+;
+; Once the search length is calculated, deposit it back at the second byte
+; location of xmm4.
+;
+;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
+;
+;   r11 - String length (String->Length)
+;
+
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
+        mov         rax, 16                     ; Load 16 into rax.
+        mov         r10, r11                    ; Copy into r10.
+        cmp         r10w, ax                    ; Compare against 16.
+        cmova       r10w, ax                    ; Use 16 if length is greater.
+        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].
+
+;
+; Home our parameter registers into xmm registers instead of their stack-backed
+; location, to avoid memory writes.
+;
+
+        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
+        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
+        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].
+
+;
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; yielding a mask identifying indices we need to perform subsequent matches
+; upon.  Convert this into a bitmap and save in xmm2d[2].
+;
+
+        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.
+
+;
+; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
+;
+
+        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
+        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].
+
+;
+; Summary of xmm register stashing for the rest of the routine:
+;
+; xmm2:
+;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
+;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
+;
+; xmm4:
+;       0:7     (vpinsrb 0)     length of search string
+;       8:15    (vpinsrb 1)     min(String->Length, 16)
+;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
+;      24:31    (vpinsrb 3)     shift count
+;
+; xmm5:
+;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
+;      64:95    (vpinsrd 2)     bitmap of slots to compare
+;      96:127   (vpinsrd 3)     index of slot currently being processed
+;
+
+;
+; Initialize rcx as our counter register by doing a popcnt against the bitmap
+; we just generated in edx, and clear our shift count register (r9).
+;
+
+        popcnt      ecx, edx                    ; Count bits in bitmap.
+        xor         r9, r9                      ; Clear r9.
+
+        align 16
+
+;
+; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
+; trailing zeros of the bitmap, and then add in the shift count, producing an
+; index (rax) we can use to load the corresponding slot.
+;
+; Register usage at top of loop:
+;
+;   rax - Index.
+;
+;   rcx - Loop counter.
+;
+;   rdx - Bitmap initially, then slot length.
+;
+;   r9 - Shift count.
+;
+;   r10 - Search length.
+;
+;   r11 - String length.
+;
+
+Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
+        mov         eax, r8d                    ; Copy tzcnt to rax,
+        add         rax, r9                     ; Add shift to create index.
+        inc         r8                          ; tzcnt + 1
+        shrx        rdx, rdx, r8                ; Reposition bitmap.
+        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
+        xor         edx, edx                    ; Clear edx.
+        mov         r9, rax                     ; Copy index back to shift.
+        inc         r9                          ; Shift = Index + 1
+        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
+
+;
+; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
+; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
+;
+; Then, load the string table slot at this index into xmm1, then shift rax back.
+;
+
+        shl         eax, 4
+        vpextrq     r8, xmm2, 0
+        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
+        shr         eax, 4
+
+;
+; The search string's first 16 characters are already in xmm0.  Compare this
+; against the slot that has just been loaded into xmm1, storing the result back
+; into xmm1.
+;
+
+        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.
+
+;
+; Convert the XMM mask into a 32-bit representation, then zero high bits after
+; our "search length", which allows us to ignore the results of the comparison
+; above for bytes that were after the search string's length, if applicable.
+; Then, count the number of bits remaining, which tells us how many characters
+; we matched.
+;
+
+        vpmovmskb   r8d, xmm1                   ; Convert into mask.
+        bzhi        r8d, r8d, r10d              ; Zero high bits.
+        popcnt      r8d, r8d                    ; Count bits.
+
+;
+; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
+; it, we can load rax (the current index) into xmm1 and use it to extract the
+; slot length via shuffle.  (The length will be in the lowest byte of xmm1
+; after the shuffle, which we can then vpextrb.)
+;
+
+        movd        xmm1, rax                   ; Load index into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
+        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.
+
+;
+; If 16 characters matched, and the search string's length is longer than 16,
+; we're going to need to do a comparison of the remaining strings.
+;
+
+        cmp         r8w, 16                     ; Compare chars matched to 16.
+        je          short @F                    ; 16 chars matched.
+        jmp         Pfx30                       ; Less than 16 matched.
+
+;
+; All 16 characters matched.  If the slot length is greater than 16, we need
+; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
+; then great, that's a slot match, we're done.
+;
+
+@@:     cmp         dl, 16                      ; Compare length to 16.
+        ja          Pfx50                       ; Length is > 16.
+        je          short Pfx35                 ; Lengths match!
+                                                ; Length <= 16, fall through...
+
+;
+; Less than or equal to 16 characters were matched.  Compare this against the
+; length of the slot; if equal, this is a match, if not, no match, continue.
+;
+
+Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
+        jne         @F                          ; No match found.
+        jmp         short Pfx35                 ; Match found!
+
+;
+; No match against this slot, decrement counter and either continue the loop
+; or terminate the search and return no match.
+;
+
+@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        dec         cx                          ; Decrement counter.
+        jnz         Pfx20                       ; cx != 0, continue.
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
+; former is used when we need to copy the number of characters matched from r8
+; back to rax.  The latter jump target doesn't require this.
+;
+
+Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.
+
+;
+; Load the match parameter back into r8 and test to see if it's not-NULL, in
+; which case we need to fill out a STRING_MATCH structure for the match.
+;
+
+Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
+        test        r8, r8                      ; Is NULL?
+        jnz         short @F                    ; Not zero, need to fill out.
+
+;
+; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
+;
+
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        ret                                     ; StringMatch == NULL, finish.
+
+;
+; StringMatch is not NULL.  Fill out characters matched (currently rax), then
+; reload the index from xmm5 into rax and save.
+;
+
+@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        mov         byte ptr StringMatch.Index[r8], al
+
+;
+; Final step, loading the address of the string in the string array.  This
+; involves going through the StringTable, so we need to load that parameter
+; back into rcx, then resolving the string array address via pStringArray,
+; then the relevant STRING offset within the StringArray.Strings structure.
+;
+
+        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
+        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
+        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
+        shr         eax, 4                  ; Revert the scaling.
+
         ret

+;
+; 16 characters matched and the length of the underlying slot is greater than
+; 16, so we need to do a little memory comparison to determine if the search
+; string is a prefix match.
+;
+; The slot length is stored in rax at this point, and the search string's
+; length is stored in r11.  We know that the search string's length will
+; always be longer than or equal to the slot length at this point, so, we
+; can subtract 16 (currently stored in r10) from rax, and use the resulting
+; value as a loop counter, comparing the search string with the underlying
+; string slot byte-by-byte to determine if there's a match.
+;
+
+Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.
+
+;
+; Free up some registers by stashing their values into various xmm offsets.
+;
+
+        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
+        mov         rcx, rdx                ; Free up rdx, rcx is now counter.
+
+;
+; Load the search string buffer and advance it 16 bytes.
+;
+
+        vpextrq     r11, xmm2, 1            ; Extract String into r11.
+        mov         r11, String.Buffer[r11] ; Load buffer address.
+        add         r11, r10                ; Advance buffer 16 bytes.
+
+;
+; Loading the slot is more involved as we have to go to the string table, then
+; the pStringArray pointer, then the relevant STRING offset within the string
+; array (which requires re-loading the index from xmm5d[3]), then the string
+; buffer from that structure.
+;
+
+        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
+        mov         r8, StringTable.pStringArray[r8] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+
+        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
+        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
+        add         r8, r10                 ; Advance buffer 16 bytes.
+
+        xor         eax, eax                ; Clear eax.
+
+;
+; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
+; Do a byte-by-byte comparison.
+;
+
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;
+
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.
+
+;
+; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
+; how many characters we matched, and then jump to Pfx40 for finalization.
+;
+
+        add         rax, r10
+        jmp         Pfx40
+
+;
+; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
+; it.  If it's zero, we have no more strings to compare, so we can do a quick
+; exit.  If there are still comparisons to be made, restore the other registers
+; we trampled then jump back to the start of the loop Pfx20.
+;
+
+Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
+        dec         cx                          ; Decrement counter.
+        jnz         short @F                    ; Jump forward if not zero.
+
+;
+; No more comparisons remaining, return.
+;
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; More comparisons remain; restore the registers we clobbered and continue loop.
+;
+
+@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
+        vpextrb     r11, xmm4, 0                ; Restore r11.
+        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        jmp         Pfx20                       ; Continue comparisons.
+
         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Looking back on my time logs (shout out to my favorite iPhone app, HoursTracker!), the routine above took about 8 hours to implement over the course of about two days, give or take. Writing assembly is slow; writing correct assembly is even slower. I generally find that there’s a noticeable hump I need to get over in the first, say, 30 minutes of any assembly programming session, but once you get into the zone, things can start flowing quite nicely. I’m an aggressive debugger user; often, to get started, I’ll write a simple LEAF_ENTRY that looks like this:

    LEAF_ENTRY Foo, _TEXT$00
        int 3
        xor eax, eax
        ret
    LEAF_END Foo, _TEXT$00

That’ll allow me to attach the debugger and at least inspect the parameter registers so I can write the next couple of instructions. I find it definitely helps get me into the zone quicker.

Anyway, enough about that. Let’s look at performance. Again, this will be an interesting one—other than the optimal negative match logic that I copied from version 1, the sole focus was on getting a working assembly version; I wasn’t giving any thought to performance at this stage.

So, it’ll be interesting to see how it compares to a) version 1 in the negative matching case (it should be very close), and b) against the C versions in the prefix matching case (it hopefully won’t be prohibitively worse).

Benchmark x64 2: Negative Matching

Benchmark Negative Match

Benchmark Negative Match

Hmmm, that’s not too bad! We’re very close to version 1 for negative matching, within about 0.5 cycles or so. That sounds about right, given that our initial logic had to be tweaked a bit to play nicer with the rest of the implementation. And we’re still about 3-4 cycles faster than the fastest C version.

What about prefix matching performance?

Benchmark x64 2: Prefix Matching

Benchmark Prefix Match

Benchmark Prefix Match

The prefix matching performance isn’t too bad either! We’re definitely slower than the C version, ranging from about 4 cycles to 10 cycles in most cases, with the $INDEX_ALLOCATION input about 13 cycles slower.

(I’ve just noticed the pattern with regards to the first 8 entries, $AttrDef to $Mft, clocking in at about 18 and 24 cycles respectively. But the next four entries, $Secure to $Cairo, consistently clock in at about 24 and 34 cycles respectively. $Secure is the 9th slot, which puts it at memory offset 192 bytes from the start of the string table. And then the 18 and 24 cycle behavior returns for the last two items, ???? and ., which are at the end of the string table’s inner slot array. This pattern is prevalent in all of our iterations. Very peculiar! We’ll investigate later.)

IsPrefixOfStringInTable_x64_3

← IsPrefixOfStringInTable_x64_2 | IsPrefixOfStringInTable_x64_4 →

(We’re nearly at the end of the first round of iterations, I promise!)

Seeing the performance of the second version in assembly, I decided to try whipping up a third version, which would switch from a LEAF_ENTRY to NESTED_ENTRY and use rep cmps for the byte comparison for long strings (instead of the byte-by-byte approach used now).

In order to use rep cmps, you need to use two non-volatile registers, rsi (the source index) and rdi (the destination index). You also need to specify the direction of the comparison, which means mutating the flags, which are also classed as non-volatile, so they need to be pushed to the stack in the prologue and popped back off in the epilogue.

I didn’t really expect this to offer a measurable speedup, but it was a tangible reason to use a NESTED_ENTRY, and otherwise allowed me to stay within the confines of the version 2 implementation.

Let’s take a look at the implementation. At the very least, it’s useful to see how you can go about organizing your prologue in MASM. For NESTED_ENTRY routines, I always define a Locals structure that incorporates the return address and home parameter space for easy access. Mainly because it allows me to write code like this:

    mov     Locals.HomeRcx[rsp], rcx        ; Home our first param.
    mov     Locals.HomeRdx[rsp], rdx        ; Home our second param.
    mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
    mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.

Instead of working wiht offsets like this:

    mov     qword ptr [rsp+30h], rcx        ; Home our first param.
    mov     qword ptr [rsp+38h], rdx        ; Home our second param.
    mov     rsi, qword ptr [rsp+10h]        ; Restore rsi.
    mov     rdi, qword ptr [rsp+8]          ; Restore rdi.

This routine was written last, after version 10 of the C routine, so it incorporates the slightly re-arranged loop logic that proved to be faster for that version. Other than that, the main changes involved converting all the early exit returns in the body of the function to jump to a single exit point, Pfx90, mainly to simplify epilogue exit code.

 % diff -u IsPrefixOfStringInTable_x64_2.asm IsPrefixOfStringInTable_x64_3.asm
--- IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
+++ IsPrefixOfStringInTable_