Is Prefix Of String In Table?

A Journey Into SIMD String Processing

AVX2

SIMD

Assembly

MASM

This article details an approach for efficiently determining if a given string prefix-matches a set of known strings. That is, do any of the known strings represent the prefix of a given string? A custom data structure is employed with successive implementations benchmarked to find the fastest possible solution.

Author

Trent Nelson

Published

May 4, 2018

Published: 4th May, 2018. Last updated: 1st November, 2024.

Thanks to Fabian Giesen, Wojciech Muła, Geoff Langdale, Daniel Lemire, and Kendall Willets for their valuable feedback on an early draft of this article.

Hours spent on this article to date: 230.56. Hours spent porting this article from raw HTML to Markdown in 2024: about 16-20. See Colophon for more details.

Hacker News discussion | Reddit discussion

TL;DR

I wrote some C and assembly code that uses SIMD instructions to perform prefix matching of strings. The C code was between 4-7x faster than the baseline implementation for prefix matching. The assembly code was 9-12x faster than the baseline specifically for the negative match case (determining that an incoming string definitely does not prefix match any of our known strings). The fastest negative match could be done in around 6 CPU cycles, which is pretty quick. (Integer division, for example, takes about 90 cycles.)

Overview

Goal: given a string, determine if it prefix-matches a set of known strings as fast as possible. That is, in a set of known strings, do any of them prefix match the incoming search string?

A reference implementation was written in C as a baseline, which simply looped through an array of strings, comparing each one, byte-by-byte, looking for a prefix match. Prefix match performance ranged from 28 CPU cycles to 130, and negative match performance was around 74 cycles.

A SIMD-friendly C structure called STRING_TABLE was derived. It is optimized for up to 16 strings, ideally of length less than or equal to 16 characters. The table is created from the set of known strings up-front; it is sorted by length, ascending, and a unique character (with regards to other characters at the same byte offset) is then extracted, along with its index. A 16-byte character array, STRING_SLOT, is used to capture the unique characters. A 16-element array of unsigned characters, SLOT_INDEX, is used to capture the index. Similarly, lengths are stored in the same fashion via SLOT_LENGTHS. Finally, a 16-element array of STRING_SLOTs is used to capture up to the first 16 bytes of each string in the set.

An example of the memory layout of the STRING_TABLE structure at run time, using sample test data, is depicted below. Note the width of each row is 16 bytes (128 bits), which is the size of an XMM register.

The layout of the STRING_TABLE structure allows us to determine if a given search string does not prefix match all 16 strings at once in 12 assembly instructions. This breaks down into 18 μops, with a block throughput of 3.48 cycles on Intel’s Skylake architecture. (In practice, this clocks in at around 6 CPU cycles.)

Assembly
IACA

mov      rax,  String.Buffer[rdx]                   ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx]      ; Broadcast string length.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu  xmm0, xmmword ptr [rax]                    ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4                           ; Identify slots > string len.
vpshufb  xmm5, xmm0, StringTable.UniqueIndex[rcx]   ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx]   ; Compare rearranged to unique.
vptest   xmm1, xmm5                                 ; Unique slots AND (!long slots).
jnc      short Pfx10                                ; CY=0, continue with routine.
xor      eax, eax                                   ; CY=1, no match.
not      al                                         ; al = -1 (NO_MATCH_FOUND).
ret                                                 ; Return NO_MATCH_FOUND.

S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|μops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |1.0|   |       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ |   |1.0|0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of μops: 18

Here’s a simplified walk-through of a negative match in action, using the search string “CAT”:

Ten iterations of a function named IsPrefixOfStringInTable were authored. The tenth and final iteration was the fastest, prefix matching in as little as 19 cycles—a 4x improvement over the baseline. Negative matching took 11 cycles—a 6.7x improvement.

An assembly version of the algorithm was authored specifically to optimize for the negative match case, and was able to do so in as little as 8 cycles, representing a 9x improvement over the baseline. (It was a little bit slower than the fastest C routine in the case of prefix matches, though, as can be seen below.)

Feedback for an early draft of this article was then solicited via Twitter, resulting in four more iterations of the C version, and three more iterations of the assembly version. The PGO build of the fastest C version prefix matched in about 16 cycles (and also had the best “worst case input string” performance where three slots needed comparison), negative matching in about 26 cycles. The fifth iteration of the assembly version negative matched in about 6 cycles, a 3 and 1 cycle improvement, respectively.

We were then ready to publish, but felt compelled to investigate an odd performance quirk we’d noticed with one of the assembly routines, which yielded 7 more assembly versions. Were any of them faster? Let’s find out.

The Background

The Tracer Project

One of the frustrations I had with existing Python profilers was that there was no easy or efficient means to filter or exclude trace information based on the module name of the code being executed. I tackled this in my tracer project, which allows you to set an environment variable named TRACER_MODULE_NAMES to restrict which modules should be traced, e.g.:

set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy

If the code being executed is coming from the module myproject3.subproject.foo, then we need to trace it, as that string prefix matches the third entry on our list.

This article details the custom data structure and algorithm I came up with in order to try and solve the prefix matching problem more optimally with a SIMD approach. The resulting StringTable component is used extensively within the tracer project, and as such, must conform to unique constraints such as no use of the C runtime library and allocating all memory through TraceStore-backed allocators. Thus, it’s not really something you’d drop in to your current project in its current form. Hopefully, the article still proves to be interesting.

Note

The code samples provided herein are copied directly from the tracer project, which is written in C and assembly, and uses the Pascal-esque Cutler Normal Form style for C. If you’re used to the more UNIX-style Kernel Normal Form of C, it’s quite like that, except that it’s absolutely nothing like that, and all these code samples will probably be very jarring.

Baseline C Implementation

The simplest way of solving this in C is to have an array of C strings (i.e., NULL-terminated byte arrays), then for each string, loop through byte by byte and see if it prefix matches the search string.

Baseline (Cutler Normal Form)
Baseline (Kernel Normal Form)

//
// Declare a set of module names to be used as a string array.
//

const PCSZ ModuleNames[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    NULL,
};

//
// Define the function pointer typedef.
//

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_CSTR_IN_ARRAY)(
    _In_ PCSZ *StringArray,
    _In_ PCSZ String,
    _Out_opt_ PSTRING_MATCH Match
    );
typedef IS_PREFIX_OF_CSTR_IN_ARRAY *PIS_PREFIX_OF_CSTR_IN_ARRAY;

//
// Forward declaration.
//

IS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left && *Right && *Left++ == *Right++) {
            Count++;
        }

        if (Count > 0 && !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)Count;
                Match->String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}

const char *module_names[] = {
    "myproject1",
    "myproject2",
    "myproject3.subproject",
    "numpy",
    "pandas",
    "scipy",
    0,
};

struct string_match {
    /* Index of the match. */
    unsigned char index;

    /* Number of characters matched. */
    unsigned char number_of_chars_matched;

    /* Pad out to an 8-byte boundary. */
    unsigned short padding[3];

    /* Pointer to the string that was matched. */
    char *str;
};

unsigned char
is_prefix_of_c_str_in_array(const char **array,
                            const char *str,
                            struct string_match *match)
{
    char *left, *right, **target;
    unsigned int c, i = 0;

    for (target = array; target; target++, i++) {
        c = 0;
        left = str;
        right *target;
        while (*left && *right && *left++ == *right) {
            c++;
        }
        if (c > 0 && !*right) {
            if (match) {
                match->index = i;
                match->chars_matched = c;
                match->str = target[i];
            }
            return i;
        }
    }

    return -1;
}

Another type of code pattern that the string table attempts to replace is anything that does a lot of if/else if/else if-type string comparisons to look for keywords. For example, in the Quake III source, there’s some symbol/string processing logic that looks like this:

// call instructions reset currentArgOffset
if ( !strncmp( token, "CALL", 4 ) ) {
    EmitByte( &segment[CODESEG], OP_CALL );
    instructionCount++;
    currentArgOffset = 0;
    return;
}

// arg is converted to a reversed store
if ( !strncmp( token, "ARG", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_ARG );
    instructionCount++;
    if ( 8 + currentArgOffset >= 256 ) {
        CodeError( "currentArgOffset >= 256" );
        return;
    }
    EmitByte( &segment[CODESEG], 8 + currentArgOffset );
    currentArgOffset += 4;
    return;
}

// ret just leaves something on the op stack
if ( !strncmp( token, "RET", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_LEAVE );
    instructionCount++;
    EmitInt( &segment[CODESEG], 8 + currentLocals + currentArgs );
    return;
}

// pop is needed to discard the return value of
// a function
if ( !strncmp( token, "pop", 3 ) ) {
    EmitByte( &segment[CODESEG], OP_POP );
    instructionCount++;
    return;
}
...

An example of using the string table approach for this problem is discussed in the Other Applications section.

The Proposed Interface

Let’s take a look at the interface we’re proposing, the IsPrefixOfStringInTable function, that this article is based upon:

The IsPrefixOfStringInTable Function

//
// Our string table index is simply a char, with -1 indicating no match found.
//

typedef CHAR STRING_TABLE_INDEX;
#define NO_MATCH_FOUND -1

typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_STRING_IN_TABLE)(
    _In_ PSTRING_TABLE StringTable,
    _In_ PSTRING String,
    _Out_opt_ PSTRING_MATCH StringMatch
    );
typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE;

IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/

All implementations discussed in this article adhere to that function signature. The STRING_TABLE structure will be discussed shortly.

The STRING_MATCH Structure

The STRING_MATCH structure is used to optionally communicate information about the prefix match back to the caller. The index and characters matched fields are often very useful when using the string table for text parsing; see the other applications section below for an example.

The structure is defined as follows:

//
// This structure is used to communicate matches back to the caller.
//

typedef struct _STRING_MATCH {

    //
    // Index of the match.
    //

    BYTE Index;

    //
    // Number of characters matched.
    //

    BYTE NumberOfMatchedCharacters;

    //
    // Pad out to 8-bytes.
    //

    USHORT Padding[3];

    //
    // Pointer to the string that was matched.  The underlying buffer will
    // stay valid for as long as the STRING_TABLE struct persists.
    //

    PSTRING String;

} STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH;
C_ASSERT(sizeof(STRING_MATCH) == 16);

The Test Data

Instead of using some arbitrary Python module names, this article is going to focus on a string table constructed out of a set of 16 strings that represent reserved names of the NTFS file system, at least when it was first released way back in the early 90s.

This list is desirable as it has good distribution of characters, there is a good mix of both short and long entries, plus one oversized one ($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all strings lead with a common character (the dollar sign), preventing a simple first character optimization used by the initial version of the StringTable component I wrote in 2016.

So the scenario we’ll be emulating, in this case, is that we’ve just been passed a filename for creation, and we need to check if it prefix matches any of the reserved names.

Here’s the full list of NTFS names we’ll be using. We’re assuming 8-bit ASCII encoding (no UTF-8) and case sensitive. (If this were actually the NT kernel, we’d need to use wide characters with UTF-16 encoding, and be case-insensitive.)

NTFS Reserved Names

$AttrDef
$BadClus
$Bitmap
$Boot
$Extend
$LogFile
$MftMirr
$Mft
$Secure
$UpCase
$Volume
$Cairo
$INDEX_ALLOCATION
$DATA
????
.

The ordering is important in certain cases. For example, when you have overlapping strings, such as $MftMirr and $Mft, you should put the longest strings first. They will be matched first, and as our routine terminates upon the first successful prefix match—if a longer string resided after a shorter one, it would never get detected.

Let’s review some guiding design requirements and cover some of the design decisions I made, which should help shape your understanding of the implementation.

Requirements and Design Decisions

The STRING struct will be used to capture incoming search strings as well as the representation of any strings registered in the table (or more accurately, in the corresponding StringArray structure associated with the string table.

//
// The STRING structure used by the NT kernel.  Our STRING_ARRAY structure
// relies on an array of these structures.  We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//

typedef struct _STRING {
    USHORT Length;
    USHORT MaximumLength;
    ULONG  Padding;
    PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;

The design should optimize for string lengths less than or equal to 16. Lengths greater than 16 are permitted, up to 128 bytes, but they incur more overhead during the prefix lookup.
The design should prioritize the fast-path code where there is no match for a given search string. Being able to terminate the search as early as possible is ideal.
The performance hits taken by unaligned data access are non-negligible, especially when dealing with XMM/YMM loads. Pay special care to alignment constraints and make sure that everything under our control is aligned on a suitable boundary.

Note

The only thing we can’t really control in the real world is the alignment of the incoming search string buffer, which will often be at undesirable alignments like 2, 4, 6, etc. Our test program explicitly aligns the incoming search strings on 32-byte boundaries to avoid the penalties associated with unaligned access.

The string table is geared toward a single-shot build. Once you’ve created it with a given string array or used a delimited environment variable, that’s it. There are no AddString() or RemoveString() routines. The order you provided the strings in will be the same order the table uses—no re-ordering will be done. Thus, for prefix matching purposes, if two strings share a common prefix, the longer one should go first, as the prefix search routine will check it first.

Only single matches are performed; the first match that qualifies as a prefix match (target string in table had length less than or equal to the search string, and all of its characters matched). There is no support for obtaining multiple matches—if you’ve constructed your string tables properly (no duplicate or incorrectly-ordered overlapping fields), you shouldn’t need to.

So, to summarize, the design guidelines are as follows:

Prioritize fast-path exit in the non-matched case. (I refer to this as negative matching in a lot of places.)
Optimize for up to 16 string slots, where each slot has up to 16 characters, ideally. It can have up to 128 in total; however, any bytes outside of the first sixteen live in the string array structure supporting the string table (accessible via pStringArray).
If a slot is longer than 16 characters, optimize for the assumption that it won’t be that much longer. For instance, assume a string of length 18 bytes is more common than 120 bytes.

The Data Structures

The primary data structure employed by this solution is the STRING_TABLE structure. It is composed of supporting structures: STRING_SLOT, SLOT_INDEX, and SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY structure from which it was created.

STRING_TABLE

Let’s review the STRING_TABLE view on GitHub structure first and then touch on the supporting structures.

//
// The STRING_TABLE struct is an optimized structure for testing whether a
// prefix entry for a string is in a table, with the expectation that the
// strings being compared will be relatively short (ideally <= 16 characters),
// and the table of string prefixes to compare to will be relatively small
// (ideally <= 16 strings).
//
// The overall goal is to be able to prefix match a string with the lowest
// possible (amortized) latency.  Fixed-size, memory-aligned character arrays,
// and SIMD instructions are used to try and achieve this.
//

typedef struct _STRING_TABLE {

    //
    // A slot where each individual element contains a uniquely-identifying
    // letter, with respect to the other strings in the table, of each string
    // in an occupied slot.
    //

    STRING_SLOT UniqueChars;

    //
    // (16 bytes consumed.)
    //

    //
    // For each unique character identified above, the following structure
    // captures the 0-based index of that character in the underlying string.
    // This is used as an input to vpshufb to rearrange the search string's
    // characters such that it can be vpcmpeqb'd against the unique characters
    // above.
    //

    SLOT_INDEX UniqueIndex;

    //
    // (32 bytes consumed.)
    //

    //
    // Length of the underlying string in each slot.
    //

    SLOT_LENGTHS Lengths;

    //
    // (48 bytes consumed, aligned at 16 bytes.)
    //

    //
    // Pointer to the STRING_ARRAY associated with this table, which we own
    // (we create it and copy the caller's contents at creation time and
    // deallocate it when we get destroyed).
    //
    // N.B.  We use pStringArray here instead of StringArray because the
    //       latter is a field name at the end of the struct.
    //
    //

    PSTRING_ARRAY pStringArray;

    //
    // (56 bytes consumed, aligned at 8 bytes.)
    //

    //
    // String table flags.
    //

    STRING_TABLE_FLAGS Flags;

    //
    // (60 bytes consumed, aligned at 4 bytes.)
    //

    //
    // A 16-bit bitmap indicating which slots are occupied.
    //

    USHORT OccupiedBitmap;

    //
    // A 16-bit bitmap indicating which slots have strings longer than 16 chars.
    //

    USHORT ContinuationBitmap;

    //
    // (64 bytes consumed, aligned at 64 bytes.)
    //

    //
    // The 16-element array of STRING_SLOT structs.  We want this to be aligned
    // on a 64-byte boundary, and it consumes 256-bytes of memory.
    //

    STRING_SLOT Slots[16];

    //
    // (320 bytes consumed, aligned at 64 bytes.)
    //

    //
    // We want the structure size to be a power of 2 such that an even number
    // can fit into a 4KB page (and reducing the likelihood of crossing page
    // boundaries, which complicates SIMD boundary handling), so we have an
    // extra 192-bytes to play with here.  The CopyStringArray() routine is
    // special-cased to allocate the backing STRING_ARRAY structure plus the
    // accommodating buffers in this space if it can fit.
    //
    // (You can test whether or not this occurred by checking the invariant
    //  `StringTable->pStringArray == &StringTable->StringArray`, if this
    //  is true, the array was allocated within this remaining padding space.)
    //

    union {
        STRING_ARRAY StringArray;
        CHAR Padding[192];
    };

} STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE;

//
// Assert critical size and alignment invariants at compile time.
//

C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots)   == 64);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320);
C_ASSERT(sizeof(STRING_TABLE) == 512);

struct string_table {
    char                       unique_chars[16];
    unsigned char              unique_index[16];
    unsigned char              slot_lengths[16];
    struct string_array       *string_array_ptr;
    struct string_table_flags  flags;
    unsigned short             occupied_bitmap;
    unsigned short             continuation_bitmap;
    char                       slots[16][16];
    union {
        struct string_array    string_array;
        char                   padding[184];
    } u;
};

STRING_TABLE struct
    UniqueChars         CHAR 16 dup  (?)
    UniqueIndex         BYTE 16 dup  (?)
    Lengths             BYTE 16 dup  (?)
    pStringArray        PSTRING_ARRAY ?
    Flags               ULONG         ?
    OccupiedBitmap      USHORT        ?
    ContinuationBitmap  USHORT        ?
    Slots               STRING_SLOT 16 dup ({ })
    union
        StringArray STRING_ARRAY {?}
        Padding CHAR 192 dup (?)
    ends
STRING_TABLE ends

;
; Assert our critical field offsets and structure size as per the same approach
; taken in StringTable.h.
;

.erre (STRING_TABLE.UniqueIndex  eq  16), @CatStr(<UnexpectedOffset STRING_TABLE.UniqueIndex: >, %(STRING_TABLE.UniqueIndex))
.erre (STRING_TABLE.Lengths      eq  32), @CatStr(<UnexpectedOffset STRING_TABLE.Lengths: >, %(STRING_TABLE.Lengths))
.erre (STRING_TABLE.pStringArray eq  48), @CatStr(<UnexpectedOffset STRING_TABLE.pStringArray: >, %(STRING_TABLE.pStringArray))
.erre (STRING_TABLE.Slots        eq  64), @CatStr(<UnexpectedOffset STRING_TABLE.Slots: >, %(STRING_TABLE.Slots))
.erre (STRING_TABLE.Padding      eq 320), @CatStr(<UnexpectedOffset STRING_TABLE.Padding: >, %(STRING_TABLE.Padding))
.erre (size STRING_TABLE eq 512), @CatStr(<IncorrectStructSize: STRING_TABLE: >, %(size STRING_TABLE))

PSTRING_TABLE typedef ptr STRING_TABLE

;
; CamelCase typedefs that are nicer to work with in assembly
; than their uppercase counterparts.
;

StringTable typedef STRING_TABLE

The following diagram depicts an in-memory representation of the STRING_TABLE structure using our NTFS reserved prefix names. It is created via the CreateStringTable routine, which we feature in the appendix of this article.

In order to improve the uniqueness of the unique characters selected from each string, the strings are sorted by length during string table creation and enumerated in this order while identifying unique characters. The rationale behind this is that shorter strings simply have fewer characters to choose from, while longer strings have more to choose from. If we identified unique characters in the order they appear in the string table, we may have longer strings preceding shorter ones, such that toward the end of the table, nothing unique can be extracted from the short ones.

The utility of the string table is maximized by ensuring a unique character is selected from every string; thus, we sort by length first. Note that the uniqueness is actually determined by offset:character pairs, with the offsets becoming the indices stored in the UniqueIndex slot. If you trace through the diagram above, you’ll see that the unique character in each slot matches the character in the corresponding string slot, indicated by the underlying index.

STRING_ARRAY

The string array captures a raw array representation of the underlying strings making up the string table. It is either embedded within the padding area at the end of the string table, or a separate allocation is made during string table creation. The main interface to creating a string table is via a STRING_ARRAY structure. The helper functions, CreateStringTableFromDelimitedString and CreateStringTableFromDelimitedEnvironmentVariable, simply break down their input into a STRING_ARRAY representation first before calling CreateStringTable.

typedef struct _Struct_size_bytes_(SizeInQuadwords>>3) _STRING_ARRAY {

    //
    // Size of the structure, in quadwords.  Why quadwords?  It allows us to
    // keep this size field to a USHORT, which helps with the rest of the
    // alignment in this struct (we want the STRING Strings[] array to start
    // on an 8-byte boundary).
    //
    // N.B.  We can't express the exact field size in the SAL annotation
    //       below, because the array of buffer sizes are inexpressible;
    //       however, we know the maximum length, so we can use the implicit
    //       invariant that the total buffer size can't exceed whatever num
    //       elements * max size is.
    //

    _Field_range_(<=, (
        sizeof(struct _STRING_ARRAY) +
        ((NumberOfElements - 1) * sizeof(STRING)) +
        (MaximumLength * NumberOfElements)
    ) >> 3)
    USHORT SizeInQuadwords;

    //
    // Number of elements in the array.
    //

    USHORT NumberOfElements;

    //
    // Minimum and maximum lengths for the String->Length fields.  Optional.
    //

    USHORT MinimumLength;
    USHORT MaximumLength;

    //
    // A pointer to the STRING_TABLE structure that "owns" us.
    //

    struct _STRING_TABLE *StringTable;

    //
    // The string array.  Number of elements in the array is governed by the
    // NumberOfElements field above.
    //

    STRING Strings[ANYSIZE_ARRAY];

} STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;

Note

The odd-looking macros _Struct_size_bytes_ and _Field_range_ are SAL Annotations. There’s a neat deck called Engineering Better Software at Microsoft which captures some interesting details about SAL, for those wanting to read more. The Code Analysis engine that uses the annotations is built upon the Z3 Theorem Prover, which is a fascinating little project in its own right.

And finally, we’re left with the smaller helper structs that we use to encapsulate the various innards of the string table. (I use unions that feature XMMWORD representations (which is a typedef of __m128i, representing an XMM register) as well as underlying byte/character representations, as I personally find it makes the resulting C code a bit nicer.)

STRING_SLOT

//
// String tables are composed of a 16 element array of 16 byte string "slots",
// which represent a unique character (with respect to other strings in the
// table) for a string in a given slot index. The STRING_SLOT structure
// provides a convenient wrapper around this construct.
//

typedef union DECLSPEC_ALIGN(16) _STRING_SLOT {
    XMMWORD CharsXmm;
    CHAR Char[16];
} STRING_SLOT, *PSTRING_SLOT, **PPSTRING_SLOT;
C_ASSERT(sizeof(STRING_SLOT) == 16);

SLOT_INDEX

//
// A 16 element array of 1 byte unsigned integers, used to capture the length
// of each string slot in a single XMM 128-bit register.
//

typedef union DECLSPEC_ALIGN(16) _SLOT_LENGTHS {
    XMMWORD SlotsXmm;
    BYTE Slots[16];
} SLOT_LENGTHS, *PSLOT_LENGTHS, **PPSLOT_LENGTHS;
C_ASSERT(sizeof(SLOT_LENGTHS) == 16);

String Table Construction

The CreateSingleStringTable routine is responsible for the construction of a new STRING_TABLE. It is here that we identify the unique set of characters (and their indices) to store in the first two fields of the string table.

//
// Define private types used by this module.
//

typedef struct _LENGTH_INDEX_ENTRY {
    BYTE Length;
    BYTE Index;
} LENGTH_INDEX_ENTRY;
typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY;

typedef struct _LENGTH_INDEX_TABLE {
    LENGTH_INDEX_ENTRY Entry[16];
} LENGTH_INDEX_TABLE;
typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE;

typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP {
    YMMWORD Ymm;
    XMMWORD Xmm[2];
    LONG Bits[(256 / (4 << 3))];  // 8
} CHARACTER_BITMAP;
C_ASSERT(sizeof(CHARACTER_BITMAP) == 32);
typedef CHARACTER_BITMAP *PCHARACTER_BITMAP;

typedef struct _SLOT_BITMAPS {
    CHARACTER_BITMAP Bitmap[16];
} SLOT_BITMAPS;
typedef SLOT_BITMAPS *PSLOT_BITMAPS;

//
// Function implementation.
//

_Use_decl_annotations_
PSTRING_TABLE
CreateSingleStringTable(
    PRTL Rtl,
    PALLOCATOR StringTableAllocator,
    PALLOCATOR StringArrayAllocator,
    PSTRING_ARRAY StringArray,
    BOOL CopyArray
    )
/*++

Routine Description:

    Allocates space for a STRING_TABLE structure using the provided allocators,
    then initializes it using the provided STRING_ARRAY.  If CopyArray is set
    to TRUE, the routine will copy the string array such that the caller is
    free to destroy it after the table has been successfully created.  If it
    is set to FALSE and StringArray->StringTable has a non-NULL value, it is
    assumed that sufficient space has already been allocated for the string
    table and this pointer will be used to initialize the rest of the structure.

    DestroyStringTable() must be called against the returned PSTRING_TABLE when
    the structure is no longer needed in order to ensure resources are released.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which
        will be used for creating the STRING_TABLE.

    StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which
        may be used to create the STRING_ARRAY if it cannot fit within the
        padding of the STRING_TABLE structure.  This is kept separate from the
        StringTableAllocator due to the stringent alignment requirements of the
        string table.

    StringArray - Supplies a pointer to an initialized STRING_ARRAY structure
        that contains the STRING structures that are to be added to the table.

    CopyArray - Supplies a boolean value indicating whether or not the
        StringArray structure should be deep-copied during creation.  This is
        typically set when the caller wants to be able to free the structure
        as soon as this call returns (or can't guarantee it will persist past
        this function's invocation, i.e. if it was stack allocated).

Return Value:

    A pointer to a valid PSTRING_TABLE structure on success, NULL on failure.
    Call DestroyStringTable() on the returned structure when it is no longer
    needed in order to ensure resources are cleaned up appropriately.

--*/
{
    BYTE Byte;
    BYTE Count;
    BYTE Index;
    BYTE Length;
    BYTE NumberOfElements;
    ULONG HighestBit;
    ULONG OccupiedMask;
    PULONG Bits;
    USHORT OccupiedBitmap;
    USHORT ContinuationBitmap;
    PSTRING_TABLE StringTable;
    PSTRING_ARRAY StringArray;
    PSTRING String;
    PSTRING_SLOT Slot;
    STRING_SLOT UniqueChars;
    SLOT_INDEX UniqueIndex;
    SLOT_INDEX LengthIndex;
    SLOT_LENGTHS Lengths;
    LENGTH_INDEX_TABLE LengthIndexTable;
    PCHARACTER_BITMAP Bitmap;
    SLOT_BITMAPS SlotBitmaps;
    PLENGTH_INDEX_ENTRY Entry;

    //
    // Validate arguments.
    //

    if (!ARGUMENT_PRESENT(StringTableAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(StringArrayAllocator)) {
        return NULL;
    }

    if (!ARGUMENT_PRESENT(SourceStringArray)) {
        return NULL;
    }

    if (SourceStringArray->NumberOfElements == 0) {
        return NULL;
    }

    //
    // Copy the incoming string array if applicable.
    //

    if (CopyArray) {

        StringArray = CopyStringArray(
            StringTableAllocator,
            StringArrayAllocator,
            SourceStringArray,
            FIELD_OFFSET(STRING_TABLE, StringArray),
            sizeof(STRING_TABLE),
            &StringTable
        );

        if (!StringArray) {
            return NULL;
        }

    } else {

        //
        // We're not copying the array, so initialize StringArray to point at
        // the caller's SourceStringArray, and StringTable to point at the
        // array's StringTable field (which will be non-NULL if sufficient
        // space has been allocated).
        //

        StringArray = SourceStringArray;
        StringTable = StringArray->StringTable;

    }

    //
    // If StringTable has no value, we've either been called with CopyArray set
    // to FALSE, or CopyStringArray() wasn't able to allocate sufficient space
    // for both the table and itself.  Either way, we need to allocate space for
    // the table.
    //

    if (!StringTable) {

        StringTable = (PSTRING_TABLE)(
            StringTableAllocator->AlignedCalloc(
                StringTableAllocator->Context,
                1,
                sizeof(STRING_TABLE),
                STRING_TABLE_ALIGNMENT
            )
        );

        if (!StringTable) {
            return NULL;
        }
    }

    //
    // Make sure the fields that are sensitive to alignment are, in fact,
    // aligned correctly.
    //

    if (!AssertStringTableFieldAlignment(StringTable)) {
        DestroyStringTable(StringTableAllocator,
                           StringArrayAllocator,
                           StringTable);
        return NULL;
    }

    //
    // At this point, we have copied the incoming StringArray if necessary,
    // and we've allocated sufficient space for the StringTable structure.
    // Enumerate over all of the strings, set the continuation bit if the
    // length > 16, set the relevant slot length, set the relevant unique
    // character entry, then move the first 16-bytes of the string into the
    // relevant slot via an aligned SSE mov.
    //

    //
    // Initialize pointers and counters, clear stack-based structures.
    //

    Slot = StringTable->Slots;
    String = StringArray->Strings;

    OccupiedBitmap = 0;
    ContinuationBitmap = 0;
    NumberOfElements = (BYTE)StringArray->NumberOfElements;
    UniqueChars.CharsXmm = _mm_setzero_si128();
    UniqueIndex.IndexXmm = _mm_setzero_si128();
    LengthIndex.IndexXmm = _mm_setzero_si128();

    //
    // Set all the slot lengths to 0x7f up front instead of defaulting
    // to zero.  This allows for simpler logic when searching for a prefix
    // string, which involves broadcasting a search string's length to an XMM
    // register, then doing _mm_cmpgt_epi8() against the lengths array and
    // the string length.  If we left the lengths as 0 for unused slots, they
    // would get included in the resulting comparison register (i.e. the high
    // bits would be set to 1), so we'd have to do a subsequent masking of
    // the result at some point using the OccupiedBitmap.  By defaulting the
    // lengths to 0x7f, we ensure they'll never get included in any cmpgt-type
    // SIMD matches.  (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8()
    // intrinsic assumes packed signed integers.)
    //

    Lengths.SlotsXmm = _mm_set1_epi8(0x7f);

    ZeroStruct(LengthIndexTable);
    ZeroStruct(SlotBitmaps);

    for (Count = 0; Count < NumberOfElements; Count++) {

        XMMWORD CharsXmm;

        //
        // Set the string length for the slot.
        //

        Length = Lengths.Slots[Count] = (BYTE)String->Length;

        //
        // Set the appropriate bit in the continuation bitmap if the string is
        // longer than 16 bytes.
        //

        if (Length > 16) {
            ContinuationBitmap |= (Count == 0 ? 1 : 1 << (Count + 1));
        }

        if (Count == 0) {

            Entry = &LengthIndexTable.Entry[0];
            Entry->Index = 0;
            Entry->Length = Length;

        } else {

            //
            // Perform a linear scan of the length-index table in order to
            // identify an appropriate insertion point.
            //

            for (Index = 0; Index < Count; Index++) {
                if (Length < LengthIndexTable.Entry[Index].Length) {
                    break;
                }
            }

            if (Index != Count) {

                //
                // New entry doesn't go at the end of the table, so shuffle
                // everything else down.
                //

                Rtl->RtlMoveMemory(&LengthIndexTable.Entry[Index + 1],
                                   &LengthIndexTable.Entry[Index],
                                   (Count - Index) * sizeof(*Entry));
            }

            Entry = &LengthIndexTable.Entry[Index];
            Entry->Index = Count;
            Entry->Length = Length;
        }

        //
        // Copy the first 16-bytes of the string into the relevant slot.  We
        // have taken care to ensure everything is 16-byte aligned by this
        // stage, so we can use SSE intrinsics here.
        //

        CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);
        _mm_store_si128(&(*Slot).CharsXmm, CharsXmm);

        //
        // Advance our pointers.
        //

        ++Slot;
        ++String;

    }

    //
    // Store the slot lengths.
    //

    _mm_store_si128(&(StringTable->Lengths.SlotsXmm), Lengths.SlotsXmm);

    //
    // Loop through the strings in order of shortest to longest and construct
    // the uniquely-identifying character table with corresponding index.
    //


    for (Count = 0; Count < NumberOfElements; Count++) {
        Entry = &LengthIndexTable.Entry[Count];
        Length = Entry->Length;
        Slot = &StringTable->Slots[Entry->Index];

        //
        // Iterate over each character in the slot and find the first one
        // without a corresponding bit set.
        //

        for (Index = 0; Index < Length; Index++) {
            Bitmap = &SlotBitmaps.Bitmap[Index];
            Bits = (PULONG)&Bitmap->Bits[0];
            Byte = Slot->Char[Index];
            if (!BitTestAndSet(Bits, Byte)) {
                break;
            }
        }

        UniqueChars.Char[Count] = Byte;
        UniqueIndex.Index[Count] = Index;
        LengthIndex.Index[Count] = Entry->Index;
    }

    //
    // Loop through the elements again such that the unique chars are stored
    // in the order they appear in the table.
    //

    for (Count = 0; Count < NumberOfElements; Count++) {
        for (Index = 0; Index < NumberOfElements; Index++) {
            if (LengthIndex.Index[Index] == Count) {
                StringTable->UniqueChars.Char[Count] = UniqueChars.Char[Index];
                StringTable->UniqueIndex.Index[Count] = UniqueIndex.Index[Index];
                break;
            }
        }
    }

    //
    // Generate and store the occupied bitmap.  Each bit, from low to high,
    // corresponds to the index of a slot.  When set, the slot is occupied.
    // When clear, it is not.  So, fill bits from the highest bit set down.
    //

    HighestBit = (1 << (StringArray->NumberOfElements-1));
    OccupiedMask = _blsmsk_u32(HighestBit);
    StringTable->OccupiedBitmap = (USHORT)OccupiedMask;

    //
    // Store the continuation bitmap.
    //

    StringTable->ContinuationBitmap = (USHORT)(ContinuationBitmap);

    //
    // Wire up the string array to the table.
    //

    StringTable->pStringArray = StringArray;

    //
    // And we're done, return the table.
    //

    return StringTable;
}

The Benchmark

The performance comparison graphs in the subsequent sections were generated in Excel, using CSV data output by the creatively-named program StringTable2BenchmarkExe.

Modern CPUs are fast, and timing is challenging, especially when you’re dealing with CPU cycle comparisons. No approach is perfect. Here’s what I settled on:

The benchmark utility has #pragma optimize("", off) at the start of the file, which disables global optimizations, even in release (optimized) builds. This prevents the compiler from doing clever things regarding the scheduling of the timestamping logic, which affects reported times.
The benchmark utility pins itself to a single core and sets its thread priority to the highest permissible value at startup. (Turbo is disabled on the computer, so the frequency is pinned to 3.68GHz.)
The benchmark utility is fed an array of function pointers and test inputs. It iterates over each test input, then iterates over each function, calling it with the test input and potentially verifying the result (some functions are included for comparison but don’t produce correct results, so they don’t have their results verified).
The test input string is copied into a local buffer aligned on a 32-byte boundary. This ensures that all test inputs are being compared fairly. (The natural alignment of the buffers varies anywhere from 2 to 512 bytes, and unaligned buffers have a significant impact on the timings.)
The function is run once, with the result captured. If verification has been requested, the result is verified. We __debugbreak() immediately if there’s a mismatch, which is handy during development.
NtDelayExecution(TRUE, 1) is called, which results in a sleep of approximately 100 nanoseconds. This forces a context switch, giving the thread a new scheduling quantum before each function is run.
The function is executed 100 times for warmup.
Timings are taken for 1000 iterations of the function using the given test input. The __rdtscp() intrinsic is used (which forces some serialization) to capture the timestamp counter before and after the iterations.
This process is repeated 100 times. The minimum time observed to perform 1000 iterations (out of 100 attempts) is captured as the function’s best time.

Release vs PGO Oddities

All of the times in the graphs come from the profile-guided optimization (PGO) build of the StringTable component. The PGO build is faster than the normal release build in every case except one, where it is notably slower.

It’s… odd. I haven’t investigated it. The following graph depicts the affected function, IsPrefixOfStringInTable_1, and a few other versions for reference, showing the performance of the PGO build compared to the release build on the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".

Only that function is affected, and the problem mainly manifests with the two example test strings shown. As this routine essentially serves as one of the initial baseline implementations, it would be misleading to compare all optimized PGO versions to the abnormally slow baseline implementation. Therefore, the release and PGO timings were blended into a single CSV, and the Excel PivotTables select the minimum time for a given function and test input.

Thus, you’re always looking at the PGO timings, except for this outlier case where the release versions are faster.

The Implementations

Round 1

In this section, we’ll take a look at the various implementations I experimented with on the first pass, prior to soliciting any feedback. I figured there were a couple of ways I could present this information. First, I could hand-pick what I choose to show and hide, creating a rosy picture that makes it seem like I effortlessly arrived at the fastest implementation without much actual effort whatsoever.

Or I could show the gritty reality of how everything actually went down in chronological fashion, errors and all. And there were definitely some errors! For better or worse, I’ve chosen this route, so you’ll get to see some pretty tedious tweaks (changing a single line, for example) before the juicy stuff really kicks in.

Additionally, with the benefit of writing this section introduction retroactively, iterations 4 and 5 aren’t testing what I initially thought they were testing. I’ve left them in as is; if anything, it demonstrates the importance of only changing one thing at a time and making sure you’re testing what you think you’re testing. I’ll discuss the errors with those iterations later in the article.

C Implementations

IsPrefixOfCStrInArray

IsPrefixOfStringInTable_1 →

Let’s review the baseline implementation again, as that’s what we’re ultimately comparing ourselves against. This version enumerates the string array (and thus has a slightly different function signature than the STRING_TABLE-based functions) looking for prefix matches. No SIMD instructions are used. The timings captured should be proportional to the location of the test input string in the array. That is, it should take less time to prefix match strings that occur earlier in the array versus those that appear later.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfCStrInArray(
    PCSZ *StringArray,
    PCSZ String,
    PSTRING_MATCH Match
    )
{
    PCSZ Left;
    PCSZ Right;
    PCSZ *Target;
    ULONG Index = 0;
    ULONG Count;

    for (Target = StringArray; *Target != NULL; Target++, Index++) {
        Count = 0;
        Left = String;
        Right = *Target;

        while (*Left && *Right && *Left++ == *Right++) {
            Count++;
        }

        if (Count > 0 && !*Right) {
            if (ARGUMENT_PRESENT(Match)) {
                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)Count;
                Match->String = NULL;
            }
            return (STRING_TABLE_INDEX)Index;
        }
    }

    return NO_MATCH_FOUND;
}

IsPrefixOfStringInTable_1

← IsPrefixOfCStrInArray | IsPrefixOfStringInTable_2 →

This version is similar to the IsPrefixOfCStrInArray implementation, except it utilizes the slot length information provided by the STRING_ARRAY structure and conforms to our standard IsPrefixOfStringInTable function signature. It uses no SIMD instructions.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_1(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine performs a simple linear scan of the string table looking for
    a prefix match against each slot.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    BYTE Left;
    BYTE Right;
    ULONG Index;
    ULONG Count;
    PSTRING_ARRAY StringArray;
    PSTRING TargetString;

    //IACA_VC_START();

    StringArray = StringTable->pStringArray;

    if (StringArray->MinimumLength > String->Length) {
        return NO_MATCH_FOUND;
    }

    for (Count = 0; Count < StringArray->NumberOfElements; Count++) {

        TargetString = &StringArray->Strings[Count];

        if (String->Length < TargetString->Length) {
            continue;
        }

        for (Index = 0; Index < TargetString->Length; Index++) {
            Left = String->Buffer[Index];
            Right = TargetString->Buffer[Index];
            if (Left != Right) {
                break;
            }
        }

        if (Index == TargetString->Length) {

            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Count;
                Match->NumberOfMatchedCharacters = (BYTE)Index;
                Match->String = TargetString;

            }

            return (STRING_TABLE_INDEX)Count;
        }

    }

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 1

Here’s the performance of these two baseline routines:

That’s an interesting result! Even without using any SIMD instructions, version 1, the IsPrefixOfStringInTable_1 routine, is faster (in all but one case) than the baseline IsPrefixOfCStrInArray routine, thanks to a more sophisticated data structure.

(And really, it’s not even using the sophisticated parts of the STRING_TABLE; it’s just leveraging the fact that we’ve captured the lengths of each string in the backing STRING_ARRAY structure by virtue of using the STRING structure to wrap our strings (instead of relying on the standard NULL-terminated C string approach).)

Before we look at IsPrefixOfStringInTable_2, which is the first of the routines to use SIMD instructions, it’s helpfu to know some backstory. The _2 version is based on the prefix matching routine I wrote for the first version of the StringTable component back in 2016. The layout of the STRING_TABLE struct differed in the first version; only the first character of each slot was used to do the initial exclusion (as opposed to the unique character), and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits), so the match bitmap had to be constructed slightly differently.

None of those details really apply to our second attempt at the StringTable component, detailed in this article. Our lengths are 8 bits, and we use unique characters in the initial negative match fast-path. However, the first version used an elaborate AVX2 prefix match routine geared toward matching long strings, attempting to use non-temporal streaming load instructions where possible (which would only make sense for a large number of long strings in specific cache-thrashing scenarios).

Compare our simpler implementation, IsPrefixMatch, used from version 3 onward, to the far more elaborate (and unnecessary) IsPrefixMatchAvx2:

IsPrefixMatch

FORCEINLINE
BYTE
IsPrefixMatch(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ BYTE Offset
    )
{
    PBYTE Left;
    PBYTE Right;
    BYTE Matched = 0;
    BYTE Remaining = (SearchString->Length - Offset) + 1;

    Left = (PBYTE)RtlOffsetToPointer(SearchString->Buffer, Offset);
    Right = (PBYTE)RtlOffsetToPointer(TargetString->Buffer, Offset);

    while (--Remaining && *Left++ == *Right++) {
        Matched++;
    }

    Matched += Offset;
    if (Matched != TargetString->Length) {
        return NO_MATCH_FOUND;
    }

    return Matched;
}

IsPrefixMatchAvx2

The AVX2 routine is overkill, especially considering the emphasis we put on favoring short strings over longer ones in the requirements section. However, we want to put broad statements like that to the test, so let’s include it as our first SIMD implementation to see how it stacks up against the simpler versions.

FORCEINLINE
USHORT
IsPrefixMatchAvx2(
    _In_ PCSTRING SearchString,
    _In_ PCSTRING TargetString,
    _In_ USHORT Offset
    )
{
    USHORT SearchStringRemaining;
    USHORT TargetStringRemaining;
    ULONGLONG SearchStringAlignment;
    ULONGLONG TargetStringAlignment;
    USHORT CharactersMatched = Offset;

    LONG Count;
    LONG Mask;

    PCHAR SearchBuffer;
    PCHAR TargetBuffer;

    STRING_SLOT SearchSlot;

    XMMWORD SearchXmm;
    XMMWORD TargetXmm;
    XMMWORD ResultXmm;

    YMMWORD SearchYmm;
    YMMWORD TargetYmm;
    YMMWORD ResultYmm;

    SearchStringRemaining = SearchString->Length - Offset;
    TargetStringRemaining = TargetString->Length - Offset;

    SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString->Buffer, Offset);
    TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString->Buffer, Offset);

    //
    // This routine is only called in the final stage of a prefix match when
    // we've already verified the slot's corresponding original string length
    // (referred in this routine as the target string) is less than or equal
    // to the length of the search string.
    //
    // We attempt as many 32-byte comparisons as we can, then as many 16-byte
    // comparisons as we can, then a final < 16-byte comparison if necessary.
    //
    // We use aligned loads if possible, falling back to unaligned if not.
    //

StartYmm:

    if (SearchStringRemaining >= 32 && TargetStringRemaining >= 32) {

        //
        // We have at least 32 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 32-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);
        TargetStringAlignment = GetAddressAlignment(TargetBuffer);

        if (SearchStringAlignment < 32) {
            SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer);
        } else {
            SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer);
        }

        if (TargetStringAlignment < 32) {
            TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer);
        } else {
            TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer);
        }

        //
        // Compare the two vectors.
        //

        ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm256_movemask_epi8(ResultYmm);

        //
        // There were at least 32 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 32 characters, we can terminate
        // this prefix search here.  (-1 == 0xffffffff == all bits set == all
        // characters matched.)
        //

        if (Mask != -1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 32 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 32-byte processing.
        //

        SearchStringRemaining -= 32;
        TargetStringRemaining -= 32;

        CharactersMatched += 32;

        SearchBuffer += 32;
        TargetBuffer += 32;

        goto StartYmm;
    }

    //
    // Intentional follow-on to StartXmm.
    //

StartXmm:

    //
    // Update the search string's alignment.
    //

    if (SearchStringRemaining >= 16 && TargetStringRemaining >= 16) {

        //
        // We have at least 16 bytes to compare for each string.  Check the
        // alignment for each buffer and do an aligned streaming load (non-
        // temporal hint) if our alignment is at a 16-byte boundary or better;
        // reverting to an unaligned load when not.
        //

        SearchStringAlignment = GetAddressAlignment(SearchBuffer);

        if (SearchStringAlignment < 16) {
            SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
        } else {
            SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer);
        }

        TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer);

        //
        // Compare the two vectors.
        //

        ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

        //
        // Generate a mask from the result of the comparison.
        //

        Mask = _mm_movemask_epi8(ResultXmm);

        //
        // There were at least 16 characters remaining in each string buffer,
        // thus, every character needs to have matched in order for this search
        // to continue.  If there were less than 16 characters, we can terminate
        // this prefix search here.  (-1 == 0xffff -> all bits set -> all chars
        // matched.)
        //

        if ((SHORT)Mask != (SHORT)-1) {

            //
            // Not all characters were matched, terminate the prefix search.
            //

            return NO_MATCH_FOUND;
        }

        //
        // All 16 characters were matched.  Update counters and pointers
        // accordingly and jump back to the start of the 16-byte processing.
        //

        SearchStringRemaining -= 16;
        TargetStringRemaining -= 16;

        CharactersMatched += 16;

        SearchBuffer += 16;
        TargetBuffer += 16;

        goto StartXmm;
    }

    if (TargetStringRemaining == 0) {

        //
        // We'll get here if we successfully prefix matched the search string
        // and all our buffers were aligned (i.e. we don't have a trailing
        // < 16 bytes comparison to perform).
        //

        return CharactersMatched;
    }

    //
    // If we get here, we have less than 16 bytes to compare.  Our target
    // strings are guaranteed to be 16-byte aligned, so we can load them
    // using an aligned stream load as in the previous cases.
    //

    TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer);

    //
    // Loading the remainder of our search string's buffer is a little more
    // complicated.  It could reside within 15 bytes of the end of the page
    // boundary, which would mean that a 128-bit load would cross a page
    // boundary.
    //
    // At best, the page will belong to our process and we'll take a performance
    // hit.  At worst, we won't own the page, and we'll end up triggering a hard
    // page fault.
    //
    // So, see if the current search buffer address plus 16 bytes crosses a page
    // boundary.  If it does, take the safe but slower approach of a ranged
    // memcpy (movsb) into a local stack-allocated STRING_SLOT structure.
    //

    if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) {

        //
        // No page boundary is crossed, so just do an unaligned 128-bit move
        // into our Xmm register.  (We could do the aligned/unaligned dance
        // here, but it's the last load we'll be doing (i.e. it's not
        // potentially on a loop path), so I don't think it's worth the extra
        // branch cost, although I haven't measured this empirically.)
        //

        SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);

    } else {

        //
        // We cross a page boundary, so only copy the the bytes we need via
        // __movsb(), then do an aligned stream load into the Xmm register
        // we'll use in the comparison.
        //

        __movsb((PBYTE)&SearchSlot.Char,
                (PBYTE)SearchBuffer,
                SearchStringRemaining);

        SearchXmm = _mm_stream_load_si128(&SearchSlot.CharsXmm);
    }

    //
    // Compare the final vectors.
    //

    ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);

    //
    // Generate a mask from the result of the comparison, but mask off (zero
    // out) high bits from the target string's remaining length.
    //

    Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining);

    //
    // Count how many characters were matched and determine if we were a
    // successful prefix match or not.
    //

    Count = __popcnt(Mask);

    if ((USHORT)Count == TargetStringRemaining) {

        //
        // If we matched the same amount of characters as remaining in the
        // target string, we've successfully prefix matched the search string.
        // Return the total number of characters we matched.
        //

        CharactersMatched += (USHORT)Count;
        return CharactersMatched;
    }

    //
    // After all that work, our string match failed at the final stage!  Return
    // to the caller indicating we were unable to make a prefix match.
    //

    return NO_MATCH_FOUND;
}

IsPrefixOfStringInTable_2

← IsPrefixOfStringInTable_1 | IsPrefixOfStringInTable_3 →

Note

This is is the first time we’re seeing the full body of the SIMD-style IsPrefixOfStringInTable implementation. It’s heavily commented, and generally, the core algorithm doesn’t fundamentally change across iterations (just slight tweaks). I’d recommend reading through it thoroughly to build a mental model of how the matching algorithm works. It’s straightforward, and the subsequent iterations will make much more sense, as they’re typically presented as diffs against the previous version.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_2(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 2

Let’s see how version 2, our first SIMD attempt, performs in comparison to the two baselines.

Eek! Our first SIMD attempt actually has worse prefix matching performance in most cases! The only area where it shows a performance improvement is in negative matching.

IsPrefixOfStringInTable_3

← IsPrefixOfStringInTable_2 | IsPrefixOfStringInTable_4 →

For version 3, let’s replace the call to IsPrefixMatchAvx2 with our simpler version, IsPrefixMatch:

Diff
Full

% diff -u IsPrefixOfStringInTable_2.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_2.c 2018-04-15 22:35:55.458773500 -0400
+++ IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_2(
+IsPrefixOfStringInTable_3(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -278,7 +278,7 @@

             TargetString = &StringTable->pStringArray->Strings[Index];

-            CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
+            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

             if (CharactersMatched == NO_MATCH_FOUND) {

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is our first AVX-optimized version of the routine.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 3

Phew! We finally see superior performance across the board. This ends the short-lived tenure of version 2, which is demonstrably worse in every case.

We’ll also omit the IsPrefixOfCStrInArray routine from the graphs for now (for the most part), as it has served its initial baseline purpose.

IsPrefixOfStringInTable_4

← IsPrefixOfStringInTable_3 | IsPrefixOfStringInTable_5 →

When I first wrote the initial string table code, I was experimenting with different strategies for loading the initial search string buffer. That resulted in the file StringLoadStoreOperations.h, which defined a bunch of helper macros. I’ve included them below, but don’t spend too much time absorbing them—they’re not good practice, and they all become irrelevant as soon as we switch to _mm_loadu_si128() in a few versions. I’m including them because they set the scene for versions 4, 5, and 6.

/*++

    VOID
    LoadSearchStringIntoXmmRegister_SEH(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    Attempts an aligned 128-bit load of String->Buffer into Slot.CharXmm via
    the _mm_load_si128() intrinsic.  The intrinsic is surrounded in a __try/
    __except block that catches EXCEPTION_ACCESS_VIOLATION exceptions.

    If such an exception is caught, the routine will check to see if the string
    buffer's address will cross a page boundary if 16-bytes are loaded.  If a
    page boundary would be crossed, a __movsb() intrinsic is used to copy only
    the bytes specified by String->Length, otherwise, an unaligned 128-bit load
    is attemped via the _mm_loadu_si128() intrinsic.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar)   \
    LengthVar = min(String->Length, 16);                               \
    TRY_SSE42_ALIGNED {                                                \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);      \
    } CATCH_EXCEPTION_ACCESS_VIOLATION {                               \
        if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {  \
            __movsb(Slot.Char, String->Buffer, LengthVar);             \
        } else {                                                       \
            Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \
        }                                                              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlignmentCheck(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine checks to see if a page boundary will be crossed if 16-bytes
    are loaded from the address supplied by String->Buffer.  If a page boundary
    will be crossed, a __movsb() intrinsic is used to only copy String->Length
    bytes into the given Slot.

    If no page boundary will be crossed by a 128-bit load, the alignment of
    the address supplied by String->Buffer is checked.  If the alignment isn't
    at least on a 16-byte boundary, an unaligned load will be issued via the
    _mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \
    LengthVar = min(String->Length, 16);                                       \
    if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {              \
        __movsb(Slot.Char, String->Buffer, LengthVar);                         \
    } else if (GetAddressAlignment(String->Buffer) < 16) {                     \
        Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);             \
    } else {                                                                   \
        Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);              \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysUnaligned(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine performs an unaligned 128-bit load of the address supplied by
    String->Buffer into the given Slot via the _mm_loadu_si128() intrinsic.
    No checks are done regarding whether or not a page boundary will be crossed.

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \
    LengthVar = min(String->Length, 16);                                   \
    if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) {          \
        __movsb(Slot.Char, String->Buffer, LengthVar);                     \
    } else if (GetAddressAlignment(String->Buffer) < 16) {                 \
        Slot.CharsXmm = _mm_loadu_si128(String->Buffer);                   \
    } else {                                                               \
        Slot.CharsXmm = _mm_load_si128(String->Buffer);                    \
    }

/*++

    VOID
    LoadSearchStringIntoXmmRegister_AlwaysMovsb(
        _In_ STRING_SLOT Slot,
        _In_ PSTRING String,
        _In_ USHORT LengthVar
        );

Routine Description:

    This routine copies min(String->Length, 16) bytes from String->Buffer
    into the given Slot via the __movsb() intrinsic.  The memory referenced by
    the Slot is not cleared first via SecureZeroMemory().

Arguments:

    Slot - Supplies the STRING_SLOT local variable name within the calling
        function that will receive the results of the load operation.

    String - Supplies the name of the PSTRING variable that is to be loaded
        into the slot.  This will usually be one of the function parameters.

    LengthVar - Supplies the name of a USHORT local variable that will receive
        the value of min(String->Length, 16).

Return Value:

    None.

--*/
#define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \
    LengthVar = min(String->Length, 16);                                     \
    __movsb(Slot.Char, String->Buffer, LengthVar);

In our StringTable2.vcxproj file, we have the following:

  <PropertyGroup Label="Globals">
    ...
    <LoadSearchStringStrategy>AlwaysMovsb</LoadSearchStringStrategy>
    <!--
    <LoadSearchStringStrategy>SEH</LoadSearchStringStrategy>
    <LoadSearchStringStrategy>AlignmentCheck</LoadSearchStringStrategy>
    <LoadSearchStringStrategy>AlwaysUnaligned</LoadSearchStringStrategy>
    -->

This setup allowed me to toggle which strategy I wanted to use for loading the search string into an XMM register. As shown above, the default is to use the AlwaysMovsb approach*; so, for version 4, let’s swap that out for the SEH approach, which wraps the aligned load in a structured exception handler that falls back to __movsb() if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.

[*]: Or was it?

Narrator: it wasn’t.

Diff
Full

% diff -u IsPrefixOfStringInTable_4.c IsPrefixOfStringInTable_3.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_4.c 2018-04-15 22:35:55.453274200 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_4(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses a structured exception
+    handler for loading the initial search string.

 Arguments:

@@ -123,7 +124,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_4(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses a structured exception
    handler for loading the initial search string.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 4

The performance of version 4 was slightly worse than 3 in every case:

Version 3 is still in the lead with the AlwaysMovsb-based search string loading approach.

Narrator: except the AlignmentCheck macro was actually active, not the AlwaysMovsb one.

IsPrefixOfStringInTable_5

← IsPrefixOfStringInTable_4 | IsPrefixOfStringInTable_6 →

Version 5 is an interesting one. It’s the first time we attempt to validate our claim that it’s more efficient to give the CPU a bunch of independent things to do up front, rather than adding more branches and attempting to terminate as early as possible.

Note: we’ll also explicitly use the LoadSearchStringIntoXmmRegister_AlwaysMovsb macro here, instead of LoadSearchStringIntoXmmRegister, to make it clear that we’re actually relying on the __movsb()-based string loading routine.

Narrator: can anyone spot the mistake with this logic?

Diff
Full

% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_5.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_5.c 2018-04-15 13:24:52.480972900 -0400
@@ -16,9 +16,13 @@

 #include "stdafx.h"

+//
+// Variant of v3 with early-exits.
+//
+
 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_5(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +35,11 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine is a variant of version 3 that uses early exits (i.e.
+    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
+    the assertion we've been making that it's more optimal to give the CPU
+    to do a bunch of things up front versus doing something, then potentially
+    branching, doing the next thing, potentially branching, etc.

 Arguments:

@@ -51,6 +59,8 @@
 --*/
 {
     ULONG Bitmap;
+    ULONG CharBitmap;
+    ULONG LengthBitmap;
     ULONG Mask;
     ULONG Count;
     ULONG Length;
@@ -71,7 +81,6 @@
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
     XMMWORD IncludeSlotsByLength;
-    XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     StringArray = StringTable->pStringArray;
@@ -123,7 +132,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

     //
     // Broadcast the search string's unique characters according to the string
@@ -133,11 +142,6 @@
     UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                   StringTable->UniqueIndex.IndexXmm);

-    //
-    // Load the slot length array into an XMM register.
-    //
-
-    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

     //
     // Load the string table's unique character array into an XMM register.
@@ -146,13 +150,6 @@
     TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

     //
-    // Broadcast the search string's length into an XMM register.
-    //
-
-    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
-    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
-
-    //
     // Compare the search string's unique character with all of the unique
     // characters of strings in the table, saving the results into an XMM
     // register.  This comparison will indicate which slots we can ignore
@@ -162,6 +159,25 @@

     IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

+    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);
+
+    if (!CharBitmap) {
+        return NO_MATCH_FOUND;
+    }
+
+    //
+    // Load the slot length array into an XMM register.
+    //
+
+    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
+
+    //
+    // Broadcast the search string's length into an XMM register.
+    //
+
+    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
+    LengthXmm = _mm_broadcastb_epi8(LengthXmm);
+
     //
     // Find all slots that are longer than the incoming string length, as these
     // are the ones we're going to exclude from any prefix match.
@@ -182,31 +198,16 @@

     IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

-    //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
-    //
-
-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

-    //
-    // Generate a mask.
-    //
+    if (!LengthBitmap) {
+        return NO_MATCH_FOUND;
+    }

-    Bitmap = _mm_movemask_epi8(IncludeSlots);
+    Bitmap = CharBitmap & LengthBitmap;

     if (!Bitmap) {
-
-        //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
-        //
-
-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_5(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is a variant of version 3 that uses early exits (i.e.
    returning NO_MATCH_FOUND as early as we can).  It is designed to evaluate
    the assertion we've been making that it's more optimal to give the CPU
    to do a bunch of things up front versus doing something, then potentially
    branching, doing the next thing, potentially branching, etc.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG CharBitmap;
    ULONG LengthBitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);


    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);

    if (!CharBitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);

    if (!LengthBitmap) {
        return NO_MATCH_FOUND;
    }

    Bitmap = CharBitmap & LengthBitmap;

    if (!Bitmap) {
        return NO_MATCH_FOUND;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 5

If our theory is correct, the performance of this version should be worse, due to all the extra branches in the initial test. Let’s see if we’re right:

Holy smokes, version 5 is bad! It’s so bad it’s actually closest in performance to the failed version 2 with the elaborate AVX2 prefix matching routine.

Note

It was actually so close I double-checked the two routines to ensure they were correct; they were, so this is just a coincidence.

That’s good news, though, as it validates the assumption we’ve been working with since inception:

//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary.  That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa.  However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//

That’s the end of version 5’s tenure. TL;DR: fewer branches > more branches.

Narrator: more accurate TL;DR: __movsb() is slow, and always make sure you’re testing what you think you’re testing.]

IsPrefixOfStringInTable_6

← IsPrefixOfStringInTable_5 | IsPrefixOfStringInTable_7 →

Version 6 is boring. We tweak the initial loading of the search string, explicitly loading it via an unaligned load. If the underlying buffer is aligned on a 16-byte boundary, this is just as fast as an aligned load. If not, at least it doesn’t crash—it’s just slow.

Tip

If you attempt an aligned load on an address that isn’t aligned at a 16-byte boundary, the processor will generate an exception, causing your program to crash (assuming you don’t have any structured exception handlers in place to catch the error).

Diff
Full

% diff -u IsPrefixOfStringInTable_3.c IsPrefixOfStringInTable_6.c
--- IsPrefixOfStringInTable_3.c 2018-04-15 22:35:55.456274700 -0400
+++ IsPrefixOfStringInTable_6.c 2018-04-26 18:29:40.594556800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_3(
+IsPrefixOfStringInTable_6(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,7 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is our first AVX-optimized version of the routine.
+    This routine differs from version 3 in that we do an unaligned load of
+    the search string buffer without any SEH wrappers or alignment checks.

 Arguments:

@@ -123,7 +124,8 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
+    SearchLength = min(String->Length, 16);
+    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

     //
     // Broadcast the search string's unique characters according to the string

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_6(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine differs from version 3 in that we do an unaligned load of
    the search string buffer without any SEH wrappers or alignment checks.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    SearchLength = min(String->Length, 16);
    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 6

Version 6 should be faster than version 3; we omit alignment checks, all of our input buffers are aligned at 32 bytes, and an unaligned XMM load of an aligned buffer should definitely be faster than a __movsb(). Let’s see:

We have a new winner! Version 3 had a good run, but it’s time to retire. Let’s tweak version 6 going forward.

Narrator: this is actually testing _mm_loadu_si128() against the AlignmentCheck routine, which first calls PointerToOffsetCrossesPageBoundary(), and then checks the address alignment before calling _mm_load_si128(). Since unaligned loads are just as fast as aligned loads as long as the underlying buffer is aligned, all this shows is that it’s slightly faster to skip the pointer boundary and address alignment checks, which isn’t too surprising.

IsPrefixOfStringInTable_7

← IsPrefixOfStringInTable_6 | IsPrefixOfStringInTable_8 →

Version 7 tweaks version 6 a little bit. We don’t need the search string length calculated so early in the routine. Let’s move it to later.

Diff
Full

% diff -u IsPrefixOfStringInTable_6.c IsPrefixOfStringInTable_7.c
--- IsPrefixOfStringInTable_6.c 2018-04-15 22:35:55.450273700 -0400
+++ IsPrefixOfStringInTable_7.c 2018-04-26 10:00:53.905933700 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_6(
+IsPrefixOfStringInTable_7(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,9 +31,10 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine differs from version 3 in that we do an aligned load of the
-    search string buffer without any SEH wrappers or alignment checks.  (Thus,
-    this routine will fault if the buffer is unaligned.)
+    This routine is based off version 6, but alters when we calculate the
+    "search length" for the given string, which is done via the expression
+    'min(String->Length, 16)'.  We don't need this value until later in the
+    routine, when we're ready to start comparing strings.

 Arguments:

@@ -125,7 +126,6 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    SearchLength = min(String->Length, 16);
     Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

     //
@@ -213,6 +213,13 @@
     }

     //
+    // Calculate the "search length" of the incoming string, which ensures we
+    // only compare up to the first 16 characters.
+    //
+
+    SearchLength = min(String->Length, 16);
+
+    //
     // A popcount against the mask will tell us how many slots we matched, and
     // thus, need to compare.
     //

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_7(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 6, but alters when we calculate the
    "search length" for the given string, which is done via the expression
    'min(String->Length, 16)'.  We don't need this value until later in the
    routine, when we're ready to start comparing strings.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    PSTRING_ARRAY StringArray;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    StringArray = StringTable->pStringArray;

    //
    // If the minimum length of the string array is greater than the length of
    // our search string, there can't be a prefix match.
    //

    if (StringArray->MinimumLength > String->Length) {
        goto NoMatch;
    }

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

This is a tiny change; if it shows any performance difference, it should lean towards a positive change, although it’s possible the compiler deferred scheduling until after the initial negative match logic since the expression wasn’t used immediately. Let’s see.

Benchmark 7

Tiny change, tiny performance improvement! Looks like this saves a couple of cycles, thus ending the short-lived reign of version 6.

IsPrefixOfStringInTable_8

← IsPrefixOfStringInTable_7 | IsPrefixOfStringInTable_9 →

Version 8 is based off version 7, but omits the initial length test. Again, it’s another small change, but if version 5 was anything to go off, the less branches, the better.

Diff
Full


% diff -u IsPrefixOfStringInTable_7.c IsPrefixOfStringInTable_8.c
--- IsPrefixOfStringInTable_7.c 2018-04-26 10:21:43.253466500 -0400
+++ IsPrefixOfStringInTable_8.c 2018-04-26 10:21:27.109761800 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_7(
+IsPrefixOfStringInTable_8(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,10 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 6, but alters when we calculate the
-    "search length" for the given string, which is done via the expression
-    'min(String->Length, 16)'.  We don't need this value until later in the
-    routine, when we're ready to start comparing strings.
+    This routine is based off version 7, but omits the initial minimum
+    length test of the string array.

 Arguments:

@@ -63,7 +61,6 @@
     ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
-    PSTRING_ARRAY StringArray;
     STRING_SLOT Slot;
     STRING_SLOT Search;
     STRING_SLOT Compare;
@@ -77,17 +74,6 @@
     XMMWORD IncludeSlots;
     const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

-    StringArray = StringTable->pStringArray;
-
-    //
-    // If the minimum length of the string array is greater than the length of
-    // our search string, there can't be a prefix match.
-    //
-
-    if (StringArray->MinimumLength > String->Length) {
-        goto NoMatch;
-    }
-
     //
     // Unconditionally do the following five operations before checking any of
     // the results and determining how the search should proceed:

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_8(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This routine is based off version 7, but omits the initial minimum
    length test of the string array.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        goto NoMatch;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

NoMatch:

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 8

Hey, look at that, another win across the board! Omitting the length test shaves off a few more cycles for both prefix and negative matching. Version 7’s one-round reign has come to a timely end.

IsPrefixOfStringInTable_9

← IsPrefixOfStringInTable_8 | IsPrefixOfStringInTable_10 →

Version 9 tweaks version 8 by simply using return NO_MATCH_FOUND after the initial bitmap check instead of goto NoMatch. (The use of goto was a bit peculiar there anyway. We’re going to rewrite the body similarly for version 10, but let’s try to stick to making one change at a time.)

Diff
Full

--- IsPrefixOfStringInTable_8.c 2018-04-26 10:30:52.337935400 -0400
+++ IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_8(
+IsPrefixOfStringInTable_9(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This routine is based off version 7, but omits the initial minimum
-    length test of the string array.
+    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
+    after the initial bitmap check versus 'goto NoMatch'.

 Arguments:

@@ -195,7 +195,7 @@
         // search string.
         //

-        goto NoMatch;
+        return NO_MATCH_FOUND;
     }

     //
@@ -330,8 +330,6 @@
     // If we get here, we didn't find a match.
     //

-NoMatch:
-
     //IACA_VC_END();

     return NO_MATCH_FOUND;

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_9(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This is a tweaked version of version 9 that does 'return NO_MATCH_FOUND'
    after the initial bitmap check versus 'goto NoMatch'.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched == 16 && Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;

            } else {

                //
                // We successfully prefix matched the search string against
                // this slot.  The code immediately following us deals with
                // handling a successful prefix match at the initial slot
                // level; let's avoid an unnecessary branch and just jump
                // directly into it.
                //

                goto FoundMatch;
            }
        }

        if ((USHORT)CharactersMatched == Length) {

FoundMatch:

            //
            // This slot is a prefix match.  Fill out the Match structure if the
            // caller provided a non-NULL pointer, then return the index of the
            // match.
            //


            if (ARGUMENT_PRESENT(Match)) {

                Match->Index = (BYTE)Index;
                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
                Match->String = &StringTable->pStringArray->Strings[Index];

            }

            return (STRING_TABLE_INDEX)Index;
        }

        //
        // Not enough characters matched, so continue the loop.
        //

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 9

This is an interesting one. The return versus goto appears to have cost us a tiny bit with the first few test inputs—only about 0.2 more cycles, which is negligible in the grand scheme of things. (Though let’s not pull on that thread too much, or the entire premise of the article might start to unravel!)

Version 9 improves the negative match performance by a few cycles, so let’s keep it.

IsPrefixOfStringInTable_10

← IsPrefixOfStringInTable_9 | IsPrefixOfStringInTable_11 →

At this point, we’ve exhausted all the small, easy tweaks. Let’s rewrite the inner loop that performs the character comparison and see how that affects performance.

This should be an interesting one because the way it’s written now is… a bit odd. (I’ve clearly made some assumptions about optimal branch organization, to say the least.)

Diff
Full

% diff -u IsPrefixOfStringInTable_9.c IsPrefixOfStringInTable_10.c
--- IsPrefixOfStringInTable_9.c 2018-04-26 10:32:04.986734400 -0400
+++ IsPrefixOfStringInTable_10.c        2018-04-26 10:38:09.357890400 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_9(
+IsPrefixOfStringInTable_10(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
-    after the initial bitmap check versus 'goto NoMatch'.
+    This version is based off version 9, but rewrites the inner loop that
+    checks for comparisons.

 Arguments:

@@ -264,7 +264,17 @@

         CharactersMatched = __popcnt(Mask);

-        if ((USHORT)CharactersMatched == 16 && Length > 16) {
+        if ((USHORT)CharactersMatched < Length && Length <= 16) {
+
+            //
+            // The slot length is longer than the number of characters matched
+            // from the search string; this isn't a prefix match.  Continue.
+            //
+
+            continue;
+        }
+
+        if (Length > 16) {

             //
             // The first 16 characters in the string matched against this
@@ -283,46 +293,24 @@
                 //

                 continue;
-
-            } else {
-
-                //
-                // We successfully prefix matched the search string against
-                // this slot.  The code immediately following us deals with
-                // handling a successful prefix match at the initial slot
-                // level; let's avoid an unnecessary branch and just jump
-                // directly into it.
-                //
-
-                goto FoundMatch;
             }
         }

-        if ((USHORT)CharactersMatched == Length) {
-
-FoundMatch:
-
-            //
-            // This slot is a prefix match.  Fill out the Match structure if the
-            // caller provided a non-NULL pointer, then return the index of the
-            // match.
-            //
-
-
-            if (ARGUMENT_PRESENT(Match)) {
+        //
+        // This slot is a prefix match.  Fill out the Match structure if the
+        // caller provided a non-NULL pointer, then return the index of the
+        // match.
+        //

-                Match->Index = (BYTE)Index;
-                Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
-                Match->String = &StringTable->pStringArray->Strings[Index];
+        if (ARGUMENT_PRESENT(Match)) {

-            }
+            Match->Index = (BYTE)Index;
+            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
+            Match->String = &StringTable->pStringArray->Strings[Index];

-            return (STRING_TABLE_INDEX)Index;
         }

-        //
-        // Not enough characters matched, so continue the loop.
-        //
+        return (STRING_TABLE_INDEX)Index;

     } while (--Count);

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_10(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 8, but rewrites the inner loop that
    checks for comparisons.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

That’s a nicer bit of logic—more C-like, less assembly-like, and arguably clearer. Let’s see how they compare. (This is an interesting one, as I genuinely don’t have a strong hunch about what kind of performance impact this will have; obviously, I thought the initial way of structuring the loop was optimal, and I had it in place for two years before deciding to embark on this article, which led to the rework we just saw. )

Benchmark 10

Hey, look at that! We’ve shaved off a few more cycles in most cases, especially for the negative matches!

Speeding Up Negative Matches with Assembly

Note

If you build the Tracer project, you can run a helper batch file in the root directory called cdb-simple.bat, which uses cdb to launch one of the project’s executables, ModuleLoader.exe. This will start up, load all of our tracing project’s DLLs, then allow the debugger to break in, yielding a debugger prompt from which we can easily disassemble functions, inspect runtime function entries, etc. This is the approach I used for capturing the output over the next couple of sections.

Now for the fun part! Let’s take a look at the disassembly of the initial part of version 10 responsible for the negative match logic and see if there are any improvements we can make.

0:000> uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h
00007fff`f69c1e04 c5fa6f5920      vmovdqu xmm3,xmmword ptr [rcx+20h]
00007fff`f69c1e09 4c8b6a08        mov     r13,qword ptr [rdx+8]
00007fff`f69c1e0d 4d8bf0          mov     r14,r8
00007fff`f69c1e10 440fb63a        movzx   r15d,byte ptr [rdx]
00007fff`f69c1e14 33ed            xor     ebp,ebp
00007fff`f69c1e16 44883c24        mov     byte ptr [rsp],r15b
00007fff`f69c1e1a 488bf1          mov     rsi,rcx
00007fff`f69c1e1d c4e279780c24    vpbroadcastb xmm1,byte ptr [rsp]
00007fff`f69c1e23 c4c17a6f6500    vmovdqu xmm4,xmmword ptr [r13]
00007fff`f69c1e29 c4e259004110    vpshufb xmm0,xmm4,xmmword ptr [rcx+10h]
00007fff`f69c1e2f c5f97411        vpcmpeqb xmm2,xmm0,xmmword ptr [rcx]
00007fff`f69c1e33 c5e164c9        vpcmpgtb xmm1,xmm3,xmm1
00007fff`f69c1e37 c5f1ef0d41320000 vpxor   xmm1,xmm1,xmmword ptr [StringTable2!_xmmffffffffffffffffffffffffffffffff (00007fff`f69c5080)]
00007fff`f69c1e3f c5e9dbd1        vpand   xmm2,xmm2,xmm1
00007fff`f69c1e43 c579d7c2        vpmovmskb r8d,xmm2
00007fff`f69c1e47 c5fa7f5c2410    vmovdqu xmmword ptr [rsp+10h],xmm3
00007fff`f69c1e4d 4585c0          test    r8d,r8d
00007fff`f69c1e50 0f849a000000    je      StringTable2!IsPrefixOfStringInTable_10+0x100 (00007fff`f69c1ef0)

There’s a bit of cruft at the start regarding setting up the function’s prologue (pushing non-volatile registers to the stack, etc. ). That’s to be expected for C (and C++, and basically every language); as the programmer, you don’t have any direct control over how many registers a compiler uses for a routine, how much stack space it uses, which registers it uses when, etc.

However, with assembly, we’re on the opposite end of the spectrum: we can control everything! We also have a little trick up our sleeves: the venerable LEAF_ENTRY.

Windows x64 ABI Calling Conventions

First, some background. The Windows x64 ABI and calling convention dictate two types of functions: NESTED_ENTRY and LEAF_ENTRY.

NESTED_ENTRY

NESTED_ENTRY is by far the most common; C and C++ functions are all implicitly NESTED_ENTRY functions. (The LEAF_ENTRY and NESTED_ENTRY symbols are MASM (ml64.exe) macro names, but the concept applies to all languages.)

LEAF_ENTRY

A LEAF_ENTRY can only be implemented in assembly. It is constrained in that it may not manipulate any of the non-volatile x64 registers (rbx, rdi, rsi, rsp, rbp, r12, r13, r14, r15, xmm6-15), nor may it call any other functions (since call implicitly modifies the stack pointer), nor may it have a structured exception handler (since handling an exception for a given stack frame also manipulates the stack pointer).

The reason for these constraints is that LEAF_ENTRY routines do not have any unwind information generated for them in their runtime function entries. Unwind information is used by the kernel to, well, unwind the modifications made to non-volatile registers while traversing back up through the call stack looking for an exception handler in the event of an exception.

For example, here’s the function entry and associated unwind information for the PGO build of the IsPrefixOfStringInTable_10 function:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_10
Debugger function entry 000001d8`2ea03cf8 for:
(00007fff`f8411df0)   StringTable2!IsPrefixOfStringInTable_10
Exact matches:
    StringTable2!IsPrefixOfStringInTable_10 (struct _STRING_TABLE *,
                                             struct _STRING *,
                                             struct _STRING_MATCH *)

BeginAddress      = 00000000`00001df0
EndAddress        = 00000000`00001e59
UnwindInfoAddress = 00000000`000054f8

Unwind info at 00007fff`f84154f8, 14 bytes
  version 1, flags 0, prolog 14, codes 8
  00: offs 14, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 58 reg: rsi.
  02: offs 14, unwind op 4, op info 5   UWOP_SAVE_NONVOL FrameOffset: 50 reg: rbp.
  04: offs 14, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
  05: offs 10, unwind op 0, op info f   UWOP_PUSH_NONVOL reg: r15.
  06: offs e, unwind op 0, op info e    UWOP_PUSH_NONVOL reg: r14.
  07: offs c, unwind op 0, op info d    UWOP_PUSH_NONVOL reg: r13.

We can see that this routine manipulates six non-volatile registers in total, including the stack pointer. The first instructions of the routine constitute the function’s prologue; in the disassembly, you can see that three of the rxx registers are pushed to the stack, followed by the allocation of 0x20 (32) bytes of stack space:

0:000> uf StringTable2!IsPrefixOfStringInTable_10
StringTable2!IsPrefixOfStringInTable_10:
00007fff`f69c1df0 48896c2418      mov     qword ptr [rsp+18h],rbp
00007fff`f69c1df5 4889742420      mov     qword ptr [rsp+20h],rsi
00007fff`f69c1dfa 4155            push    r13
00007fff`f69c1dfc 4156            push    r14
00007fff`f69c1dfe 4157            push    r15
00007fff`f69c1e00 4883ec20        sub     rsp,20h

It also cheekily uses the home parameter space for stashing rbp and rsi instead of pushing them to the stack. That’s fair game, though—this is the PGO build, so I’d expect it to use some extra tricks to shave off a few cycles here and there. I’d do the same if I were writing assembly. (Side note: if you view the source of this page, there’s a commented-out section below that shows the runtime function entry for the release build of version 10; it uses nine registers instead of six and 40 bytes of stack space instead of 32. I wrote it before switching to using the PGO build for everything.)

The home parameter space is a 32-byte area that immediately follows the return address (i.e., the value of rsp when the function is entered); it is mandated by the x64 calling convention on Windows and is primarily intended to provide scratch space for a routine to home its parameter registers (i.e., the registers used for the first four arguments of a function: rcx, rdx, r8, and r9). This allows the four volatile registers to be repurposed within a routine while still providing a way to refer to the parameters if needed. That’s its intended use—however, it’s not strictly enforced, so you can essentially treat this area as a free 32-byte scratch space if you’re writing assembly.

Note

On a semi-related note, I’d highly recommend reading A History of Modern 64-bit Computing if you have some spare time. It’s a fascinating insight into contemporary x64 conventions we often take for granted, drawing on numerous interviews with industry luminaries like Dave Cutler and Linus Torvalds. I found it incredibly useful for understanding the why behind concepts like home parameter space, structured exception handling, runtime function entries, and why you can’t write inline assembly for x64 with MSVC anymore—apparently, it provides a direct vector for disrupting the mechanisms relied upon by the kernel stack unwinding functionality. (At least, I think that’s the reason—can anyone from Microsoft confirm?))

Assembly Implementations

IsPrefixOfStringInTable_x64_1

IsPrefixOfStringInTable_x64_2 →

So, knowing what we now know about the venerable little LEAF_ENTRY trick, let’s see if we can construct a simple routine in assembly that just deals with the negative match case.

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00

        ;IACA_VC_START

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result back into xmm0.
;

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm0, xmm0, xmm1

;
; Load the string table's unique character array into xmm2, and the lengths for
; each string slot into xmm3.
;

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm5 to all ones.  This is used later.
;

        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's unique character array (xmm0) against the string
; table's unique chars (xmm2), saving the result back into xmm0.
;

        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm5            ; Invert the result.

;
; Intersect-via-test xmm0 and xmm1 to identify string slots of a suitable
; length with a matching unique character.
;

        vptest      xmm0, xmm1                  ; Check for no match.
        ;jnz        short @F                    ; There was a match.
                                                ; (Not yet implemented.)

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ;
        not         al                          ; rax = -1
        ret

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Note how we don’t need to push anything to the stack since we didn’t manipulate any non-volatile registers. If an exception occurs within the body of our implementation (say we dereference a NULL pointer), the kernel knows it doesn’t have to undo any non-volatile register modifications (using offsets specified by the unwind information) because there isn’t any unwind information. It can simply advance to the frame before us (e. g. , rsp at the time of the fault, minus 8 bytes) as it continues its search for runtime function entries and associated unwind information. As you can see, the unwind info is effectively empty:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_x64_1
Debugger function entry 000001f9`048edf98 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_1 (void)

BeginAddress      = 00000000`00003290
EndAddress        = 00000000`000032cb
UnwindInfoAddress = 00000000`00004468

Unwind info at 00007ffd`15594468, 4 bytes
  version 1, flags 0, prolog 0, codes 0

Benchmark x64 1

Let’s see how this scrappy little fellow (who always returns NO_MATCH_FOUND but still mimics the steps required to successfully negative match) does against the leading C implementation at this point, version 10:

Fwoah, look at that, we’ve shaved about three cycles off the C version!

(Note that when I first wrote this, I was comparing the assembly version against the release build (not the PGO build), which was clocking in at about 13-14 cycles for negative matching. So getting it down to ~7.5 from 13-14 was a bit more exciting. Damn the PGO build and its 10.9-ish cycles for negative matching!)

The good news is that our theory about the performance of the LEAF_ENTRY looks like it’s paid off: we can reliably get about 7.5 cycles for negative matching.

IsPrefixOfStringInTable_x64_2

← IsPrefixOfStringInTable_x64_1 | IsPrefixOfStringInTable_x64_3 →

The bad news is that we now need to implement the rest of the functionality within the constraints of a LEAF_ENTRY!

The problem with a LEAF_ENTRY for anything more than a trivial bit of code is that you only have a handful of volatile registers to work with, and no stack space can be used for register spilling or temporaries. (Technically I could use the home parameter space, but, eh, we’re already avoiding stack spills, why not make life harder for ourselves and try to avoid all memory spilling.)

If you can’t spill to memory, your only option is really spilling to XMM registers via vpinsr and vpextr combinations, which, as you can see in the implementation of version 2 below, I have to do a lot.

(Also note: when I wrote this version, I didn’t use the disassembly from the C routines for guidance. I find that as soon as you start to grok the disassembly for a given routine, it becomes harder to think of ways to approach it from a fresh angle. Also, the LEAF_ENTRY aspect significantly limited what I could do anyway, so I figured I may as well just give it a crack from scratch and see what I could come up with. It would be an interesting point of reference compared to a future iteration that tries to improve on the disassembly of an optimized PGO version, for example.)

The diff view for this routine is less useful given the vast majority of the code is new, so I’ve put the full version of the code first. It’s based more or less on the approach used by version 8 of the C routine (I actually wrote it after I wrote version 8; versions 9 and 10 of the C routine (with the latter having the improved loop logic) came after).

Full
Diff

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

% diff -u IsPrefixOfStringInTable_x64_1.asm IsPrefixOfStringInTable_x64_2.asm
--- IsPrefixOfStringInTable_x64_1.asm   2018-04-29 11:03:46.403568800 -0400
+++ IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
@@ -50,12 +50,12 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result back into xmm0.
+; result into xmm5.
 ;

         ;IACA_VC_START
@@ -63,34 +63,36 @@
         mov     rax, String.Buffer[rdx]
         vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
         vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm0, xmm0, xmm1
+        vpshufb xmm5, xmm0, xmm1

 ;
-; Load the string table's unique character array into xmm2, and the lengths for
-; each string slot into xmm3.
-;
+; Load the string table's unique character array into xmm2.

         vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
-        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

 ;
-; Set xmm5 to all ones.  This is used later.
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
 ;

-        vpcmpeqq    xmm5, xmm5, xmm5                    ; Set xmm5 to all ones.
+        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Load the lengths of each string table slot into xmm3.
 ;
+        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+;
+; Set xmm2 to all ones.  We use this later to invert the length comparison.
+;
+
+        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

 ;
-; Compare the search string's unique character array (xmm0) against the string
-; table's unique chars (xmm2), saving the result back into xmm0.
+; Broadcast the byte-sized string length into xmm4.
 ;

-        vpcmpeqb    xmm0, xmm0, xmm2            ; Compare unique chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
 ; Compare the search string's length, which we've broadcasted to all 8-byte
@@ -100,30 +102,378 @@
 ; a slot with a length less than or equal to our search string's length.
 ;

-        vpcmpgtb    xmm1, xmm4, xmm3            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm5            ; Invert the result.
+        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
+        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

 ;
-; Intersect-and-test the unique character match xmm mask register (xmm0) with
+; Intersect-and-test the unique character match xmm mask register (xmm5) with
 ; the length match mask xmm register (xmm1).  This affects flags, allowing us
 ; to do a fast-path exit for the no-match case (where ZF = 1).
 ;

-        vptest      xmm0, xmm1                  ; Check for no match.
-        ;jnz        short @F                    ; There was a match.
-                                                ; (Not yet implemented.)
+        vptest      xmm5, xmm1                  ; Check for no match.
+        jnz         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
 ;

-        xor         eax, eax                    ;
-        not         al                          ; rax = -1
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+        ;IACA_VC_END
+
+;
+; (There was at least one match, continue with processing.)
+;
+
+;
+; Calculate the "search length" for the incoming search string, which is
+; equivalent of 'min(String->Length, 16)'.  (The search string's length
+; currently lives in xmm4, albeit as a byte-value broadcasted across the
+; entire register, so extract that first.)
+;
+; Once the search length is calculated, deposit it back at the second byte
+; location of xmm4.
+;
+;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
+;
+;   r11 - String length (String->Length)
+;
+
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
+        mov         rax, 16                     ; Load 16 into rax.
+        mov         r10, r11                    ; Copy into r10.
+        cmp         r10w, ax                    ; Compare against 16.
+        cmova       r10w, ax                    ; Use 16 if length is greater.
+        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].
+
+;
+; Home our parameter registers into xmm registers instead of their stack-backed
+; location, to avoid memory writes.
+;
+
+        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
+        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
+        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].
+
+;
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; yielding a mask identifying indices we need to perform subsequent matches
+; upon.  Convert this into a bitmap and save in xmm2d[2].
+;
+
+        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.
+
+;
+; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
+;
+
+        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
+        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].
+
+;
+; Summary of xmm register stashing for the rest of the routine:
+;
+; xmm2:
+;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
+;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
+;
+; xmm4:
+;       0:7     (vpinsrb 0)     length of search string
+;       8:15    (vpinsrb 1)     min(String->Length, 16)
+;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
+;      24:31    (vpinsrb 3)     shift count
+;
+; xmm5:
+;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
+;      64:95    (vpinsrd 2)     bitmap of slots to compare
+;      96:127   (vpinsrd 3)     index of slot currently being processed
+;
+
+;
+; Initialize rcx as our counter register by doing a popcnt against the bitmap
+; we just generated in edx, and clear our shift count register (r9).
+;
+
+        popcnt      ecx, edx                    ; Count bits in bitmap.
+        xor         r9, r9                      ; Clear r9.
+
+        align 16
+
+;
+; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
+; trailing zeros of the bitmap, and then add in the shift count, producing an
+; index (rax) we can use to load the corresponding slot.
+;
+; Register usage at top of loop:
+;
+;   rax - Index.
+;
+;   rcx - Loop counter.
+;
+;   rdx - Bitmap initially, then slot length.
+;
+;   r9 - Shift count.
+;
+;   r10 - Search length.
+;
+;   r11 - String length.
+;
+
+Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
+        mov         eax, r8d                    ; Copy tzcnt to rax,
+        add         rax, r9                     ; Add shift to create index.
+        inc         r8                          ; tzcnt + 1
+        shrx        rdx, rdx, r8                ; Reposition bitmap.
+        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
+        xor         edx, edx                    ; Clear edx.
+        mov         r9, rax                     ; Copy index back to shift.
+        inc         r9                          ; Shift = Index + 1
+        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
+
+;
+; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
+; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
+;
+; Then, load the string table slot at this index into xmm1, then shift rax back.
+;
+
+        shl         eax, 4
+        vpextrq     r8, xmm2, 0
+        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
+        shr         eax, 4
+
+;
+; The search string's first 16 characters are already in xmm0.  Compare this
+; against the slot that has just been loaded into xmm1, storing the result back
+; into xmm1.
+;
+
+        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.
+
+;
+; Convert the XMM mask into a 32-bit representation, then zero high bits after
+; our "search length", which allows us to ignore the results of the comparison
+; above for bytes that were after the search string's length, if applicable.
+; Then, count the number of bits remaining, which tells us how many characters
+; we matched.
+;
+
+        vpmovmskb   r8d, xmm1                   ; Convert into mask.
+        bzhi        r8d, r8d, r10d              ; Zero high bits.
+        popcnt      r8d, r8d                    ; Count bits.
+
+;
+; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
+; it, we can load rax (the current index) into xmm1 and use it to extract the
+; slot length via shuffle.  (The length will be in the lowest byte of xmm1
+; after the shuffle, which we can then vpextrb.)
+;
+
+        movd        xmm1, rax                   ; Load index into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
+        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.
+
+;
+; If 16 characters matched, and the search string's length is longer than 16,
+; we're going to need to do a comparison of the remaining strings.
+;
+
+        cmp         r8w, 16                     ; Compare chars matched to 16.
+        je          short @F                    ; 16 chars matched.
+        jmp         Pfx30                       ; Less than 16 matched.
+
+;
+; All 16 characters matched.  If the slot length is greater than 16, we need
+; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
+; then great, that's a slot match, we're done.
+;
+
+@@:     cmp         dl, 16                      ; Compare length to 16.
+        ja          Pfx50                       ; Length is > 16.
+        je          short Pfx35                 ; Lengths match!
+                                                ; Length <= 16, fall through...
+
+;
+; Less than or equal to 16 characters were matched.  Compare this against the
+; length of the slot; if equal, this is a match, if not, no match, continue.
+;
+
+Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
+        jne         @F                          ; No match found.
+        jmp         short Pfx35                 ; Match found!
+
+;
+; No match against this slot, decrement counter and either continue the loop
+; or terminate the search and return no match.
+;
+
+@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        dec         cx                          ; Decrement counter.
+        jnz         Pfx20                       ; cx != 0, continue.
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
+; former is used when we need to copy the number of characters matched from r8
+; back to rax.  The latter jump target doesn't require this.
+;
+
+Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.
+
+;
+; Load the match parameter back into r8 and test to see if it's not-NULL, in
+; which case we need to fill out a STRING_MATCH structure for the match.
+;
+
+Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
+        test        r8, r8                      ; Is NULL?
+        jnz         short @F                    ; Not zero, need to fill out.
+
+;
+; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
+;
+
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        ret                                     ; StringMatch == NULL, finish.
+
+;
+; StringMatch is not NULL.  Fill out characters matched (currently rax), then
+; reload the index from xmm5 into rax and save.
+;
+
+@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
+        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
+        mov         byte ptr StringMatch.Index[r8], al
+
+;
+; Final step, loading the address of the string in the string array.  This
+; involves going through the StringTable, so we need to load that parameter
+; back into rcx, then resolving the string array address via pStringArray,
+; then the relevant STRING offset within the StringArray.Strings structure.
+;
+
+        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
+        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
+        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
+        shr         eax, 4                  ; Revert the scaling.
+
         ret

+;
+; 16 characters matched and the length of the underlying slot is greater than
+; 16, so we need to do a little memory comparison to determine if the search
+; string is a prefix match.
+;
+; The slot length is stored in rax at this point, and the search string's
+; length is stored in r11.  We know that the search string's length will
+; always be longer than or equal to the slot length at this point, so, we
+; can subtract 16 (currently stored in r10) from rax, and use the resulting
+; value as a loop counter, comparing the search string with the underlying
+; string slot byte-by-byte to determine if there's a match.
+;
+
+Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.
+
+;
+; Free up some registers by stashing their values into various xmm offsets.
+;
+
+        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
+        mov         rcx, rdx                ; Free up rdx, rcx is now counter.
+
+;
+; Load the search string buffer and advance it 16 bytes.
+;
+
+        vpextrq     r11, xmm2, 1            ; Extract String into r11.
+        mov         r11, String.Buffer[r11] ; Load buffer address.
+        add         r11, r10                ; Advance buffer 16 bytes.
+
+;
+; Loading the slot is more involved as we have to go to the string table, then
+; the pStringArray pointer, then the relevant STRING offset within the string
+; array (which requires re-loading the index from xmm5d[3]), then the string
+; buffer from that structure.
+;
+
+        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
+        mov         r8, StringTable.pStringArray[r8] ; Load string array.
+
+        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+
+        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
+        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
+        add         r8, r10                 ; Advance buffer 16 bytes.
+
+        xor         eax, eax                ; Clear eax.
+
+;
+; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
+; Do a byte-by-byte comparison.
+;
+
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;
+
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.
+
+;
+; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
+; how many characters we matched, and then jump to Pfx40 for finalization.
+;
+
+        add         rax, r10
+        jmp         Pfx40
+
+;
+; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
+; it.  If it's zero, we have no more strings to compare, so we can do a quick
+; exit.  If there are still comparisons to be made, restore the other registers
+; we trampled then jump back to the start of the loop Pfx20.
+;
+
+Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
+        dec         cx                          ; Decrement counter.
+        jnz         short @F                    ; Jump forward if not zero.
+
+;
+; No more comparisons remaining, return.
+;
+
+        xor         eax, eax                    ; Clear rax.
+        not         al                          ; al = -1
+        ret                                     ; Return.
+
+;
+; More comparisons remain; restore the registers we clobbered and continue loop.
+;
+
+@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
+        vpextrb     r11, xmm4, 0                ; Restore r11.
+        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+        jmp         Pfx20                       ; Continue comparisons.
+
         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_1, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Looking back on my time logs (shout out to my favorite iPhone app, HoursTracker!), the routine above took about 8 hours to implement over the course of about two days, give or take. Writing assembly is slow; writing correct assembly is even slower. I generally find that there’s a noticeable hump I need to get over in the first, say, 30 minutes of any assembly programming session, but once you get into the zone, things can start flowing quite nicely. I’m an aggressive debugger user; often, to get started, I’ll write a simple LEAF_ENTRY that looks like this:

    LEAF_ENTRY Foo, _TEXT$00
        int 3
        xor eax, eax
        ret
    LEAF_END Foo, _TEXT$00

That’ll allow me to attach the debugger and at least inspect the parameter registers so I can write the next couple of instructions. I find it definitely helps get me into the zone quicker.

Anyway, enough about that. Let’s look at performance. Again, this will be an interesting one—other than the optimal negative match logic that I copied from version 1, the sole focus was on getting a working assembly version; I wasn’t giving any thought to performance at this stage.

So, it’ll be interesting to see how it compares to a) version 1 in the negative matching case (it should be very close), and b) against the C versions in the prefix matching case (it hopefully won’t be prohibitively worse).

Benchmark x64 2: Negative Matching

Hmmm, that’s not too bad! We’re very close to version 1 for negative matching, within about 0.5 cycles or so. That sounds about right, given that our initial logic had to be tweaked a bit to play nicer with the rest of the implementation. And we’re still about 3-4 cycles faster than the fastest C version.

What about prefix matching performance?

Benchmark x64 2: Prefix Matching

The prefix matching performance isn’t too bad either! We’re definitely slower than the C version, ranging from about 4 cycles to 10 cycles in most cases, with the $INDEX_ALLOCATION input about 13 cycles slower.

(I’ve just noticed the pattern with regards to the first 8 entries, $AttrDef to $Mft, clocking in at about 18 and 24 cycles respectively. But the next four entries, $Secure to $Cairo, consistently clock in at about 24 and 34 cycles respectively. $Secure is the 9th slot, which puts it at memory offset 192 bytes from the start of the string table. And then the 18 and 24 cycle behavior returns for the last two items, ???? and ., which are at the end of the string table’s inner slot array. This pattern is prevalent in all of our iterations. Very peculiar! We’ll investigate later.)

IsPrefixOfStringInTable_x64_3

← IsPrefixOfStringInTable_x64_2 | IsPrefixOfStringInTable_x64_4 →

(We’re nearly at the end of the first round of iterations, I promise!)

Seeing the performance of the second version in assembly, I decided to try whipping up a third version, which would switch from a LEAF_ENTRY to NESTED_ENTRY and use rep cmps for the byte comparison for long strings (instead of the byte-by-byte approach used now).

In order to use rep cmps, you need to use two non-volatile registers, rsi (the source index) and rdi (the destination index). You also need to specify the direction of the comparison, which means mutating the flags, which are also classed as non-volatile, so they need to be pushed to the stack in the prologue and popped back off in the epilogue.

I didn’t really expect this to offer a measurable speedup, but it was a tangible reason to use a NESTED_ENTRY, and otherwise allowed me to stay within the confines of the version 2 implementation.

Let’s take a look at the implementation. At the very least, it’s useful to see how you can go about organizing your prologue in MASM. For NESTED_ENTRY routines, I always define a Locals structure that incorporates the return address and home parameter space for easy access. Mainly because it allows me to write code like this:

    mov     Locals.HomeRcx[rsp], rcx        ; Home our first param.
    mov     Locals.HomeRdx[rsp], rdx        ; Home our second param.
    mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
    mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.

Instead of working wiht offsets like this:

    mov     qword ptr [rsp+30h], rcx        ; Home our first param.
    mov     qword ptr [rsp+38h], rdx        ; Home our second param.
    mov     rsi, qword ptr [rsp+10h]        ; Restore rsi.
    mov     rdi, qword ptr [rsp+8]          ; Restore rdi.

This routine was written last, after version 10 of the C routine, so it incorporates the slightly re-arranged loop logic that proved to be faster for that version. Other than that, the main changes involved converting all the early exit returns in the body of the function to jump to a single exit point, Pfx90, mainly to simplify epilogue exit code.

Diff
Full

 % diff -u IsPrefixOfStringInTable_x64_2.asm IsPrefixOfStringInTable_x64_3.asm
--- IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
+++ IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:01:10.033827200 -0400
@@ -18,6 +18,31 @@

 include StringTable.inc

+;
+; Define a locals struct for saving flags, rsi and rdi.
+;
+
+Locals struct
+
+    Padding             dq      ?
+    SavedRdi            dq      ?
+    SavedRsi            dq      ?
+    SavedFlags          dq      ?
+
+    ReturnAddress       dq      ?
+    HomeRcx             dq      ?
+    HomeRdx             dq      ?
+    HomeR8              dq      ?
+    HomeR9              dq      ?
+
+Locals ends
+
+;
+; Exclude the return address onward from the frame calculation size.
+;
+
+LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
+
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -33,6 +58,14 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
+;   This routine is based off version 2.  It has been converted into a nested
+;   entry (version 2 is a leaf entry), and uses 'repe cmpsb' to do the string
+;   comparison for long strings (instead of the byte-by-byte comparison used
+;   in version 2).  This requires use of the rsi and rdi registers, and the
+;   direction flag.  These are all non-volatile registers and thus, must be
+;   saved to the stack in the function prologue (hence the need to make this
+;   a nested entry).
+;
 ; Arguments:
 ;
 ;   StringTable - Supplies a pointer to a STRING_TABLE struct.
@@ -50,7 +83,19 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
+
+;
+; Begin prologue.  Allocate stack space and save non-volatile registers.
+;
+
+        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
+
+        push_eflags                                 ; Save flags.
+        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
+        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
+
+        END_PROLOGUE

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -120,7 +165,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp         Pfx90                       ; Return.

         ;IACA_VC_END

@@ -214,7 +259,7 @@
 ;
 ;   rcx - Loop counter.
 ;
-;   rdx - Bitmap initially, then slot length.
+;   rdx - Bitmap.
 ;
 ;   r9 - Shift count.
 ;
@@ -228,8 +273,6 @@
         add         rax, r9                     ; Add shift to create index.
         inc         r8                          ; tzcnt + 1
         shrx        rdx, rdx, r8                ; Reposition bitmap.
-        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
-        xor         edx, edx                    ; Clear edx.
         mov         r9, rax                     ; Copy index back to shift.
         inc         r9                          ; Shift = Index + 1
         vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
@@ -252,7 +295,7 @@
 ; into xmm1.
 ;

-        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.
+        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

 ;
 ; Convert the XMM mask into a 32-bit representation, then zero high bits after
@@ -267,17 +310,6 @@
         popcnt      r8d, r8d                    ; Count bits.

 ;
-; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
-; it, we can load rax (the current index) into xmm1 and use it to extract the
-; slot length via shuffle.  (The length will be in the lowest byte of xmm1
-; after the shuffle, which we can then vpextrb.)
-;
-
-        movd        xmm1, rax                   ; Load index into xmm1.
-        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
-        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.
-
-;
 ; If 16 characters matched, and the search string's length is longer than 16,
 ; we're going to need to do a comparison of the remaining strings.
 ;
@@ -287,37 +319,38 @@
         jmp         Pfx30                       ; Less than 16 matched.

 ;
-; All 16 characters matched.  If the slot length is greater than 16, we need
-; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
-; then great, that's a slot match, we're done.
+; All 16 characters matched.  Load the underlying slot's length from the
+; relevant offset in the xmm3 register, then check to see if it's greater than,
+; equal or less than 16.
 ;

-@@:     cmp         dl, 16                      ; Compare length to 16.
+@@:     movd        xmm1, rax                   ; Load into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
+        vpextrb     rax, xmm1, 0                ; And extract back into rax.
+        cmp         al, 16                      ; Compare length to 16.
         ja          Pfx50                       ; Length is > 16.
         je          short Pfx35                 ; Lengths match!
                                                 ; Length <= 16, fall through...

 ;
 ; Less than or equal to 16 characters were matched.  Compare this against the
-; length of the slot; if equal, this is a match, if not, no match, continue.
+; length of the search string; if equal, this is a match.
 ;

-Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
-        jne         @F                          ; No match found.
-        jmp         short Pfx35                 ; Match found!
+Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
+        je          short Pfx35                 ; Match found!

 ;
 ; No match against this slot, decrement counter and either continue the loop
 ; or terminate the search and return no match.
 ;

-@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
         dec         cx                          ; Decrement counter.
         jnz         Pfx20                       ; cx != 0, continue.

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp         Pfx90                       ; Return.

 ;
 ; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
@@ -341,7 +374,7 @@
 ;

         vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        ret                                     ; StringMatch == NULL, finish.
+        jmp         Pfx90                       ; StringMatch == NULL, finish.

 ;
 ; StringMatch is not NULL.  Fill out characters matched (currently rax), then
@@ -367,7 +400,7 @@
         mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
         shr         eax, 4                  ; Revert the scaling.

-        ret
+        jmp         Pfx90

 ;
 ; 16 characters matched and the length of the underlying slot is greater than
@@ -382,14 +415,15 @@
 ; string slot byte-by-byte to determine if there's a match.
 ;

-Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.
+Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

 ;
 ; Free up some registers by stashing their values into various xmm offsets.
 ;

+        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
         vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
-        mov         rcx, rdx                ; Free up rdx, rcx is now counter.
+        mov         rcx, rax                ; Free up rax, rcx is now counter.

 ;
 ; Load the search string buffer and advance it 16 bytes.
@@ -409,31 +443,27 @@
         vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
         mov         r8, StringTable.pStringArray[r8] ; Load string array.

+        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
         shl         eax, 4                  ; Scale the index; sizeof STRING=16.

         lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
         mov         r8, String.Buffer[r8]   ; Load string table buffer address.
         add         r8, r10                 ; Advance buffer 16 bytes.

-        xor         eax, eax                ; Clear eax.
+        mov         rax, rcx                ; Copy counter.

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Do a byte-by-byte comparison.
+; Set up rsi/rdi so we can do a 'rep cmps'.
 ;

-        align 16
-@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
-        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
-        jne         short Pfx60                 ; If not equal, jump.
-
-;
-; The two bytes were equal, update rax, decrement rcx and potentially continue
-; the loop.
-;
+        cld
+        mov         rsi, r11
+        mov         rdi, r8
+        repe        cmpsb

-        inc         ax                          ; Increment index.
-        loopnz      @B                          ; Decrement cx and loop back.
+        test        cl, 0
+        jnz         short Pfx60                 ; Not all bytes compared, jump.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -460,7 +490,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        ret                                     ; Return.
+        jmp Pfx90                               ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -473,7 +503,17 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00
+        align   16
+
+Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
+        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
+        popfq                                   ; Restore flags.
+        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
+
+        ret
+
+        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 2.  It has been converted into a nested
;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
;   comparison for long strings (instead of the byte-by-byte comparison used
;   in version 2).  This requires use of the rsi and rdi registers, and the
;   direction flag.  These are all non-volatile registers and thus, must be
;   saved to the stack in the function prologue (hence the need to make this
;   a nested entry).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

        alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
        popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

I don’t have a strong hunch as to how this will perform; like I said earlier, it was mainly done to set up the scaffolding for using a NESTED_ENTRY in the future, such that we’ll have the glue in place if we want to iterate on the disassembly of the PGO versions. If I had to guess, I suspect it will be slightly slower than version 2, but surely not by much, right? It’s a pretty minor change in the grand scheme of things. Let’s take a look.

Benchmark x64 3

Hah! Version 3 is much, much worse! Even its negative matching performance is terrible, which is the one thing the assembly versions have been good at so far. How peculiar.

Now, in the interest of keeping events chronological, as much as I’d like to dive in now and figure out why, I’ll have to defer to my behavior when I encountered this performance gap: I laughed, shelved the version 3 experiment, and moved on.

That’s a decidedly unsatisfying end to the matter, though, I’ll admit. We’ll come back to it later in the article and try and get some closure as to why it was so slow, comparatively.

Internet Feedback

So, at this point, with version 10 of the C routine and version 2 of the assembly version in hand, and a very early draft of this article, I solicited feedback on Twitter and got some great responses. Thanks again to Fabian Giesen, Wojciech Muła, Geoff Langdale, Daniel Lemire, and Kendall Willets for their discussion and input over the course of a few days!

Round 2—Post-Internet Feedback

Let’s take a look at the iterations that came about after receiving feedback.

IsPrefixOfStringInTable_11

← IsPrefixOfStringInTable_10 | IsPrefixOfStringInTable_12 →

Both Fabian Giesen and Wojciech Muła pointed out that we could use _mm_andnot_si128() to avoid the need to invert the results of the IncludeSlotsByLength XMM register (via _mm_xor_si128()). Let’s try that.

Diff
Full

% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_11.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 10:38:09.357890400 -0400
+++ IsPrefixOfStringInTable_11.c        2018-04-26 12:43:44.184528000 -0400
@@ -18,7 +18,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_11(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -31,8 +31,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but with the vpandn used at the
+    end of the initial test, as suggested by Wojciech Mula (@pshufb).

 Arguments:

@@ -70,9 +70,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -158,28 +156,25 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we invert the mask shortly after.
+    //      we do the "and not" intersection with the include slots next.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
-    //
-
-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
-    //
     // We're now ready to intersect the two XMM registers to determine which
     // slots should still be included in the comparison (i.e. which slots have
     // the exact same unique character as the string and a length less than or
     // equal to the length of the search string).
     //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
+    //

-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);

     //
     // Generate a mask.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_11(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but with the vpandn used at the
    end of the initial test, as suggested by Wojciech Mula (@pshufb).

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we do the "and not" intersection with the include slots next.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

We’re only shaving one instruction off here, so the performance gain, if any, should be very modest.

Benchmark 11

Definitely a slight improvement over version 10 in most cases!

IsPrefixOfStringInTable_x64_4

← IsPrefixOfStringInTable_x64_3 | IsPrefixOfStringInTable_x64_5 →

Something I didn’t know about vptest that Fabian pointed out is that it actually does two operations. The first essentially does an AND of the two input registers and sets the zero flag (ZF=1) if the result is all 0s. We’ve been using that aspect in the assembly version up to now.

However, it also does the equivalent of (xmm0 and (not xmm1)), and sets the carry flag (CY=1) if that expression evaluates to all zeros. That’s handy, because it’s exactly the expression we want to do!

So, let’s take version 2 of our assembly routine, remove the vpxor bit, and re-arrange the vptest inputs such that we can do a jnc instead of jnz:

Diff
Full

% diff -u IsPrefixOfStringInTable_x64_2.asm IsPrefixOfStringInTable_x64_4.asm
--- IsPrefixOfStringInTable_x64_2.asm   2018-04-26 14:15:53.805409700 -0400
+++ IsPrefixOfStringInTable_x64_4.asm   2018-04-26 14:16:37.909717200 -0400
@@ -33,6 +33,10 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
+;   This routine is based off version 2, but leverages the fact that
+;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
+;   to all 0s, avoiding the the need to do the pxor or pandn steps.
+;
 ; Arguments:
 ;
 ;   StringTable - Supplies a pointer to a STRING_TABLE struct.
@@ -50,7 +54,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_2, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -83,12 +87,6 @@
         vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

 ;
-; Set xmm2 to all ones.  We use this later to invert the length comparison.
-;
-
-        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.
-
-;
 ; Broadcast the byte-sized string length into xmm4.
 ;

@@ -103,16 +101,16 @@
 ;

         vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

 ;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
-; the length match mask xmm register (xmm1).  This affects flags, allowing us
-; to do a fast-path exit for the no-match case (where ZF = 1).
+; the inverted length match mask xmm register (xmm1).  This will set the carry
+; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
+; us to do a fast-path exit for the no-match case.
 ;

-        vptest      xmm5, xmm1                  ; Check for no match.
-        jnz         short Pfx10                 ; There was a match.
+        vptest      xmm1, xmm5                  ; Check for no match.
+        jnc         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
@@ -159,12 +157,12 @@
         vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

 ;
-; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
 ; yielding a mask identifying indices we need to perform subsequent matches
 ; upon.  Convert this into a bitmap and save in xmm2d[2].
 ;

-        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
         vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

 ;
@@ -473,7 +471,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_2, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 2, but leverages the fact that
;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
;   to all 0s, avoiding the the need to do the pxor or pandn steps.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the inverted length match mask xmm register (xmm1).  This will set the carry
; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
; us to do a fast-path exit for the no-match case.
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00

Let’s see how that stacks up against the existing version 2 of the assembly routine.

Benchmark x64 4

Nice, we’ve shaved an entire cycle off the negative match path! I say that both seriously and sarcastically. A single cycle, wow, stop the press! On the other hand, going from 8 cycles to 7 cycles is usually a lot harder than, say, going from 100,000 cycles to 80,000 cycles. We’re so close to the lower bound, additional cycle improvements are a lot like trying to get blood out of a stone.

IsPrefixOfStringInTable_12

← IsPrefixOfStringInTable_11 | IsPrefixOfStringInTable_13 →

The vptest fast-path exit definitely yielded a repeatable and measurable gain for the assembly version. Let’s replicate it in a C version.

Diff
Full

% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_12.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 13:28:06.006627100 -0400
+++ IsPrefixOfStringInTable_12.c        2018-04-26 17:47:54.970331600 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_12(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,8 +32,15 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but with factors in the improvements
+    made to version 4 of the x64 assembly version, thanks to suggestions from
+    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
+
+    Like version 11, we omit the vpxor to invert the lengths, but instead of
+    an initial vpandn, we leverage the fact that vptest sets the carry flag
+    if all 0s result from the expression: "param1 and (not param2)".  This
+    allows us to do a fast-path early exit (like x64 version 2 does) if no
+    match is found.

 Arguments:

@@ -71,9 +78,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -159,47 +164,58 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we invert the mask shortly after.
+    //      we do the "and not" intersection with the include slots next.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
+    // We can do a fast-path test for no match here via _mm_testc_si128(),
+    // which is essentially equivalent to the following logic, just with
+    // fewer instructions:
     //
-
-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
-    //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
+    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+    //                                      IncludeSlotsByUniqueChar);
     //
-
-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
-
+    //      if (!IncludeSlots) {
+    //          return NO_MATCH_FOUND;
+    //      }
     //
-    // Generate a mask.
     //

-    Bitmap = _mm_movemask_epi8(IncludeSlots);
-
-    if (!Bitmap) {
+    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

         //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
+        // No remaining slots were left after we intersected the slots with
+        // matching unique characters with the inverted slots to ignore due
+        // to length.  Thus, no prefix match was found.
         //

         return NO_MATCH_FOUND;
     }

     //
+    // Continue with the remaining logic, including actually generating the
+    // IncludeSlots, which we need for bitmap generation as part of our
+    // comparison loop.
+    //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
+    //
+
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);
+
+    //
+    // Generate a mask, count the number of bits, and initialize the search
+    // length.
+    //
+
+    Bitmap = _mm_movemask_epi8(IncludeSlots);
+
+    //
     // Calculate the "search length" of the incoming string, which ensures we
     // only compare up to the first 16 characters.
     //

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_12(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but with factors in the improvements
    made to version 4 of the x64 assembly version, thanks to suggestions from
    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).

    Like version 11, we omit the vpxor to invert the lengths, but instead of
    an initial vpandn, we leverage the fact that vptest sets the carry flag
    if all 0s result from the expression: "param1 and (not param2)".  This
    allows us to do a fast-path early exit (like x64 version 2 does) if no
    match is found.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Count;
    ULONG Length;
    ULONG Index;
    ULONG Shift = 0;
    ULONG CharactersMatched;
    ULONG NumberOfTrailingZeros;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we do the "and not" intersection with the include slots next.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We can do a fast-path test for no match here via _mm_testc_si128(),
    // which is essentially equivalent to the following logic, just with
    // fewer instructions:
    //
    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
    //                                      IncludeSlotsByUniqueChar);
    //
    //      if (!IncludeSlots) {
    //          return NO_MATCH_FOUND;
    //      }
    //
    //

    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

        //
        // No remaining slots were left after we intersected the slots with
        // matching unique characters with the inverted slots to ignore due
        // to length.  Thus, no prefix match was found.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Continue with the remaining logic, including actually generating the
    // IncludeSlots, which we need for bitmap generation as part of our
    // comparison loop.
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask, count the number of bits, and initialize the search
    // length.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    //
    // A popcount against the mask will tell us how many slots we matched, and
    // thus, need to compare.
    //

    Count = __popcnt(Bitmap);

    do {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap and adding the amount we've already shifted by.
        //

        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
        Index = NumberOfTrailingZeros + Shift;

        //
        // Shift the bitmap right, past the zeros and the 1 that was just found,
        // such that it's positioned correctly for the next loop's tzcnt. Update
        // the shift count accordingly.
        //

        Bitmap >>= (NumberOfTrailingZeros + 1);
        Shift = Index + 1;

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    } while (--Count);

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

Benchmark 12

Eh, there’s not much in this one. The negative match fast path is basically identical, and the normal prefix matches are a tiny bit slower.

IsPrefixOfStringInTable_13

← IsPrefixOfStringInTable_12 | IsPrefixOfStringInTable_14 →

Another tip from Fabian: we can tweak the loop logic further. Instead of shifting the bitmap right each iteration (and keeping a separate shift count), we can just leverage the blsr intrinsic, which stands for reset lowest set bit, and is equivalent to doing x & (x -1). This allows us to tweak the loop organization as well, such that we can simply do while (Bitmap) { } instead of the do { } while (--Count) approach we’ve been using.

Diff
Full

% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_13.c
--- IsPrefixOfStringInTable_10.c        2018-04-26 18:22:23.926168500 -0400
+++ IsPrefixOfStringInTable_13.c        2018-04-26 19:16:34.926170200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_13(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,8 +32,10 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 8, but rewrites the inner loop that
-    checks for comparisons.
+    This version is based off version 10, but does away with the bitmap
+    shifting logic and `do { } while (--Count)` loop, instead simply using
+    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
+    Giesen (@rygorous) for pointing this approach out.

 Arguments:

@@ -54,12 +56,9 @@
 {
     ULONG Bitmap;
     ULONG Mask;
-    ULONG Count;
     ULONG Length;
     ULONG Index;
-    ULONG Shift = 0;
     ULONG CharactersMatched;
-    ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
     STRING_SLOT Slot;
@@ -206,31 +205,26 @@

     SearchLength = min(String->Length, 16);

-    //
-    // A popcount against the mask will tell us how many slots we matched, and
-    // thus, need to compare.
-    //
-
-    Count = __popcnt(Bitmap);
-
-    do {
+    while (Bitmap) {

         //
         // Extract the next index by counting the number of trailing zeros left
-        // in the bitmap and adding the amount we've already shifted by.
+        // in the bitmap.
         //

-        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
-        Index = NumberOfTrailingZeros + Shift;
+        Index = _tzcnt_u32(Bitmap);

         //
-        // Shift the bitmap right, past the zeros and the 1 that was just found,
-        // such that it's positioned correctly for the next loop's tzcnt. Update
-        // the shift count accordingly.
+        // Clear the bitmap's lowest set bit, such that it's ready for the next
+        // loop's tzcnt if no match is found in this iteration.  Equivalent to
+        //
+        //      Bitmap &= Bitmap - 1;
+        //
+        // (Which the optimizer will convert into a blsr instruction anyway in
+        //  non-debug builds.  But it's nice to be explicit.)
         //

-        Bitmap >>= (NumberOfTrailingZeros + 1);
-        Shift = Index + 1;
+        Bitmap = _blsr_u32(Bitmap);

         //
         // Load the slot and its length.
@@ -313,7 +307,7 @@

         return (STRING_TABLE_INDEX)Index;

-    } while (--Count);
+    }

     //
     // If we get here, we didn't find a match.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_13(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version is based off version 10, but does away with the bitmap
    shifting logic and `do { } while (--Count)` loop, instead simply using
    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
    Giesen (@rygorous) for pointing this approach out.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Length;
    ULONG Index;
    ULONG CharactersMatched;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlotsByLength;
    XMMWORD IncludeSlots;
    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // Invert the result of the comparison; we want 0xff for slots to include
    // and 0x0 for slots to ignore (it's currently the other way around).  We
    // can achieve this by XOR'ing the result against our all-ones XMM register.
    //

    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);

    //
    // We're now ready to intersect the two XMM registers to determine which
    // slots should still be included in the comparison (i.e. which slots have
    // the exact same unique character as the string and a length less than or
    // equal to the length of the search string).
    //

    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
                                 IncludeSlotsByLength);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    if (!Bitmap) {

        //
        // No bits were set, so there are no strings in this table starting
        // with the same character and of a lesser or equal length as the
        // search string.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    while (Bitmap) {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap.
        //

        Index = _tzcnt_u32(Bitmap);

        //
        // Clear the bitmap's lowest set bit, such that it's ready for the next
        // loop's tzcnt if no match is found in this iteration.  Equivalent to
        //
        //      Bitmap &= Bitmap - 1;
        //
        // (Which the optimizer will convert into a blsr instruction anyway in
        //  non-debug builds.  But it's nice to be explicit.)
        //

        Bitmap = _blsr_u32(Bitmap);

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    }

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

I like this change. It was a great suggestion from Fabian. Let’s see how it performs. Hopefully it’ll do slightly better at prefix matching, given that we’re effectively reducing the number of instructions required as part of the string comparison logic.

Benchmark 13

Ah! A measurable, repeatable speed-up! Excellent!

IsPrefixOfStringInTable_14

← IsPrefixOfStringInTable_13 | IsPrefixOfStringInTable_15 →

Let’s give the C version the same chance as the assembly version with regards to negative matching; we’ll take version 13 above and factor in the vptest logic from version 12.

% diff -u IsPrefixOfStringInTable_13.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_13.c        2018-04-26 19:16:34.926170200 -0400
+++ IsPrefixOfStringInTable_14.c        2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_13(
+IsPrefixOfStringInTable_14(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,10 +32,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 10, but does away with the bitmap
-    shifting logic and `do { } while (--Count)` loop, instead simply using
-    blsr in conjunction with `while (Bitmap) { }`.  Credit goes to Fabian
-    Giesen (@rygorous) for pointing this approach out.
+    This version combines the altered bitmap logic from version 13 with the
+    fast-path _mm_testc_si128() exit from version 12.

 Arguments:

@@ -70,9 +68,7 @@
     XMMWORD TableUniqueChars;
     XMMWORD IncludeSlotsByUniqueChar;
     XMMWORD IgnoreSlotsByLength;
-    XMMWORD IncludeSlotsByLength;
     XMMWORD IncludeSlots;
-    const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);

     //
     // Unconditionally do the following five operations before checking any of
@@ -164,22 +160,43 @@
     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

     //
-    // Invert the result of the comparison; we want 0xff for slots to include
-    // and 0x0 for slots to ignore (it's currently the other way around).  We
-    // can achieve this by XOR'ing the result against our all-ones XMM register.
+    // We can do a fast-path test for no match here via _mm_testc_si128(),
+    // which is essentially equivalent to the following logic, just with
+    // fewer instructions:
     //
+    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+    //                                      IncludeSlotsByUniqueChar);
+    //
+    //      if (!IncludeSlots) {
+    //          return NO_MATCH_FOUND;
+    //      }
+    //
+    //
+
+    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

-    IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
+        //
+        // No remaining slots were left after we intersected the slots with
+        // matching unique characters with the inverted slots to ignore due
+        // to length.  Thus, no prefix match was found.
+        //
+
+        return NO_MATCH_FOUND;
+    }

     //
-    // We're now ready to intersect the two XMM registers to determine which
-    // slots should still be included in the comparison (i.e. which slots have
-    // the exact same unique character as the string and a length less than or
-    // equal to the length of the search string).
+    // Continue with the remaining logic, including actually generating the
+    // IncludeSlots, which we need for bitmap generation as part of our
+    // comparison loop.
+    //
+    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+    // at the moment (we want 0xff for slots to include, and 0x00 for slots
+    // to ignore; it's currently the other way around), we use _mm_andnot_si128
+    // instead of just _mm_and_si128.
     //

-    IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
-                                 IncludeSlotsByLength);
+    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+                                    IncludeSlotsByUniqueChar);

     //
     // Generate a mask.
@@ -187,17 +204,6 @@

     Bitmap = _mm_movemask_epi8(IncludeSlots);

-    if (!Bitmap) {
-
-        //
-        // No bits were set, so there are no strings in this table starting
-        // with the same character and of a lesser or equal length as the
-        // search string.
-        //
-
-        return NO_MATCH_FOUND;
-    }
-
     //
     // Calculate the "search length" of the incoming string, which ensures we
     // only compare up to the first 16 characters.

% diff -u IsPrefixOfStringInTable_12.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_12.c        2018-04-26 17:47:54.970331600 -0400
+++ IsPrefixOfStringInTable_14.c        2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@

 _Use_decl_annotations_
 STRING_TABLE_INDEX
-IsPrefixOfStringInTable_12(
+IsPrefixOfStringInTable_14(
     PSTRING_TABLE StringTable,
     PSTRING String,
     PSTRING_MATCH Match
@@ -32,15 +32,8 @@
     search string.  That is, whether any string in the table "starts with
     or is equal to" the search string.

-    This version is based off version 10, but with factors in the improvements
-    made to version 4 of the x64 assembly version, thanks to suggestions from
-    both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
-
-    Like version 11, we omit the vpxor to invert the lengths, but instead of
-    an initial vpandn, we leverage the fact that vptest sets the carry flag
-    if all 0s result from the expression: "param1 and (not param2)".  This
-    allows us to do a fast-path early exit (like x64 version 2 does) if no
-    match is found.
+    This version combines the altered bitmap logic from version 13 with the
+    fast-path _mm_testc_si128() exit from version 12.

 Arguments:

@@ -61,12 +54,9 @@
 {
     ULONG Bitmap;
     ULONG Mask;
-    ULONG Count;
     ULONG Length;
     ULONG Index;
-    ULONG Shift = 0;
     ULONG CharactersMatched;
-    ULONG NumberOfTrailingZeros;
     ULONG SearchLength;
     PSTRING TargetString;
     STRING_SLOT Slot;
@@ -118,7 +108,7 @@
     // Load the first 16-bytes of the search string into an XMM register.
     //

-    Search.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);
+    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

     //
     // Broadcast the search string's unique characters according to the string
@@ -164,7 +154,7 @@
     // N.B. Because we default the length of empty slots to 0x7f, they will
     //      handily be included in the ignored set (i.e. their words will also
     //      be set to 0xff), which means they'll also get filtered out when
-    //      we do the "and not" intersection with the include slots next.
+    //      we invert the mask shortly after.
     //

     IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
@@ -209,8 +199,7 @@
                                     IncludeSlotsByUniqueChar);

     //
-    // Generate a mask, count the number of bits, and initialize the search
-    // length.
+    // Generate a mask.
     //

     Bitmap = _mm_movemask_epi8(IncludeSlots);
@@ -222,31 +211,26 @@

     SearchLength = min(String->Length, 16);

-    //
-    // A popcount against the mask will tell us how many slots we matched, and
-    // thus, need to compare.
-    //
-
-    Count = __popcnt(Bitmap);
-
-    do {
+    while (Bitmap) {

         //
         // Extract the next index by counting the number of trailing zeros left
-        // in the bitmap and adding the amount we've already shifted by.
+        // in the bitmap.
         //

-        NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
-        Index = NumberOfTrailingZeros + Shift;
+        Index = _tzcnt_u32(Bitmap);

         //
-        // Shift the bitmap right, past the zeros and the 1 that was just found,
-        // such that it's positioned correctly for the next loop's tzcnt. Update
-        // the shift count accordingly.
+        // Clear the bitmap's lowest set bit, such that it's ready for the next
+        // loop's tzcnt if no match is found in this iteration.  Equivalent to
+        //
+        //      Bitmap &= Bitmap - 1;
+        //
+        // (Which the optimizer will convert into a blsr instruction anyway in
+        //  non-debug builds.  But it's nice to be explicit.)
         //

-        Bitmap >>= (NumberOfTrailingZeros + 1);
-        Shift = Index + 1;
+        Bitmap = _blsr_u32(Bitmap);

         //
         // Load the slot and its length.
@@ -329,7 +313,7 @@

         return (STRING_TABLE_INDEX)Index;

-    } while (--Count);
+    }

     //
     // If we get here, we didn't find a match.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_14(
    PSTRING_TABLE StringTable,
    PSTRING String,
    PSTRING_MATCH Match
    )
/*++

Routine Description:

    Searches a string table to see if any strings "prefix match" the given
    search string.  That is, whether any string in the table "starts with
    or is equal to" the search string.

    This version combines the altered bitmap logic from version 13 with the
    fast-path _mm_testc_si128() exit from version 12.

Arguments:

    StringTable - Supplies a pointer to a STRING_TABLE struct.

    String - Supplies a pointer to a STRING struct that contains the string to
        search for.

    Match - Optionally supplies a pointer to a variable that contains the
        address of a STRING_MATCH structure.  This will be populated with
        additional details about the match if a non-NULL pointer is supplied.

Return Value:

    Index of the prefix match if one was found, NO_MATCH_FOUND if not.

--*/
{
    ULONG Bitmap;
    ULONG Mask;
    ULONG Length;
    ULONG Index;
    ULONG CharactersMatched;
    ULONG SearchLength;
    PSTRING TargetString;
    STRING_SLOT Slot;
    STRING_SLOT Search;
    STRING_SLOT Compare;
    SLOT_LENGTHS Lengths;
    XMMWORD LengthXmm;
    XMMWORD UniqueChar;
    XMMWORD TableUniqueChars;
    XMMWORD IncludeSlotsByUniqueChar;
    XMMWORD IgnoreSlotsByLength;
    XMMWORD IncludeSlots;

    //
    // Unconditionally do the following five operations before checking any of
    // the results and determining how the search should proceed:
    //
    //  1. Load the search string into an Xmm register, and broadcast the
    //     character indicated by the unique character index (relative to
    //     other strings in the table) across a second Xmm register.
    //
    //  2. Load the string table's unique character array into an Xmm register.
    //
    //  3. Broadcast the search string's length into an XMM register.
    //
    //  3. Load the string table's slot lengths array into an XMM register.
    //
    //  4. Compare the unique character from step 1 to the string table's unique
    //     character array set up in step 2.  The result of this comparison
    //     will produce an XMM register with each byte set to either 0xff if
    //     the unique character was found, or 0x0 if it wasn't.
    //
    //  5. Compare the search string's length from step 3 to the string table's
    //     slot length array set up in step 3.  This allows us to identify the
    //     slots that have strings that are of lesser or equal length to our
    //     search string.  As we're doing a prefix search, we can ignore any
    //     slots longer than our incoming search string.
    //
    // We do all five of these operations up front regardless of whether or not
    // they're strictly necessary.  That is, if the unique character isn't in
    // the unique character array, we don't need to load array lengths -- and
    // vice versa.  However, we assume the benefits afforded by giving the CPU
    // a bunch of independent things to do unconditionally up-front outweigh
    // the cost of putting in branches and conditionally loading things if
    // necessary.
    //

    //
    // Load the first 16-bytes of the search string into an XMM register.
    //

    Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);

    //
    // Broadcast the search string's unique characters according to the string
    // table's unique character index.
    //

    UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
                                  StringTable->UniqueIndex.IndexXmm);

    //
    // Load the slot length array into an XMM register.
    //

    Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);

    //
    // Load the string table's unique character array into an XMM register.
    //

    TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);

    //
    // Broadcast the search string's length into an XMM register.
    //

    LengthXmm.m128i_u8[0] = (BYTE)String->Length;
    LengthXmm = _mm_broadcastb_epi8(LengthXmm);

    //
    // Compare the search string's unique character with all of the unique
    // characters of strings in the table, saving the results into an XMM
    // register.  This comparison will indicate which slots we can ignore
    // because the characters at a given index don't match.  Matched slots
    // will be 0xff, unmatched slots will be 0x0.
    //

    IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);

    //
    // Find all slots that are longer than the incoming string length, as these
    // are the ones we're going to exclude from any prefix match.
    //
    // N.B. Because we default the length of empty slots to 0x7f, they will
    //      handily be included in the ignored set (i.e. their words will also
    //      be set to 0xff), which means they'll also get filtered out when
    //      we invert the mask shortly after.
    //

    IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);

    //
    // We can do a fast-path test for no match here via _mm_testc_si128(),
    // which is essentially equivalent to the following logic, just with
    // fewer instructions:
    //
    //      IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
    //                                      IncludeSlotsByUniqueChar);
    //
    //      if (!IncludeSlots) {
    //          return NO_MATCH_FOUND;
    //      }
    //
    //

    if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {

        //
        // No remaining slots were left after we intersected the slots with
        // matching unique characters with the inverted slots to ignore due
        // to length.  Thus, no prefix match was found.
        //

        return NO_MATCH_FOUND;
    }

    //
    // Continue with the remaining logic, including actually generating the
    // IncludeSlots, which we need for bitmap generation as part of our
    // comparison loop.
    //
    // As the IgnoreSlotsByLength XMM register is the inverse of what we want
    // at the moment (we want 0xff for slots to include, and 0x00 for slots
    // to ignore; it's currently the other way around), we use _mm_andnot_si128
    // instead of just _mm_and_si128.
    //

    IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
                                    IncludeSlotsByUniqueChar);

    //
    // Generate a mask.
    //

    Bitmap = _mm_movemask_epi8(IncludeSlots);

    //
    // Calculate the "search length" of the incoming string, which ensures we
    // only compare up to the first 16 characters.
    //

    SearchLength = min(String->Length, 16);

    while (Bitmap) {

        //
        // Extract the next index by counting the number of trailing zeros left
        // in the bitmap.
        //

        Index = _tzcnt_u32(Bitmap);

        //
        // Clear the bitmap's lowest set bit, such that it's ready for the next
        // loop's tzcnt if no match is found in this iteration.  Equivalent to
        //
        //      Bitmap &= Bitmap - 1;
        //
        // (Which the optimizer will convert into a blsr instruction anyway in
        //  non-debug builds.  But it's nice to be explicit.)
        //

        Bitmap = _blsr_u32(Bitmap);

        //
        // Load the slot and its length.
        //

        Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
        Length = Lengths.Slots[Index];

        //
        // Compare the slot to the search string.
        //

        Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);

        //
        // Create a mask of the comparison, then filter out high bits from the
        // search string's length (which is capped at 16).  (This shouldn't be
        // technically necessary as the string array buffers should have been
        // calloc'd and zeroed, but optimizing compilers can often ignore the
        // zeroing request -- which can produce some bizarre results where the
        // debug build is correct (because the buffers were zeroed) but the
        // release build fails because the zeroing got ignored and there are
        // junk bytes past the NULL terminator, which get picked up in our
        // 128-bit loads.)
        //

        Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);

        //
        // Count how many characters matched.
        //

        CharactersMatched = __popcnt(Mask);

        if ((USHORT)CharactersMatched < Length && Length <= 16) {

            //
            // The slot length is longer than the number of characters matched
            // from the search string; this isn't a prefix match.  Continue.
            //

            continue;
        }

        if (Length > 16) {

            //
            // The first 16 characters in the string matched against this
            // slot, and the slot is oversized (longer than 16 characters),
            // so do a direct comparison between the remaining buffers.
            //

            TargetString = &StringTable->pStringArray->Strings[Index];

            CharactersMatched = IsPrefixMatch(String, TargetString, 16);

            if (CharactersMatched == NO_MATCH_FOUND) {

                //
                // The prefix match failed, continue our search.
                //

                continue;
            }
        }

        //
        // This slot is a prefix match.  Fill out the Match structure if the
        // caller provided a non-NULL pointer, then return the index of the
        // match.
        //

        if (ARGUMENT_PRESENT(Match)) {

            Match->Index = (BYTE)Index;
            Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
            Match->String = &StringTable->pStringArray->Strings[Index];

        }

        return (STRING_TABLE_INDEX)Index;

    }

    //
    // If we get here, we didn't find a match.
    //

    //IACA_VC_END();

    return NO_MATCH_FOUND;
}

We’re really clutching at straws here obviously with regards to trying to eke out more performance. The _mm_testc_si128() alteration was a tiny bit slower for version 12 across the board. However, the vptest (which is the underlying assembly instruction that maps to the _mm_testc_si128() intrinsic) version 4 of our assembly was definitely a little bit faster than the other versions. Let’s see how our final C version performs.

Benchmark 14

Welp, at least it’s consistent! Like version 12, the _mm_testc_si128() change doesn’t really offer a compelling improvement for version 14. That makes version 13 officially our fastest C implementation for round 2.

IsPrefixOfStringInTable_x64_5

← IsPrefixOfStringInTable_x64_4 | IsPrefixOfStringInTable_x64_6 →

Before we conclude round 2, let’s see if we can eke any more performance out of the negative match fast path of our fastest assembly version so far: version 4. For this step, I’m going to leverage Intel Architecture Code Analyzer, or IACA, for short.

This is a handy little static analysis tool that can provide useful information for fine-tuning performance-sensitive code. Let’s take a look at the output from IACA for our assembly version 4. To do this, I uncomment the two macros, IACA_VC_START and IACA_VC_END, which reside at the start and end of the negative match logic. These macros are defined in StringTable.inc, and look like this:

IACA_VC_START macro Name

        mov     byte ptr gs:[06fh], 06fh

        endm

IACA_VC_END macro Name

        mov     byte ptr gs:[0deh], 0deh

        endm

The equivalent versions for C are defined in Rtl.h, and look like this:

//
// Define start/end markers for IACA.
//

#define IACA_VC_START() __writegsbyte(111, 111)
#define IACA_VC_END()   __writegsbyte(222, 222)

You may have noticed commented-out versions of these macros in both the C and assembly code. What they do is emit a specific byte pattern in the instruction byte code that the IACA tool can detect. You place the start and end markers around the code you’re interested in, recompile it, then run IACA against the final executable (or library).

Let’s see what happens when we do this for our version 4 assembly routine. I’ll include the relevant assembly snippet, reformatted into a more concise fashion, followed by the IACA output (also reformatted into a more concise fashion):

Assembly
IACA

mov      rax,  String.Buffer[rdx]                       ; Load address of string buffer.
vmovdqu  xmm0, xmmword ptr [rax]                        ; Load search buffer.
vmovdqa  xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
vpshufb  xmm5, xmm0, xmm1                               ; Rearrange string by uniq. ix.
vmovdqa  xmm2, xmmword ptr StringTable.UniqueChars[rcx] ; Load unique chars.
vpcmpeqb xmm5, xmm5, xmm2                               ; Compare unique chars.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx]     ; Load table lengths.
vpbroadcastb xmm4, byte ptr String.Length[rdx]          ; Broadcast string length.
vpcmpgtb xmm1, xmm3, xmm4                               ; Identify long slots.
vptest   xmm1, xmm5                                     ; Unique slots AND (!long slots).
jnc      short Pfx10                                    ; CY=0, continue with routine.
xor      eax, eax                                       ; CY=1, no match.  Clear rax.
not      al                                             ; al = -1 (NO_MATCH_FOUND)
ret                                                     ; Return NO_MATCH_FOUND

S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.74 Cycles       Throughput Bottleneck: Dependency Chains
Loop Count:  22
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|µops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm1, xmmword ptr [rcx+0x10]
| 1  |   |   |       |       | |1.0|   | | vpshufb xmm5, xmm0, xmm1
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm2, xmmword ptr [rcx]
| 1  |1.0|   |0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmm2
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |1.0|       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of µops: 18

The Intel Architecture Code Analyzer User Manual (v3.0) provides decent documentation about how to interpret the output, so I won’t go over the gory details. What I’m really looking at in this pass is what my block throughput is, and potentially what the bottleneck is.

In this case, our block throughput is being reported as 3.74 cycles, which basically indicates how many CPU cycles it takes to execute the block. Our bottleneck is dependency chains, which refers to the situation where, say, instruction C can’t start because the results from instruction A aren’t ready yet.

Alright, well, what can we do? A good answer would be that with an intimate understanding of contemporary Intel CPU architecture, you can pinpoint exactly what needs changing in order to reduce dependencies, maximize port utilization, and generally make the CPU happier.

Or you can just move shit around until the number gets smaller. That’s what I did.

Well, that’s not entirely true. Fabian did make a good suggestion when he was reviewing some of my assembly that I was often needlessly doing a load into an XMM register only to use it once in a subsequent operation. Instead of doing that, I could just use the load-op version of the instruction, which allows for an instruction input parameter to be sourced from memory.

For example, instead of this:

vmovdqa  xmm2, xmmword ptr StringTable.UniqueChars[rcx] ; Load unique chars.
vpcmpeqb xmm5, xmm5, xmm2                               ; Compare unique chars.

You can just do this:

vpcmpeqb xmm5, xmm5, xmmword ptr StringTable.UniqueChars[rcx] ; Compare...

But yeah, other than a few load-op tweaks, I basically just shuffled shit around until the block throughput reported lower. Very rigorous methodology, I know. Here’s the final version, which also happens to be the version quoted in the introduction of this article:

Assembly
IACA

mov      rax,  String.Buffer[rdx]                   ; Load address of string buffer.
vpbroadcastb xmm4, byte ptr String.Length[rdx]      ; Broadcast string length.
vmovdqa  xmm3, xmmword ptr StringTable.Lengths[rcx] ; Load table lengths.
vmovdqu  xmm0, xmmword ptr [rax]                    ; Load string buffer.
vpcmpgtb xmm1, xmm3, xmm4                           ; Identify slots > string len.
vpshufb  xmm5, xmm0, StringTable.UniqueIndex[rcx]   ; Rearrange string by unique index.
vpcmpeqb xmm5, xmm5, StringTable.UniqueChars[rcx]   ; Compare rearranged to unique.
vptest   xmm1, xmm5                                 ; Unique slots AND (!long slots).
jnc      short Pfx10                                ; CY=0, continue with routine.
xor      eax, eax                                   ; CY=1, no match.
not      al                                         ; al = -1 (NO_MATCH_FOUND).
ret                                                 ; Return NO_MATCH_FOUND.

S:\Source\tracer>iaca x64\Release\StringTable2.dll
Intel(R) Architecture Code Analyzer
Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  x64\Release\StringTable2.dll
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.48 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  24
Port Binding In Cycles Per Iteration:
----------------------------------------------------------------------------
| Port   |  0  - DV  |  1  |  2  - D   |  3  - D   |  4  |  5  |  6  |  7  |
----------------------------------------------------------------------------
| Cycles | 2.0   0.0 | 1.0 | 3.5   3.5 | 3.5   3.5 | 0.0 | 3.0 | 2.0 | 0.0 |
----------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

|    | Ports pressure in cycles        | |
|µops|0DV| 1 | 2 - D | 3 - D |4| 5 | 6 |7|
-------------------------------------------
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | mov rax, qword ptr [rdx+0x8]
| 2  |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpbroadcastb xmm4, byte ptr [rdx]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqa xmm3, xmmword ptr [rcx+0x20]
| 1  |   |   |0.5 0.5|0.5 0.5| |   |   | | vmovdqu xmm0, xmmword ptr [rax]
| 1  |1.0|   |       |       | |   |   | | vpcmpgtb xmm1, xmm3, xmm4
| 2^ |   |   |0.5 0.5|0.5 0.5| |1.0|   | | vpshufb xmm5, xmm0, xmmword ptr [rcx+0x10]
| 2^ |   |1.0|0.5 0.5|0.5 0.5| |   |   | | vpcmpeqb xmm5, xmm5, xmmword ptr [rcx]
| 2  |1.0|   |       |       | |1.0|   | | vptest xmm1, xmm5
| 1  |   |   |       |       | |   |1.0| | jnb 0x10
| 1* |   |   |       |       | |   |   | | xor eax, eax
| 1  |   |   |       |       | |   |1.0| | not al
| 3^ |   |   |0.5 0.5|0.5 0.5| |   |   | | ret
Total Num Of µops: 18

As you can see, that is reporting a block throughput of 3.48 instead of 3.74. A whopping 0.26 reduction! Also note the bottleneck is now being reported as FrontEnd, which basically means that the thing holding up this code now is literally the CPU’s ability to decode the actual instruction stream into actionable internal work. (Again, super simplistic explanation of a very complex process.)

For the sake of completeness, here’s the proper diff and full version of assembly version 5:

Diff (5 vs 4)
Full

% diff -u IsPrefixOfStringInTable_x64_4.asm IsPrefixOfStringInTable_x64_5.asm
--- IsPrefixOfStringInTable_x64_4.asm   2018-04-26 17:56:37.934374900 -0400
+++ IsPrefixOfStringInTable_x64_5.asm   2018-04-26 18:17:26.087861100 -0400
@@ -33,9 +33,14 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2, but leverages the fact that
-;   vptest sets the carry flag if '(xmm0 and (not xmm1))' evaluates
-;   to all 0s, avoiding the the need to do the pxor or pandn steps.
+;   This routine is identical to version 4, but has the initial negative match
+;   instructions re-ordered and tweaked in order to reduce the block throughput
+;   reported by IACA (from 3.74 to 3.48).
+;
+;   N.B. Although this does result in a measurable speedup, the clarity suffers
+;        somewhat due to the fact that instructions that were previously paired
+;        together are now spread out (e.g. moving the string buffer address into
+;        rax and then loading that into xmm0 three instructions later).
 ;
 ; Arguments:
 ;
@@ -54,32 +59,21 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_4, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_5, _TEXT$00

 ;
-; Load the string buffer into xmm0, and the unique indexes from the string table
-; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result into xmm5.
+; Load the address of the string buffer into rax.
 ;

         ;IACA_VC_START

-        mov     rax, String.Buffer[rdx]
-        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
-        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm5, xmm0, xmm1
-
-;
-; Load the string table's unique character array into xmm2.
-
-        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
+        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

 ;
-; Compare the search string's unique character array (xmm5) against the string
-; table's unique chars (xmm2), saving the result back into xmm5.
+; Broadcast the byte-sized string length into xmm4.
 ;

-        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
 ; Load the lengths of each string table slot into xmm3.
@@ -88,26 +82,38 @@
         vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Load the search string buffer into xmm0.
 ;

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

 ;
 ; Compare the search string's length, which we've broadcasted to all 8-byte
 ; elements of the xmm4 register, to the lengths of the slots in the string
-; table, to find those that are greater in length.  Invert the result, such
-; that we're left with a masked register where each 0xff element indicates
-; a slot with a length less than or equal to our search string's length.
+; table, to find those that are greater in length.
 ;

         vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

 ;
+; Shuffle the buffer in xmm0 according to the unique indexes, and store the
+; result into xmm5.
+;
+
+        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.
+
+;
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
+;
+
+        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.
+
+;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
-; the inverted length match mask xmm register (xmm1).  This will set the carry
-; flag (CY = 1) if the result of 'xmm5 and (not xmm1)' is all 0s, which allows
-; us to do a fast-path exit for the no-match case.
+; the length match mask xmm register (xmm1).  This affects flags, allowing us
+; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
+; been inverted).
 ;

         vptest      xmm1, xmm5                  ; Check for no match.
@@ -472,7 +478,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_4, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_5, _TEXT$00

 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is identical to version 4, but has the initial negative match
;   instructions re-ordered and tweaked in order to reduce the block throughput
;   reported by IACA (from 3.74 to 3.48).
;
;   N.B. Although this does result in a measurable speedup, the clarity suffers
;        somewhat due to the fact that instructions that were previously paired
;        together are now spread out (e.g. moving the string buffer address into
;        rax and then loading that into xmm0 three instructions later).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_5, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap initially, then slot length.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        vpinsrd     xmm5, xmm5, edx, 2          ; Store bitmap, free up rdx.
        xor         edx, edx                    ; Clear edx.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm1, xmm0            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Load the slot length into rdx.  As xmm3 already has all the slot lengths in
; it, we can load rax (the current index) into xmm1 and use it to extract the
; slot length via shuffle.  (The length will be in the lowest byte of xmm1
; after the shuffle, which we can then vpextrb.)
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle lengths.
        vpextrb     rdx, xmm1, 0                ; Extract target length to rdx.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  If the slot length is greater than 16, we need
; to do an inline memory comparison of the remaining bytes.  If it's 16 exactly,
; then great, that's a slot match, we're done.
;

@@:     cmp         dl, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the slot; if equal, this is a match, if not, no match, continue.
;

Pfx30:  cmp         r8b, dl                     ; Compare against slot length.
        jne         @F                          ; No match found.
        jmp         short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

@@:     vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rdx, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rdx                ; Free up rdx, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        xor         eax, eax                ; Clear eax.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_5, _TEXT$00

Did it make a difference? Were we able to shave any time off the negative match fast path? Let’s find out.

Benchmark x64 5

Hurrah! We’ve got a new winner! Our final tweaks yielded a very small but measurable and repeatable improvement in both prefix matching and negative matching! Let’s mark that up as a win.

Reviewing x64 v3…

Alright, we need to get some closure on why IsPrefixOfStringInTable_x64_3 was so bad in comparison to IsPrefixOfStringInTable_x64_2. Let’s review the performance chart again quickly:

What immediately stands out to me with those results is how everything seems to be impacted; it’s not just the prefix matching performance that’s bad, it’s the negative match performance as well. This is odd, as we didn’t really change anything in the negative match logic.

Except for that pesky prologue we added to stash the values of rsi, rdi, and the flags register. Hmmm! That seems like a good a place as any to start investigating. Let’s whip up another version that defers the prologue until after the initial negative match logic. This exploits a little detail regarding prologues in that they need to appear in the first 255 bytes of the function byte code — but don’t necessarily need to appear at the very start. As long as the prologue definition for the register is the first time the register is mutated, you’ve got a bit of room to play with regarding where to actually put it.

So, here’s version 7 of the routine, based off version 3, that simply relocates the prologue code to appear after the initial negative match logic:

Diff (7 vs 3)
Full

 % diff -u IsPrefixOfStringInTable_x64_3.asm IsPrefixOfStringInTable_x64_7.asm
--- IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:13:23.879193700 -0400
+++ IsPrefixOfStringInTable_x64_7.asm   2018-04-29 19:33:06.374193900 -0400
@@ -58,13 +58,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2.  It has been converted into a nested
-;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
-;   comparison for long strings (instead of the byte-by-byte comparison used
-;   in version 2).  This requires use of the rsi and rdi registers, and the
-;   direction flag.  These are all non-volatile registers and thus, must be
-;   saved to the stack in the function prologue (hence the need to make this
-;   a nested entry).
+;   This routine is based off version 3, but relocates the prologue code to
+;   after the initial negative match logic (jump target Pfx10).
 ;
 ; Arguments:
 ;
@@ -83,19 +78,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
-
-;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-        push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00

 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
@@ -165,11 +148,23 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

         ;IACA_VC_END

 ;
+; Begin prologue.  Allocate stack space and save non-volatile registers.
+;
+
+Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.
+
+        push_eflags                                 ; Save flags.
+        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
+        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
+
+        END_PROLOGUE
+
+;
 ; (There was at least one match, continue with processing.)
 ;

@@ -187,7 +182,7 @@
 ;   r11 - String length (String->Length)
 ;

-Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
+        vpextrb     r11, xmm4, 0                ; Load length.
         mov         rax, 16                     ; Load 16 into rax.
         mov         r10, r11                    ; Copy into r10.
         cmp         r10w, ax                    ; Compare against 16.
@@ -512,7 +507,7 @@

         ret

-        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :
;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 3, but relocates the prologue code to
;   after the initial negative match logic (jump target Pfx10).
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00

;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

        vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
        popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Let’s see how that change impacts the runtime function entry with regards to the unwind code information. Here’s the entry for assembly version 3:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_x64_3
Debugger function entry 00000185`395f2a88 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_3 (void)

BeginAddress      = 00000000`00003dc0
EndAddress        = 00000000`00003fb0
UnwindInfoAddress = 00000000`00005508

Unwind info at 00007fff`f8425508, 10 bytes
  version 1, flags 0, prolog f, codes 6
  00: offs f, unwind op 4, op info 7    UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs a, unwind op 4, op info 6    UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 5, unwind op 2, op info 0    UWOP_ALLOC_SMALL.
  05: offs 4, unwind op 2, op info 3    UWOP_ALLOC_SMALL.
</pre></small>

Compare that to the entry for the routine we just wrote, version 7, with the prologue appearing much later in the routine:

0:000> .fnent StringTable2!IsPrefixOfStringInTable_x64_7
Debugger function entry 00000185`395f2a88 for:
Exact matches:
    StringTable2!IsPrefixOfStringInTable_x64_7 (void)

BeginAddress      = 00000000`00004540
EndAddress        = 00000000`00004730
UnwindInfoAddress = 00000000`00005530

Unwind info at 00007fff`f8425530, 10 bytes
  version 1, flags 0, prolog 4c, codes 6
  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.

As you can see, the prolog value has changed to 0x4c, and the offsets for each entry have also changed accordingly. Let’s disassemble the function and see if we can correlate the addresses of our prologue instructions to the offsets indicated above:

0:000> uf StringTable2!IsPrefixOfStringInTable_x64_7
StringTable2!IsPrefixOfStringInTable_x64_7:
00007fff`f8424540 488b4208        mov     rax,qword ptr [rdx+8]
00007fff`f8424544 c5fa6f00        vmovdqu xmm0,xmmword ptr [rax]
00007fff`f8424548 c5f96f4910      vmovdqa xmm1,xmmword ptr [rcx+10h]
00007fff`f842454d c4e27900e9      vpshufb xmm5,xmm0,xmm1
00007fff`f8424552 c5f96f11        vmovdqa xmm2,xmmword ptr [rcx]
00007fff`f8424556 c5d174ea        vpcmpeqb xmm5,xmm5,xmm2
00007fff`f842455a c5f96f5920      vmovdqa xmm3,xmmword ptr [rcx+20h]
00007fff`f842455f c4e26929d2      vpcmpeqq xmm2,xmm2,xmm2
00007fff`f8424564 c4e2797822      vpbroadcastb xmm4,byte ptr [rdx]
00007fff`f8424569 c5e164cc        vpcmpgtb xmm1,xmm3,xmm4
00007fff`f842456d c5f1efca        vpxor   xmm1,xmm1,xmm2
00007fff`f8424571 c4e27917e9      vptest  xmm5,xmm1
00007fff`f8424576 7505            jne     StringTable2!IsPrefixOfStringInTable_x64_7+0x3d (00007fff`f842457d)

StringTable2!IsPrefixOfStringInTable_x64_7+0x38:
00007fff`f8424578 33c0            xor     eax,eax
00007fff`f842457a f6d0            not     al
00007fff`f842457c c3              ret

StringTable2!IsPrefixOfStringInTable_x64_7+0x3d:
00007fff`f842457d 4883ec20        sub     rsp,20h
00007fff`f8424581 9c              pushfq
00007fff`f8424582 4889742410      mov     qword ptr [rsp+10h],rsi
00007fff`f8424587 48897c2408      mov     qword ptr [rsp+8],rdi
00007fff`f842458c c4c37914e300    vpextrb r11d,xmm4,0
00007fff`f8424592 48c7c010000000  mov     rax,10h

All of the addresses share 0x00007fff'f8424 as the first 13 digits, so we can ignore that part to simplify the values we’re working with. Let’s take a look at the first of our prologue instructions, sub rsp, 20h. This maps to our alloc_stack LOCALS_SIZE line:

Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

        push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

The sub rsp, 20h line appears at byte offset 0x57d. If we subtract that from the address of the very first instruction, 0x540, we get 61, or 0x3d in hex.

Hmmmm. That doesn’t map to any of the offsets that appear in version 7’s runtime function entry. Let’s try the address of the pushfq instruction, which is at offset 0x581. If we subtract the start address 0x540 from that, we’re left with 65, which in hex is, drum roll, 0x41! That matches the last line of the runtime function entry:

  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.
      ^^^^^^^

That makes sense if we think about the purpose of the unwind entries. They are there for the kernel to compare against a faulting instruction’s address (i. e. , the value contained in the RIP register at the time of the fault) to determine what needs to be unwound as part of exception handling. In this case, at byte offset 0x41, the sub rsp, 20h instruction will have already been executed, so the kernel knows it needs to unwind this (e. g. , by doing what will effectively equate to add rsp, 20h) within the exception handling logic when it needs to unwind the entire frame and restore all of the non-volatile registers.

If we take a look at the first instruction after our last prologue instruction, vpextrb r11d, xmm4, 0, it resides at address offset 0x58c. Subtracting the start address 0x540 from that, we get 76, which is 0x4c in hex, matching the offset of the last unwind entry, as well as the prologue endpoint:

Unwind info at 00007fff`f8425530, 10 bytes
  version 1, flags 0, prolog 4c, codes 6
                      ^^^^^^^^^
  00: offs 4c, unwind op 4, op info 7   UWOP_SAVE_NONVOL FrameOffset: 8 reg: rdi.
      ^^^^^^^
  02: offs 47, unwind op 4, op info 6   UWOP_SAVE_NONVOL FrameOffset: 10 reg: rsi.
  04: offs 42, unwind op 2, op info 0   UWOP_ALLOC_SMALL.
  05: offs 41, unwind op 2, op info 3   UWOP_ALLOC_SMALL.

The reason that the prologue must occur within the first 255 bytes of the function is simply due to the fact that the prologue size and offsets are stored using a single byte, so 255 is the maximum value that can be represented. When writing a NESTED_ENTRY with MASM, you need to have the END_PROLOGUE macro (which expands to . endprolog) occur within the first 255 bytes of your function.

If we move the END_PROLOGUE line in version 7 way down to the bottom of the routine and try and compile, MASM balks:

IsPrefixOfStringInTable_x64_7.asm(511): error A2247: size of prolog too big, must be > 256 bytes

Note

I have no idea why the spelling of prolog vs prologue and epilog vs epilogue is so inconsistent within the Microsoft tooling and docs.

Let’s get back on track. We need to review the performance of version 7 to see if relocating the prologue has any impact on the negative matching performance of the routine. If it does, this is a strong indicator that it’s at fault, especially if the prefix matching still shows the same performance issues. Here’s the comparison.

Benchmark x64 7

Hah! Look at that, the negative match performance is back on par with version 2. So, the blame now squarely points to something peculiar in the prologue inducing a huge (well, relatively huge) performance hit. But the prologue is so simple! It’s only pushing flags, and two registers!

IsPrefixOfStringInTable_x64_8

I know register pushing is cheap. Borderline free in the grand scheme of things. Flags though. Flags are an interesting one. The bane of the out-of-order CPU pipeline, they could very well be forcing a synchronization point within the code, preventing all the contemporary goodies you get when you let the CPU do its thing whenever it wants, rather than when you need. (Goodies like… Meltdown!)

Let’s test the theory. We’ll take version 7 and simply comment out the flag pushing and popping behavior.

Note

Technically we’re not allowed to do that; the direction indicator is classed as non-volatile; if the calling function has it set to reverse, and on return, we’ve set it to forward, things are going to be problematic if it actually wanted it set to reverse. In practice, this isn’t that common. At least with our current stack, what with our aversion to even using a C runtime library, we know nothing in our benchmark environment is going to be faced with that predicament.

Diff (8 v 7)
Full

% diff -u IsPrefixOfStringInTable_x64_7.asm IsPrefixOfStringInTable_x64_8.asm
--- IsPrefixOfStringInTable_x64_7.asm   2018-04-29 21:10:09.061479900 -0400
+++ IsPrefixOfStringInTable_x64_8.asm   2018-04-29 22:08:02.761164300 -0400
@@ -58,8 +58,9 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 3, but relocates the prologue code to
-;   after the initial negative match logic (jump target Pfx10).
+;   This routine is based off version 7, but comments-out the pushing and
+;   popping of flags to the stack in the prologue and epilogue, respectively,
+;   in order to test a theory regarding performance.
 ;
 ; Arguments:
 ;
@@ -78,8 +79,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_7, _TEXT$00
-
+        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -158,7 +158,7 @@

 Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

-        push_eflags                                 ; Save flags.
+       ;push_eflags                                 ; Save flags.
         save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
         save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

@@ -502,12 +502,12 @@

 Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
         mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-        popfq                                   ; Restore flags.
+       ;popfq                                   ; Restore flags.
         add     rsp, LOCALS_SIZE                ; Deallocate stack space.

         ret

-        NESTED_END   IsPrefixOfStringInTable_x64_7, _TEXT$00
+        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;
; Define a locals struct for saving flags, rsi and rdi.
;

Locals struct

    Padding             dq      ?
    SavedRdi            dq      ?
    SavedRsi            dq      ?
    SavedFlags          dq      ?

    ReturnAddress       dq      ?
    HomeRcx             dq      ?
    HomeRdx             dq      ?
    HomeR8              dq      ?
    HomeR9              dq      ?

Locals ends

;
; Exclude the return address onward from the frame calculation size.
;

LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 7, but comments out the pushing and
;   popping of flags to the stack in the prologue and epilogue, respectively,
;   in order to test a theory regarding performance.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.

       ;push_eflags                                 ; Save flags.
        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.

        END_PROLOGUE

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

        vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Set up rsi/rdi so we can do a 'rep cmps'.
;

        cld
        mov         rsi, r11
        mov         rdi, r8
        repe        cmpsb

        test        cl, 0
        jnz         short Pfx60                 ; Not all bytes compared, jump.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp Pfx90                               ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        align   16

Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
       ;popfq                                   ; Restore flags.
        add     rsp, LOCALS_SIZE                ; Deallocate stack space.

        ret

        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Let’s see how we perform when we cheat by not saving flags:

Crikey! Flags were clearly at fault! Not only that, but look at the performance of the routine in comparison with version 2 for prefix matching; there’s a definite improvement in performance! (I also looked up the latency of pushfq: 9 cycles! I had no idea it was that expensive. )

…wait wait wait. Shut the front door! This new assembly version is nearly as fast as the fastest C versions, and it doesn’t even have the optimized negative match re-work in place. Plot twist!

That means the slight tweak we made re-arranging the logic for IsPrefixOfStringInTable_x64_3 actually provided a tangible speed-up, but it was lost in the noise of pushfq slowing things down so much. Or perhaps IsPrefixOfStringInTable_x64_2 is just doing something particularly bad.

Either way, it means we might be able to wrangle an assembly version that can dominate the negative matching fast path and give the C version a run for its money with prefix matching, which would be a great way to end the article! Let’s give it a shot.

Note

This is the first point in the article where I’m not retroactively documenting what I’ve done—it’s all live! I have no idea if I’ll be able to produce a final assembly version that’s competitive with C in all aspects. Then again, I’m persistent and stubborn, so who knows.

We’ll do this in a couple of pieces. First, we’ll convert version 8 (which has version 3’s logic) into a LEAF_ENTRY and restore the byte-by-byte comparison logic instead of repe cmpsb, but keep everything else identical. This will be version 9. For version 10, we can tidy up version 9 a bit and replace some of the jumps to the epilogue area (Pfx90) with a simple ret where applicable.

From there, we’ll make version 11, which will combine version 10 and the optimized negative match logic we established in the assembly version 5. After that, we can use versions 12 onward to try replicating the superior inner loop approach identified by Fabian that led to the C routine IsPrefixOfStringInTable_13. And to think we were almost going to publish this article without investigating the slowdown associated with version 3 of the assembly!

IsPrefixOfStringInTable_x64_9

← IsPrefixOfStringInTable_x64_8 | IsPrefixOfStringInTable_x64_10 →

As mentioned, let’s take the version 8 NESTED_ENTRY and convert it into a LEAF_ENTRY with the least amount of code churn possible. As version 8 is essentially version 3 with a relocated prologue and the push_eflags/popfq bits commented out, I’ll provide a diff against version 3 as well.

% diff -u IsPrefixOfStringInTable_x64_8.asm IsPrefixOfStringInTable_x64_9.asm
--- IsPrefixOfStringInTable_x64_8.asm   2018-04-29 22:08:02.761164300 -0400
+++ IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
@@ -18,31 +18,6 @@

 include StringTable.inc

-;
-; Define a locals struct for saving flags, rsi and rdi.
-;
-
-Locals struct
-
-    Padding             dq      ?
-    SavedRdi            dq      ?
-    SavedRsi            dq      ?
-    SavedFlags          dq      ?
-
-    ReturnAddress       dq      ?
-    HomeRcx             dq      ?
-    HomeRdx             dq      ?
-    HomeR8              dq      ?
-    HomeR9              dq      ?
-
-Locals ends
-
-;
-; Exclude the return address onward from the frame calculation size.
-;
-
-LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
-
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -58,9 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 7, but comments out the pushing and
-;   popping of flags to the stack in the prologue and epilogue, respectively,
-;   in order to test a theory regarding performance.
+;   This routine is based off version 8, but reverts the 'repe cmps' to the
+;   same byte-by-byte comparison loop we used in all the previous version.
+;   As this removes the dependency on rsi, rdi and the direction flag, we
+;   no longer need to push those values to the stack, so we also revert back
+;   to a LEAF_ENTRY.
 ;
 ; Arguments:
 ;
@@ -79,7 +56,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_8, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -153,18 +130,6 @@
         ;IACA_VC_END

 ;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-Pfx10:  alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-       ;push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
-
-;
 ; (There was at least one match, continue with processing.)
 ;

@@ -182,7 +147,7 @@
 ;   r11 - String length (String->Length)
 ;

-        vpextrb     r11, xmm4, 0                ; Load length.
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
         mov         rax, 16                     ; Load 16 into rax.
         mov         r10, r11                    ; Copy into r10.
         cmp         r10w, ax                    ; Compare against 16.
@@ -449,16 +414,21 @@

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Set up rsi/rdi so we can do a 'rep cmps'.
+; Do a byte-by-byte comparison.
 ;

-        cld
-        mov         rsi, r11
-        mov         rdi, r8
-        repe        cmpsb
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;

-        test        cl, 0
-        jnz         short Pfx60                 ; Not all bytes compared, jump.
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -485,7 +455,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp Pfx90                               ; Return.
+        ret                                     ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -498,16 +468,9 @@

         ;IACA_VC_END

-        align   16
-
-Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
-        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-       ;popfq                                   ; Restore flags.
-        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
-
-        ret
+Pfx90:  ret

-        NESTED_END   IsPrefixOfStringInTable_x64_8, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

% diff -u IsPrefixOfStringInTable_x64_3.asm IsPrefixOfStringInTable_x64_9.asm
--- IsPrefixOfStringInTable_x64_3.asm   2018-04-29 16:13:23.879193700 -0400
+++ IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
@@ -18,31 +18,6 @@

 include StringTable.inc

-;
-; Define a locals struct for saving flags, rsi and rdi.
-;
-
-Locals struct
-
-    Padding             dq      ?
-    SavedRdi            dq      ?
-    SavedRsi            dq      ?
-    SavedFlags          dq      ?
-
-    ReturnAddress       dq      ?
-    HomeRcx             dq      ?
-    HomeRdx             dq      ?
-    HomeR8              dq      ?
-    HomeR9              dq      ?
-
-Locals ends
-
-;
-; Exclude the return address onward from the frame calculation size.
-;
-
-LOCALS_SIZE  equ ((sizeof Locals) + (Locals.ReturnAddress - (sizeof Locals)))
-
 ;++
 ;
 ; STRING_TABLE_INDEX
@@ -58,13 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 2.  It has been converted into a nested
-;   entry (version 2 is a leaf entry), and uses 'rep cmpsb' to do the string
-;   comparison for long strings (instead of the byte-by-byte comparison used
-;   in version 2).  This requires use of the rsi and rdi registers, and the
-;   direction flag.  These are all non-volatile registers and thus, must be
-;   saved to the stack in the function prologue (hence the need to make this
-;   a nested entry).
+;   This routine is based off version 8, but reverts the 'repe cmps' to the
+;   same byte-by-byte comparison loop we used in all the previous version.
+;   As this removes the dependency on rsi, rdi and the direction flag, we
+;   no longer need to push those values to the stack, so we also revert back
+;   to a LEAF_ENTRY.
 ;
 ; Arguments:
 ;
@@ -83,20 +56,7 @@
 ;
 ;--

-        NESTED_ENTRY IsPrefixOfStringInTable_x64_3, _TEXT$00
-
-;
-; Begin prologue.  Allocate stack space and save non-volatile registers.
-;
-
-        alloc_stack LOCALS_SIZE                     ; Allocate stack space.
-
-        push_eflags                                 ; Save flags.
-        save_reg    rsi, Locals.SavedRsi            ; Save non-volatile rsi.
-        save_reg    rdi, Locals.SavedRdi            ; Save non-volatile rdi.
-
-        END_PROLOGUE
-
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -165,7 +125,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

         ;IACA_VC_END

@@ -454,16 +414,21 @@

 ;
 ; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Set up rsi/rdi so we can do a 'rep cmps'.
+; Do a byte-by-byte comparison.
 ;

-        cld
-        mov         rsi, r11
-        mov         rdi, r8
-        repe        cmpsb
+        align 16
+@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
+        jne         short Pfx60                 ; If not equal, jump.
+
+;
+; The two bytes were equal, update rax, decrement rcx and potentially continue
+; the loop.
+;

-        test        cl, 0
-        jnz         short Pfx60                 ; Not all bytes compared, jump.
+        inc         ax                          ; Increment index.
+        loopnz      @B                          ; Decrement cx and loop back.

 ;
 ; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
@@ -490,7 +455,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp Pfx90                               ; Return.
+        ret                                     ; Return.

 ;
 ; More comparisons remain; restore the registers we clobbered and continue loop.
@@ -503,16 +468,9 @@

         ;IACA_VC_END

-        align   16
-
-Pfx90:  mov     rsi, Locals.SavedRsi[rsp]       ; Restore rsi.
-        mov     rdi, Locals.SavedRdi[rsp]       ; Restore rdi.
-        popfq                                   ; Restore flags.
-        add     rsp, LOCALS_SIZE                ; Deallocate stack space.
-
-        ret
+Pfx90:  ret

-        NESTED_END   IsPrefixOfStringInTable_x64_3, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 8, but reverts the 'repe cmps' to the
;   same byte-by-byte comparison loop we used in all the previous version.
;   As this removes the dependency on rsi, rdi and the direction flag, we
;   no longer need to push those values to the stack, so we also revert back
;   to a LEAF_ENTRY.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        jmp         Pfx90                       ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        jmp         Pfx90                       ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        jmp         Pfx90

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

Pfx90:  ret

        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Let’s go straight into version 10, as it’s only a very minor tweak to version 9 above.

IsPrefixOfStringInTable_x64_10

← IsPrefixOfStringInTable_x64_9 | IsPrefixOfStringInTable_x64_11 →

Remove the final remnants of the NESTED_ENTRY and replace the jumps to exit label Pfx90 with ret instead.

Diff (10 v 9)
Full

% diff -u IsPrefixOfStringInTable_x64_9.asm IsPrefixOfStringInTable_x64_10.asm
--- IsPrefixOfStringInTable_x64_9.asm   2018-04-30 20:14:58.067237400 -0400
+++ IsPrefixOfStringInTable_x64_10.asm  2018-05-02 08:16:39.672110400 -0400
@@ -33,11 +33,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 8, but reverts the 'repe cmps' to the
-;   same byte-by-byte comparison loop we used in all the previous version.
-;   As this removes the dependency on rsi, rdi and the direction flag, we
-;   no longer need to push those values to the stack, so we also revert back
-;   to a LEAF_ENTRY.
+;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
+;   been replaced by normal 'ret' lines.
 ;
 ; Arguments:
 ;
@@ -56,7 +53,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_9, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
 ;
 ; Load the string buffer into xmm0, and the unique indexes from the string table
 ; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
@@ -310,7 +307,7 @@

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
-        jmp         Pfx90                       ; Return.
+        ret                                     ; Return.

 ;
 ; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
@@ -334,7 +331,7 @@
 ;

         vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        jmp         Pfx90                       ; StringMatch == NULL, finish.
+        ret                                     ; StringMatch == NULL, finish.

 ;
 ; StringMatch is not NULL.  Fill out characters matched (currently rax), then
@@ -360,7 +357,7 @@
         mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
         shr         eax, 4                  ; Revert the scaling.

-        jmp         Pfx90
+        ret

 ;
 ; 16 characters matched and the length of the underlying slot is greater than
@@ -468,9 +465,7 @@

         ;IACA_VC_END

-Pfx90:  ret
-
-        LEAF_END   IsPrefixOfStringInTable_x64_9, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
;   been replaced by normal 'ret' lines.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
;
; Load the string buffer into xmm0, and the unique indexes from the string table
; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
; result into xmm5.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]
        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
        vpshufb xmm5, xmm0, xmm1

;
; Load the string table's unique character array into xmm2.

        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.

;
; Load the lengths of each string table slot into xmm3.
;
        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.

;
; Set xmm2 to all ones.  We use this later to invert the length comparison.
;

        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.  Invert the result, such
; that we're left with a masked register where each 0xff element indicates
; a slot with a length less than or equal to our search string's length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
        vpxor       xmm1, xmm1, xmm2            ; Invert the result.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where ZF = 1).
;

        vptest      xmm5, xmm1                  ; Check for no match.
        jnz         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Let’s review performance. I’ll omit the C versions from the graphs for now while we focus on optimizing the assembly versions. In this next comparison, we want to verify that we’re still seeing the performance gains we saw in version 8 in versions 9 and 10. If the timings for version 9 and 10 differ, I’d expect version 10 to be better—but it won’t be by much.

Note

I had to generate new CSV files for these graphs, as the old ones didn’t have any timings for these new functions we’ve added. It’s easier to just regenerate timings for everything versus trying to splice in the new timings into the old files.

So, there will be small differences in the numbers you see here for old routines referenced earlier (i. e. , the timings for assembly versions 2, 4, 5, and 8 aren’t identical to earlier graphs). The differences are negligible (a handful of cycles per 1000 iterations).

All the source files live here on GitHub.

Benchmark x64 10

Excellent! Version 10 is a tiny bit faster than 9, but both retain the speed advantages we saw from version 8. We can also see how expensive the setup cost is for repe cmpsb, too, which version 8 used. It’s not necessarily a fair comparison, as only one byte is being compared ($INDEX_ALLOCATION is 17 bytes long; so we’re only comparing the last N letters), and there’s a fixed overhead with the repe cmp/stos/lods-type instructions that can’t be avoided. (They can prove optimal for longer sequences, though. )

IsPrefixOfStringInTable_x64_11

← IsPrefixOfStringInTable_x64_10 | IsPrefixOfStringInTable_x64_12 →

Let’s take version 10 and blend in the optimal negative match instruction ordering we used for version 5. (Version 10 is essentially derived from version 3, and we wrote that before we’d come up with the optimizations explored in versions 4 and 5.)

Diff (11 v 10)
Full

% diff -u IsPrefixOfStringInTable_x64_10.asm IsPrefixOfStringInTable_x64_11.asm
--- IsPrefixOfStringInTable_x64_10.asm  2018-05-02 08:16:39.672110400 -0400
+++ IsPrefixOfStringInTable_x64_11.asm  2018-05-03 17:21:18.181161900 -0400
@@ -33,8 +33,8 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is identical to version 9, except the 'jmp Pfx90' lines have
-;   been replaced by normal 'ret' lines.
+;   This routine is based off version 10, with the optimized negative prefix
+;   matching logic in place from version 5.
 ;
 ; Arguments:
 ;
@@ -53,68 +53,65 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_10, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00
+
 ;
-; Load the string buffer into xmm0, and the unique indexes from the string table
-; into xmm1.  Shuffle the buffer according to the unique indexes, and store the
-; result into xmm5.
+; Load the address of the string buffer into rax.
 ;

         ;IACA_VC_START

-        mov     rax, String.Buffer[rdx]
-        vmovdqu xmm0, xmmword ptr [rax]                 ; Load search buffer.
-        vmovdqa xmm1, xmmword ptr StringTable.UniqueIndex[rcx] ; Load indexes.
-        vpshufb xmm5, xmm0, xmm1
+        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

 ;
-; Load the string table's unique character array into xmm2.
+; Broadcast the byte-sized string length into xmm4.
+;

-        vmovdqa xmm2, xmmword ptr StringTable.UniqueChars[rcx]  ; Load chars.
+        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

 ;
-; Compare the search string's unique character array (xmm5) against the string
-; table's unique chars (xmm2), saving the result back into xmm5.
+; Load the lengths of each string table slot into xmm3.
 ;

-        vpcmpeqb    xmm5, xmm5, xmm2            ; Compare unique chars.
+        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

 ;
-; Load the lengths of each string table slot into xmm3.
+; Load the search string buffer into xmm0.
 ;
-        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]      ; Load lengths.
+
+        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

 ;
-; Set xmm2 to all ones.  We use this later to invert the length comparison.
+; Compare the search string's length, which we've broadcasted to all 8-byte
+; elements of the xmm4 register, to the lengths of the slots in the string
+; table, to find those that are greater in length.
 ;

-        vpcmpeqq    xmm2, xmm2, xmm2            ; Set xmm2 to all ones.
+        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

 ;
-; Broadcast the byte-sized string length into xmm4.
+; Shuffle the buffer in xmm0 according to the unique indexes, and store the
+; result into xmm5.
 ;

-        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.
+        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

 ;
-; Compare the search string's length, which we've broadcasted to all 8-byte
-; elements of the xmm4 register, to the lengths of the slots in the string
-; table, to find those that are greater in length.  Invert the result, such
-; that we're left with a masked register where each 0xff element indicates
-; a slot with a length less than or equal to our search string's length.
+; Compare the search string's unique character array (xmm5) against the string
+; table's unique chars (xmm2), saving the result back into xmm5.
 ;

-        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.
-        vpxor       xmm1, xmm1, xmm2            ; Invert the result.
+        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

 ;
 ; Intersect-and-test the unique character match xmm mask register (xmm5) with
 ; the length match mask xmm register (xmm1).  This affects flags, allowing us
-; to do a fast-path exit for the no-match case (where ZF = 1).
+; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
+; been inverted).
 ;

-        vptest      xmm5, xmm1                  ; Check for no match.
-        jnz         short Pfx10                 ; There was a match.
+        vptest      xmm1, xmm5                  ; Check for no match.
+        jnc         short Pfx10                 ; There was a match.

 ;
 ; No match, set rax to -1 and return.
@@ -161,12 +158,12 @@
         vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

 ;
-; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm5, xmm1'),
+; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
 ; yielding a mask identifying indices we need to perform subsequent matches
 ; upon.  Convert this into a bitmap and save in xmm2d[2].
 ;

-        vpand       xmm5, xmm5, xmm1            ; Intersect unique + lengths.
+        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
         vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

 ;
@@ -465,7 +462,7 @@

         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_10, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based off version 10, with the optimized negative prefix
;   matching logic in place from version 5.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.  (The search string's length
; currently lives in xmm4, albeit as a byte-value broadcasted across the
; entire register, so extract that first.)
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
        mov         rax, 16                     ; Load 16 into rax.
        mov         r10, r11                    ; Copy into r10.
        cmp         r10w, ax                    ; Compare against 16.
        cmova       r10w, ax                    ; Use 16 if length is greater.
        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].

;
; Home our parameter registers into xmm registers instead of their stack-backed
; location, to avoid memory writes.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
; xmm2:
;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
;
; xmm4:
;       0:7     (vpinsrb 0)     length of search string
;       8:15    (vpinsrb 1)     min(String->Length, 16)
;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
;      24:31    (vpinsrb 3)     shift count
;
; xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;

;
; Initialize rcx as our counter register by doing a popcnt against the bitmap
; we just generated in edx, and clear our shift count register (r9).
;

        popcnt      ecx, edx                    ; Count bits in bitmap.
        xor         r9, r9                      ; Clear r9.

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, and then add in the shift count, producing an
; index (rax) we can use to load the corresponding slot.
;
; Register usage at top of loop:
;
;   rax - Index.
;
;   rcx - Loop counter.
;
;   rdx - Bitmap.
;
;   r9 - Shift count.
;
;   r10 - Search length.
;
;   r11 - String length.
;

Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
        mov         eax, r8d                    ; Copy tzcnt to rax,
        add         rax, r9                     ; Add shift to create index.
        inc         r8                          ; tzcnt + 1
        shrx        rdx, rdx, r8                ; Reposition bitmap.
        mov         r9, rax                     ; Copy index back to shift.
        inc         r9                          ; Shift = Index + 1
        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1, then shift rax back.
;

        shl         eax, 4
        vpextrq     r8, xmm2, 0
        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
        shr         eax, 4

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; If 16 characters matched, and the search string's length is longer than 16,
; we're going to need to do a comparison of the remaining strings.
;

        cmp         r8w, 16                     ; Compare chars matched to 16.
        je          short @F                    ; 16 chars matched.
        jmp         Pfx30                       ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register, then check to see if it's greater than,
; equal or less than 16.
;

@@:     movd        xmm1, rax                   ; Load into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
        vpextrb     rax, xmm1, 0                ; And extract back into rax.
        cmp         al, 16                      ; Compare length to 16.
        ja          Pfx50                       ; Length is > 16.
        je          short Pfx35                 ; Lengths match!
                                                ; Length <= 16, fall through...

;
; Less than or equal to 16 characters were matched.  Compare this against the
; length of the search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx35                 ; Match found!

;
; No match against this slot, decrement counter and either continue the loop
; or terminate the search and return no match.
;

        dec         cx                          ; Decrement counter.
        jnz         Pfx20                       ; cx != 0, continue.

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
; former is used when we need to copy the number of characters matched from r8
; back to rax.  The latter jump target doesn't require this.
;

Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.

;
; Load the match parameter back into r8 and test to see if it's not-NULL, in
; which case we need to fill out a STRING_MATCH structure for the match.
;

Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
        test        r8, r8                      ; Is NULL?
        jnz         short @F                    ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
;

        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        ret                                     ; StringMatch == NULL, finish.

;
; StringMatch is not NULL.  Fill out characters matched (currently rax), then
; reload the index from xmm5 into rax and save.
;

@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
        mov         byte ptr StringMatch.Index[r8], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable, so we need to load that parameter
; back into rcx, then resolving the string array address via pStringArray,
; then the relevant STRING offset within the StringArray.Strings structure.
;

        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.

        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
        shr         eax, 4                  ; Revert the scaling.

        ret

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; The slot length is stored in rax at this point, and the search string's
; length is stored in r11.  We know that the search string's length will
; always be longer than or equal to the slot length at this point, so, we
; can subtract 16 (currently stored in r10) from rax, and use the resulting
; value as a loop counter, comparing the search string with the underlying
; string slot byte-by-byte to determine if there's a match.
;

Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

;
; Free up some registers by stashing their values into various xmm offsets.
;

        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
        mov         rcx, rax                ; Free up rax, rcx is now counter.

;
; Load the search string buffer and advance it 16 bytes.
;

        vpextrq     r11, xmm2, 1            ; Extract String into r11.
        mov         r11, String.Buffer[r11] ; Load buffer address.
        add         r11, r10                ; Advance buffer 16 bytes.

;
; Loading the slot is more involved as we have to go to the string table, then
; the pStringArray pointer, then the relevant STRING offset within the string
; array (which requires re-loading the index from xmm5d[3]), then the string
; buffer from that structure.
;

        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
        mov         r8, StringTable.pStringArray[r8] ; Load string array.

        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
        shl         eax, 4                  ; Scale the index; sizeof STRING=16.

        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
        add         r8, r10                 ; Advance buffer 16 bytes.

        mov         rax, rcx                ; Copy counter.

;
; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
; Do a byte-by-byte comparison.
;

        align 16
@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
        jne         short Pfx60                 ; If not equal, jump.

;
; The two bytes were equal, update rax, decrement rcx and potentially continue
; the loop.
;

        inc         ax                          ; Increment index.
        loopnz      @B                          ; Decrement cx and loop back.

;
; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
; how many characters we matched, and then jump to Pfx40 for finalization.
;

        add         rax, r10
        jmp         Pfx40

;
; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
; it.  If it's zero, we have no more strings to compare, so we can do a quick
; exit.  If there are still comparisons to be made, restore the other registers
; we trampled then jump back to the start of the loop Pfx20.
;

Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
        dec         cx                          ; Decrement counter.
        jnz         short @F                    ; Jump forward if not zero.

;
; No more comparisons remaining, return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; More comparisons remain; restore the registers we clobbered and continue loop.
;

@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
        vpextrb     r11, xmm4, 0                ; Restore r11.
        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
        jmp         Pfx20                       ; Continue comparisons.

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

Notice the similarity between the diff above and the one for IsPrefixOfStringInTable_x64_5. Let’s see how the performance compares. The negative match performance for version 11 should be on par with version 5.

We have a new winner! Version 11 is now the fastest assembly version across the board for both prefix and negative matching. Before we start our final pass on version 12, let’s take a quick look at how we currently compare against the fastest C version:

It’s already very close! We just need to shave off a few more cycles on the assembly version to take the crown.

IsPrefixOfStringInTable_x64_12

← IsPrefixOfStringInTable_x64_11

Let’s start with updating the main loop logic such that it matches IsPrefixOfStringInTable_13. We’ll omit the bitmap shifting and loop count in lieu of the blsr approach.

(About 5 hours pass…)

Alright, I’m back! Version 12 of our assembly routine is complete! This was the first big major change to the routine since version 2 really, and I had the benefit of the past ~220 hours already spent obsessing over this topic, so, I’m actually pretty happy with the result! Let’s take a look. (The diff view of this version is pretty messy compared to the others, given the increased amount of code churn that was involved.)

Diff (12 v 11)
Full

% diff -u IsPrefixOfStringInTable_x64_11.asm IsPrefixOfStringInTable_x64_12.asm
--- IsPrefixOfStringInTable_x64_11.asm  2018-05-03 17:21:18.181161900 -0400
+++ IsPrefixOfStringInTable_x64_12.asm  2018-05-04 12:36:24.773963100 -0400
@@ -33,8 +33,11 @@
 ;   search string.  That is, whether any string in the table "starts with
 ;   or is equal to" the search string.
 ;
-;   This routine is based off version 10, with the optimized negative prefix
-;   matching logic in place from version 5.
+;   This routine is based on version 11, but leverages the inner loop logic
+;   tweak we used in version 13 of the C version, pointed out by Fabian Giesen
+;   (@rygorous).  That is, we do away with the shifting logic and explicit loop
+;   counting, and simply use blsr to keep iterating through the bitmap until it
+;   is empty.
 ;
 ; Arguments:
 ;
@@ -53,7 +56,7 @@
 ;
 ;--

-        LEAF_ENTRY IsPrefixOfStringInTable_x64_11, _TEXT$00
+        LEAF_ENTRY IsPrefixOfStringInTable_x64_12, _TEXT$00

 ;
 ; Load the address of the string buffer into rax.
@@ -129,9 +132,7 @@

 ;
 ; Calculate the "search length" for the incoming search string, which is
-; equivalent of 'min(String->Length, 16)'.  (The search string's length
-; currently lives in xmm4, albeit as a byte-value broadcasted across the
-; entire register, so extract that first.)
+; equivalent of 'min(String->Length, 16)'.
 ;
 ; Once the search length is calculated, deposit it back at the second byte
 ; location of xmm4.
@@ -141,21 +142,18 @@
 ;   r11 - String length (String->Length)
 ;

-Pfx10:  vpextrb     r11, xmm4, 0                ; Load length.
-        mov         rax, 16                     ; Load 16 into rax.
-        mov         r10, r11                    ; Copy into r10.
-        cmp         r10w, ax                    ; Compare against 16.
-        cmova       r10w, ax                    ; Use 16 if length is greater.
-        vpinsrb     xmm4, xmm4, r10d, 1         ; Save back to xmm4b[1].
+Pfx10:  vpextrb     r11, xmm4, 0                ; Load string length.
+        mov         r9, 16                      ; Load 16 into r9.
+        mov         r10, r11                    ; Copy length into r10.
+        cmp         r10w, r9w                   ; Compare against 16.
+        cmova       r10w, r9w                   ; Use 16 if length is greater.

 ;
-; Home our parameter registers into xmm registers instead of their stack-backed
-; location, to avoid memory writes.
+; Home our parameter register rdx into the base of xmm2.
 ;

         vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
-        vpinsrq     xmm2, xmm2, rcx, 0          ; Save rcx into xmm2q[0].
-        vpinsrq     xmm2, xmm2, rdx, 1          ; Save rdx into xmm2q[1].
+        vmovq       xmm2, rdx                   ; Save rcx.

 ;
 ; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
@@ -171,77 +169,70 @@
 ;

         vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
-        vpinsrq     xmm5, xmm5, r8, 0           ; Save r8 into xmm5q[0].
+        vmovq       xmm5, r8                    ; Save r8 into xmm5q[0].

 ;
 ; Summary of xmm register stashing for the rest of the routine:
 ;
-; xmm2:
-;        0:63   (vpinsrq 0)     rcx (1st function parameter, StringTable)
-;       64:127  (vpinsrq 1)     rdx (2nd function paramter, String)
-;
-; xmm4:
-;       0:7     (vpinsrb 0)     length of search string
-;       8:15    (vpinsrb 1)     min(String->Length, 16)
-;      16:23    (vpinsrb 2)     loop counter (when doing long string compares)
-;      24:31    (vpinsrb 3)     shift count
+;   xmm2:
+;        0:63   (vpinsrq 0)     rdx (2nd function parameter, String)
 ;
-; xmm5:
+;   xmm4:
+;       0:7     (vpinsrb 0)     length of search string [r11]
+;       8:15    (vpinsrb 1)     min(String->Length, 16) [r10]
+;
+;   xmm5:
 ;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
 ;      64:95    (vpinsrd 2)     bitmap of slots to compare
 ;      96:127   (vpinsrd 3)     index of slot currently being processed
 ;
-
+; Non-stashing xmm register use:
 ;
-; Initialize rcx as our counter register by doing a popcnt against the bitmap
-; we just generated in edx, and clear our shift count register (r9).
+;   xmm0: First 16 characters of search string.
+;
+;   xmm3: Slot lengths.
+;
+;   xmm1: Freebie!
 ;
-
-        popcnt      ecx, edx                    ; Count bits in bitmap.
-        xor         r9, r9                      ; Clear r9.

         align 16

 ;
 ; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
-; trailing zeros of the bitmap, and then add in the shift count, producing an
-; index (rax) we can use to load the corresponding slot.
+; trailing zeros of the bitmap, producing an index (rax) we can use to load the
+; corresponding slot.
 ;
-; Register usage at top of loop:
+; Volatile register usage at top of loop:
 ;
-;   rax - Index.
-;
-;   rcx - Loop counter.
+;   rcx - StringTable.
 ;
 ;   rdx - Bitmap.
 ;
-;   r9 - Shift count.
+;   r9 - Constant value of 16.
+;
+;   r10 - Search length (min(String->Length, 16))
+;
+;   r11 - Search string length (String->Length).
 ;
-;   r10 - Search length.
+; Use of remaining volatile registers during loop:
 ;
-;   r11 - String length.
+;   rax - Index.
+;
+;   r8 - Freebie!
 ;

-Pfx20:  tzcnt       r8d, edx                    ; Count trailing zeros.
-        mov         eax, r8d                    ; Copy tzcnt to rax,
-        add         rax, r9                     ; Add shift to create index.
-        inc         r8                          ; tzcnt + 1
-        shrx        rdx, rdx, r8                ; Reposition bitmap.
-        mov         r9, rax                     ; Copy index back to shift.
-        inc         r9                          ; Shift = Index + 1
-        vpinsrd     xmm5, xmm5, eax, 3          ; Store the raw index xmm5d[3].
+Pfx20:  tzcnt       eax, edx                    ; Count trailing zeros = index.

 ;
 ; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
 ; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
 ;
-; Then, load the string table slot at this index into xmm1, then shift rax back.
+; Then, load the string table slot at this index into xmm1.
 ;

-        shl         eax, 4
-        vpextrq     r8, xmm2, 0
-        vmovdqa     xmm1, xmmword ptr [rax + StringTable.Slots[r8]]
-        shr         eax, 4
+        mov         r8, rax                     ; Copy index (rax) into r8.
+        shl         r8, 4                       ; "Scale" the index.
+        vmovdqa     xmm1, xmmword ptr [r8 + StringTable.Slots[rcx]]

 ;
 ; The search string's first 16 characters are already in xmm0.  Compare this
@@ -264,187 +255,220 @@
         popcnt      r8d, r8d                    ; Count bits.

 ;
-; If 16 characters matched, and the search string's length is longer than 16,
-; we're going to need to do a comparison of the remaining strings.
+; Determine if less than 16 characters matched, as this avoids needing to do
+; a more convoluted test to see if a byte-by-byte string comparison is needed
+; (for lengths longer than 16).
 ;

-        cmp         r8w, 16                     ; Compare chars matched to 16.
-        je          short @F                    ; 16 chars matched.
-        jmp         Pfx30                       ; Less than 16 matched.
+        cmp         r8w, r9w                    ; Compare chars matched to 16.
+        jl          short Pfx30                 ; Less than 16 matched.

 ;
 ; All 16 characters matched.  Load the underlying slot's length from the
-; relevant offset in the xmm3 register, then check to see if it's greater than,
-; equal or less than 16.
-;
-
-@@:     movd        xmm1, rax                   ; Load into xmm1.
-        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length...
-        vpextrb     rax, xmm1, 0                ; And extract back into rax.
-        cmp         al, 16                      ; Compare length to 16.
-        ja          Pfx50                       ; Length is > 16.
-        je          short Pfx35                 ; Lengths match!
-                                                ; Length <= 16, fall through...
+; relevant offset in the xmm3 register into r11, then check to see if it's
+; greater than 16.  If it is, we're going to need to do a string compare,
+; handled by Pfx50.
+;
+; N.B. The approach for loading the slot length here is a little quirky.  We
+;      have all the lengths for slots in xmm3, and we have the current match
+;      index in rax.  If we move rax into an xmm register (xmm1 in this case),
+;      we can use it to shuffle xmm3, such that the length we're interested in
+;      will be deposited back into the lowest byte, which we can then extract
+;      via vpextrb.
+;
+
+        movd        xmm1, rax                   ; Load index into xmm1.
+        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length by index.
+        vpextrb     r11, xmm1, 0                ; Extract slot length into r11.
+        cmp         r11w, r9w                   ; Compare length to 16.
+        ja          short Pfx50                 ; Length is > 16.
+        jmp         short Pfx40                 ; Lengths match!

 ;
-; Less than or equal to 16 characters were matched.  Compare this against the
-; length of the search string; if equal, this is a match.
+; Less than 16 characters were matched.  Compare this against the length of the
+; search string; if equal, this is a match.
 ;

 Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
-        je          short Pfx35                 ; Match found!
+        je          short Pfx40                 ; Match found!

 ;
-; No match against this slot, decrement counter and either continue the loop
-; or terminate the search and return no match.
+; No match against this slot.  Clear the lowest set bit of the bitmap and check
+; to see if there are any bits remaining in it.
 ;

-        dec         cx                          ; Decrement counter.
-        jnz         Pfx20                       ; cx != 0, continue.
+        blsr        edx, edx                    ; Reposition bitmap.
+        test        edx, edx                    ; Is bitmap empty?
+        jnz         short Pfx20                 ; Bits remain, continue loop.
+
+;
+; No more bits remain set in the bitmap, we're done.  Indicate no match found
+; and return.
+;

         xor         eax, eax                    ; Clear rax.
         not         al                          ; al = -1
         ret                                     ; Return.

 ;
-; Pfx35 and Pfx40 are the jump targets for when the prefix match succeeds.  The
-; former is used when we need to copy the number of characters matched from r8
-; back to rax.  The latter jump target doesn't require this.
+; Load the match parameter into r9 and test to see if it's not-NULL, in which
+; case we need to fill out a STRING_MATCH structure for the match, handled by
+; jump target Pfx80 at the end of this routine.
 ;

-Pfx35:  mov         rax, r8                     ; Copy numbers of chars matched.
+Pfx40:  vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
+        test        r9, r9                      ; Is NULL?
+        jnz         Pfx80                       ; Not zero, need to fill out.

 ;
-; Load the match parameter back into r8 and test to see if it's not-NULL, in
-; which case we need to fill out a STRING_MATCH structure for the match.
+; StringMatch is NULL, we're done.  We can return straight from here, rax will
+; still have the index stored.
 ;

-Pfx40:  vpextrq     r8, xmm5, 0                 ; Extract StringMatch.
-        test        r8, r8                      ; Is NULL?
-        jnz         short @F                    ; Not zero, need to fill out.
+        ret                                     ; StringMatch == NULL, finish.

 ;
-; StringMatch is NULL, we're done. Extract index of match back into rax and ret.
+; 16 characters matched and the length of the underlying slot is greater than
+; 16, so we need to do a little memory comparison to determine if the search
+; string is a prefix match.
 ;
-
-        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        ret                                     ; StringMatch == NULL, finish.
-
+; Register use on block entry:
 ;
-; StringMatch is not NULL.  Fill out characters matched (currently rax), then
-; reload the index from xmm5 into rax and save.
+;   rax - Index.
 ;
-
-@@:     mov         byte ptr StringMatch.NumberOfMatchedCharacters[r8], al
-        vpextrd     eax, xmm5, 3                ; Extract raw index for match.
-        mov         byte ptr StringMatch.Index[r8], al
-
+;   rcx - StringTable.
 ;
-; Final step, loading the address of the string in the string array.  This
-; involves going through the StringTable, so we need to load that parameter
-; back into rcx, then resolving the string array address via pStringArray,
-; then the relevant STRING offset within the StringArray.Strings structure.
+;   rdx - Bitmap.
 ;
-
-        vpextrq     rcx, xmm2, 0            ; Extract StringTable into rcx.
-        mov         rcx, StringTable.pStringArray[rcx] ; Load string array.
-
-        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
-        lea         rdx, [rax + StringArray.Strings[rcx]] ; Resolve address.
-        mov         qword ptr StringMatch.String[r8], rdx ; Save STRING ptr.
-        shr         eax, 4                  ; Revert the scaling.
-
-        ret
-
+;   r9 - Constant value of 16.
 ;
-; 16 characters matched and the length of the underlying slot is greater than
-; 16, so we need to do a little memory comparison to determine if the search
-; string is a prefix match.
+;   r10 - Search length (min(String->Length, 16))
 ;
-; The slot length is stored in rax at this point, and the search string's
-; length is stored in r11.  We know that the search string's length will
-; always be longer than or equal to the slot length at this point, so, we
-; can subtract 16 (currently stored in r10) from rax, and use the resulting
-; value as a loop counter, comparing the search string with the underlying
-; string slot byte-by-byte to determine if there's a match.
+;   r11 - Slot length.
+;
+; Register use during the block (after we've freed things up and loaded the
+; values we need):
+;
+;   rax - Index/accumulator.
+;
+;   rcx - Loop counter (for byte comparison).
+;
+;   rdx - Byte loaded into dl for comparison.
+;
+;   r8 - Target string buffer.
+;
+;   r9 - Search string buffer.
 ;
-
-Pfx50:  sub         rax, r10                ; Subtract 16 from search length.

 ;
-; Free up some registers by stashing their values into various xmm offsets.
+; Initialize r8 such that it's pointing to the slot's String->Buffer address.
+; This is a bit fiddly as we need to go through StringTable.pStringArray first
+; to get the base address of the STRING_ARRAY, then the relevant STRING offset
+; within the array, then the String->Buffer address from that structure.  Then,
+; add 16 to it, such that it's ready as the base address for comparison.
 ;

-        vpinsrd     xmm5, xmm5, edx, 2      ; Free up rdx register.
-        vpinsrb     xmm4, xmm4, ecx, 2      ; Free up rcx register.
-        mov         rcx, rax                ; Free up rax, rcx is now counter.
+Pfx50:  mov         r8, StringTable.pStringArray[rcx] ; Load string array addr.
+        mov         r9, rax                 ; Copy index into r9.
+        shl         r9, 4                   ; "Scale" index; sizeof STRING=16.
+        lea         r8, [r9 + StringArray.Strings[r8]] ; Load STRING address.
+        mov         r8, String.Buffer[r8]   ; Load String->Buffer address.
+        add         r8, r10                 ; Advance it 16 bytes.

 ;
-; Load the search string buffer and advance it 16 bytes.
+; Load the string's buffer address into r9.  We need to get the original
+; function parameter value (rdx) from xmm2q[0], then load the String->Buffer
+; address, then advance it 16 bytes.
 ;

-        vpextrq     r11, xmm2, 1            ; Extract String into r11.
-        mov         r11, String.Buffer[r11] ; Load buffer address.
-        add         r11, r10                ; Advance buffer 16 bytes.
+        vpextrq     r9, xmm2, 0             ; Extract String into r9.
+        mov         r9, String.Buffer[r9]   ; Load buffer address.
+        add         r9, r10                 ; Advance buffer 16 bytes.

 ;
-; Loading the slot is more involved as we have to go to the string table, then
-; the pStringArray pointer, then the relevant STRING offset within the string
-; array (which requires re-loading the index from xmm5d[3]), then the string
-; buffer from that structure.
+; Save the StringTable parameter, currently in rcx, into xmm1, which is a free
+; use xmm register at this point.  This frees up rcx, allowing us to copy the
+; slot length, currently in r11, and then subtracting 16 (currently in r10),
+; in order to account for the fact that we've already matched 16 bytes.  This
+; allows us to then use rcx as the loop counter for the byte-by-byte comparison.
 ;

-        vpextrq     r8, xmm2, 0             ; Extract StringTable into r8.
-        mov         r8, StringTable.pStringArray[r8] ; Load string array.
+        vmovq       xmm1, rcx               ; Free up rcx.
+        mov         rcx, r11                ; Copy slot length.
+        sub         rcx, r10                ; Subtract 16.

-        vpextrd     eax, xmm5, 3            ; Extract index from xmm5.
-        shl         eax, 4                  ; Scale the index; sizeof STRING=16.
+;
+; We'd also like to use rax as the accumulator within the loop.  It currently
+; stores the index, which is important, so, stash that in r10 for now.  (We
+; know r10 is always 16 at this point, so it's easy to restore afterward.)
+;

-        lea         r8, [rax + StringArray.Strings[r8]] ; Resolve address.
-        mov         r8, String.Buffer[r8]   ; Load string table buffer address.
-        add         r8, r10                 ; Advance buffer 16 bytes.
+        mov         r10, rax                ; Save rax to r10.
+        xor         eax, eax                ; Clear rax.

-        mov         rax, rcx                ; Copy counter.
+;
+; And we'd also like to use rdx/dl to load each byte of the search string.  It
+; currently holds the bitmap, which we need, so stash that in r11 for now, which
+; is the last of our free volatile registers at this point (after we've copied
+; the slot length from it above).
+;
+
+        mov         r11, rdx                ; Save rdx to r11.
+        xor         edx, edx                ; Clear rdx.

 ;
-; We've got both buffer addresses + 16 bytes loaded in r11 and r8 respectively.
-; Do a byte-by-byte comparison.
+; We've got both buffer addresses + 16 bytes loaded in r8 and r9 respectively.
+; We need to do a byte-by-byte comparison.  The loop count is in rcx, and rax
+; is initialized to 0.  We're ready to go!
 ;

-        align 16
-@@:     mov         dl, byte ptr [rax + r11]    ; Load byte from search string.
-        cmp         dl, byte ptr [rax + r8]     ; Compare against target.
-        jne         short Pfx60                 ; If not equal, jump.
+@@:     mov         dl, byte ptr [rax + r9] ; Load byte from search string.
+        cmp         dl, byte ptr [rax + r8] ; Compare to byte in slot.
+        jne         short Pfx60             ; Bytes didn't match, exit loop.

 ;
-; The two bytes were equal, update rax, decrement rcx and potentially continue
+; The two bytes were equal, update rax, decrement rcx, and potentially continue
 ; the loop.
 ;
+        inc         al                      ; Increment index.
+        dec         cl                      ; Decrement counter.
+        jnz         short @B                ; Continue if not 0.

-        inc         ax                          ; Increment index.
-        loopnz      @B                          ; Decrement cx and loop back.
+;
+; All bytes matched!  The number of characters matched will live in rax, and
+; we also need to add 16 to it to account for the first chunk that was already
+; matched.  However, rax is also our return value, and needs to point at the
+; index of the slot that matched.  Exchange it with r8 first, as if we do have
+; a StringMatch parameter, the jump target Pfx80 will be expecting r8 to hold
+; the number of characters matched.
+;
+
+        mov         r8, rax                     ; Save characters matched.
+        mov         rax, r10                    ; Re-load index from r10.
+        vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
+        test        r9, r9                      ; Is NULL?
+        jnz         short Pfx75                 ; Not zero, need to fill out.

 ;
-; All bytes matched!  Add 16 (still in r10) back to rax such that it captures
-; how many characters we matched, and then jump to Pfx40 for finalization.
+; StringMatch is NULL, we're done.  Return rax, which will have the index in it.
 ;

-        add         rax, r10
-        jmp         Pfx40
+        ret                                     ; StringMatch == NULL, finish.

 ;
-; Byte comparisons were not equal.  Restore the rcx loop counter and decrement
-; it.  If it's zero, we have no more strings to compare, so we can do a quick
-; exit.  If there are still comparisons to be made, restore the other registers
-; we trampled then jump back to the start of the loop Pfx20.
+; The byte comparisons were not equal.  Re-load the bitmap from r11 into rdx,
+; reposition it by clearing the lowest set bit, and potentially exit if there
+; are no more bits remaining.
 ;

-Pfx60:  vpextrb     rcx, xmm4, 2                ; Restore rcx counter.
-        dec         cx                          ; Decrement counter.
-        jnz         short @F                    ; Jump forward if not zero.
+Pfx60:  mov         rdx, r11                    ; Reload bitmap.
+        blsr        edx, edx                    ; Clear lowest set bit.
+        test        edx, edx                    ; Is bitmap empty?
+        jnz         short Pfx65                 ; Bits remain.

 ;
-; No more comparisons remaining, return.
+; No more bits remain set in the bitmap, we're done.  Indicate no match found
+; and return.
 ;

         xor         eax, eax                    ; Clear rax.
@@ -452,17 +476,65 @@
         ret                                     ; Return.

 ;
-; More comparisons remain; restore the registers we clobbered and continue loop.
+; We need to continue the loop, having had this oversized string test (length >
+; 16 characters) fail.  Before we do that though, restore the registers we
+; clobbered to comply with Pfx20's top-of-the-loop register use assumptions.
 ;

-@@:     vpextrb     r10, xmm4, 1                ; Restore r10.
-        vpextrb     r11, xmm4, 0                ; Restore r11.
-        vpextrd     edx, xmm5, 2                ; Restore rdx bitmap.
+Pfx65:  vpextrb     r11, xmm4, 0                ; Restore string length.
+        vpextrq     rcx, xmm1, 0                ; Restore rcx (StringTable).
+        mov         r9, 16                      ; Restore constant 16 to r9.
+        mov         r10, r9                     ; Restore search length.
+                                                ; (We know it's always 16 here.)
         jmp         Pfx20                       ; Continue comparisons.

+;
+; This is the target for when we need to fill out the StringMatch structure.
+; It's located at the end of this routine because we're optimizing for the
+; case where the parameter is NULL in the loop body above, and we don't want
+; to pollute the code cache with this logic (which is quite convoluted).
+
+; N.B. Pfx75 is the jump target when we need to add 16 to the characters matched
+;      count stored in r8.  This particular path is exercised by the long string
+;      matching logic (i.e. when strings are longer than 16 and the prefix match
+;      is confirmed via byte-by-byte comparison).  We also need to reload rcx
+;      from xmm1.
+;
+; Expected register use at this point:
+;
+;   rax - Index of match.
+;
+;   rcx - StringTable.
+;
+;   r8 - Number of characters matched.
+;
+;   r9 - StringMatch.
+;
+;
+
+Pfx75:  add         r8, 16                                  ; Add 16 to count.
+        vpextrq     rcx, xmm1, 0                            ; Reload rcx.
+
+Pfx80:  mov         byte ptr StringMatch.NumberOfMatchedCharacters[r9], r8b
+        mov         byte ptr StringMatch.Index[r9], al
+
+;
+; Final step, loading the address of the string in the string array.  This
+; involves going through the StringTable to find the string array address via
+; pStringArray, then the relevant STRING offset within the StringArray.Strings
+; structure.
+;
+
+        mov         rcx, StringTable.pStringArray[rcx]      ; Load string array.
+        mov         r8, rax                                 ; Copy index to r8.
+        shl         r8, 4                                   ; "Scale" index.
+        lea         rdx, [r8 + StringArray.Strings[rcx]]    ; Resolve address.
+        mov         qword ptr StringMatch.String[r9], rdx   ; Save STRING ptr.
+        ret                                                 ; Return!
+
         ;IACA_VC_END

-        LEAF_END   IsPrefixOfStringInTable_x64_11, _TEXT$00
+        LEAF_END   IsPrefixOfStringInTable_x64_12, _TEXT$00


 ; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

;++
;
; STRING_TABLE_INDEX
; IsPrefixOfStringInTable_x64_*(
;     _In_ PSTRING_TABLE StringTable,
;     _In_ PSTRING String,
;     _Out_opt_ PSTRING_MATCH Match
;     )
;
; Routine Description:
;
;   Searches a string table to see if any strings "prefix match" the given
;   search string.  That is, whether any string in the table "starts with
;   or is equal to" the search string.
;
;   This routine is based on version 11, but leverages the inner loop logic
;   tweak we used in version 13 of the C version, pointed out by Fabian Giesen
;   (@rygorous).  That is, we do away with the shifting logic and explicit loop
;   counting, and simply use blsr to keep iterating through the bitmap until it
;   is empty.
;
; Arguments:
;
;   StringTable - Supplies a pointer to a STRING_TABLE struct.
;
;   String - Supplies a pointer to a STRING struct that contains the string to
;       search for.
;
;   Match - Optionally supplies a pointer to a variable that contains the
;       address of a STRING_MATCH structure.  This will be populated with
;       additional details about the match if a non-NULL pointer is supplied.
;
; Return Value:
;
;   Index of the prefix match if one was found, NO_MATCH_FOUND if not.
;
;--

        LEAF_ENTRY IsPrefixOfStringInTable_x64_12, _TEXT$00

;
; Load the address of the string buffer into rax.
;

        ;IACA_VC_START

        mov     rax, String.Buffer[rdx]         ; Load buffer addr.

;
; Broadcast the byte-sized string length into xmm4.
;

        vpbroadcastb xmm4, byte ptr String.Length[rdx]  ; Broadcast length.

;
; Load the lengths of each string table slot into xmm3.
;

        vmovdqa xmm3, xmmword ptr StringTable.Lengths[rcx]  ; Load lengths.

;
; Load the search string buffer into xmm0.
;

        vmovdqu xmm0, xmmword ptr [rax]         ; Load search buffer.

;
; Compare the search string's length, which we've broadcasted to all 8-byte
; elements of the xmm4 register, to the lengths of the slots in the string
; table, to find those that are greater in length.
;

        vpcmpgtb    xmm1, xmm3, xmm4            ; Identify long slots.

;
; Shuffle the buffer in xmm0 according to the unique indexes, and store the
; result into xmm5.
;

        vpshufb     xmm5, xmm0, StringTable.UniqueIndex[rcx] ; Rearrange string.

;
; Compare the search string's unique character array (xmm5) against the string
; table's unique chars (xmm2), saving the result back into xmm5.
;

        vpcmpeqb    xmm5, xmm5, StringTable.UniqueChars[rcx] ; Compare to uniq.

;
; Intersect-and-test the unique character match xmm mask register (xmm5) with
; the length match mask xmm register (xmm1).  This affects flags, allowing us
; to do a fast-path exit for the no-match case (where CY = 1 after xmm1 has
; been inverted).
;

        vptest      xmm1, xmm5                  ; Check for no match.
        jnc         short Pfx10                 ; There was a match.

;
; No match, set rax to -1 and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

        ;IACA_VC_END

;
; (There was at least one match, continue with processing.)
;

;
; Calculate the "search length" for the incoming search string, which is
; equivalent of 'min(String->Length, 16)'.
;
; Once the search length is calculated, deposit it back at the second byte
; location of xmm4.
;
;   r10 and xmm4[15:8] - Search length (min(String->Length, 16))
;
;   r11 - String length (String->Length)
;

Pfx10:  vpextrb     r11, xmm4, 0                ; Load string length.
        mov         r9, 16                      ; Load 16 into r9.
        mov         r10, r11                    ; Copy length into r10.
        cmp         r10w, r9w                   ; Compare against 16.
        cmova       r10w, r9w                   ; Use 16 if length is greater.

;
; Home our parameter register rdx into the base of xmm2.
;

        vpxor       xmm2, xmm2, xmm2            ; Clear xmm2.
        vmovq       xmm2, rdx                   ; Save rcx.

;
; Intersect xmm5 and xmm1 (as we did earlier with the 'vptest xmm1, xmm5'),
; yielding a mask identifying indices we need to perform subsequent matches
; upon.  Convert this into a bitmap and save in xmm2d[2].
;

        vpandn      xmm5, xmm1, xmm5            ; Intersect unique + lengths.
        vpmovmskb   edx, xmm5                   ; Generate a bitmap from mask.

;
; We're finished with xmm5; repurpose it in the same vein as xmm2 above.
;

        vpxor       xmm5, xmm5, xmm5            ; Clear xmm5.
        vmovq       xmm5, r8                    ; Save r8 into xmm5q[0].

;
; Summary of xmm register stashing for the rest of the routine:
;
;   xmm2:
;        0:63   (vpinsrq 0)     rdx (2nd function parameter, String)
;
;   xmm4:
;       0:7     (vpinsrb 0)     length of search string [r11]
;       8:15    (vpinsrb 1)     min(String->Length, 16) [r10]
;
;   xmm5:
;       0:63    (vpinsrq 0)     r8 (3rd function parameter, StringMatch)
;      64:95    (vpinsrd 2)     bitmap of slots to compare
;      96:127   (vpinsrd 3)     index of slot currently being processed
;
; Non-stashing xmm register use:
;
;   xmm0: First 16 characters of search string.
;
;   xmm3: Slot lengths.
;
;   xmm1: Freebie!
;

        align 16

;
; Top of the main comparison loop.  The bitmap will be present in rdx.  Count
; trailing zeros of the bitmap, producing an index (rax) we can use to load the
; corresponding slot.
;
; Volatile register usage at top of loop:
;
;   rcx - StringTable.
;
;   rdx - Bitmap.
;
;   r9 - Constant value of 16.
;
;   r10 - Search length (min(String->Length, 16))
;
;   r11 - Search string length (String->Length).
;
; Use of remaining volatile registers during loop:
;
;   rax - Index.
;
;   r8 - Freebie!
;

Pfx20:  tzcnt       eax, edx                    ; Count trailing zeros = index.

;
; "Scale" the index (such that we can use it in a subsequent vmovdqa) by
; shifting left by 4 (i.e. multiply by '(sizeof STRING_SLOT)', which is 16).
;
; Then, load the string table slot at this index into xmm1.
;

        mov         r8, rax                     ; Copy index (rax) into r8.
        shl         r8, 4                       ; "Scale" the index.
        vmovdqa     xmm1, xmmword ptr [r8 + StringTable.Slots[rcx]]

;
; The search string's first 16 characters are already in xmm0.  Compare this
; against the slot that has just been loaded into xmm1, storing the result back
; into xmm1.
;

        vpcmpeqb    xmm1, xmm0, xmm1            ; Compare search string to slot.

;
; Convert the XMM mask into a 32-bit representation, then zero high bits after
; our "search length", which allows us to ignore the results of the comparison
; above for bytes that were after the search string's length, if applicable.
; Then, count the number of bits remaining, which tells us how many characters
; we matched.
;

        vpmovmskb   r8d, xmm1                   ; Convert into mask.
        bzhi        r8d, r8d, r10d              ; Zero high bits.
        popcnt      r8d, r8d                    ; Count bits.

;
; Determine if less than 16 characters matched, as this avoids needing to do
; a more convoluted test to see if a byte-by-byte string comparison is needed
; (for lengths longer than 16).
;

        cmp         r8w, r9w                    ; Compare chars matched to 16.
        jl          short Pfx30                 ; Less than 16 matched.

;
; All 16 characters matched.  Load the underlying slot's length from the
; relevant offset in the xmm3 register into r11, then check to see if it's
; greater than 16.  If it is, we're going to need to do a string compare,
; handled by Pfx50.
;
; N.B. The approach for loading the slot length here is a little quirky.  We
;      have all the lengths for slots in xmm3, and we have the current match
;      index in rax.  If we move rax into an xmm register (xmm1 in this case),
;      we can use it to shuffle xmm3, such that the length we're interested in
;      will be deposited back into the lowest byte, which we can then extract
;      via vpextrb.
;

        movd        xmm1, rax                   ; Load index into xmm1.
        vpshufb     xmm1, xmm3, xmm1            ; Shuffle length by index.
        vpextrb     r11, xmm1, 0                ; Extract slot length into r11.
        cmp         r11w, r9w                   ; Compare length to 16.
        ja          short Pfx50                 ; Length is > 16.
        jmp         short Pfx40                 ; Lengths match!

;
; Less than 16 characters were matched.  Compare this against the length of the
; search string; if equal, this is a match.
;

Pfx30:  cmp         r8d, r10d                   ; Compare against search string.
        je          short Pfx40                 ; Match found!

;
; No match against this slot.  Clear the lowest set bit of the bitmap and check
; to see if there are any bits remaining in it.
;

        blsr        edx, edx                    ; Reposition bitmap.
        test        edx, edx                    ; Is bitmap empty?
        jnz         short Pfx20                 ; Bits remain, continue loop.

;
; No more bits remain set in the bitmap, we're done.  Indicate no match found
; and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; Load the match parameter into r9 and test to see if it's not-NULL, in which
; case we need to fill out a STRING_MATCH structure for the match, handled by
; jump target Pfx80 at the end of this routine.
;

Pfx40:  vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
        test        r9, r9                      ; Is NULL?
        jnz         Pfx80                       ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done.  We can return straight from here, rax will
; still have the index stored.
;

        ret                                     ; StringMatch == NULL, finish.

;
; 16 characters matched and the length of the underlying slot is greater than
; 16, so we need to do a little memory comparison to determine if the search
; string is a prefix match.
;
; Register use on block entry:
;
;   rax - Index.
;
;   rcx - StringTable.
;
;   rdx - Bitmap.
;
;   r9 - Constant value of 16.
;
;   r10 - Search length (min(String->Length, 16))
;
;   r11 - Slot length.
;
; Register use during the block (after we've freed things up and loaded the
; values we need):
;
;   rax - Index/accumulator.
;
;   rcx - Loop counter (for byte comparison).
;
;   rdx - Byte loaded into dl for comparison.
;
;   r8 - Target string buffer.
;
;   r9 - Search string buffer.
;

;
; Initialize r8 such that it's pointing to the slot's String->Buffer address.
; This is a bit fiddly as we need to go through StringTable.pStringArray first
; to get the base address of the STRING_ARRAY, then the relevant STRING offset
; within the array, then the String->Buffer address from that structure.  Then,
; add 16 to it, such that it's ready as the base address for comparison.
;

Pfx50:  mov         r8, StringTable.pStringArray[rcx] ; Load string array addr.
        mov         r9, rax                 ; Copy index into r9.
        shl         r9, 4                   ; "Scale" index; sizeof STRING=16.
        lea         r8, [r9 + StringArray.Strings[r8]] ; Load STRING address.
        mov         r8, String.Buffer[r8]   ; Load String->Buffer address.
        add         r8, r10                 ; Advance it 16 bytes.

;
; Load the string's buffer address into r9.  We need to get the original
; function parameter value (rdx) from xmm2q[0], then load the String->Buffer
; address, then advance it 16 bytes.
;

        vpextrq     r9, xmm2, 0             ; Extract String into r9.
        mov         r9, String.Buffer[r9]   ; Load buffer address.
        add         r9, r10                 ; Advance buffer 16 bytes.

;
; Save the StringTable parameter, currently in rcx, into xmm1, which is a free
; use xmm register at this point.  This frees up rcx, allowing us to copy the
; slot length, currently in r11, and then subtracting 16 (currently in r10),
; in order to account for the fact that we've already matched 16 bytes.  This
; allows us to then use rcx as the loop counter for the byte-by-byte comparison.
;

        vmovq       xmm1, rcx               ; Free up rcx.
        mov         rcx, r11                ; Copy slot length.
        sub         rcx, r10                ; Subtract 16.

;
; We'd also like to use rax as the accumulator within the loop.  It currently
; stores the index, which is important, so, stash that in r10 for now.  (We
; know r10 is always 16 at this point, so it's easy to restore afterward.)
;

        mov         r10, rax                ; Save rax to r10.
        xor         eax, eax                ; Clear rax.

;
; And we'd also like to use rdx/dl to load each byte of the search string.  It
; currently holds the bitmap, which we need, so stash that in r11 for now, which
; is the last of our free volatile registers at this point (after we've copied
; the slot length from it above).
;

        mov         r11, rdx                ; Save rdx to r11.
        xor         edx, edx                ; Clear rdx.

;
; We've got both buffer addresses + 16 bytes loaded in r8 and r9 respectively.
; We need to do a byte-by-byte comparison.  The loop count is in rcx, and rax
; is initialized to 0.  We're ready to go!
;

@@:     mov         dl, byte ptr [rax + r9] ; Load byte from search string.
        cmp         dl, byte ptr [rax + r8] ; Compare to byte in slot.
        jne         short Pfx60             ; Bytes didn't match, exit loop.

;
; The two bytes were equal, update rax, decrement rcx, and potentially continue
; the loop.
;
        inc         al                      ; Increment index.
        dec         cl                      ; Decrement counter.
        jnz         short @B                ; Continue if not 0.

;
; All bytes matched!  The number of characters matched will live in rax, and
; we also need to add 16 to it to account for the first chunk that was already
; matched.  However, rax is also our return value, and needs to point at the
; index of the slot that matched.  Exchange it with r8 first, as if we do have
; a StringMatch parameter, the jump target Pfx80 will be expecting r8 to hold
; the number of characters matched.
;

        mov         r8, rax                     ; Save characters matched.
        mov         rax, r10                    ; Re-load index from r10.
        vpextrq     r9, xmm5, 0                 ; Extract StringMatch.
        test        r9, r9                      ; Is NULL?
        jnz         short Pfx75                 ; Not zero, need to fill out.

;
; StringMatch is NULL, we're done.  Return rax, which will have the index in it.
;

        ret                                     ; StringMatch == NULL, finish.

;
; The byte comparisons were not equal.  Re-load the bitmap from r11 into rdx,
; reposition it by clearing the lowest set bit, and potentially exit if there
; are no more bits remaining.
;

Pfx60:  mov         rdx, r11                    ; Reload bitmap.
        blsr        edx, edx                    ; Clear lowest set bit.
        test        edx, edx                    ; Is bitmap empty?
        jnz         short Pfx65                 ; Bits remain.

;
; No more bits remain set in the bitmap, we're done.  Indicate no match found
; and return.
;

        xor         eax, eax                    ; Clear rax.
        not         al                          ; al = -1
        ret                                     ; Return.

;
; We need to continue the loop, having had this oversized string test (length >
; 16 characters) fail.  Before we do that though, restore the registers we
; clobbered to comply with Pfx20's top-of-the-loop register use assumptions.
;

Pfx65:  vpextrb     r11, xmm4, 0                ; Restore string length.
        vpextrq     rcx, xmm1, 0                ; Restore rcx (StringTable).
        mov         r9, 16                      ; Restore constant 16 to r9.
        mov         r10, r9                     ; Restore search length.
                                                ; (We know it's always 16 here.)
        jmp         Pfx20                       ; Continue comparisons.

;
; This is the target for when we need to fill out the StringMatch structure.
; It's located at the end of this routine because we're optimizing for the
; case where the parameter is NULL in the loop body above, and we don't want
; to pollute the code cache with this logic (which is quite convoluted).

; N.B. Pfx75 is the jump target when we need to add 16 to the characters matched
;      count stored in r8.  This particular path is exercised by the long string
;      matching logic (i.e. when strings are longer than 16 and the prefix match
;      is confirmed via byte-by-byte comparison).  We also need to reload rcx
;      from xmm1.
;
; Expected register use at this point:
;
;   rax - Index of match.
;
;   rcx - StringTable.
;
;   r8 - Number of characters matched.
;
;   r9 - StringMatch.
;
;

Pfx75:  add         r8, 16                                  ; Add 16 to count.
        vpextrq     rcx, xmm1, 0                            ; Reload rcx.

Pfx80:  mov         byte ptr StringMatch.NumberOfMatchedCharacters[r9], r8b
        mov         byte ptr StringMatch.Index[r9], al

;
; Final step, loading the address of the string in the string array.  This
; involves going through the StringTable to find the string array address via
; pStringArray, then the relevant STRING offset within the StringArray.Strings
; structure.
;

        mov         rcx, StringTable.pStringArray[rcx]      ; Load string array.
        mov         r8, rax                                 ; Copy index to r8.
        shl         r8, 4                                   ; "Scale" index.
        lea         rdx, [r8 + StringArray.Strings[rcx]]    ; Resolve address.
        mov         qword ptr StringMatch.String[r9], rdx   ; Save STRING ptr.
        ret                                                 ; Return!

        ;IACA_VC_END

        LEAF_END   IsPrefixOfStringInTable_x64_12, _TEXT$00


; vim:set tw=80 ts=8 sw=4 sts=4 et syntax=masm fo=croql comments=\:;           :

I’m really happy with how that turned out! Switching to blsr really improved the layout of the inner loop, and vastly reduced our register pressure, which means less XMM register spilling is required, which is always a good thing.

But does it improve performance? Eek! It’s our final Hail Mary attempt at an improvement. Can we beat the fastest profile-guided optimization build of the C version in both prefix matching and negative matching?

Drum roll

The performance for version 12 of the assembly is…

The assembly version brings in gold across the board! A quick run through VTune suggests the routine is clocking in with a CPI of 0.266, which is pretty darn close to the theoretical maximum of 0.25 (which implies 4 instructions retired per clock cycle).

Other Applications

Once I’d written the first version of the StringTable component, for better or worse, it became the hammer for all of my string-related problems! My favorite example of this is the code I wrote for parsing the output of Windows debug engine’s examine symbols command.

Here’s an example of a few lines of output from the cdb command x /v /t Rtl!*:

prv global 00007ffd`1570d100   10 struct _STRING Rtl!ExtendedLengthVolumePrefixA = struct _STRING "\\?\"
prv global 00007ffd`1570d110   10 struct _UNICODE_STRING Rtl!ExtendedLengthVolumePrefixW = "\\?\"
prv global 00007ffd`1570da30  5a8 char *[181] Rtl!RtlFunctionNames = char *[181]
prv global 00007ffd`15711018    8 <function> * Rtl!__C_specific_handler_impl = 0x00007ffd`214c0f00
prv global 00007ffd`1570d820  208 char *[65] Rtl!RtlExFunctionNames = char *[65]
prv global 00007ffd`15711000    8 <function> * Rtl!atexit_impl = 0x00007ffd`15704370
...
prv global 00007ffd`1570d120  1a8 char *[53] Rtl!CuFunctionNames = char *[53]
prv func   00007ffd`15708be0   5d <function> Rtl!AppendCharBufferToCharBuffer (char **, char *, unsigned long)
prv func   00007ffd`15702450   2e <function> Rtl!RtlHeapAllocatorFreePointer (void *, void **)
prv func   00007ffd`15702730   3c <function> Rtl!RtlHeapAllocatorAlignedFreePointer (void *, void **)
prv func   00007ffd`157093b0   1e <CLR type> Rtl!UnregisterRtlAtExitEntry$fin$0 (void)
...
prv func   00007ffd`157061f0   48 <function> Rtl!RtlCryptGenRandom (struct _RTL *, unsigned long, unsigned char *)
prv func   00007ffd`15707500   b2 <function> Rtl!AppendTailGuardedListTsx (struct _GUARDED_LIST *, struct _LIST_ENTRY *)
prv func   00007ffd`157075c0    d <function> Rtl!DummyVectorCall1 (union __m128i *, union __m128i *, ...
prv func   00007ffd`157025c0   4f <function> Rtl!RtlHeapAllocatorAlignedMalloc (void *, unsigned int64, unsigned int64)
prv func   00007ffd`15703cc0   9e <function> Rtl!DisableCreateSymbolicLinkPrivilege (void)

The function ExamineSymbolsParseLine is called for each line of output and is responsible for parsing it into a DEBUG_ENGINE_EXAMINED_SYMBOL structure. It’s some good ol’ fashioned string processing using nothing but pointer arithmetic and a bunch of string tables.

It was the first time I needed to match more than 16 strings in a given category, though. A pattern emerged that was quite reasonable, and it became my de facto way of dealing with multiple string tables for a given category.

Let’s look at the basic type category. Two string tables were constructed from the following constant delimited strings (view on GitHub):

#define DSTR(String) String ";"

//
// ExamineSymbolsBasicTypes
//

const STRING ExamineSymbolsBasicTypes1 = RTL_CONSTANT_STRING(
    DSTR("<NoType>")
    DSTR("<function>")
    DSTR("char")
    DSTR("wchar_t")
    DSTR("short")
    DSTR("long")
    DSTR("int64")
    DSTR("int")
    DSTR("unsigned char")
    DSTR("unsigned wchar_t")
    DSTR("unsigned short")
    DSTR("unsigned long")
    DSTR("unsigned int64")
    DSTR("unsigned int")
    DSTR("union")
    DSTR("struct")
);

const STRING ExamineSymbolsBasicTypes2 = RTL_CONSTANT_STRING(
    DSTR("<CLR type>")
    DSTR("bool")
    DSTR("void")
    DSTR("class")
    DSTR("float")
    DSTR("double")
    DSTR("_SAL_ExecutionContext")
    DSTR("__enative_startup_state")
);

In concert with the two string tables, an enumeration was defined (view on GitHub):

//
// The order of these enumeration symbols must match the exact order of the
// corresponding string in the relevant ExamineSymbolsBasicTypes[1..n] STRING
// structure (see DebugEngineConstants.c).  This is because string tables are
// created from the delimited strings and the match index is cast directly to
// an enum of this type.
//

typedef enum _DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE {
    UnknownType = -1,

    //
    // First 16 types captured by BasicTypeStringTable1.
    //

    NoType = 0,
    FunctionType,

    CharType,
    WideCharType,
    ShortType,
    LongType,
    Integer64Type,
    IntegerType,

    UnsignedCharType,
    UnsignedWideCharType,
    UnsignedShortType,
    UnsignedLongType,
    UnsignedInteger64Type,
    UnsignedIntegerType,

    UnionType,
    StructType,

    //
    // Next 16 types captured by BasicTypeStringTable2.
    //

    CLRType = 16,
    BoolType,
    VoidType,
    ClassType,
    FloatType,
    DoubleType,
    SALExecutionContextType,
    ENativeStartupStateType,

    //
    // Any types that don't map directly to literal type names extracted from
    // the output string are listed here.  The first one starts at 48 in order
    // to differentiate it from the string tables.
    //

    //
    // Call site of an inline function.
    //

    InlineCallerType = 48,

    //
    // Enum is special in that it doesn't map to a string in the string table;
    // if a type can't be inferred from the list above, it defaults to Enum.
    //

    EnumType,

    //
    // Any enumeration value >= InvalidType is invalid.  Make sure this always
    // comes last in the enum layout.
    //

    InvalidType

} DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE;

Here’s the part of the logic within ExamineSymbolsParseLine that deals with matching the basic type part of the line. This refers to the 5th column of the output, e.g., the struct, char *[181], <function>, <CLR type> bits in the following output:

prv global 00007ffd`1570d110   10 struct _UNICODE_STRING Rtl!ExtendedLengthVolumePrefixW = "\\?\"
prv global 00007ffd`1570da30  5a8 char *[181] Rtl!RtlFunctionNames = char *[181]
prv global 00007ffd`15711018    8 <function> * Rtl!__C_specific_handler_impl = 0x00007ffd`214c0f00
prv func   00007ffd`157093b0   1e <CLR type> Rtl!UnregisterRtlAtExitEntry$fin$0 (void)

    //
    // (Type declarations of the variables being referenced shortly.)
    //

    SHORT MatchOffset;
    USHORT MatchIndex;
    USHORT MatchAttempts;
    USHORT NumberOfStringTables;
    STRING BasicType;
    STRING_MATCH Match;
    PSTRING_TABLE StringTable;
    DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE SymbolType;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

    ...

    //
    // The basic type will be next.  Set up the variable then search the string
    // table for a match.  Set the length to the BytesRemaining for now; as long
    // as it's greater than or equal to the basic type length (which it should
    // always be), that will be fine.
    //

    BasicType.Buffer = Char;
    BasicType.Length = (USHORT)BytesRemaining;
    BasicType.MaximumLength = (USHORT)BytesRemaining;

    StringTable = Session->ExamineSymbolsBasicTypeStringTable1;
    IsPrefixOfStringInTable = Session->StringTableApi->IsPrefixOfStringInTable;
    MatchOffset = 0;
    MatchAttempts = 0;
    NumberOfStringTables = Session->NumberOfBasicTypeStringTables;
    ZeroStruct(Match);

RetryBasicTypeMatch:

    MatchIndex = IsPrefixOfStringInTable(StringTable, &BasicType, &Match);

    if (MatchIndex == NO_MATCH_FOUND) {
        if (++MatchAttempts >= NumberOfStringTables) {

            //
            // We weren't able to match the name to any known types.
            // Default to the enum type.
            //

            SymbolType = EnumType;

        } else {

            //
            // There are string tables remaining.  Attempt another match.
            //

            StringTable++;
            MatchOffset += MAX_STRING_TABLE_ENTRIES;
            goto RetryBasicTypeMatch;

        }

    } else {

        //
        // We found a match.  Our enums are carefully offset in order to allow
        // the following `index + offset = enum value` logic to work.
        //

        SymbolType = MatchIndex + MatchOffset;
    }

    //
    // N.B. This next part doesn't occur in the source file, but I wanted
    //      to include it to demonstrate how you could then simply switch
    //      on the resulting symbol type directly, e.g.:
    //

    switch (SymbolType) {
        case CharType:
        case WideCharType:
            ...
            break;

        case UnionType:
        case StructType:
            ...
            break;

        default:
            ...
            break;

    }

If there’s no match found, we check to see if we’ve performed the maximum number of attempts, that is, whether or not we’ve exhausted all our string tables. If we have, we just default to the EnumType.

Otherwise, bump the StringTable pointer (which relies on the fact that the underlying string table pointers in the session structure are contiguous — a handy implementation detail), bump the match offset by the number of entries per string table, and try the match again.

If we found a match, we can obtain the SymbolType enum representation of the underlying match by simply adding the match index to the match offset. I like that. It’s simple and fast. It also plays nicely with switch statements; do your lookup, resolve the underlying enum value, and process each possible path in a case statement like you’d do with any other integer representation of an option.

The other nice side-effect is that it forces you to pick which table a given string should go in. I made this decision by looking at which types occurred most frequently, and simply put those in the first table. Less frequent types go in subsequent tables.

I have a hunch there’s a lot of mileage in that approach; that is, linear scanning an array of string tables until a match is found. There will be an inflection point where some form of a log(n) binary tree search will perform better overall, but it would be very interesting to see how many strings you need to potentially match against before that point is hit.

Unless the likelihood of matching any given string in your set is completely random, by ordering the strings in your tables by how frequently they occur, the amortized cost of parsing a chunk of text would be very competitive using this approach, I would think.

A fun experiment for next time, perhaps!

Appendix

And now here’s all the stuff that wasn’t important enough to occur earlier in the article.

Implementation Considerations

One issue with writing so many versions of the exact same function is… how do you actually handle this? Downstream consumers of the component don’t need to access the 30 different function pointers for each function you’ve experimented with, but things like unit tests and benchmark programs do.

Here’s what I did for the StringTable component. Define two API structures, a normal one and an “extended” one. The extended one mirrors the normal one, and then adds all of its additional functions to the end.

I use a .def file to control the DLL function exports, with an alias to easily control which version of a function is the official version. The main header file then contains some bootstrap glue (in the form of an inline function) that dynamically loads the target library and resolves the number of API methods according to the size of the API structure provided.

This currently means that the StringTable2.dll includes all 14 C and 5 assembly variants, which is harmless, but it does increase the size of the module unnecessarily. (The module is currently about 19KB in size, whereas it would be under 4KB if only the official versions were included.) What I’ll probably end up doing is setting up a second project called StringTableEx, and, in conjunction with some #ifdefs, have that be the version of the module that contains all the additional functions, with the normal version just containing the official versions.

Here’s the bootstrap glue from StringTable.h and the StringTable.def file I currently use. (Note: this routine uses the LoadSymbols() function from the Rtl component.)

Bootstrap Header Glue
StringTable.def

//
// Define the string table API structure.
//

typedef struct _STRING_TABLE_API {

    PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;

    PCOPY_STRING_ARRAY CopyStringArray;
    PCREATE_STRING_TABLE CreateStringTable;
    PDESTROY_STRING_TABLE DestroyStringTable;

    PINITIALIZE_STRING_TABLE_ALLOCATOR
        InitializeStringTableAllocator;

    PINITIALIZE_STRING_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
        InitializeStringTableAllocatorFromRtlBootstrap;

    PCREATE_STRING_ARRAY_FROM_DELIMITED_STRING
        CreateStringArrayFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_STRING
        CreateStringTableFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_ENVIRONMENT_VARIABLE
        CreateStringTableFromDelimitedEnvironmentVariable;

    PIS_STRING_IN_TABLE IsStringInTable;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

} STRING_TABLE_API;
typedef STRING_TABLE_API *PSTRING_TABLE_API;

typedef struct _STRING_TABLE_API_EX {

    //
    // Inline STRING_TABLE_API.
    //

    PSET_C_SPECIFIC_HANDLER SetCSpecificHandler;

    PCOPY_STRING_ARRAY CopyStringArray;
    PCREATE_STRING_TABLE CreateStringTable;
    PDESTROY_STRING_TABLE DestroyStringTable;

    PINITIALIZE_STRING_TABLE_ALLOCATOR
        InitializeStringTableAllocator;

    PINITIALIZE_STRING_TABLE_ALLOCATOR_FROM_RTL_BOOTSTRAP
        InitializeStringTableAllocatorFromRtlBootstrap;

    PCREATE_STRING_ARRAY_FROM_DELIMITED_STRING
        CreateStringArrayFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_STRING
        CreateStringTableFromDelimitedString;

    PCREATE_STRING_TABLE_FROM_DELIMITED_ENVIRONMENT_VARIABLE
        CreateStringTableFromDelimitedEnvironmentVariable;

    PIS_STRING_IN_TABLE IsStringInTable;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;

    //
    // Extended API methods used for benchmarking.
    //

    PIS_PREFIX_OF_CSTR_IN_ARRAY IsPrefixOfCStrInArray;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_1;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_2;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_3;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_4;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_5;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_6;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_7;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_8;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_9;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_10;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_11;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_12;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_13;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_14;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_1;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_2;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_3;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_4;
    PIS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable_x64_5;
    PIS_PREFIX_OF_STRING_IN_TABLE IntegerDivision_x64_1;

} STRING_TABLE_API_EX;
typedef STRING_TABLE_API_EX *PSTRING_TABLE_API_EX;

typedef union _STRING_TABLE_ANY_API {
    STRING_TABLE_API Api;
    STRING_TABLE_API_EX ApiEx;
} STRING_TABLE_ANY_API;
typedef STRING_TABLE_ANY_API *PSTRING_TABLE_ANY_API;

FORCEINLINE
BOOLEAN
LoadStringTableApi(
    _In_ PRTL Rtl,
    _Inout_ HMODULE *ModulePointer,
    _In_opt_ PUNICODE_STRING ModulePath,
    _In_ ULONG SizeOfAnyApi,
    _Out_writes_bytes_all_(SizeOfAnyApi) PSTRING_TABLE_ANY_API AnyApi
    )
/*++

Routine Description:

    Loads the string table module and resolves all API functions for either
    the STRING_TABLE_API or STRING_TABLE_API_EX structure.  The desired API
    is indicated by the SizeOfAnyApi parameter.

    Example use:

        STRING_TABLE_API_EX GlobalApi;
        PSTRING_TABLE_API_EX Api;

        Success = LoadStringTableApi(Rtl,
                                     NULL,
                                     NULL,
                                     sizeof(GlobalApi),
                                     (PSTRING_TABLE_ANY_API)&GlobalApi);
        ASSERT(Success);
        Api = &GlobalApi;

    In this example, the extended API will be provided as our sizeof(GlobalApi)
    will indicate the structure size used by STRING_TABLE_API_EX.

    See ../StringTable2BenchmarkExe/main.c for a complete example.

Arguments:

    Rtl - Supplies a pointer to an initialized RTL structure.

    ModulePointer - Optionally supplies a pointer to an existing module handle
        for which the API symbols are to be resolved.  May be NULL.  If not
        NULL, but the pointed-to value is NULL, then this parameter will
        receive the handle obtained by LoadLibrary() as part of this call.
        If the string table module is no longer needed, but the program will
        keep running, the caller should issue a FreeLibrary() against this
        module handle.

    ModulePath - Optionally supplies a pointer to a UNICODE_STRING structure
        representing a path name of the string table module to be loaded.
        If *ModulePointer is not NULL, it takes precedence over this parameter.
        If NULL, and no module has been provided via *ModulePointer, an attempt
        will be made to load the library via 'LoadLibraryA("StringTable.dll")'.

    SizeOfAnyApi - Supplies the size, in bytes, of the underlying structure
        pointed to by the AnyApi parameter.

    AnyApi - Supplies the address of a structure which will receive resolved
        API function pointers.  The API furnished will depend on the size
        indicated by the SizeOfAnyApi parameter.

Return Value:

    TRUE on success, FALSE on failure.

--*/
{
    BOOL Success;
    HMODULE Module = NULL;
    ULONG NumberOfSymbols;
    ULONG NumberOfResolvedSymbols;

    //
    // Define the API names.
    //
    // N.B. These names must match STRING_TABLE_API_EX exactly (including the
    //      order).
    //

    const PCSTR Names[] = {
        "SetCSpecificHandler",
        "CopyStringArray",
        "CreateStringTable",
        "DestroyStringTable",
        "InitializeStringTableAllocator",
        "InitializeStringTableAllocatorFromRtlBootstrap",
        "CreateStringArrayFromDelimitedString",
        "CreateStringTableFromDelimitedString",
        "CreateStringTableFromDelimitedEnvironmentVariable",
        "IsStringInTable",
        "IsPrefixOfStringInTable",
        "IsPrefixOfCStrInArray",
        "IsPrefixOfStringInTable_1",
        "IsPrefixOfStringInTable_2",
        "IsPrefixOfStringInTable_3",
        "IsPrefixOfStringInTable_4",
        "IsPrefixOfStringInTable_5",
        "IsPrefixOfStringInTable_6",
        "IsPrefixOfStringInTable_7",
        "IsPrefixOfStringInTable_8",
        "IsPrefixOfStringInTable_9",
        "IsPrefixOfStringInTable_10",
        "IsPrefixOfStringInTable_11",
        "IsPrefixOfStringInTable_12",
        "IsPrefixOfStringInTable_13",
        "IsPrefixOfStringInTable_14",
        "IsPrefixOfStringInTable_x64_1",
        "IsPrefixOfStringInTable_x64_2",
        "IsPrefixOfStringInTable_x64_3",
        "IsPrefixOfStringInTable_x64_4",
        "IsPrefixOfStringInTable_x64_5",
        "IntegerDivision_x64_1",
    };

    //
    // Define an appropriately sized bitmap we can passed to Rtl->LoadSymbols().
    //

    ULONG BitmapBuffer[(ALIGN_UP(ARRAYSIZE(Names), sizeof(ULONG) << 3) >> 5)+1];
    RTL_BITMAP FailedBitmap = { ARRAYSIZE(Names)+1, (PULONG)&BitmapBuffer };

    //
    // Determine the number of symbols we want to resolve based on the size of
    // the API indicated by the caller.
    //

    if (SizeOfAnyApi == sizeof(AnyApi->Api)) {
        NumberOfSymbols = sizeof(AnyApi->Api) / sizeof(ULONG_PTR);
    } else if (SizeOfAnyApi == sizeof(AnyApi->ApiEx)) {
        NumberOfSymbols = sizeof(AnyApi->ApiEx) / sizeof(ULONG_PTR);
    } else {
        return FALSE;
    }

    //
    // Attempt to load the underlying string table module if necessary.
    //

    if (ARGUMENT_PRESENT(ModulePointer)) {
        Module = *ModulePointer;
    }

    if (!Module) {
        if (ARGUMENT_PRESENT(ModulePath)) {
            Module = LoadLibraryW(ModulePath->Buffer);
        } else {
            Module = LoadLibraryA("StringTable2.dll");
        }
    }

    if (!Module) {
        return FALSE;
    }

    //
    // We've got a handle to the string table module.  Load the symbols we want
    // dynamically via Rtl->LoadSymbols().
    //

    Success = Rtl->LoadSymbols(
        Names,
        NumberOfSymbols,
        (PULONG_PTR)AnyApi,
        NumberOfSymbols,
        Module,
        &FailedBitmap,
        TRUE,
        &NumberOfResolvedSymbols
    );

    ASSERT(Success);

    //
    // Debug helper: if the breakpoint below is hit, then the symbol names
    // have potentially become out of sync.  Look at the value of first failed
    // symbol to assist in determining the cause.
    //

    if (NumberOfSymbols != NumberOfResolvedSymbols) {
        PCSTR FirstFailedSymbolName;
        ULONG FirstFailedSymbol;
        ULONG NumberOfFailedSymbols;

        NumberOfFailedSymbols = Rtl->RtlNumberOfSetBits(&FailedBitmap);
        FirstFailedSymbol = Rtl->RtlFindSetBits(&FailedBitmap, 1, 0);
        FirstFailedSymbolName = Names[FirstFailedSymbol-1];
        __debugbreak();
    }

    //
    // Set the C specific handler for the module, such that structured
    // exception handling will work.
    //

    AnyApi->Api.SetCSpecificHandler(Rtl->__C_specific_handler);

    //
    // Update the caller's pointer and return success.
    //

    if (ARGUMENT_PRESENT(ModulePointer)) {
        *ModulePointer = Module;
    }

    return TRUE;
}

LIBRARY StringTable2
EXPORTS
    SetCSpecificHandler
    CopyStringArray
    CreateStringTable
    DestroyStringTable
    InitializeStringTableAllocator
    InitializeStringTableAllocatorFromRtlBootstrap
    CreateStringArrayFromDelimitedString
    CreateStringTableFromDelimitedString
    CreateStringTableFromDelimitedEnvironmentVariable
    TestIsPrefixOfStringInTableFunctions
    IsStringInTable
    IsPrefixOfStringInTable_1
    IsPrefixOfStringInTable_2
    IsPrefixOfStringInTable_3
    IsPrefixOfStringInTable_4
    IsPrefixOfStringInTable_5
    IsPrefixOfStringInTable_6
    IsPrefixOfStringInTable_7
    IsPrefixOfStringInTable_8
    IsPrefixOfStringInTable_9
    IsPrefixOfStringInTable_10
    IsPrefixOfStringInTable_11
    IsPrefixOfStringInTable_12
    IsPrefixOfStringInTable_13
    IsPrefixOfStringInTable_14
    IsPrefixOfStringInTable_x64_1
    IsPrefixOfStringInTable_x64_2
    IsPrefixOfStringInTable_x64_3
    IsPrefixOfStringInTable_x64_4
    IsPrefixOfStringInTable_x64_5
    IsPrefixOfCStrInArray
    IntegerDivision_x64_1
    IsPrefixOfStringInTable=IsPrefixOfStringInTable_13

Release Build versus Profile Guided Optimization Build

It’s interesting to see a side-by-side comparison of the optimized release build next to the PGO build. The main changes are mostly all to do with branching and jump direction. The following diagram was generated via IDA Pro 6.95.

IsPrefixOfStringInTable_13-Release-vs-PGO.png

Typedefs

If there’s one thing you can’t argue about with the Pascal-style Cutler Normal Form, is that it loves a good typedef. For the sake of completeness, here’s a list of all the explicit or implied typedefs featured in the code on this page.

//
// Standard NT/Windows typedefs (typically living in minwindef.h).
//

typedef void *PVOID;
typedef char CHAR;
typedef short SHORT;
typedef long LONG;
typedef unsigned long ULONG;
typedef ULONG *PULONG;
typedef unsigned short USHORT;
typedef USHORT *PUSHORT;
typedef unsigned char UCHAR;
typedef UCHAR *PUCHAR;
typedef _Null_terminated_ char *PSZ;
typedef const _Null_terminated_ char *PCSZ;

typedef int BOOL;
typedef unsigned char BYTE;
typedef unsigned short WORD;

typedef BYTE BOOLEAN;
typedef BOOLEAN *PBOOLEAN;

//
// The STRING structure used by the NT kernel.  Our STRING_ARRAY structure
// relies on an array of these structures.  We never pass raw 'char *'s
// around, only STRING/PSTRING structs/pointers.
//

typedef struct _STRING {
    USHORT Length;
    USHORT MaximumLength;
    ULONG  Padding;
    PCHAR Buffer;
} STRING, *PSTRING;
typedef const STRING *PCSTRING;

//
// Our SIMD register typedefs.
//

typedef __m128i DECLSPEC_ALIGN(16) XMMWORD, *PXMMWORD, **PPXMMWORD;
typedef __m256i DECLSPEC_ALIGN(32) YMMWORD, *PYMMWORD, **PPYMMWORD;
typedef __m512i DECLSPEC_ALIGN(64) ZMMWORD, *PZMMWORD, **PPZMMWORD;

Colophon

This article was originally written in raw HTML when it was first published back in May 2018.

In October-November 2024, I converted my entire website to use Quarto, and “ported” the raw HTML version of this article to Quarto/Markdown by hand. It took about 16-20 hours or so. Mainly because the article is so f’n long, and it took a lot of fiddling to get the same aesthetics as the original article—particularly with regards to syntax highlighting, etc.

Quarto & Markdown

The syntax highlighting for C and assembly is provided by c.xml and asm.xml respectively.

These files were copied from the KDE syntax-highlighting repository and then modified to add support for things like SIMD intrinsics, SAL annotations, NT types and definitions, custom types and definitions used in the article, etc.

For the syntax color scheme, I copied the dracula.theme from the Quarto repository into tpn.theme and then just hacked on it until I was mostly content with the results.

Data Visualization

Algorithm Diagrams

The algorithm diagrams at the start of the article were created using my trusty ol’ copy of Visio 2019. The diagrams were saved as SVG. I had to inject a <rect width="100%" height="100%" fill="#ffffff" /> into the start of each SVG file when porting to Quarto/Markdown in order to support dark mode. This set a white background for the diagram, which was otherwise transparent and did not render correctly in dark mode.

Benchmarks

All of the resources used to generate this article, including raw .csv data, supporting .svgs, etc., live in the resources directory.

The charts were created in Excel. Each .csv data file was added as a new data source and imported, and then a PivotTable was created to generate the desired bar graph. This was then saved as PDF, which I then edited with Inkscape, deleting the surrounding border, cropping the canvas to fit the diagram, and then saving as SVG.

After saving as SVG, I then edited each chart in Vim and manually did some font corrections. Looking at my .vimrc file, I believe this was what I used:

"
" SVG helpers.
"

map <leader>v
    \ <ESC>:% s/GillSans;/Calibri,GillSans-Light;/ge<CR>
    \ <ESC>:% s/Monaco/Monaco,Consolas,Menlo,monospace/ge<CR>

I can’t remember if I did any other modifications whilst in Vim (I’m writing this colophon in 2024, but the SVG hacking was done over six years ago).