String comparison with SSE4.2

Is The Flash? nop, are SIMD instructions! - Date: 01/02/2016

The story tale

Hello ladies and gentlemen, Royal readers of my gitbook! No more jokes, so I wrote this post in English. Consequently, In last week following search algorithms, as a try to gain some performance at my private projects, I view something about “SSE4.2“. So when I view the possibility of using “xmm0″(a register of 128 bits), thinking “oh my god! I want to use it! This is awesome!” some days studying it with my friend João Victorino aka “Pl4kt0n”, After studying the concepts around SSE4.2, I ended up writing a program. Relax folks! I don’t have a karate trick at this point!

The benchmark

To explain, I make two functions, one with the simple function “strcmp()”, the other with my implementation using SSE4.2 with Assembly ( i change AT&T to Intel syntax(“AT&T” is very boring ), for the reason that I guess easy to follow examples of the manual intel’s manual’), the other fact, I test my “strcmp()” function with “array of words”, to carry some results like “CPU cycles” to make the benchmark, so with it, we have some conditions to compare, just a cartesian choice to view and compare like a simple plot bar withGnuplot“. You can view the results here! and gnuplot cmd here!

So there is no trick. The generic condition results in a typical result, then follow another way to find an uncommon impact. This code doesn’t have a scheme.

pcmpistri

I use the instruction pcmpistri”(Packed Compare Implicit LengthStrings, Return Index), and the “movdqu”(move unaligned double quadword) instruction must be used to transfer data from this into an XMM register. With these instructions, you can make many things around “strings”, take a look at the following:

global strcmp_sse42_64
; by Cooler_  c00f3r[at]gmail[dot]com
; 64 bit
; nasm -f elf64 code.s -o code.o
; int strcmp_sse42_64(const char *, const char *);  // declare in C code
strcmp_sse42_64:
    push        rbp
    mov     rbp, rsp
    mov     rax, rdi
    mov     rdx, rsi
    sub     rax, rdx
    sub     rdx, 32
  
strloop_64:
    add     rdx, 32
    movdqu      xmm0, [rdx]
    pcmpistri   xmm0, [rdx+rax], 0011000b ;compare... jump again if above...
    ja      strloop_64
    jc      blockmov_64 ; jump 2 movzx
    xor     rax, rax ; clear return result...
    jmp     quit
 
blockmov_64:
    add     rax, rdx    
    movzx       rax, byte[rax+rcx] ; move with zero
    movzx       rdx, byte[rdx+rcx]
    sub     rax, rdx    
     
quit:
    pop     rbp
    ret

So using it to hook functions 32bit and 64bit versions:

#if UINTPTR_MAX == 0xffffffff
static int (*strcmp_sse42)(const char *, const char *) = strcmp_sse42_32;
#elif UINTPTR_MAX == 0xffffffffffffffff
static int (*strcmp_sse42)(const char *, const char *) = strcmp_sse42_64;
#else
    fprintf(stderr,"error in arch\n");
    exit(0);
#endif

Before hooking it up, we need to check whether or not your machine has SSE4.2 support. There are many ways of doing it. However, for the sake of simplicity, let’s go with the following one:

void cpu_get(int* cpuinfo, int info)
{
#if UINTPTR_MAX == 0xffffffff
 __asm__ __volatile__(
  "xchg %%ebx, %%edi;"
  "cpuid;"
  "xchg %%ebx, %%edi;"
  :"=a" (cpuinfo[0]), "=D" (cpuinfo[1]), "=c" (cpuinfo[2]), "=d" (cpuinfo[3])
  :"0" (info)
 );
#elif UINTPTR_MAX == 0xffffffffffffffff
 __asm__ __volatile__(
  "xchg %%rbx, %%rdi;"
  "cpuid;"
  "xchg %%rbx, %%rdi;"
  :"=a" (cpuinfo[0]), "=D" (cpuinfo[1]), "=c" (cpuinfo[2]), "=d" (cpuinfo[3])
  :"0" (info)
 );
#endif
}
 
void test_sse42_enable()
{
    int cpuinfo[4];
    int sse42=0;
 
    cpu_get(cpuinfo,1);
 
    sse42=cpuinfo[2] & (1 << 20) || 0;
 
    if(sse42)
        puts("SSE4.2 Test...\n OK SSE 4.2 instruction enable !\n");
    else {
        puts("SSE4.2 Not enabled\n your CPU need SSE 4.2 instruction to run this programm\n");
        exit(0);
    }
}

Look at all the source code here!

$ git clone https://github.com/CoolerVoid/cooler_sse42_strcmp
$ make; ./test
SSEshe4.2 Test…
OK SSE 4.2 instruction enable!

::: strcmp() with SSE42: 2812 cicles
Array size of words is: 245
Benchmark strcmp() with SSE42 matchs is: 84

::: simple strcmp(): 12663 cicles
Array size of words is: 245
Benchmark strcmp() matchs is: 84
$ cat /proc/cpuinfo | grep “model name”
model name : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
$ gcc -v | grep “gcc version”
gcc version 4.8.3 20140911 (Red Hat 4.8.3-7) (GCC)
$ uname -a
Linux localhost.localdomain 3.15.10-201.fc20.i686 #1 SMP Wed Aug 27 21:33:30 UTC 2014 i686 i686 i386 GNU/Linux

Other cool stuff

SSE is very common in image processing, game developers use it too, take a look at the following:

https://software.intel.com/en-us/articles/using-intel-streaming-simd-extensions-and-intel-integrated-performance-primitives-to-accelerate-algorithms Do you like CPU features? look this!

Thank you for reading, cheers!

Last updated