Saturday 28 March 2015

Substring matching: battle royale

At the finish of the recent optimisation series  I've concluded (amongst other things) that C++ regex matching was faster than the Python one, but that conclusion did not sit well with me. It was unclear whether it was down to inherent differences between languages or algorithms.
Back then, the post said that I'll explore it further, and no time is better than now. 

To understand things a bit better, I went all the way back - before regexes made their appearance. If you recall, the naive approach simply took a set of patterns and searched those one by one in the text.
The reason I made all this journey back is that apparently such an algorithm (or lack thereof) would serve as the perfect boxing ring to compare two languages; without different algorithms playing the roles of illegal substances.

You probably noticed the words apparently, and would. Yes, I was proven wrong; these "illegal substances" played an even larger role than with regexes. However, sometimes it is better to follow a bumpy road than drive down a straight motorway; you find out a few more things about the car you're driving.

Let's shake the dust off the Python snippet from one month ago, and convert it from URLs to files (asynchronous optimisation is not a goal for us today).
import sys

def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(filenames, patterns):
   return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   resultTuple = serialScan(filenames, patterns)
   for filename, patterns in resultTuple:
      print ': '.join([filename, ','.join(patterns)])

Now it is time to take a deep breath, since its C++ sibling is not as compact:

#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;

MatchResult serialMatch(const vector<string> &filenames, 
                        const vector<string> &patterns)
   {
   MatchResult res;
   for (auto &filename : filenames)
      {
      ifstream file(filename);
      const string fileContents((istreambuf_iterator<char>(file)),
                                 istreambuf_iterator<char>());
      vector<string> matches;
      std::copy_if(patterns.cbegin(), patterns.cend(), 
                    back_inserter(matches),
            [&fileContents] (const string &pattern) 
            { return fileContents.find(pattern) != string::npos; } );
      res.insert(make_pair(filename, std::move(matches)));
      }
   return res;
   }

int main(int argc, char **argv)
   {
   vector<string> filenames;
   ifstream filenamesListFile(argv[1]);
   std::copy(istream_iterator<string>(filenamesListFile), 
             istream_iterator<string>(),
             back_inserter(filenames));

   vector<string> patterns;
   patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
   ifstream patternsFile(argv[2]);
   std::copy(istream_iterator<string>(patternsFile), 
             istream_iterator<string>(),
             back_inserter(patterns));

   auto matchResult = serialMatch(filenames, patterns);
    
   for (const auto &matchItem : matchResult)
       {
       cout << matchItem.first << ": ";
       for (const auto &matchString : matchItem.second)
            cout << matchString << ",";
       cout << endl;
       }
   }

Well, I did warn that C++11 is going to come back!
Just in case something looks unfamiliar:
  • Line 10 uses an alias declaration. In this case, I could have also used the old trusty typedef, but using looks a bit more elegant.
  • Line 23 goes for a copy_if with a capturing lambda function. For now it uses the straightforward std::string::find function to identify substrings. (Not for very long though)
  • We move rather than copy results into the eventual output in line 27.
  • Hash map also pays a visit (look for unordered_map). To be fair, it existed long before 2011; it's just the syntax that is new.
There are a few bits and pieces sprinkled about, e.g. auto, inserter iterators, but they are beside the point; there are far better resources available if you wish to know more. 
Regardless, it can't escape one's attention that not all languages are born equal. The C++ brethren is three times longer - despite aggressive usage of stream shortcuts and lambda functions.

However, it surely must be faster? When embarking on this road, I hoped that the answer will be a cautious 'yes'.

Not so:
roman@localhost:~/blog/Python-C++ comparison$ time C++/stlMatch filenamesList.txt twoThousandWords.txt
yahoo.html:


real    0m5.084s
user    0m4.716s
sys     0m0.060s


roman@localhost:~/blog/Python-C++ comparison$ time python Python/serialMatch.py filenamesList.txt twoThousandWords.txt

real    0m4.051s
user    0m3.260s
sys     0m0.040s



The first round of this battle royale is won fair and square by Python. The one million dollar question is why.

For one, Python more often than not goes for optimised C implementation in its core libraries, and it's undoubtedly the case here. It is also quite possible that handcrafted C is more efficient than the C++ I came up with - even if the latter is encouraged with -O3 and move semantics.

But 20% difference? That can't be explained by any of the above, hence I went to StackOverflow and looked for experts' wisdom.

Said wisdom was provided within a couple of hours (aren't specialist sites great!). It turns out that Python 2.5 and later uses a bespoke string matching algorithm based on Boyer-Moore-Horspool algorithm.
On the other hand, libstdc++ uses a simple comparison as it is templatised for arbitrary character types, and tries being as generic as possible. It could specialize the template for 8-bit characters, but it doesn't.

Together with the excellent answer, I also received a bit of advice; try out more efficient string matching algorithms included in Boost. These are the same Boyer-Moore family of algorithms that Python employs by default - with C++, they have to be used explicitly.

Good advice ought to be followed, and I've amended my C++ example slightly by changing the lambda function in the copy_if to the following variant (just add the word horspool for the third algorithm, and don't forget the includes).
[&fileContents] (const string &pattern) 
               { return boost::algorithm::boyer_moore_search(
                              fileContents.begin(), 
                              fileContents.end(), 
                              pattern.begin(), pattern.end()) 
                          != fileContents.end(); } );

For baseline/worst case I've also used strstr. Time for the results!
Searched string set size strstr std::string::find boyer_moore_search boyer_moore_horspool_search Python/string
20 0.149s 0.087s 0.082s 0.069s 0.085s
200 1.004s 0.537s 0.402s 0.277s 0.380s
2000 9.502s 4.975s 3.537s 2.305s 3.420s
20000 133.449s 57.084s 34.896s 22.369s 33.969s

As usual, a few observations:

a) When we deal with small set sizes C++ has a slight edge. With a handful of expressions to look for, algorithms are not as dominating, and different micro-optimisations matter.

b) Don't use strstr!

c) If string matching is in your critical path, then use one of the Boost algorithms. I haven't measured memory usage of the horspool algorithm (it trades off memory use for speed), but it was unnoticeable with the 175KB patterns, and 300 KB text sets.

This is all very educating, but where does it leave us? My original idea of using string matching as the arena on which we compare languages and nothing but the languages was deeply flawed. Ninety percent, if not more, of the comparison was down to algorithms, and language specifics had very little role to play.

However, as with many other roads, this one was far more interesting than the destination. We got some stats on string matching, found good optimisation opportunities in C++, so the time was still worth it.

But, before closing the post on that philosophical note, you might ask - what about regexes? How does the insight into substring search performance help us there?

It does not help a lot, or, you might say, at all. Languages are not inherently fast or slow (not unless their name starts with an "R", ends with "y" and has a couple more letters in the middle). Their comparative performance is down to what functions you use, which algorithms they hide, and how optimised these algorithms are. 
This implies that the regex performance delta was down to algorithms, not C++ being faster as a rule.

Of course, whenever the critical path is entirely within your own business logic, generic comparison becomes valid. Could be a good topic for yet another performance post...

3 comments :

  1. Can you use ctypes and call directly into pcre library? This sounds like it's an easy implementation (and of course, releases the GIL) but maybe performs really badly.

    ReplyDelete
    Replies
    1. That comment works better for a different post. Sorry.

      Delete
    2. No problem. You're right - I should have tried that, and did not occur to me a couple of posts back. It might not be very straightforward though - is there an easy way to return a non-integral type via ctypes? I'm referring to "pcre2_code *" from pcre2_compile .

      Delete