Sunday 28 June 2015

Cython and regular expressions

Recap

I'm returning yet again to the long suffering text matching example from a couple of months back. The goal is/was to take a text document, a set of patterns, and see which of these patterns could be found. We went through a variety of techniques: Python micro-optimisations, regexes, C++ extensions, Twisted, reactors and improved string matching algorithms. My intent this time around is to try improvements brought out by Cython.

In case you're unfamiliar with Cython, I'll do a two sentences' introduction: its aim is to preserve the ease of development inherent to Python while extending it with static type definitions and easy C/C++ bindings. Considering that dynamic typing comes at a great cost, it's perfectly suited for optimising hotspots in Python code and striking a balance between productivity and performance.

As before, the end result turned out to be surprising (at least to me), but what matters is the journey. Let's embark on it.

Firstly, I'll shake the dust off the pure Python example:
from twisted.internet import reactor, defer, threads
import sys, re

def stopReactor(ignore_res):
   reactor.stop()

def printResults(filename, matchingPatterns):
   print ': '.join([filename, ','.join(matchingPatterns)])

def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(filename, matchingPatterns)

def parallelScan(filenames, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for filename in filenames:
      d = threads.deferToThread(scanFile, filename, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelScan(filenames, patterns)
   reactor.run()

This was just a copy/paste to save you the bother of clicking through to an older post. I still preserved the reactor and multi-threading, but they won't be playing a major role today.

We know that it's scanFile where we spend most of the time, and this is the hotspot that needs optimisation treatment with Cython. Since the matching is done via regexes, we cannot in good faith claim that we're optimising without also switching their library; here, I'm going to use regex.h.

Writing Cython code is not a common skill, and we're not dealing with Hello, World here, so as the narrator, I have two choices: jump to the end result, and go for an autopsy, or build it piece by piece. Taking the mantra that it's the journey that matters, I'll go for the second option.

Cython

Going for the easy bit, here's the updated scanFile function:
def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = contentMatchPatternCython.matchPatterns(pageContent, pattern_regex)
   printResults(filename, matchingPatterns)

Nothing glamorous here, we just defer matching to the Cython extension that will be built.
Let's tick the build checkbox too:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(name='contentMatchPatternCython',
      cmdclass={'build_ext': build_ext},
      ext_modules = [Extension('contentMatchPatternCython', 
                     sources = ['contentMatchPatternCython.pyx'])])

Still nothing exciting, as this is just a small bootstrapping script that will build the Cython extension.

Time to take a deep breath, and look at how we write the actual algorithm.
This is the skeleton:

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   return matchingPatterns
Baby steps here. We just return an empty set to satisfy the function signature, however it already differs from vanilla Python code by defining types for pageContent, regex and matchingPatterns.
Why? We'd like to use Cython's memory and performance optimisations by pre-defining the variable types which will get us optimised variable storage and access. In this specific case, there is no requirement to support unicode, hence the bytes type definition.

Now, let's import regex.h functions:
cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 
Note that we only import what we need later on, and that we tell Cython to release the GIL when executing the imported functions via the magic nogil keyword.

Ok, so we have the interface and the required C library function imported into PCython. All that's left is the algorithm, and tying it all together:
cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   cdef regex_t regex_obj
   cdef regmatch_t regmatch_obj[1]
   cdef int regex_res = 0
   cdef int current_str_pos = 0
   
   regcomp(&regex_obj, regex, REG_EXTENDED)
   regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
   while regex_res == 0:
      matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
      current_str_pos += regmatch_obj[0].rm_eo
      regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

   regfree(&regex_obj)
   return matchingPatterns

The interesting stuff happens at lines 19-24 where the usual compilation and regex execution takes place. The code is not one-to-one to pure Python since the underlying C library does not support the notion of generators, and we have to do string slicing (which is cousin once removed to pointer arithmetic).

Cython is relatively new to me as well, and even though I've written the code above, it caused a certain degree of cognitive dissonance: a bit akin to someone seeing platypus for the first time.


The code looks like C, but semicolons are nowhere to be found, while there are Python indentations and weird brackets around strings. If you've been developing in both languages for a while, seeing them unified in matrimony within a single function is a new experience. 
Anyhow, time to cut the nostalgia, stop coding and start measuring! At this point, my intent was to try this on a single <file, pattern> pair, and move on to doing further Cython tweaks to ensure we can release the GIL on the entire regex matching loop. Reality, however, turned to be different...

Performance

From here, and until the end of the post, the inputs are fixed: we use a single haystack (17KB source of google.html) and a fixed set of needles (first 2000 tokens of /usr/share/dict/words).

Making sure to run a few iterations and take the fastest one, here's the timing using the Python-only example:

$ time python pythonOnly.py shortFilenamesList.txt shortWords.txt

real    0m0.868s
user    0m0.546s
sys     0m0.312s

All right, so how does Cython compare? Drum roll please...

$ time python pythonWithCython.py shortFilenamesList.txt shortWords.txt

real    0m2.396s
user    0m2.012s
sys     0m0.374s

Ouch - 2.7 times slower! Of course, such a difference has to be explained. In my mind, there are two possible sources: different regex library or Cython itself. The former is far more likely, but we should never make fast assumptions with performance tuning. Hence, the natural next step was to write the same in C. 
I'll omit the C code for brevity, as it mirrors the Cython code with the semicolons and pointer arithmetics thrown in. Let's jump to the result:

$ time ./a.exe

real    0m1.547s
user    0m1.513s
sys     0m0.031s

No joy - it's still way slower than Python. Time for

Conclusions

So, we know by now that it's the difference in the regex library that matters; no low-level optimisations in Cython or C can overcome that. In hindsight, there was little basis to suppose that Cython will matter on this specific task, as Python's re implementation is also native C.

However, this was not time wasted, and here's why:

  1. This is yet another tangible example of why blanket statements such as Language X is slow, and Y is fast are untrue. In this case, Python came up faster; in another it may well be slower. 
  2. Thanks to user @nhahtdh on StackOverflow, I gained insight into the difference in performance. regex.h implements a backtracking engine, and does not optimise non-backtracking regexes, which is almost certainly something that Python re does.
  3. Good Cython practice. There is a reasonable template in place to import C libraries into Cython, link with it, build etc. (There are of course plenty of other resources that show the same, but it's always helpful to do and gradually demonstrate it by yourself)

What next?

Coming back to the task at hand: can we truly say that Python's re engine is simply better? My answer is a definite 'no' since engines shine on different patterns. Ours has been a long sequence of OR expressions, and it is just one possibility among many.

So, there are many options to take it further: try other matching patterns, compare with C++ boost::regex, or go for examples that are less dependent on supporting libraries. While Cython has not been of much help with optimising our venerable string matching example, it yet has a future in this blog.

Monday 15 June 2015

Interviews: what not to say or do

My previous couple of posts on hiring were for interviewer's sake: what to ask, what to tell, and in general, how to land that perfect candidate.

However, what about the interviewees? They are the ones that withstand a barrage of questions, and have far more at stake: shouldn't we give them a bit of advice too?

My recent experience runs towards the cushy part, i.e. the interviewer, but after talking to dozens of potential colleagues, I've acquired a mini-collection of behaviours that either put me off a particular candidate or vice versa. This is precisely the collection I'm going to share below.

As for the order: a while back I've been taught that in each monologue it's good to cover the negatives first and finish on positives. So, let's start with


What not to do on interviews



Don't go on a tangent. There's a great British comedy talk show called QI. In that show, one gains points by bringing up quite interesting - hence the name - facts that might or might not be related to the question asked (hint: it's usually the latter).

Well, job interview is exactly the opposite of that. Talking about B while being asked about A is a bad idea: and in case you're unsure why, I'll elaborate. Firstly, it gives me, the interviewer, a clear indication that the guy on the other end knows nothing about A. Secondly, it leads me to a deduction that on this subject they might only know about B. Thirdly, it points to a potential communication problem.
It may well sound trite and obvious, but this is by far the most frequent violation I've encountered.

A while back, I hired senior C++ developers, and one of the more advanced questions was:
When would you not define a virtual destructor? 
I took great pains to underline the word not when asking the question. Nevertheless, many people chose to explain when one would have defined a virtual d'tor, which is of course a far simpler question. A large subset answered the same inverted question again after I gave them another chance.
Needless to say, they would have been in a far better situation by just admitting ignorance.




Don't guess. Speaking of admitting ignorance - guessing an answer is never the path to fame and glory. If you honestly say that you're unfamiliar with the answer, then it perhaps highlights a technical gap, but not a personal one. Guessing at an interview, however, shows that the candidate might do the same when working with others and writing production code, and let's face it - none of us are omniscient.

One might object that you have a chance of guessing correctly, and thus avoid the double whammy. Well, unless we have a software developer and an actor all rolled into a single package, the lack of confidence in the answer will be still audible and noticeable. Note that there are shades of grey here, and it's perfectly valid to reason about the possible answer, e.g. "I'm not sure how database X implements durability, but based on database Y, I'd guess they use a temporary log buffer".

Again, it happened to me a number of times. On one interview, I've asked about the difference between HTTP/1.0 and 1.1, and after a short pause got a seemingly confident and completely incorrect answer about the latter introducing load balancing. In my mind, the interview ended right there, although of course we ran it to completion.

In short, "Don't know" is not a rude or dirty phrase. It is much better than giving the right answer to the wrong question or guessing.



Don't project over confidence. Continuing the previous thought: it's crucial to have a realistic assessment of your skills.

Picture yourself two candidates: both being able to write basic Python code, but not going far into intermediate concepts such as generators, property methods or mixin classes. One claims his Python skills as moderate, while the other puts emphasises Python experience on the CV, and declares full mastery. Which of the candidates would you progress with?

This exact situation happened to me, and obviously the second candidate got filtered out early on. The frustrating thing here is that his skill and salary expectations were good enough for the role, which was fairly junior. He simply overestimated himself, and that means that he would have been doing the same at work, and that is a recipe for interpersonal issues.

Basically, try and grade yourself ahead of the interview on each of the keywords in the CV. Go to sites such as StackOverflow and try answering questions on your favourite keywords. Does your self assessment stack up?

Don't be emotional. This is a very big no-no. It's great to express a gamut of emotions when auditioning for a role. It's also fine to be expressive during important keynotes and motivation speeches. Not so good on job interviews.



From my point of view: if a person is unbalanced even at such an artificial and special setting as an interview, what will happen when they starting working with us? Would they be a living incarnation of jack-in-the-box?
Of course, it might be that they are simply nervous, and in day-to-day interactions, they'll be an oasis of calm and reliability. They just might. But, this interview is all I have to go by, and employing a person erroneously is a costly and emotionally charged mistake.

Don't speed up. It's not the most common fallacy, but it did happen to me on a number of occasions. The interviewee treats the end of my question as a starter gun in a 100 meter dash and proceeds to firing out sentences at a serious rate of knots per minute.
I can (sort of) follow the thought pattern here: "I'm not quite sure what he is looking for, so let's do a scatter shot and see if one of the answers hits the target".




Unfortunately, the main result achieved by this method is mild headache at the other end of the wire (or room). Also, even if one of the answers is correct, it's the wrong ones that are going to matter.

However, let's switch subjects a little bit. Apart from knowing the answers to the technical questions and avoiding the pitfalls above, 

What to do before interviews?


Read up on the company and its products. It always makes a good impression when the candidate knows more than just the title page of the website. It does not take long to Google the company, and become aware about their business model and latest happenings. It also kills two birds with one stone: shows the interviewer that you are a diligent person, and helps with figuring out whether it's the right place for you.




Again, must sound very obvious, and again, not many people do this - especially software engineers. I guess sometimes the mindset is: "I'm going to develop code and get paid, their business model is not my business."
There are more than enough reasons why this is not the right thought pattern, but nevertheless - even the first bird, i.e. showing diligence, should be convincing enough.

At one call, I had a candidate succinctly describe to me all of our recent developments, with highlights from the latest Gartner report thrown in. It gave him a +50 karma points boost for the rest of the chat.

Read up on the interviewer. Fifteen years ago, average online presence was minimal. Unless you were lucky enough to be quizzed by a public figure, all you had to go by ahead of time was the interviewer name.
Today, this excuse does not hold any longer. We've got LinkedIn, blogs, forums - plenty of input to figure out what your counterpart knows and cares about. For example, if you're interviewing with me (yes, we have roles open!) - finding this blog might help.
Moreover, if that person is also your hiring manager, then you get a chance to understand compatibility and yet again, show personal diligence.

Get good questions ready. Don't think that the interview is finished whenever the technical questions end and you get the virtual or physical microphone. It is still going on, and the questions you ask cast a shadow (or light of glory) on you as a person.

Here, I had the whole range: from silence, to vocalised "10 questions to ask on interview" articles.
Silence is practically the worst option, since it shows detachment and lack of care. Asking stock questions, such as "What do you like best about working in your company?" does not do any harm, but does not do much good either. Asking stock questions gets you stock answers, and does not show personal preparation. It is a bit like the worn off "Name-your-three-best-worst-qualities" question that some employers still stubbornly hang on to.

As with the other bulletpoints, it's best to ask questions pertinent to the company, role and what you heard so far. For example,
You do your development in C++; do you go for the latest standards, and how extensively do you use external libraries?
You mentioned that this role has a customer element to it: can you describe typical customer interactions I'm likely to have?
How often do you release software? Can you describe a typical release cycle? 
As you are based on third-part IaaS systems, is there any special testing process that you follow and how do you get development environment similar to the target deployment?
Can you describe the technical career progression in your company?
These questions show that you've been attentive, wish to understand the position in depth, and that you have previous experience with similar processes.

Be ready to back up your CV. This is of course the biggie. Before an important interview, try re-reading your CV, and see if you have any stale skills that need a refresher. Obviously, it does not mean that you stand a chance of getting away with techniques you've never used, and inserted in the resume to gain attention. But, if you genuinely used a specific technology for a few years in the past, then brushing up won't do any harm, and may avoid the embarrassment of stalling up on basic questions. 




I was once interviewing a person who was mostly brought in due to his networking stack knowledge: SSL, HTTP, TCP/IP, DNS - he had them all. The most frustrating thing was not even that he could not answer entry level questions on most of them. It was that it was evident that he knew them at some point, since snippets of right terms were popping up while he was searching for the answers. However, there were also plenty of holes in other topics, and hiring based on a hunch is simply too much a risk: including for the candidate who had active employment at the time. We had to pass, while I'm sure (or hope) that if he did a half day refresher ahead of our meeting, the decision could have been different.

Wrapping up

If you followed me all the way here, you're probably regretting the 240 seconds wasted. So much obvious advice, which is already spread upon countless blogs and articles.
If that's the case, you are correct - few people will find genuinely new information here. The hardest part about changing behaviour is not reading about it, but doing it, and my task at hand was giving a few examples and sharing personal experience to stress why it's so important. 

Landing the right job is a crucial moment in our lives. It's disappointing, yet understandable, not to get it because we simply lack the skills, experience, or even if we had encountered interviewers on a bad day.
But, losing an opportunity due to not presenting yourself in the best light possible is more than disappointing; it means that we lost a chance to invest a few hours to make the next few years better.

Monday 8 June 2015

Hiring: QA screening

This post is a small bonus chapter on top of the phone screening narrative.

So far, I've concentrated on pre-filtering developers, and mentioned test specialists very (too) briefly.

Let's take QA then, and ask again

What qualities are we looking for?

They are not that different from developers, with only coding making way for testing:


  • Can they find bugs?
  • Is there a large discrepancy between what they think they know and what they actually know? 
  • Can we work together, i.e. is there a communication problem?

And going a little bit beyond the land of the utterly obvious, and zooming in upon bug finding:
  • Is there relevant industry background (e.g. hardware appliances, or financial systems - depending on what we do)?
  • Do they understand system interactions?
  • How about intuition and ability to poke holes in software (including corner cases)?
The last one is the absolute must, so let's expand a bit:


Poking holes in software

Like with developers, here I tend to pose a simple scenario, and see how far we get. For the aesthetic sense of symmetry, I can ask to provide test cases for the same function that was brought up in the developer coding interview.

Now, the good thing is that with test cases the sky is the limit. It is possible to raise just the obvious ones, and yet, it is also possible to show full expertise while staying within the confines of the original task.


Let's take an example: prepare test cases for a function that reverses a string. If you like, you can take a small break here, and think what test cases you'd come up with.




The first checkpoint are the requirements. Note that I never mentioned what should be done when reversing empty or NULL strings: should the function return the same value, throw an exception or assert? It was very rarely that I was asked questions about edge cases, but on those rare instances it was a major bonus for the interviewee.

Now, let's move to the basic functional test cases; there's the very obvious ones:


abc→cba
aba→aba

and slightly less so:


a→a
bbb→bbb
cC→Cc

Nevertheless, that was still the boring bit. For more excitement, let's turn to speed complexity:

  • Does the function perform quicker on palindromes?
  • What is the speed on huge strings; how does it scale with size? 
  • Does it perform better on multi-core machines?

Of course performance does not finish here, as we also have memory considerations
  • Does the function swap characters in place of uses a new copy?
  • What is the practical input size before we start swapping on typical hardware/VM?

However, we can go back to the functional tests, as we're not done there yet - we need to consider localization and special character sets:
  • What is the behaviour on UTF-16, UTF-8, UTF-32?
  • How does the function deal with special characters, such as newlines, and NULLs?

If we really want to be persistent about it, we can throw concurrency in, and consider what happens if we modify the string in parallel to calling it in a different thread (though you might argue that we cross into Software-Engineer-in-Test land here).

It's quite possible that we can squeeze out more, but if a candidate reached up to here, then they should already get the top prize: they know non-functional testing, exploratory testing, localisation, and concurrency (spoiler alert: never happened to me).
In any case, the intent is not to declare this exercise as the only way to filter candidates, but to demonstrate that with even the simplest task we can go far if need be. 


Do they understand system interactions?


Ability to recognise risky interactions is one of the main skills for test professionals. Ideally, I'd like to describe a system and ask what we should be worried about at a very high level. The problem, as always, is time - we can't devote more than 10 minutes, and that's not a lot of time to describe a system and discuss test strategy.

For this reason, I usually left this until face-to-face stage, but from time to time, if the previous stage went well, I'd describe a hypothetical system that relates to the person's previous experience. For example, if they have SaaS on their CV, I'd outline a licensing system that contains entitlements from different subscribers, and ask what typical faults they would explore. It's important not to overdo it, as such a Q&A can easily take up the rest of the call, but when it works well, it kills two proverbial birds with one stone: both establish SaaS credentials, and check ability to view systems from above.


CV/skillset discrepancy - still applies

Now, what I mentioned last time about CV/skillset discrepancy still applies. If someone scatters different skills in their CV, they're always fair game irrespective of what we are looking for. For this reason, I've been asking about automation, networking, security, Java and everything in between. As with developers, this works both ways - it can highlight important experience that could complement the team, or it can show CV inflation. Usually, I would not spend too much time on this: just a couple of questions per topic to do broad and shallow coverage.

In fact, everything else in the previous post also applies. We need to figure out if we can work together, so I might go for an open question that does not have a single answer, and see whether we can have a reasonable (but short) chat. To take an example at random: is it fine to concentrate on component testing only with microservices?
Also, this should be a non-confrontational experience for the interviewee, so I'll similarly present the role, the company, and let them ask any number of questions within reason.

Summary

Phone screening for QA candidates is not vastly dissimilar to developers, with the main accent shifted into ability to think of failure modes, and system interactions. At least one practical question during the call is an absolute must, and even the simplest test tasks can go very far.