Saturday 28 February 2015

Python optimisation - Part III: Parallelism and C++ extensions

In this post, I'm continuing the series on Python optimisation

The task in hand is/was to get a set of URLs as an input, download them and match their content against a set of patterns.

We started with the most naive serial implementation, and gradually enhanced it using asynchronous I/O operations with Twisted, and compiling the patterns into a regular expression

Today, I'd like to look at further performance improvements by taking our example into the land of parallelism and multi-core execution.

Before we kick off - two small adjustments to the test set and environment.
  • The test bed is moved to a 6-core PC; the single-core laptop used for the previous optimisations is left behind.
  • The set of URLs has been increased from four to six to utilise all of the cores in parallel.
To ensure we keep ourselves honest, I've re-run exactly the same code from the previous post on the new platform (and using the expanded URL set) with the following results.

Serial fetch and scan: 7.95 sec
Parallel fetch: 5.66 sec
Regex optimisation: 3.50 sec

Note: You might correctly point out that these numbers highly depend on network latency. I've done a few runs to derive those, averaged them, and verified the variance. If not precise, they should be in the ballpark.

This will be our baseline. Now, with the introductions behind us, let's start covering new ground.



Deferring to Python threads


Whenever the spectre of parallelism appears, many people naturally see the words "thread" and "process" entering their conversations. 

Of course, modern parallelism is far from being restricted to OS constructs, but Python is a classical language in this particular aspect, and we are going to be using those words too.

If you had any mileage with Python and multi-core optimisation, you probably already raise both hands in favour of processes. The creator of Python definitely doesI'm a bit less categorical; for sure, multi-process uses the cores, but there are situations where the lack of convenient shared memory, and cost of IPC start weighing down.


However, I'm jumping ahead. The best way to understand why something might not work is to actually try it, so let's give threads a go.

Moreover, we're in luck, since Twisted provides a convenient mechanism called deferToThread. This does exactly what it says on the tin: take a function, and execute it, as well as its callbacks, in a thread.
from twisted.internet import reactor, defer, threads
from twisted.web import client
import sys, re

def stopReactor(ignore_res):
   reactor.stop()

def printResults(url, matchingPatterns):
   print ': '.join([url, ','.join(matchingPatterns)])

def matchPatterns(pageContent, url, pattern_regex):
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(url, matchingPatterns)

def gotPage(pageContent, url, pattern_regex):
   return threads.deferToThread(matchPatterns, pageContent, url, pattern_regex)

def parallelFetchAndScan(urls, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()


This is not far removed from the last optimisation, with two key differences:

- Line 19 makes use of thread deferral. It's as simple as providing the function name with its parameters to the Twisted API.
- In line 16, we directly print the results as we get them rather than saving them in a static data collection. Leaving things as they are would be a recipe for race conditions, and avoiding shared writable state looks like a good idea.

Before we look at the performance (yes, a bit of suspense), let's have an inside look into Twisted thread management. 
I've sneaked in the following line of code within matchPatterns to confirm that we are within the realm of multi-threading:
print 'I am in thread: %s' % threading.currentThread().name
These are the lines it generated (of course, they would vary from run to run): 

I am in thread: PoolThread-twisted.internet.reactor-3
I am in thread: PoolThread-twisted.internet.reactor-5
I am in thread: PoolThread-twisted.internet.reactor-4

I am in thread: PoolThread-twisted.internet.reactor-1
I am in thread: PoolThread-twisted.internet.reactor-3
I am in thread: PoolThread-twisted.internet.reactor-2

Interesting. Twisted uses a pool thread instead of spawning a new thread each time, and reuses threads as needed (we managed to process one URL in full, while another was still downloading).

Ok, this is all fine and good, but how do the numbers look like?

time python parallelFetchThread.py urls.txt words.txt

real    0m3.470s
user    0m1.388s
sys     0m1.278s


 
Looks like all these good people in Intel and Twisted community have worked for no gain! We did not get any benefit from threads or extra cores, and I can finally mention the dreaded GIL also known as Global Interpreter Lock.
It is the reason our performance has not shifted; we never had any parallelism since the lock was held whenever any instance of matchPatterns was running. This is the reason people usually go for spawning processes rather than threads with Python.


However, let's for the sake of the argument assume that we want to stubbornly stick with threads. Maybe we have a large pool of shared configuration data, or maybe have a memory-hungry cache that makes multi-processing a challenge.


Is there any way of escaping the GIL while staying on threads? Of course, the question is semi-rhetorical, since if the answer were "no", this post would have been of little value.


Escaping the GIL via C++ extensions


The answer we're going to shoot for is escaping the GIL by reimplementing matchPatterns in C++.

Since compiled code does not touch any of the interpreter state, we can free the global lock, and (hopefully) use the 6 cores as we were meant to.
Of course, C is also an option, but C++ can create more concise and maintainable code, and I haven't seen many resources on the Web that go specifically into Python->C++ extensions.

Here's the master plan:

  • Replace matchPatterns from our Python module with invocation of a new C++ extension we're going to create. Since we cannot pass compiled regexes across, do not pre-compile, but rather pass the regex in its string form.
  • Create wrapper C interface which accepts page content and regex string, and writes back the list of found patterns.
  • Re-implement matchPatterns using C++11 with Boost::Regex.
  • Compile and install the new extension using the Python distutils module.
  • Profit!
Let's start with the easier bit, which is the slightly cut down Python code.

from twisted.internet import reactor, defer, threads
from twisted.web import client
import sys
import contentMatchPattern

def stopReactor(ignore_res):
   reactor.stop()

def printResults(url, matchingPatterns):
   print ': '.join([url, ','.join(matchingPatterns)])

def matchPatterns(pageContent, url, pattern_regex):
   matchingPatterns = contentMatchPattern.matchPatterns(pageContent, pattern_regex)
   printResults(url, matchingPatterns)

def gotPage(pageContent, url, pattern_regex):
   return threads.deferToThread(matchPatterns, pageContent, url, pattern_regex)

def parallelFetchAndScan(urls, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = '(' + '|'.join(patterns) + ')'

   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()

contentMatchPattern is the upcoming extension I've talked so much about. Another minor difference is in line 21 where we no longer compile the regex, and add parentheses to conform to differences between Boost and Python regex processing.

Now, let's have a look at the wrapper C code:

#include <Python.h>

// TBD - matchPatternsImpl

static PyObject *pythonExt_matchPatterns(PyObject *self, PyObject *args)
 {
 const char *content, *pattern_regex;
        PyArg_ParseTuple(args, "ss", &content, &pattern_regex);
 
 set<string> matchedPatterns = matchPatternsImpl(content, pattern_regex);

 PyObject *list = PyList_New(matchedPatterns.size());
 size_t listPos = 0;
 for (auto pattern : matchedPatterns)
  {
  PyList_SetItem(list, listPos, Py_BuildValue("s", pattern.c_str()));
  ++listPos;
  }
        return list;
 }
  
extern "C"
{
 
static PyMethodDef module_methods[] = {
   { "matchPatterns", (PyCFunction)pythonExt_matchPatterns, METH_VARARGS, NULL },
   { NULL, NULL, 0, NULL }
};   
 
void initcontentMatchPattern(void)
 {
    Py_InitModule3("contentMatchPattern", module_methods,
                   "Match patterns in text");
 }

}

Just don't try compiling this, as we haven't finished yet. A couple of C++11 bits have been carried along, but the main interface is implemented via module_methods, init<extension name> and various PyList* and PyArg calls. 
You can either use it as a mini-template for your own extensions, or look at more detailed documentation of Python.h

Let's augment that with the C++ implementation of the actual algorithm:
#include <Python.h>

#include <iostream>
#include <set>
#include <boost/regex.hpp>
using namespace std;

set<string> matchPatternsImpl(const string &content, const string &pattern_regex)
 {
 set<string> ret;
 boost::regex regex(pattern_regex);
 boost::match_results<std::string::const_iterator> what;
 boost::match_flag_type flags = boost::match_default;

 auto searchIt = content.cbegin();
 while (boost::regex_search(searchIt, content.cend(), what, regex, flags))
  {
  ret.insert(string(what[1].first, what[1].second));
  searchIt = what[1].second;
  }
 return ret;
 }

static PyObject *pythonExt_matchPatterns(PyObject *self, PyObject *args)
 {
 PySaveThread threadStateGuard;

 const char *content, *pattern_regex;
    PyArg_ParseTuple(args, "ss", &content, &pattern_regex);
 
 set<string> matchedPatterns = matchPatternsImpl(content, pattern_regex);

 PyObject *list = PyList_New(matchedPatterns.size());
 size_t listPos = 0;
 for (auto pattern : matchedPatterns)
  {
  PyList_SetItem(list, listPos, Py_BuildValue("s", pattern.c_str()));
  ++listPos;
  }
 return list;
 }
  
extern "C"
{
 
static PyMethodDef module_methods[] = {
   { "matchPatterns", (PyCFunction)pythonExt_matchPatterns, METH_VARARGS, NULL },
   { NULL, NULL, 0, NULL }
};   
 
void initcontentMatchPattern(void)
 {
    Py_InitModule3("contentMatchPattern", module_methods,
                   "Match patterns in text");
 }

}

Your mileage may vary here, but even if C++ is not entirely your thing, the general gist should come across; we compile a regex, then iterate over the content, insert the matches into a list, and return it.

I could not resist slipping in a couple of C++11 constructs. Here they are for those of you who have C++ background, but not fully up to speed with the latest and the greatest:

The one new feature I did not use was std::regex. The reason is simple - my compiler's performance with it was snail-like in comparison to Boost, so I stuck to "old guns".

There could have been more, but this is not a primer on C+11, and neither do I claim to have a full expertise in it. (Though it still might get more stage presence on this blog)

Now, if you looked closely at the code, one line should raise suspicions. Where did this PySaveThread come from? 
Yep, guilty as charged - this is my class and I haven't defined it yet. Here it is:
class PySaveThread
 {
 public:
  PySaveThread()
   {
   _threadState = PyEval_SaveThread();
   }
  ~PySaveThread()
   {
   PyEval_RestoreThread(_threadState);
   }
 private:
  PyThreadState *_threadState;
 };

This is the reason we shook the dust off C++ compiler. The PyEval_SaveThread() and PyEval_RestoreThread() calls free up the GIL and lock it again, saving the thread state in the process.
Without those, the GIL would persist in our extension.

Note that I'm using the RAII (Resource Acquisition Is Initalization) paradigm. This way we make sure that PyEval_RestoreThread is called even if any of the C++ code throws an exception.


Note: You might argue at this point that Python became secondary, and we might as well write the whole thing in C++. Well, yes and no. Python is a concise language, and even this small example would be inflated with C++. Moreover, consider this task in hand as a part of a larger framework which includes less performance conscious code that could be written quickly in Python.


All that remains now is building and installing. Here is the mini-Python script I'm using for that:

from distutils.core import setup
from distutils.extension import Extension

setup(name='contentMatchPattern',
      ext_modules = [Extension('contentMatchPattern', 
                     sources = ['contentMatchPattern.cpp'], 
                     extra_compile_args = ['-std=c++0x'], 
                     libraries = ['boost_regex'])])
This makes use of the Python distutils package. Note the extra_compile_args and libraries parameters which take care of compiling with C++11 support and linking with Boost::Regex.

Now, we can install the extension by running python extInstall.py install. Phew! Finally, we can run and time this:


time python parallelFetchThreadCpp.py urls.txt words.txt

real    0m3.000s
user    0m1.871s
sys     0m1.232s

14% saving. Of course, it's not zero, but is that all? Well, for big applications, 14% is not a laughing matter - it can still mean a large CapEx reduction. 
Another angle to point out is the example itself; since my aim was to exemplify both I/O asynchronous optimisation and parallelism, it has a strong networking element. We still do quite a bit of URL fetching, and there's only so much we can save on the processing side.

Just to show what we have gained by escaping the GIL, let's take the magic PySaveThread definition out, recompile, and run again:

real    0m3.900s
user    0m1.872s
sys     0m1.278s

Yes, we did. Moreover, C++ implementation with GIL is slower than the Python one, which might be a bit counter-intuitive. There are two reasons for that:


  1. Python's re module is implemented with highly-optimised C code; it is at least at parity with the Boost::Regex implementation in my environment.
  2. If you recall, we no longer pre-compile the regex for all URLs, so we ended up recompiling it 6 times. 
This also shows that the gain from removing the GIL is greater than 14% when we do a like-for-like comparison.

So, it has been quite a journey. We started with a naive, serial implementation, and gradually introduced asynchronous programming with Twisted, optimised matching with regexes, and parallelism with C++ extensions.
We went down all the way from 7.95 seconds execution time to 3 seconds.
Each of these stages could merit its own series of posts/articles/books, but others have already written them. The goal here was to show various techniques working in practice on a single example, and let the combination of your interest and Google do the rest.

This example has served its purpose, but the series is not at an end just yet. In the next post, I'd like to explore more the benefits of removing the GIL on pure processing tasks.

Thursday 19 February 2015

Scrum - Part VII: Demos and retrospectives


Continuing the Agile series, I'd like to turn over to demos and retrospectives.

Let's start with the latter, as it's the most politically charged of all Agile ceremonies. If you have not been to many of those, you might be wondering why. After all, isn't it just a short meeting to look back as a team on the sprint and see what we did well, suggest what we can improve, and then move on?

To explain myself, I need to become a bit philosophical. Scrum, and other "collectivist" team methodologies, have a sharp contrast to the many individualistic attributes of the IT career. Namely, it's not teams who get annual reviews, promotions, salary raises and bonuses - people do.
Even the most altruistic IT professional cannot abstract him/herself from that: they can work within a team as well as anyone else, but at the end of the day they know it's for their personal contribution and their skillset that they are being paid for.

So, whenever we come to a meeting and ask the question "Did we do well?", each engineer inevitably, inaudibly, also asks the question: "Have I done well?". If the answers to these questions are different, especially if the first one is a "No" and the second one is a "Yes", then the meeting may backfire.

Of course, in an idealistic world, it's only the team that matters. In football, it does not matter in theory when a midfielder has played well, yet the team has lost 0-3. (In practice, it does matter to a degree since there are scouts and reviewers out there who would still single him out.)
Then, maybe you might say that there is no way these two questions can be answered differently.

In real world, however, sprints partially fail more often due to technical gaps and oversights than processes. Developer X is working on project Y, and greatly underestimated the task complexity. Engineer A has let defect B into production causing disruption to deployment. Mistakes happen - this is all part of the job. 
Moreover, sometimes A, B, X and Y all happen at the same time, but somewhere in the corner of the room sits person Z who has done nothing wrong and who maybe worked 8 AM - 7PM throughout the sprint completing both their and others' tasks and amortising some of the impact. This person needs a reward - not a rant.

In short, retrospective is great for process adjustments, but many sprints go south because of purely technical or personal issues, and in my opinion an Agile ceremony is not the best place to solve those.

With all of that, I tend to make retrospectives optional, and use them when there is a process, rather than personal, improvement to be made. This happens of course, but not in each and every sprint. In fact, if we modified the processes in each of the 120-odd sprints in my career, then our process would have been lack thereof. 

Now, let's move on to the subject of demos. For me, the one and main purpose of demos is that they need to be interesting. Not comprehensive, not formal, not long or short - just interesting.
The real acceptance and validation happens during the sprint when tasks get completed, and someone who understand those tasks well (QA Engineer, Product Manager, Product Owner etc.) validates them in depth. Demos are more about sharing, getting visibility and celebrating success.

This is all good and fine when there is UI involved, but when a team develops micro-services, extends APIs, deploys load balancers, or optimises a Web service, making a subject interesting becomes more challenging. 
Of course, for the right person, seeing a new port being taken by a service and a sample request on APIv2 triggering a HelloWorld response is exciting. Unfortunately, not all of us were born this way, so keeping interest is not easy.

Another dilemma inevitably pops up towards the end of the sprint's last week. Since demos tend to include the entire department, it's not something we should cut corners on; if we did, they won't be interestingTM.
This, in turn, means that we spend effort specifically on the demo: set up environment, prepare slide deck, get notes in order etc. By Murphy's law, this comes a time where customers and Technical Support breathe down our collective neck, while the test/demo environment is doing its best impression of a yo-yo as far as stability goes.

Hence, my ideal approach is:


  • Identify in advance what we want to demo. Think whether the target audience (i.e. other engineers) will care about this.
  • Allocate time (again, in advance) to get the environment ready.
  • Use demos as a visibility / celebration opportunity, not acceptance criteria.
  • Explain the business need behind each feature.
  • Call out future plans.
  • Ask difficult questions when someone else demos - let them feel loved. If demo goes by without a single question, then it almost certainly was just a tickbox.

Let's take my long-running Multimedia player tasks from previous posts, assume we just completed them, and apply these principles:


Integrate Chinese UI localisation provided by an external agency - Yes, do a demo. Highly visible, and gives a ton of creative chances to show off.


Enhance skin support for more customisation - Only if we have a sample skin implementation visualising the new features (we should, but it might be a separate follow on task that's not ready yet).

Address 3 visual defects - Do not demo. Too minor, and not very interesting.

Support two more codecs - Yes, do a demo: especially if we can dig out a nice movie clip with one of these codecs. Visible, and can appeal to engineers: we can even prepare a slide deck with features and specifics of these codecs, and explain how we will gain market share because of their support.


Prepare high-level effort estimate for porting to the MacOS platform - Do not demo. Wait until we actually have the MacOS platform supported.



As usual, I've strayed a bit from the classic sprint review recipe. Again, as usual, this is not because recipes are wrong, but because adapting them to various situations and teams is what managers do.

In the next post in these series I talk a bit about reconciling long-term planning and the dynamic nature of sprints.

Sunday 15 February 2015

Python profiling with cProfile - Part II: Asynchronous programming with Twisted

Continuing the previous blog on Python profiling, I'm moving to optimisation of concurrent and asynchronous software.

The task for today is: find patterns within a set of web pages
More specifically, given a set {X} of URLs and set {Y} of patterns, retrieve contents for {X}, and match a subset of {Y} for each.

For example, if X is:

{http://www.google.com}

and Y is:

{Google, NoSuchPattern}

then the result is

(http://www.google.com: [Google]}

Since I'd like to show off both asynchronous and parallel optimisation, we'll use the top 4 Alexa domains as the list of URLs:

   http://www.google.com
   http://www.facebook.com
   http://www.youtube.com
   http://www.yahoo.com


As for the list of patterns, I've simply taken the first 4000 lines of /usr/share/dict/words.

Let's start with a straightforward serial attempt, and then start optimising.

(Also, just to put the usual disclaimer: error handling, parameter validation etc. are omitted to save space)


Serial fetch and scan

import sys, urllib

def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(urls, patterns):
   return zip(urls, [getMatchingPatterns(patterns, urllib.urlopen(url).read()) for url in urls])

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   resultTuple = serialScan(urls, patterns)
   for url, patterns in resultTuple:
      print ': '.join([url, ','.join(patterns)])

The algorithm is as simplistic as it gets - just fetch the URLs serially and scan them one by one.
Before we move on, maybe it's just worth briefly zooming into the use of list comprehensions here:
def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(urls, patterns):
   return zip(urls, [getMatchingPatterns(patterns, urllib.urlopen(url).read()) for url in urls])


Urllib.urlopen fetches a URL synchronously, we read the content entirely in memory, and then filter the list of patterns with the string.__contains__ function.
This returns a list of patterns included in the URL's content, and we combine that with the URL itself using zip.

Here is the output and timing:

roman@localhost:~/blog/prAv$ time python  serialFetch.py urls.txt words.txt 
http://www.google.com: A,Ag,At,Au,Australia,B,Ba,C,Ca,Chi,Cl,Co,Col,Colo,Cr,Crick,Cu,D
http://www.facebook.com: A,Ac,Ag,Al,Am,Apr,Ar,Art,As,At,Au,Aug,Av,B,Ba,Bi,Bk,Br,Brazil,C,Ca,Cal,Capt,Cd,Che,Chi,China,Chinese,Cl,Co,Com,Cook,Cr,Cs,Cu,D,Day,Dec
http://www.youtube.com: A,Ac,Advent,Ag,Al,Ali,Alice,Alyson,Andersen,Apr,April,Ar,As,At,Atwood,Au,Audi,Aug,August,Av,B,Ba,Ball,Be,Ben,Best,Bi,Bk,Br,Bran,Brit,Britain,Bruce,C,Ca,Cal,Carter,Cf,Che,Chi,Ci,Cl,Co,Col,Collin,Com,Conan,Cr,Creator,Cs,Cu,D,Day,Dec,December
http://www.yahoo.com: A,Abe,Ac,Adidas,Africa,Ag,Al,Am,Amazon,America,American,Ann,Anna,Apple,Aquarius,Ar,Argentina,Aries,Art,Arthur,As,Asia,At,Au,Aug,Australia,Av,B,Ba,Banks,Barney,Bart,Be,Beau,Begin,Belfast,Belgium,Best,Bi,Bill,Bk,Blake,Blu,Bob,Bobby,Boru,Br,Bran,Brazil,Brit,British,Briton,Britons,Brits,Brooke,Brown,Bud,C,Ca,Cal,Camel,Canada,Cancer,Capri,Capricorn,Carney,Case,Cd,Cf,Chan,Chang,Che,Chi,Chile,Chris,Church,Ci,Cl,Cm,Co,Col,Colo,Colombia,Com,Cook,Cr,Cs,Cu,D,Dale,Damian,Day,Dec,December

real    0m7.107s
user    0m3.484s
sys     0m0.040s


There are two parts to timing that are interesting:

(a) We spent about 3.5 seconds in user space. It won't be far fetched to assume that it was the string matching.
(b) About 3.6 seconds was spent retrieving network resources.

Let's first focus on (b), and look at Twisted.


Parallel fetch, serial scan


Twisted provides quite a few constructs: mostly focused around networking, and asynchronous programming, and it's the latter that I'm going to zoom into.

What we'd like to do is fetch all the URLs in parallel, and once they are all retrieved, scan them using the same serial algorithm. This should reduce the 3.6 seconds cost we had in the first, simple attempt.

(Note: Yes, it is more efficient to scan URLs as and when they are retrieved, rather than wait for all of them. We sacrifice a bit of performance for the sake of clarity, but we'll come back to this later on)

The main mechanism for that are Deferreds, which essentially mean: "I do not have a result for you right now, but I'll call your function when I'm ready". In other languages, they are known as futures.

Whatever they are called and whichever language they are used in, they are the cornerstone of asynchronous programming; if we have a resource we can use in parallel, such as network, I/O, or another CPU core, then asynch events is always an option. Here we can do that to fetch multiple URLs at the same time:


from twisted.internet import reactor, defer
from twisted.web import client
import sys

def printResults(resultTuple):
   for url, patterns in resultTuple:
      print ': '.join([url, ','.join(patterns)])

   reactor.stop()

def gotAllPages(results, urls, patterns):
   matchingPatterns = [filter(result[1].__contains__, patterns) for result in results]
   printResults(zip(urls, matchingPatterns))

def parallelFetchAndScan(urls, patterns):
   deferreds = [client.getPage(url) for url in urls]
   defer.DeferredList(deferreds).addCallback(gotAllPages, urls, patterns)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()

This is quite a jump from the previous version, so let's go over it.


Firstly, the flow of the program is not sequential, but rather event-driven. We install a reactor in line 26, which is responsible for listening for events and triggering callbacks. When we gather all the results, we stop the reactor, which ends the program (line 09).

In line 16, we invoke Twisted.web.client.getPage, which as opposed to urllib.urlopen returns immediately with a Deferred. We could register a callback on this Deferred object, however to keep things simple for a while, we'd like to get notified only when all results become available, hence we combine those into a DeferredList. The end result allows us to get a single event once all asynchronous events have completed.
We register our callback in line 17, using addCallback, and make sure the callback gets the urls and patterns parameters.
The actual pattern matching is the same as before; this part we're yet to optimise.

So, with all of that, did things improve?

roman@localhost:~/blog/prAv$ time python parallelFetch.py urls.txt words.txt 
real    0m4.996s
user    0m3.988s
sys     0m0.068s


We did save 2 seconds, and now we spend most of the time in pattern matching.

We can try taking it down further by scanning page content as and when it comes rather than wait for all pages to be downloaded:

from twisted.internet import reactor, defer
from twisted.web import client
import sys
results = {}

def printResults(ignore_res):
   for url, patterns in results.iteritems():
      print ': '.join([url, ','.join(patterns)])

   reactor.stop()

def gotPage(pageContent, url, patterns):
   matchingPatterns = filter(pageContent.__contains__, patterns)
   results[url] = matchingPatterns

def parallelFetchAndScan(urls, patterns):
   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, patterns)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(printResults)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()


Here, we use the per-URL gotPage event to extract matching patterns, and use the DeferredList event only to print out the results, and finish the program. I also reluctantly used a global variable for results; in real life of course, I would have encapsulated this within an object.

Did this improve things?
roman@localhost:~/blog/prAv$ time python parallelFetchProgressiveScan.py urls.txt words.txt
real    0m4.689s
user    0m3.920s
sys     0m0.116s


Yes, but not a lot. It is time to look at how we optimise the actual matching.
(By the way, I have removed the results from the actual output, you can take my word, or run the program, to confirm they are consistent)


Parallel fetch, optimised serial scan

Time to run the profiler. This time rather than adding profiler calls within the software itself, I'm using the following command line:

python -m cProfile -o profilerResult.dmp parallelFetchProgressiveScan.py urls.txt words.txt

 

There are two reasons for that:

a) The program is more complicated, and the profiler output is prohibitively long. So, we need to do post-processing of the profiler result.

b) There is no single entry point any longer: we have the reactor in addition to the initial page-fetching calls.

Then, I use the following small Python script to extract the top 10 totTime entries - this would be the functions we spent most time in not counting subroutines (see previous post).

import pstats
import sys
p=pstats.Stats(sys.argv[1])
p.sort_stats('tottime').print_stats(10) 


In general, pstats is quite a handy utility to post-process profiling results. Here, the output is:


roman@localhost:~/blog/prAv$ python topTotTime.py profilerResult.txt
Sun Feb 15 12:33:17 2015    profilerResult.txt

         128752 function calls (126060 primitive calls) in 5.449 seconds

   Ordered by: internal time
   List reduced from 1554 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5    3.652    0.730    3.652    0.730 {filter}
      429    0.969    0.002    0.969    0.002 {method 'poll' of 'select.epoll' objects}
      724    0.031    0.000    0.031    0.000 {method 'recv' of 'OpenSSL.SSL.Connection' objects}
       87    0.030    0.000    0.030    0.000 {method 'send' of 'OpenSSL.SSL.Connection' objects}
   252/47    0.027    0.000    0.077    0.002 /usr/lib/python2.7/sre_parse.py:379(_parse)
  439/438    0.020    0.000    0.119    0.000 /usr/lib/python2.7/dist-packages/zope/interface/interface.py:222(changed)
     4279    0.017    0.000    0.024    0.000 /usr/lib/python2.7/dist-packages/zope/interface/interface.py:545(__hash__)
    10528    0.016    0.000    0.016    0.000 {method 'append' of 'list' objects}
   530/44    0.016    0.000    0.048    0.001 /usr/lib/python2.7/sre_compile.py:32(_compile)
     8166    0.014    0.000    0.014    0.000 {getattr}



The result is unsurprising - our CPU time is dominated by the filter call in line 13. It was more or less expected, however it's always worth confirming suspicions when optimising.
Of course, the algorithm is inefficient to say the least: our complexity is O (|X| * |Y|), or in simpler terms, we scan each pattern for each URL.
It'd be great to do smarter matching using state machines, where we look for patterns such as Twist and Twisted simultaneously as we iterate through the string. 
And - this is exactly what regular expressions do! So, the plan is straightforward: construct a gigantic regex or-ing all expressions, compile it, and match.

Let's try that:
from twisted.internet import reactor, defer
from twisted.web import client
import sys, re
results = {}

def printResults(ignore_res):
   for url, patterns in results.iteritems():
      print ': '.join([url, ','.join(patterns)])

   reactor.stop()

def gotPage(pageContent, url, pattern_regex):
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   results[url] = matchingPatterns

def parallelFetchAndScan(urls, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(printResults)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()

A few noteworthy moments are:
  • In line 20 we are sorting patterns in the order of decreasing size. This is because Python OR regex matching is non-greedy, which means that a regex of a|ab will match a in the string abc. We would like to match as long patterns as possible.
  • Line 21 is where the gigantic OR regex is created.
  • During matching, we use a set (line 13). As opposed to the previous algorithm, we can find the same pattern many times, so duplicates need to be eliminated. 
Let's time it:

roman@localhost:~/blog/prAv$ time python parallelFetchWithRegex.py urls.txt words.txt 

real    0m2.241s
user    0m1.244s
sys     0m0.120s


That's better! We went down almost by a factor of three compared to the first, simple, attempt.

Note: We did get that at a cost of a slight functional change. Now, for example, the pattern Col is not matched for Google as we have it is a part of a larger Colorado string. This difference might be justifiable, especially if our goal is to find the most specific pattern.

To recap, we had several successive optimisation stages - both at the architecture level (using asynch-driven programming), and at the low-level (regexes and state machines rather than scanning each sub-string).

In the next post in this series, I'd like to go into further optimisation of this example, by running and profiling the pattern matching on multiple cores.

Mini-acknowledgement: What I find good about these posts is the opportunity to learn while writing. Though I was aware about all of these resources before starting, I had to drill into many other options that did not make it into the post. It did take a few hours start to finish, but it was worth the time.

Friday 6 February 2015

Scrum - Part VI: Task lifecycle

After a not-very-brief foray into low-level optimisation, I'm going back to the Agile series, this time to talk about task/story lifecycle within a sprint.

Without becoming too formal, I'd define lifecycle as a way of figuring out where a given task is in the currently running sprint.
Is it being worked at? How likely is it to getting completed soon? Is it at a point where it can be shown to others?

Usually Scrum (and not only Scrum) managers define this using workflows. Here's a very typical one I "borrowed" from the Atlassian site.




It defined the most basic states in any workflow; all are self-explanatory.

We could end the post here, if not for one little fact: most people need to go beyond those 4-5 basic states.

For example, for a development task, we usually need a specific state to signify that we're done with writing code, and now we're in code review.
In some organisations and teams, where there is a tight SCM<->Scrum integration, this state might be even required for submitting code.

In some cases, we need to say that the code is in the build now (i.e. reviewed and submitted), and now it is in testing.
Simply put, "In Progress" needs to be broken up a bit, otherwise the process is too opaque.

However, sometimes people go a bit too far, and start creating states (or swimlanes) such as:
  • In deployment
  • Low-level design
  • UX review
  • In documentation
  • Demo ready
And so on and so forth. Actually, I take that back - it might not be too far.
This all depends on what the team is doing; if we have a DevOps scenario where we are capable of deploying individual code changes, then yes - In Deployment might make sense.
If we are a UI team that is preparing small incremental widget updates, then UX review might be useful as well.

I've usually been trying to follow a couple of simple rules:
  • Start simple. Better have a few states, and add new ones only when absolutely needed. Too many states is an overhead on the team, and an overhead on me.
    Vice versa, if there is a swimlane that we mostly ignore or jump over when moving tasks, it should be a candidate for elimination.
  • Adjust the states to whoever is in the team. If we have both development and QA in the same team, then In Testing is a good state to have. If we're a pure QA automation group, then perhaps having a state of Added to Continuous Integration would help.
My main background is either pure Development teams, or a combination of DevOps/QA, so I'll use those for my next examples:

  • Spike planning task. I plan getting into those later, but in one sentence: non-coding activity to get a better understanding of future work (e.g. low-level design, UX). Might be also preparing a test strategy by a QA engineer.
  • Internal refactoring. There's definitely coding and code review involved, but QA might not: especially if we already have a stabilisation cycle in the same area.
  • Deploy software.
  • Troubleshoot and analyse customer escalations.
  • Training and getting familiarised with specific technology.
  • Performance testing.
  • Review technical documentation.

And so on and so forth. All of these tasks may have valid acceptance criteria, but the simple sequence of To Do-In progress-Review-Testing-Done does not map to them as-is. It's purely a matter of personal preference and judgement whether using that sequence as the default is good enough, and whether it's acceptable to allow free form interpretation on all the others.

In my case, To Do, In Progress and Done are the cornerstones, and Review just works too well with integrated review systems (e.g. Swarm/Perforce or Crucible with JIRA) to get dropped, but Testing was a mini-dilemma for a while.

Firstly, most tasks simply did not have a testing element to them, as per examples above. Secondly, even if they did, sometimes testing is just too substantial, and it is impossible to enter both Dev and QA in the same sprint.

Just to give an example, let's revive candidate sprint tasks from one of the first posts:

  1. Integrate Chinese UI localisation provided by an external agency.
  2. Enhance skin support for more customisation.
  3. Address 3 visual defects.
  4. Support two more codecs.
  5. Prepare high-level effort estimate for porting to the MacOS platform.
 
Even if we could develop the first task in a given sprint, we might not be able to test it. So, I'd argue for a dedicated QA task in this case: maybe even two - one for planning and one for execution. Same might be true for enhanced skin support, especially if Dev completes towards the end of the sprint.
On the other hand, we might be able to fix and test the visual defects in the same iteration.

Taking all of that into account, my experience was that the majority of tasks did not need or could not accommodate a testing swimlane.

However, after a year or two, I still decided to reintroduce it, simply because the risk of missing a testing activity on those small defects is more of a problem than the overhead of dragging the other 80% tasks across that state. 

All this is not so much a guide to how QA planning should be managed, but rather an example of the Start Simple and Tailor States to the team principles.
To reinforce the latter, teams are not born equal, and if for example, I were responsible for a team that creates internal components (rather than customer facing), I might omit Testing all together and embed API auto-tests in acceptance criteria.

Speaking of those, I'd like to touch upon the standard template of: "As <X>, I'd like to do <Y> in order to achieve <Z>".

Many Agile advocates suggest this as a mandatory definition for all user stories, and thus, we end up with acceptance criteria such as:

"As a QA Engineer, I'd like to define a test plan, so that I'll understand my future work better", or

"As a developer, I would like to refactor class FooBar to Foo and Bar, so that we'll be able to maintain it", or

"As a customer of IniTech, I would like defect #12345 to be fixed, so that I'll be able to enjoy the product"

Like a legal document, it is semantically correct, fits the template and yet, adds zero information - it's obvious why we want to write a test plan, refactor, or solve defects without explicitly naming the actor.

This is why I've been using this template very sparingly, and only in the rare cases where it's not obvious why we do a specific task. Even then, the rationale usually came at a much higher level than individual story.
Of course, it might work much better for more UI-focussed teams: indeed, if you'll look at my own examples of the multimedia player, you'll noticed that the template would fit most of them quite well.

Summary


Workflows are not born equal, and it is important to get them right. They are like a mini-checklist: miss a testing state, and you might end up with untested defects fixes. On the other hand, it's easy to fall into the opposite extreme, so it is a delicate decision of what enters acceptance criteria and what becomes an explicit swimlane.
This is where strong understanding of what the team does most of the time come in, and where the choice of tools matters.
Lastly, acceptance criteria exist for us, so acceptance criteria templates are a guideline. As long as the person doing task understands what's expected and why, the criteria are good enough.