Mini language, parsers, recursion, lambda, yield ..... in Python

I needed to generate patterns of sequences of numbers for a project of mine. It just happen it is very small program, but also uses several important concepts in programming, namely writing parsers for Domain specific languages, recursion, lambda functions and Generators. So in this article we will explore those concepts with you using the Python language.
   You can find the source code here : The Program

Recursion
First, what exactly is a recursion ? Glad you asked ! Simply defined :

Definition : Recursion, see Recursion.

or expressed more humorously :

To understand recursion, you must understand recursion.

Pretty clear, right ?
One very popular and easy way to illustrate the concept is the so called Fibonacci numbers, where every number is defined as sum of the previous two numbers in the sequence. In mathematics this is expressed like this :
fib(n) = fib(n-1) + fib(n-2)
Math is descriptive language, so we just declared the formula, but in writing a program there is one important thing we have to provide and that is the condition that will end the recursion, because if we don't do that the execution will continue forever.
Now let's see how this will look in Python :
def fib(n):
  rv = 1 if n <= 2 else fib(n-1) + fib(n-2)
  return rv

#example of Factorial implementation
def fac(n): return 1 if n < 1 else fac(n-1)*n
Here is how we calculate the first 10 Fibonacci numbers using the function we defined :
>> [ fib(x) for x in range(1,10) ]
[1, 1, 2, 3, 5, 8, 13, 21, 34]
Recursion is one of the core concept in Functional programming languages, it provides very succinct and mathematically looking functions. One disadvantage of recursion though is the way it is implemented, it does rely heavily on using a stack to track the progress of those nested calls, in general this is the way how the arguments are passed when function is called.
The solution to this in Functional and other languages is to use the so called Tail-recursion. The idea is simple, whenever function is called recursively, instead of using the stack, a jump to the beginning is done (normally the arguments are redefined as local variables). Of course for this to happen the compiler or the run-time has to be able to rearrange the function in such a way that the call happen at the end of the function execution (Tail Call Optimization).
Python does not do TCO, so if you have to do heavy recursive function, 1000++ or more nested calls it may overflow. Keep that in mind. AFAIK there is no plan to implement TCO in Python.
Enough, let's jump to the next topic ...
Lambda functions
Very often we need to create functions on the fly. In Python we do this with lambda/anonymous functions. Instead of using def keyword we use lambda. Look at the example below. Here we assign on-the-fly created function to a variable and then we use that variable as if it is a function(). In fact it is a function :).
>> x2 = lambda x: x**2
>> x2(5)
25
>> map(lambda x: x*x, [1,2,3,4,5])
[1, 4, 9, 16, 25]
>> map(x2, [1,2,3,4,5])
[1, 4, 9, 16, 25]
In the second example we process a list and again create the function on the fly, but as you see x2() would have worked too.

The Program

Now that we know what recursion and lambda is we can start solving the real problem. As I said in the beginning we needed a simplified mini-expression-language to generate patterns of numbers.
The language
The expression language is pretty simple, but this does not mean that the parsing will be easy. Currently it supports only two basic operations : That should do it. Let me show you couple of examples :
>> 5*4      : repeat 5, 4 times
5,5,5,5
>> (3*2)*3  : repeat 3, 2 times, then repeat the resulting pattern 3 times
3,3,3,3,3,3
>> 1:5      : generate sequence from 1 to 4
1,2,3,4
>> 0:11:2   : sequence from 0 to 10, with step 2
0,2,4,6,8,10
>> (2,3*2,1:5)*2,6
2,3,3,1,2,3,4;2,3,3,1,2,3,4;6
I think it is pretty obvious ...

Here is high level overview of the process that we have to code :
(LoL - is shortcut for List of Lists, you can also encounter in other articles abbreviations as AoH /Array of Hashes/, HoH /Hash of Hashes/ and many other abbreviation-permutations)

You can find the source code of the program here : The Program
The parser
As we mentioned already, the first step is to parse the expression. For this we will use a module called lepl. Here is how to install it, from the console just type :
$ easy_install lepl
Now that you have it installed, let's look how the Parser code look like :
    def node(self,val):
       return val

 1: def build_grammar(self):
 2:   #definition of tokens
 3:   num = Integer() >> int
 4:   comma = Literal(',')
 5:   spaces = Space()[:]
 6:   dcol = Literal(':')
 7:   mult = Literal('*')
 8:   left_bracket = Literal('(')
 9:   right_bracket = Literal(')')
10:
11:   with Separator(~spaces):
12:      repeat = num & mult & num > self.node
13:      #greedier first
14:      ranges = num & dcol & ( (num & dcol & num) | num ) > self.node
15:      ops = repeat | ranges | num #operations
16:      op_list = ops[1:,~comma] #more than one ops separated by comma
17:      group = Delayed() #there will something encased in brackets
18:      group_repeat = group & mult & num > self.node #like a repeat but for block
19:      group_list = (group_repeat | group | op_list | ops)[1:,~comma] #block-list
20:      group += ~left_bracket & group_list & ~right_bracket > self.node #finally define the block
21:      atom = group_repeat | group | ops #group or standalone op
22:      self.grammar = atom[1:,~comma] > self.node
23:
24: def parse(self,txt):
25:    return self.grammar.parse(txt)
As you can see we have two methods one to build the grammar and the other to actually parse() the expression.
The parse() is obvious, lets discus how the grammar works.
What does the Parser do ? As the name imply a Parser consume a text and generate some sort of malleable data-structure. In our case it will generate LoL-tree or you could say tree of lists if we have to be more punctual.
The first step is to define Tokens. Tokens are normally straightforward definitions of chunk of text which act like a sort of atoms for parsing. The basic characteristic of tokens is that they are not too variable and are easy to define. You can think of them as the words in sentences or like word-stem in a word. Another benefit of tokenizing is that it makes the parser cleaner and easier to modify later.
The way to define parse-sub-expressions in LEPL is to assign the definition to a variable, it is very easy as you will see.
So the tokens are defined from Line:3 to Line:9 in the code snippet above. 3 different functions are used in those definitions, here is what they mean : There is two subtle things which may caught your eye in the token definitions. The first one is >> after the Integer() on Line:3.
As the parser goes trough the text, we would want to take some action when specific condition arise. In this case when we are sure that the current part of the string look like an Integer we want to generate this number as output of the process. That is what >> do. There is also > operator, which passes every match one by one to the function that follows, instead of passing all the results as a list in the former case.
So at Line:3 every character that the Integer() function matches will be passed as single argument to the Python function int, which will convert the string to Python integer.
Now let's look at Line:5, Space() is understandable, what is this [:] at the end. LEPL had stolen the Python array syntax to set the match-greediness. In the current case [:] translates to : please match as many space symbols as possible, until you reach non-space. It does not have to be this way always, sometime you may want to match as less as possible [:1] OR exact number of times [n:m] ... and so on.
Btw LEPL has a dedicated Token() function, with which you can do some fancy things, but I'm not using it here.
OK, now that we know what Tokens are and how to use them, let's get to the meat of the matter. First on Line:11 we specify that all the following rules (those within the with block) should disregard spaces. The ~tilde~ sign is a shortcut to DO NOT generate parse output, just make sure that the matched content is there i.e. match but don't generate.
Earlier we said we would support two operations in our mini-language, the repetition defined at Line:12 and ranges defined at Line:14. Repetition is easy to figure out it is Number followed by multiplication sign followed by another number.
The & symbol translates to AND, which mean that we have a successive match only if all three sub-matches succeed.
At the end we send the matched string to the PatternLang.node() method, which in our case does nothing, but just returns a List. LEPL has its own Node and List functions, but you would normally use those if want to create some more complex grammars, in our case just simple Python-list would suffice.
Now the range is a bit trickier we have two type of ranges start:end and start:end:step, so we describe a range as Number followed by double-colon followed either by another number OR by another Number-Colon-Number. One more thing if you look at the comment we put the longer pattern Number-Colon-Number match first, why ? Because it is greedier i.e. if it was the other way around the match would be satisfied at the moment we get just a Number and the parser will never bother to check for the more complex sub-expression with step.
As you would suspect | symbol is shortcut for OR and we use brackets to set a precedence so the the parser know what is the order of evaluation.
At the end again we send the matched string to PatternLang.node() method, to produce us a List.
On Line:15 we define all operators, and again use the same trick, the 'greedier' expressions fist.
Next on Line:16 we specify how we can combine multiple operations. Look how we use the Python-array syntax ops[1:,~comma], please match at least one operator followed by comma, but continue to match as long as there are more ops and don't include the commas in the output.
That was the easy part so far. As we mentioned earlier we would wanna be able to group together sub-expressions and have the ability to repeat them. If it was just that it wouldn't be a big deal, but once we allow that we have to allow for the case of nested expressions.

Hmmmm .... sound familiar ? Exactly we would have to use recursion to do that. But doing this in a Parser is syntactically a little bit more awkward than in general programming language. But don't despair, we can do that.. First if you remember we have to have a condition that will end the recursion so we don't go to infinity. In LEPL that is done by Delayed() function, look at Line:17, which says I will define how group will look like, but not now, a little later.
Our case is even trickier because we don't just have simple recursion, but a (group)*X is not only recursive but could also be repeated and then to add insult to injury it could also be a part of a list.
Line:18 group repetition is group followed by multiplication sign, followed by Number
Line:19 group list is many things separated by comma. If we do not define group-list this way we can only handle the case of ((group)), but now we can also handle a list of stuff ... F.e.: ((group),num,(group),op,op,...) i.e. compound nested expressions. Let me note again we put the greedier expression in front, so that they have a chance being matched.
Line:20 is centerpiece of our small expression language. Here the recursion starts, where a group is defined as something which is surrounded by brackets, but this something could also contain a group itself via group_list, ad infinitum. We use the ~ sign to tell the parser again that we do not want to include those in the parsed result. We need the += operation to tell LEPL that we are 'adding' to the Delayed() definitions of group.
The next two lines are just some fluff, because for simple expression we want to be lazy and skip the top-brackets.
F.e. if we do not have those lines we would not be able to simply say obj.generate('3*4'), but have to do obj.generate('(3*4)').
Pretty neat huh...
Processing ...
That's good we know how to convert the text to LoL structure, so our next task is to figure out how to walk that tree and expand/evaluate the expressions.
Let's first see the whole process visually, step by step how the string-of-commands is converted to LoL, which then is processed and finally flattened at the end :
pattern> 1:4,(3*4,6,(2,7)*2)

parsed LoL>
[
  [1, ':', 4],
  [
    [3, '*', 4],
    6,
    [ [2, 7], '*', 2 ]
  ]
]

after processing >
[
  [1, 2, 3],
  [
    [3, 3, 3, 3],
    6,
    [ [2, 7], [2, 7] ]
  ]
]

final result >
[1, 2, 3, 3, 3, 3, 3, 6, 2, 7, 2, 7]
So we will split the processing in two pieces one that does the walking of the LoL-tree process() and the other that does the evaluation of a core-subexpression calc().
Here is the code :
 1: def calc(self,lst):
 2:    if len(lst) >= 3:
 3:       if lst[1] == '*' : return [ lst[0] ] * lst[2]
 4:       if lst[1] == ':' :
 5:          #extract only the numbers
 6:          args = filter(lambda n: isinstance(n,int), lst)
 7:          return range( *args )
 8:    return lst

 9: def process(self,lst):
10:    for i in range(len(lst)): #go over the list
11:       if isinstance(lst[i],list): #if list process it
12:          lst[i] = self.process(lst[i]) #recursivly
13:    #it is operation, do it
14:    return self.calc(lst)
Let's think ... What is the more natural way to process a tree structure ? ... Of course recursion ;)
That is what process() do, it goes over the items of a List passed as argument one by one (Line:10) and if it encounters another list inside this list, calls itself recursively this time passing this inner-list as argument (Line:12). But there is one more thing, after the result is returned it takes the place of the current element, you may have noticed the assignment which does that.
As you can see we try to first intercept any sub-list and sort-of postpone their evaluation until we reach the leafs of the tree. When we reach a leaf the calc() take care of evaluating it.
Remember we said that we need a condition to end the recursion otherwise it will go forever. Well in our case the condition is if you don't see any sub-list i.e. not list no recursive call.
So what exactly calc() does ? In Line:2 we check if the list have 3 or more arguments. Both operation we support are either 3 or 5 element lists. It is not very clever idea if we want to extend the evaluation in some weird ways in the future, but for now it will suffice.
Implementing repetition is easy, albeit idiosyncratic Python ... [ thing-to-repeat ] * how-many-times.
For generating range of values, we just need to extract the numbers and pass them as argument the Python range() function. Did you see on Line:6, how we used lambda function to filter double-colon symbol :.
So we are done with the processing ... what's next ?
Yield .....
Normally when you write a function in Python you will return value using the return keyword, if you don't specify it Python will assume return None. That is good enough in 99% of the cases, but there is a class of applications that could benefit if you could make a function able to keep its own state.
Look for examples of this in the article I have on Closures and Iterators for some ideas of what does it mean.
Using the yield keyword you can create the so called Generator, which is subset of Iterators. Yield is also often used to provide lazy evaluation in handling infinite loops or structures. In our example below we will use yield capability to return list of values.
How does it work ? The general idea is that yield act as return with the small difference that the next time you call the same method the execution will continue at the place it exited previously i.e. the current function yield temporary control to the caller, until called back again. Also the method will preserve the context of the previous call, so you don't have to worry of passing data around.
Here is the code which will receive LoL-tree and will return back a flat-list of all the elements unpacked correctly :
    #Flatten list of list of list .... LoL
 1: def flatten(lst):
 2:   for el in lst :
 3:      try: #if elem is list, process is recursively
 4:         for e in flatten(el) : yield e
 5:      except TypeError : #if is not list return it
 6:         yield el

 7: def generate(self,pattern):
 8:    return list( flatten( self.process( self.parse(pattern) ) ))
We use several tricks to accomplish this.
At Line:2;5 we process the list one element at a time. Using try ... catch mechanism, we check if the element is list or not.
(We could use isinstance(x,list) as we did earlier, but we would have not learned anything new ;) )
If it is not a list Line:6 we yield back the element. If on the other hand it is a list then we have to unwind it. For the unwinding we will of course have to use recursion again (Line:4), this is our general theme here, right ? If you take a closer look at the code you may catch a peculiarity, compared to our earlier use of recursion. There we were doing similar recursion, but we had just one for loop. Why are we doing it differently ? Think about it !! In the process() method we were replacing the current element with whatever came up unpacked by calc(), we were in a sense returning elements only to feed the replacements. In comparison here we do not do any replacements, we just yield back a stream of elements as we go.
The final step in our class is to have a general method to go trough all the steps from parsing to flattening (Line:7;8). The yield will give us a Generator object, which we can loop over, but in 99% of the cases we just need the final list. That is why we pass the result of flatten() to be listified.
Examples of using the code :
from pattern_lang import PatternLang, flatten

x = PatternLang()
print x.generate('4*3')
print x.generate('0:11:2')
print x.generate('6:1:-1,0*3')
print x.generate('((((0*2),1)*2,2)*2,3)*2')
print x.generate('1*3,(5,1,7,(3*3)*2,9)*2,(0:5)*2')
$ python test.py
[4, 4, 4]
[0, 2, 4, 6, 8, 10]
[6, 5, 4, 3, 2, 0, 0, 0]
[0, 0, 1, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 2, 3]
[1, 1, 1, 5, 1, 7, 3, 3, 3, 3, 3, 3, 9, 5, 1, 7, 3, 3, 3, 3, 3, 3, 9, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]


You can find the source code here : The Program