Perl like Scalar in C++


C++ is nice, Perl is nicer. Let the war begin :))

Programmers have different ideas what is the best way to program and which language to use. Different personalities, different preferences I think. For me I have always liked more weak type languages even that my introduction to programming was in Pascal. I always liked to approach a problem from top->down. What I mean by this is to start with more abstract rough solution and drill down the details as I go. The drawback of programing in this way is that it is somewhat slower initially. But you have better view of how to conceptually to optimize the algorithm of the application. Most of the time my pre-alpha version is one-off script, which at alpha stage becomes OO-skeleton app ... and so on
Normally weak typed languages are better suited for this type of tasks. Once you have some working prototype you can "fish"-out the details. On the other hand in language where the type and other minute details can be specified then the compiler has more ways to optimize the code for speed and memory. So I was wondering can I use strongly typed language like I use weak typed one, so that I have the prototyping-speed of scripting language with the ability over time to optimize for speed if I wanted to. That is how I decided to implement some of the idioms I normally use when I write Perl in C++.


Something like this would serve several purposes.


Probably the most obvious obstacle on our way that we would wanna get rid of is the need to micromanage variable type declaration, casting, conversion ...., we can do this by implementing a perl-lke Scalar type i.e. a type that can hold number or string or boolean and also behave as expected when you use operation like +, -, *, comparison, regex-matching .... etc. (I was wondering if adding reference to the mix would be beneficial, may be in the future). If you are in a hurry you can jump and see the implementation here Scalar.h, but don't worry we will go there step by step.

Let's start with simple example :

#include "perl_like.h"
using namespace pl;

int main() {

	Scalar s1 = "The quick brown fox...";

	cout << "First it is a string: " << s1 << nl;
	s1 = 15;
	cout << " and then a number : "<< s1 << nl;

	return 0;
}

What we did here is first to #include the perl_like header file (this file includes all other files needed for Scalar and other future projects). Then we use namespace pl, so we don't have to prepend everything we use from scalar class with pl::. (namely we could specify Scalar, instead of pl::Scalar).
Our next task is to implement the main() function, where we will put our test code.
  • First thing first, we declare a scalar variable s1 and assign a string to it.
  • Then we print out, so we can be sure the assignment worked.
  • Afterwards we assign a number to the same variable.
  • and print again.
    See we did not declare new variable of any of the number types, we just assigned the number to the same variable. Let's compile and execute to see will the magic work :
    > g++ scalar_example.cpp  -o example
    > ./example
    First it is a string: The quick brown fox...
     and then a number : 15
    
    Cool, it seems to be working. Let's add some more stuff to the mix :
    00: #include "perl_like.h"
    01: using namespace pl;
    02:
    03:  int main() {
    04:
    05:  	scalar s1 = "The quick brown fox...";
    06:
    07:  	cout << "First it is a string: " << s1 << nl;
    08:  	s1 = 15;
    09:  	cout << " and then a number : "<< s1 << nl;
    10:
    11:  	pnl;
    12:  	scalar s2(10);
    13:  	cout << "Then we sum two scalars : s1(15) + s2(10) = " << s1 + s2 << nl;
    14:
    15:  	//string concatenation is done using |, instead of +
    16:  	scalar $s3, $s4;
    17:  	$s3 = "conca"; $s4 = "tenated";
    18:  	cout << "Lets concatenate strings : ";
    19:	scalar $s5 = $s3 | $s4;
    20:	cout << $s5 << nl;
    21:
    22:	//regex match
    23:	if ($s5 ^= "tena") {
    24:		cout << "match regular expression, succesfull" << nl;
    25:	}
    26:
    27:	//refer to how it behaves in Perl, if in doubt
    28:	scalar $s6 = $s5 * s1;
    29:	cout << "Multiplying a string * number yelds : " << $s6 << nl;
    30:
    31:	scalar $s7 = 0;
    32:	if ($s7) {}
    33:	else { cout << "perl idiom if($s) : 0 => false" << nl; }
    34:
    35:	$s7 = 1;
    36:	if ($s7) cout << "perl idiom if($s) : not 0 => true" << nl;
    37:
    38:	//you can always dump a variable to see the internal
    39:	cout << "Here is how dumping a variable works $s5.dump()" << nl;
    40:	$s5.dump();
    41:
    42:	return 0;
    43:  }
    
    
    Here is how the output looks like :
    > g++ scalar_example.cpp  -o example
    > ./example
    First it is a string: The quick brown fox...
     and then a number : 15
    
    Then we sum two scalars : s1(15) + s2(10) = 25
    Lets concatenate strings : concatenated
    match regular expression, succesfull
    Multiplying a string * number yelds : 0
    perl idiom if($s) : 0 => false
    perl idiom if($s) : not 0 => true
    Here is how dumping a variable works $s5.dump()
      .type:2, .num:-1.43217e-05, .str:concatenated
    
    
    And now the explanation :

    Line 12: We declare another scalar variable s2(10) this time using function like syntax. (this calls the class constructor as we shall see later).
    Line 13: Then we sum the two scalars s1 + s2 and print, we should expect to get the result of numerical summation, because both of the variables are internally integer at the moment.
    Line 16: What follows is something important to know for Scalar, which at first may seem contra-intuitive, but make perfect sense if you think about it. Because "+" is numerical operation, we can't use it on strings. Normally in C++ when you use "+" to concatenate strings it may make sense because you are certain you operate only on strings, but remember Scalar could be both. That is why we will steal "|" (bitwise OR for concatenation of strings).
    (Because the way we are implementing Scalar i.e. overloading almost all the operations, we can't use ".", which is used in perl to do concatenation). Btw, if it makes sense to do concatenation with "+" (when internally both operands are strings instead of returning 0), we can accomplish this with a little tweak. But let's not get ahead of ourselves and continue... So to concatenate two scalars we use something like this : $s3 | $s4.
    Some of the most observant of you probably spotted something weird for C++. When I declare scalar variables I prepend the name with dollar sign "$" as we do in Perl. This is nothing special, C++ allows $ to be part of the variable name, so we will use it to denote the Scalar variables. I think this is a good practice anyway, because the sigil will allow visually to distinguish Scalar from normal C++ variables.
    Line 23: Here is another Perl idiom comparing a string against regular expression. If you have to do regex'es in C/C++ you have to do alot of preparation work. I decided to steal yet another operator for this goal i.e. ^= (bitwise NOT shortcut). I couldn't use ~, because it is unary operation ;(. So we end up with very perlish looking comparison if ($s1 ^= "regex").
    Note: One pecularity to note about C++ and encoding regular expressions. If you want to use slash \ inside your regular expression, you have to use double-slash instead f.e. if you want to match digits you have to use : \\d+. That is because the string is interpolated.
    Line 28: The next thing is not something important, but I included it here just for demonstration purposes. Multiplying (string * number) returns 0. If you want to predict what the result will be of Scalar operation when two operands are different internal type, just use ZERO in the place of the string and then do the numerical operation.
    Line 32,36 : The following two examples are yet another Perl idiom, namely checking a variable in an if() statement. The expectation is that if the variable is 0 that would mean false, otherwise it would means true. (I will have more to say about this further down). Non Perl developers could be abit scared of all those shortcut behaviors (probably not the C/C++ programmers, because they have their own batch of idiosyncrasy), but that is the beauty, once you get comfortable of using them it becomes second nature and improve both flexibility and readability of the code. It is just like the natural languages. Today's kludge could be tomorrow rule if it make sense.
    Line 40: Last, but not least I provided you with a .dump() method, which is always a good thing, especially if you prefer good old non-IDE debugging. It is such a loss that there is no clean way to implement general variable dumping in C++ ;(

    Implementation details

    Now that we know it works, let see how it works. If you are not interested in the internals you can skip this section. The ability of Scalar to mimic numbers and strings is achieved via C++ struct. Here is how the declaration looks like :
    typedef double number;
    enum scalar_subtype { NUMBER = 1, STRING = 2 };
    .....
    struct {
    	number n;
    	string s;
    } value;
    scalar_subtype sub_type;
    

    As you can see in addition to the data itself we have one more attribute sub_type to hold what is the current internal type of the Scalar.
    Any operation that we are about to do on the Scalar has to first consult the internal representation, before executing it. The next core part of the implementation are the operations themselves. Here we have a little bitty problem on our hands... there is 4 major operation in which numbers can participate i.e. addition, subtraction, multiplication and division, then we have 4 shortcut operations +=,-=,*=, /= and on top of that we have 6 (==,!=, >, <, >=, <=) comparison operators and increment/decrement operations, plus some others... In any of those operations the operands can be of tree types : number,string or scalar.
    So if we do rough calculations this mean we have to implement 2^3 = 8 permutations of types * ~20 operation ~= 160 methods to handle all the possible permutations.
    To lower the number of those possibilities we will be clever in three ways.
    • First we will implement the += operation, for Scalar-Scalar. This would allow us to "skip" the implementation of "+", because it is almost the same thing, as we will see in a minute.
    • Second we will use templates to "shortcut" Scalar-String and Scalar-Number operation to Scalar-Scalar.
    • And third we will provide a capability for conversion from C++ intrinsic types to Scalar, so we can make those Scalar-shorcuts work. (that is because if you know how to convert a specific type to Scalar-Scalar, you can always convert and then apply Scalar-Scalar operation).

    Before we start with the gritty details let me mention several utility methods that we will use all over the place. The attribute accessors an such.
    We will use two level access to the Scalar storage, first level are protected methods and should be used only by the class constructors and from the second level methods. This way we protect ourself from recursively calling the storage accessors and make it easy in the future to implement Scalar with other ways of storing the data.

    Level 1

    • set_num(), get_num()
    • set_str(), get_str()
    • set_type(), get_type()
    Which as I said are used in the constructors and the next level accessors methods i.e. :

    Level 2

    • getters: num(), str()
    • setters: num(x), str(x) - set the internal value and type
    • checkers: is_num(), is_str() - return the current internal type
    • converters: to_num(), to_str() - whatever the internal type is get us back the thing we want.

    There is at least two other possible implementations - as a string (you just convert it back and forth on every usage as needed) OR C++ union, the new standard seems to allow unions of intrinsic types and strings (I didn't implemented it this way, because I wanted Scalar to be as backward compatible as possible). On the pure string implementation I have a semi-working scenario which I can publish with some useful remarks if I have time some day, btw to take on yet another "tangent", if it were not for "restrictiveness" of C++ i.e. disallowing implicit conversion of strings via operator() in certain cases the current implementation would have been much more compact and concise.

    ... where we were, ... yep now that we know how to access and edit the scalar internal data lets implement the class constructors :
    Scalar() { set_num(0); };
    //copy constructor : Scalar $s = 55
    Scalar(const Scalar& c) {
    	c.get_type() == NUMBER ? set_num(c.get_num()) : set_str(c.get_str());
    }
    Scalar(const string& x) { set_str(x); };
    Scalar(const char* x) {
    	string s = x;//convert to string first
    	set_str(s);
    };
    Scalar(const number& n) { set_num(n); };//$s = 55
    Scalar(const int& n) { set_num(n); };//Scalar $s = 0; zero-ambiugity assignment conv
    
    ..... starting with the copy constructor, which is normally called when you pass argument by value or in assignments... Then we need the empty constructor and last but not least constructors to create scalar from intrinsic C++ types, which we call using the function-like-syntax i.e. scalar $s(55).
    Next we have cast operators, so that the compiler can handle those automatically, instead of us doing explicit casting.
    operator number() const { return to_num(); }
    operator float() const { return to_num(); }
    operator int() const { return to_num(); }
    operator char() const { return to_num(); }
    operator string() const { return to_str(); }
    //mimic boolean
    operator bool() const { return is_num() ? num() != 0 : str2num(str()).first; }
    
    Another important operator is assignment :
    Scalar& operator = (const Scalar& rhs) { 
    	if (this == &rhs) return *this;//self-assignment no,no..!
    	rhs.is_num() ? num(rhs.num()) : str(rhs.str());
    	return *this;
    }
    
    template<class T>
    Scalar& operator = (const T& rhs) {
    	return *this = Scalar(rhs);
    }
    
    The variable rhs holds our right hand operand in the expression being evaluated.
    The first thing we want to do is check that this is not a self-assignment. Next based on the rhs internal type we set the value in the lhs Scalar, and finally we return reference. (this variable as you know points to the current object).
    The second method declaration in this snippet is to handle the cases where our right hand operand is not a Scalar. We use template to catch all other types. What this catch-all-types operator will do for us is to first create a new Scalar (remember: we already implemented the constructors to create Scalar from intrinsic types) on the fly and then call our Scalar = Scalar assignment. Simple, eh!
    You should read this template declaration thing as follows : "Match T to the type of the variable that is passed. Then in the following method declaration and body substitute any occurrence of T with the matched type". This all happens at compile time, so if there is mismatch the compiler will catch it. From what is my understanding the overloaded methods are created only when needed. Template programming is much more powerful, scary and downright ugly, but I won't say much more at this time.
    OK. The accessors, constructors, cast'ers, assignment and the utility methods give us everything necessary, so we can finally implement the Scalar overloaded operators.
    We will start with the shortcut-addition operator:
    Scalar& operator += (const Scalar& rhs) {
    	if (is_num() && rhs.is_num()) { num(num() + rhs.num()); return *this; }
    	if (is_num() && rhs.is_str()) {
    		number n1 = rhs.to_num(); num( n1 ? num() + n1 : num() + 0);
    		return *this;
    	}
    	if (is_str() && rhs.is_num()) {
    		number n1 = to_num(); num( n1 ? n1 + rhs.num() : 0 + rhs.num());
    		return *this;
    	}
    	if (is_str() && rhs.is_str()) {
    		number n1 = to_num();
    		number n2 = rhs.to_num();
    		//logical XOR : first case str+str OR num+num, else ....
    		if (!n1 != !n2) { n1 && !n2 ? num(n1 + 0)	 : num(0 + n1); }
    			else { n1 && n2  ? num(n1 + n2) : num(n1 + n2); }
    	}
    	return *this;
    }
    
    
    template<class T>
    Scalar operator + (const T& rhs) {
    	//first make a copy then do shortcut-summation
    	Scalar $rv = *this;
    	$rv += Scalar(rhs);
    	return $rv;
    }
    

    Our first argument is always the one on which we operate on, the second one is our right hand operand i.e. this += rhs. Where *this points to the current object. In the case above we don't have to explicitly use this to access lhs.method(), we just call the method(). For the right-hand operand we have to specify explicitly i.e. rhs.method().
    Our first order of business is to check the internal type of both operands and based on that act accordingly.
    As I already mentioned, but it does not hurt repeating, there is two methods that will help us with that : is_num() returns true if the Scalar is a number (internally), and is_str() obviously returns true if the operand is a string. We also have two methods to get the current value of the scalar num(), str() and of course to set the value num(X), str(X). The setters methods also re-set the internal type automatically, based on which method you call.
    The code is almost self explanatory, but I will go ahead and explain it. In principle we need to to cover 4 cases i.e. number-number, number-string, string-number, string-string.
  • Number-number is easy, just do the operation. See how the summation is enveloped in call to the setter method num(..operation..).
  • When one of the arguments is string and the other number, we have to try to convert the string-to->number and if we are successful i.e. we get back number, do the numerical operation. if not-successful then we substitute the string operand with ZERO and do the numerical operation. If you look in the code only division has small difference because of the division by zero case.
  • And finally the most complex case is the one where both operands are strings. This is micro-cosmos of our original operation +=, because once both strings are converted we are presented with the same 4 conditions we mentioned earlier.
    Don't be scared of the (!a != !b), this simply imitates logical XOR operator (because C++ does not have one). This way we isolate our 4 cases in two where both operands are either similar types or different. And then we do something like what we already did for lhs<->rhs, only this time we do this on the converted values. As a final requirement of overloading the += operator we have to return a reference to the result, which in our case is the left hand operand.
    If you look again carefully all modification were done on the left-operand i.e. *this.
  • Then see how easy is to implement + operator, we put it again in catch-all-template. Implementation is as follows first instantiate a new Scalar from the current one, then create another Scalar from the right hand operand and finally short-cut sum them.

    And now the next piece, the comparison operators.
    bool operator == (const Scalar& rhs) const {
    	if (is_num() && rhs.is_num()) return num() == rhs.num();
    	if (is_num() && rhs.is_str()) return num() == rhs.to_num();
    	if (is_str() && rhs.is_num()) return to_num() == rhs.num();
    	if (is_str() && rhs.is_str()) return str() == rhs.str();
    	return false;
    }
    
    template<class T>
    bool operator == (const T& rhs) const {
    	return *this == Scalar(rhs);
    }
    
    We can see that the comparison operator looks alot like the summation operator. Again we have case for every combination of sub types of the lhs and rhs operand. One difference though is that instead of returning the Scalar we return boolean value this time.
    Then again we use template to handle the non-Scalar cases. Inside this method we create Scalar from the intrinsic number or string type and then use the Scalar-Scalar implementation to handle the rest.
    I promised earlier to revisit again the the implementation of the Perl idiom of pretending that the Scalar is a boolean and just do simple if(). Here is one example.(you can see more in the test script).
    Scalar $s = "0";
    if ($s) { cout << "$s is true" << endl; }
    else { cout << "$s is false" << endl; }
    
    Will print "$s is false" as you may expect. One subtle thing to see for non-Perl programmers is that the zero we are using is in fact a string. The logic is the following :
    • "0" is interpreted as false
    • 0 => false
    • "0dsad" => false (conversion of string to number yield 0 i.e. false)
    • "sadoa" => true , any regular string is treated as true
    • "4567sdawq" => true, any string convertible to non-zero number is true.
    • 89 => true, any number except zero is treated as true.
    • "0E0" => true, special case zero but true
    The example above is pretty boring example, but I wanted to elaborate more on the implementation details.
    #define pBool std::pair<bool,number>
    pBool str2num(const string& str) {
    	istringstream is(str);
    	pBool rv(false,0);
    	if (str == "0E0") {//Zero but true
    		rv = std::make_pair(true,0);
    	} else {
    		is >> rv.second;//convert
    		//logical XOR : !fail != !0?
    		if ( !is.fail() != (rv.second == 0) ) rv.first = true;
    	};
    	return rv;
    };
    
    
    The utility function we use does not just convert from string to number, but also returns a boolean to tell the receiving end if the conversion was successful. We do that by returning a pair of data. We can access both elements of the pair like this num2str(str).first and num2str(str).second. The other weird thing you can see is this special handling of the string "0E0", which translated means "Zero but true" i.e. if we use the result of conversion as number it will be interpreted as Zero but in boolean context it will be interpreted as True. Of what use that could be ? To give you an example in the Perl database interface DBI it is used when a query was executed successfully but returns no rows. So if you check the status in boolean context you will get true, but if in numerical context you will get 0 rows.

    TODO

    * stream operator (friend)
    * increments : ++$i vs $i++
    


    Implementation details - musings

    Some of you may probably wonder could we simplify things even more... yes we can but it will be at the expense of "debug-ability" :)
    How is that ?
    If you look at Scalar.h, you may see that +=,-=,*=,/= are very similar in implementation.
    In fact when I first started implementing Scalar I wrongly started with the addition operation, rather than with the shortcut, and as I was doing it I saw I can describe them in micro-units of macros. If you glance over the code you will see there is tree basic micro-operation involved : convert a value and store it in temp variable, do a logical check, evaluate the operation with operands in different order.
    Here is a glimpse of how it looked like :
    ....
    //shortcut for string+string and string+number logic
    #define NumNum(op,l,r)	return scalar(l op r)
    #define StrStr(op,l,r)	{ mn_cn12(l,r)	; return scalar(logical_xor(n1,n2) ? n1 op n2 : (n1 && !n2 ? n1 op 0 : 0 op n2 ) ); }
    //Ex: convert lhs to num. If it can be converted then return result of num-operation, else convert rhs to string and...
    #define StrNum(op,l,r)	{ mn_cn1(l); return scalar(n1 ? n1 op r  : 0 op r ); }
    #define NumStr(op,l,r)	{ mn_cn1(r); return scalar(n1 ? l  op n1 : l op 0 ); }
    ......
    

    Of course using macros is always looking for troubles, but anyway it was a good exercise.
    Templates are of no use here, because they handle "type-variability". In this case the variable-thing is the operation (+,-,*,/) and as we know operation is the method name, not the method argument types.
    As far as I know only functional, logical languages and Perl (via AUTOLOAD) allows you to play with the "functor"-name. Strongly typed languages like C++, normally would discourage such a freedom, because it can end badly if not used carefully.
    I personally have been a witness what a monstrous abuse Perl AUTOLOAD could become in the hand of wrong people ;). (I'm talking about the GOD-object anti-pattern, look it up on wikipedia).
    But in the hands of experienced programmers it could work miracles...
    One more overcomplication/oversimplification we could do, if we wanted to ;), is to implement all the operations in one method and then call it from everywhere.
    How will we do something like this ?
    Simple use templates to describe all possible variations of types and call one all-in-one method with the operands plus their types as arguments. This way in the receiving end we will get all the required information so that we can logically split the operation handling via if-then-elses and do the correct typecasting (that's because we have to pass all variables as void-pointers).
    Here is our example from my old test.cpp :
    const string INT_TYPEID	= typeid(int).name();
    const string NUM_TYPEID	= typeid(number).name();
    const string STR_TYPEID	= typeid(string).name();
    ...
    void test(string op, string n1_type, void* n1, string n2_type, void* n2, number got, number expected, string msg) {
    	string n1_str = n1_type == NUM_TYPEID ? num2str(*(number*) n1) : "'" + *(string*) n1 + "'";
    	string n2_str = n2_type == NUM_TYPEID ? num2str(*(number*) n2) : "'" +  *(string*) n2 + "'";
    	string details = msg +  " : " + n1_str + op + n2_str + " = " + num2str(expected);
    	myis(got,expected,details);
    }
    

    The idea seems simple, but the implementation is ugly.
    Do you see how we pass n1 and n2 as void* and then cast them when needed. One thing you may find hard to figure out in *(type*) var-declaration.
    What this translates to is, first typecast to (type*) and then give me the value to which the pointer *var points to. Remember the compiler have no idea what is the type of void*-pointer, we have to help him, but we can do this only if we know the type of the variable in run-time, that is why we pass it like a separate parameter.
    I thought test_scalar.cpp to be good idea to illustrate this approach, rather than implement it in Scalar.h.

    Test suite

    I've build some ~100 tests to test the Scalar implementation, you can find it here: test_scalar.cpp This way if you play with Scalar.h, you could easily find any code that could break something else and believe me you would need them. One thing to mention you would have to install libtap++ library. You could find it here : libtap++ Then you compile the test_scalar.cpp like this :
    > g++ test_scalar.cpp -ltap++ -o test
    >./test
    .....
    

    There is a little problem libtab++ (it may have been fixed already), namely that the is() test does not print the specified message in some cases, you can resolve the problem by editing tap++.h
    > 142c142,143
    ---
    > 			bool ret = ok(2 * fabs(left - right) / (fabs(left) + fabs(right)) < epsilon);
    ---
    > 			bool ret = 2 * fabs(left - right) / (fabs(left) + fabs(right)) < epsilon;
    > 			ok(ret, message);
    
    

    Conclusions

    Pfuuuu... finally we got to this point... You can find a zip file which contains all the files we discussed here.
    +> Scalar.zip
    And by the by use integer in for loops, not Scalar ...

    What's next ?

    I tried to isolate all access to the struct{} that holds the real data, with the hope that I could play with other implementations, such as probably making it union{} thus making a Scalar occupy less space.

    I'm currently experimenting with implementing Hash, again my idea is to look and feel close to Perl behavior if possible.
    Of course it wont be 100% possible because we are working in the framework of C++, still... I already have some working implementation, here hash.h, I just have to write an article like the one you just red and create some example script.
    Next comes HoH i.e. Hash of hash, again I have some early working variant of it...but it is even more tricky to make it more perl-like.