Web scrubber and more ...

Have you encountered a nice site with alot of useful content, which you would like to have in your personal book library. Yeah ... me too. With some luck and little bit of coding I would show you how to extract information from HTML pages and consequently build a PDF file out of them.
A prerequisite to this is that the site has uniform structure, which now-days most of the sites do.

You can download or review the app from here scrubber.pl, if you are in a hurry. And here is how the generated PDF looks like : scrubber.pdf

Recently I've been reading many Economics books and articles from different Economics schools. I'm preparing some articles on this theme. So it is no coincidence that the site we will use for our experiment is the site of Austrian school of Economics : mises.org.
The good thing about misses.org is the uniform structure of the site which we need for our little experiment.

Prerequisites ...

Now that we picked site to mangle ;), we have to decide on the programing language and tools to use for the task. For this one I decided to use Perl, more specifically the Mojolicious framework. Which is framework similar to Ruby on Rails with some additional goodies which we will sequester for our needs.
Yep we won't use the framework per se we don't want to build web site but just browse an already build site and extract information from it. Mojolicious has a very nice additional modules which are part of the whole package, specificaly tools for fetching and processing HTML pages.
Mind you, you can use the LWP library to achieve similar results, but we wont.
Installing Mojolicious is a breeze and takes 20-30 sec, here is how : (for more details look at the original site).
$ curl -L https://cpanmin.us | perl - -M https://cpan.metacpan.org -n Mojolicious
One important thing to do before we start is that we have to manually review the HTML code of the articles we are interested in and find the element/s that contain the content we want to extract. Remember in this simplified scenario we expect that the articles will have similar structure.
Variable HTML structure won't be problem either, but will complicate our task. I will leave this as exercise to you once you understand how to handle homogeneous HTML structure. It wont be hard to extrapolate once you know the basics.
In our current example the information is contained inside the <DIV> tag with class name "body-content". Keep this in mind when we start discussing the script operation.


Apart from the basic description of this specific program I also wanted to share with you in this article the methodology I normally use when writing applications. It may not be the way you write apps, but I think it wont hurt to have one more idea in mind when you approach a problem.

Without further ado here are the stages it normally takes me to write a more involved app. After I get general idea of what I want to do and approximately know the final goal I write a quick ad-hoc script (this is stage one), which does not have any subroutines/procedures or classes and such, just straight-forward linear code. No bells and wishes. I do this to confirm my technical requirement and to figure out the tools and libraries I would use f.e. in the current scenario I could have gone with LWP and some HTML-parser lib or used complicated regular expression to extract the content.

The good thing is that from my past experience I knew that Mojolicious have a User Agent class and DOM parser (which I have never used before), so I decided to give it a go. In this early stage it is OK to pick and choose tools, because you would not need to do a huge rewrite if something goes astray. This is the time to experiment. That is the basic idea you do something and if it does not work you throw it away ... then you try something else. Perl and scripting languages in general are best suited for such type of approach.

Now that we decided what libraries to use and approximately know the direction we are going, the time time is for .... reorganizing.
Second stage is the place to "divide and conquer" i.e. whenever you have accumulation of code that does logically similar things separate it in subroutines. The program we are currently discussing is in this stage. You can see how the main body of code is mostly calls to subroutines, which are logically separate routines.

The benefit is that having code "subroutined" allows you to make changes that won't impact the rest of the code, which should be logically different, right ?! Logic task separation means more granular control and quicker and easier reaction to changes. It also provides better readability. In this stage the code is still fluid, but start to acquire structure, that we hope will match the requirements and the way you think better.
As a side note, most of the functional languages are normally "forcing" you to write in Stage2-mode by design, so doing this will help you build habit for working with them.

Stage three is where we component-ize our code. We gained some experience which we did not have beforehand, so we can manipulate pieces of code easily w/o breaking the whole app.
If we are using Object Oriented language this is the stage when we cross from procedural code to Object oriented.
If we are using Functional type language this is the time to separate the app to packages and/or "interfaces".
What is the goal of doing programming this way ... if you are writing more complicated app you need a "breathing space" to experiment before you commit to approach,libraries and structure.
Now that we are clear on the theory lets continue with our current program.

The algorithm

So how are we going to implement the "web-scrubbing" ? Here are the steps :
Seems pretty easy, isn't it.. and it is as we will see.

The program

I'm printing the whole program below, because it is not that big. You can also download it from here: scrubber.pl. The app is pretty well commented I will discuss mostly the overall structure and some nuances in a general way.

The first thing that picks our attention are the two modules we import which provide us with ability to fetch web pages (Mojo::UserAgent) and parse HTML (Mojo::DOM) into structures which is more natural to use than pure text (which is simply a stream of characters). You can think of Mojo::UserAgent as sort of "browser".

Next we declare some constants and variables. The important constant CONTENT_ELEMENT is the one we mentioned in the beginning of the article. It uses XPath/CSS-Path notation for specifying that we will be looking for <DIV> html tag with class name body-content, this is where article main content is. We are interested of actual information, rather than site menus, navigation and such. (If you have not worked with XPath take a look at this tutorial).
Next we have the list of URLs we want to extract from. Because we may want to structure the resulting PDF file into sections or chapters, I decided that any listing that starts with dash "-" will be treated differently ... i.e. we will create a "empty" page which will contain only specified text in enlarged font. So whenever you need a new chapter start the URL with dash and write the section name.

Next we have a list of very simple subroutines that handle very simple tasks (remember our stage 2). Sometimes it may seem useless to pack such small piece of code in separate subroutine, but it is good habit to get used to it. Why is that ? ... Take for example the get_filename() sub, it seem like total nonsense, why didn't we just folded it into the original LOOP (discussed in a sec), instead we spend time separating it. Glad you asked ... what if the file name we wanted to create was more elaborated procedure and was not part of the URL ? If it was part of the main LOOP we had to write the whole logic there, which would have made reading the code harder. Because we have it as separate sub() we would need at most to pass some additional arguments and hide the logic inside, away from the main code. This approach gives us separation of intent.

The part that seem more complex is the fetch_and_extract() subroutine.
What we do here is use the UserAgent object we created in the beginning to fetch the page specified by the URL (passed as argument). Three things are of importance in this chained call. First max_redirect() call assure us that if we don't access the page directly, but the web server redirect us we can still fetch the page.
Second res() method gives us the Mojo::Response object back. The response contains the full HTML-page plus the HTTP-headers and such, that is why we use ->body() call to get back just the HTML.

Then we create DOM object, which accept as argument the result we got back from the get() method. Once we have the string-stream parsed as DOM-objects, we can lookup for our element using XPath like expression. Which is what we do with find(CONTENT_ELEMENT). Now that we have the content of this <div> tag, there is nothing more to do but to pass it back. Notice that we return back a reference to the string instead the string itself. The reason is that if the string is too big it will cause unnecessary memory coping around and will also be slower.

Another thing to notice from methodological point of view is that this method does two logical things in single subroutine : fetching and parsing. Even the name is akin to the action we implement inside. I did this on purpose :), to show you that the rules are not always clear cut. If it make sense to combine similar code in a somewhat bigger chunks, do it. Splitting the sub() is OK too.
My golden rule is to not have subs/methods bigger than 1-2 screen-pages. Multi-page subs/methods are the scourge in programming and should avoided at all cost. KISS : Keep it small simple. Readable source code is better trade-off in most cases than having faster ugly code, if you are worried about the penalty you incur having too many calls.
If I rework this code into OO at stage3, I will most probably split it in 2 or even 3 chunks : fetching, parsing and third method which will call those two in sequence. This logical separation helps you in another way too, if you later want to add post/pre processing before doing fetching/parsing you have "clean points to entry", which will not disturb the overall flow of the code.

Finally the LOOP. Look at it, it is almost bare bone subroutine calls. If you pick suitable function/method names it will read almost like pure English sentences. Compare that to garnoumous linear program, where you have to read every portion of code and understand it, so you can then separate in your head into logical steps, scrolling back and forth many pages... Instead the sub names stand to remind you about the logic of every specific step.
FOR every URL
	IF the URL starts with dash THEN format the header, the content of the page is empty string
	ELSE given the URL : FETCH the page, EXTRACT the HTML, CONSTRUCT the file name and FORMAT the header

	NOW SAVE the result into a file

Simple English. As a final step it will be a good idea to put the loop in its own subroutine.

That is everything we required from our code, now we have all the articles and Sections numbered so that we can preserve the order in which they get collated into PDF file. How do we do that, there are many tools... the one I picked up for the task is called wkhtmltopdf. It has many dependencies for compilation, but they have provided download page with precompiled versions you can use. Once you have the tool installed, just run the command inside the HTML dir :
> wkhtmltopdf *html scrubber.pdf
If you want to see how the generated PDF looks like download it from here : scrubber.pdf
#!/usr/bin/env perl
use strict;
use Mojo::UserAgent;
use Mojo::DOM;

#specify the HTML element that contain the content you are interested
use constant CONTENT_ELEMENT => 'div.body-content';
#where to generate the html files
use constant DEST_DIR => './tmp';
#object used to fetch the pages
our $ua = new Mojo::UserAgent;

#List of urls you want to fetch
our @urls = (

#used to generate the filename which we use to save on the hard-drive
sub get_filename {
	my %a = @_;
	#glean the file name from the url
	(my $fname) = $a{url} =~ m!.+/(.+)!;
	return $fname

sub get_head_str {
	my %a = @_;
	#we want page header to be clean of dashes
	(my $head_str = $a{str}) =~ s/-/ /g;
	return $head_str

sub full_path {
	my %a = @_;
	#generate number to prepend to the filename, so we can preserve the order for pdf generation
	my $fn = sprintf "%02d", $a{idx};
	return  DEST_DIR . "/$fn$a{fname}.html"

sub save_html {
	my %a = @_;
	open my $fh, ">$a{full_name}" or die $!;
	print $fh $a{header};
	print $fh ${$a{str}};
	close $fh;

sub fetch_and_extract {
	my %a = @_;
	my ($tx,$dom) = (0,0);
	#fetch the html page
	$tx = $ua->max_redirects(3)->get($a{url})->res->body;
	#make it DOM object so we can traverse the HTML
	$dom = new Mojo::DOM($tx);
	#extract the content of html element
	my $str = $dom->find(CONTENT_ELEMENT);
	$str =~ s/[^[:ascii:]]+//g;#clean up non ascii characters
	return \$str

sub format_header {
	my %a = @_;
	return "<h1 style='color:#556698'>" . $a{str} . "</h1><hr>\n"	if $a{type} eq 'normal';
	return "<br><br><br><br><hr><h1 style='font-size:35pt;color:#556698'>" . $a{str} . "</h1><hr>\n";

#LOOP over all URL's
for my $i (0 .. $#urls) {

	print "$urls[$i]\n";
	#initialize vars
	my ($str_ref,$fname,$header) = (0,0,0);

	if ($urls[$i] =~ /^-/) {#Section page if it starts with dash

		$fname = $urls[$i];
		$str_ref = \"";#"
		#format the Section page header
		$header = format_header type => 'section', str => $fname

	} else {# Normal page

		$str_ref = fetch_and_extract url => $urls[$i];

		$fname = get_filename url => $urls[$i];
		#header is build from the filename
		my $head_str = ucfirst get_head_str str => $fname;
		$header = format_header type => 'normal', str => $head_str


	my $full_name = full_path idx => $i, fname => $fname;
	save_html full_name => $full_name, str => $str_ref, header => $header;

#To convert html files to PDF, run this script inside the dir where you generated the HTML files
# wkhtmltopdf *html rm.pdf