Scrape a blog: a statistical approach

epimenidecretese has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear All,

I am trying to build a text corpus from a blog.

I have downloaded the whole site locally and removed all the files I wouldn't need, so I am left just with the html pages containing the posts (pure text, just the post: that's what I want at the end).

What follows is the smallest and cleanest code I have:

#scrape beppegrillo.it
############################################################
#!/usr/bin/perl;
use utf8;
use strict;
use warnings;
use File::Find;
use HTML::Tree;
use LWP::Simple;
############################################################
#set the working directory where the html files are
############################################################
my $dir = "CORPUS/2005/01/";
my $url;
#############################################################
#call a sub routine in order to operate on each file in the directory
############################################################
find(\&edit, $dir);
#
############################################################
#specifica cosa deve fare la subroutine edit
############################################################
sub edit() {
############################################################
#check that what you're working on is a file and not a directory; chec
+k that is an html file
############################################################
    my $file = $_;
    if ((-e $file) && (! -d $file) && (/.html?/)){    
############################################################
#open filehandle in order to read
############################################################
    open (FH, "<",$file) || die $!;
############################################################
#build the tree or die
############################################################
    my $tree = HTML::Tree->new();
    $tree->parse_file($file) || die $!;
############################################################
#get the main div, the one that contains the post and print it as html
############################################################
    my $getmaindiv = $tree->look_down(_tag => "div",id  => "post_princ
+ipale") || die $!;
    print $getmaindiv->as_HTML, "\n";    
    close FH;
############################################################
#
############################################################
    }
}
[download]

Now, it more or less works. I've been able with this code to get what I want. I had to add some more lines (but just a few, in order to get only the <p> inside the main <div> tag) but it did the work.

So, I would like to try a different approach. I would like to pile up all the code and get the most frequent sequence of tag, so to statistically identify what is NOT pure post text.

I've seen HTML::ExtractMain which should use the Readability algorithm but it doesn't seem to work.

Do you guys think that what I trying to do is possibile in a few lines of code?

At the moment I only can get all the html code togeather, but I would like to have it line by line, so that then some frequency list could be possible.

Any idea?

Comment on Scrape a blog: a statistical approach Download Code

Replies are listed 'Best First'.
Re: Scrape a blog: a statistical approach by roboticus (Chancellor) on Apr 12, 2014 at 14:48 UTC
emimenidecretese: I'll apologize up front--I'm not answering your question. Instead, I'm going to provide a couple comments on your code. When you create a subroutine, it's a bad idea to add a prototype (i.e., the parenthesized part) unless you know exactly what you're asking for. Perl prototypes aren't like prototypes in other languages. When you write your code with good variable names and flow, then it's generally self-documenting, and comments can actually get in the way of making your intentions clear. Your comments are large and blocky, so they can be visually distracting. If you delete most of the comments in your code, it actually reads pretty clearly. If you keep comments, make them simple and non-distracting. After applying these two suggestions to your subroutine, I get this: ############################################################ #specifica cosa deve fare la subroutine edit ############################################################ sub edit { my $file = $_; # only operate on html files if ((-e $file) && (! -d $file) && (/.html?/)){ open (FH, "<",$file) \|\| die $!; my $tree = HTML::Tree->new(); $tree->parse_file($file) \|\| die $!; # The main div contains the post of interest my $getmaindiv = $tree->look_down(_tag => "div",id => "post_princ +ipale") \|\| die $!; print $getmaindiv->as_HTML, "\n"; close FH; } } [download] Most of your subroutine is inside an if statement. In cases like this, I prefer^[] to simply return if the case isn't met, then you save an indentation level, reducing the visual complexity a bit. `sub edit { my $file = $_; # only operate on html files return unless (-e $file) && (! -d $file) && (/.html?/); open (FH, "<",$file) \|\| die $!; my $tree = HTML::Tree->new(); $tree->parse_file($file) \|\| die $!; # The main div contains the post of interest my $getmaindiv = $tree->look_down(_tag => "div",id => "post_princ +ipale") \|\| die $!; print $getmaindiv->as_HTML, "\n"; close FH; }` [download] Now that the code is a little easier to read, I notice that you're not actually using the file handle you open. You're using the HTML parsers ability to accept a filename instead of a file handle. So I'd just remove the file handle code: `sub edit { my $file = $_; # only operate on html files return unless (-e $file) && (! -d $file) && (/.html?/); my $tree = HTML::Tree->new(); $tree->parse_file($file) \|\| die $!; # The main div contains the post of interest my $getmaindiv = $tree->look_down(_tag => "div",id => "post_princ +ipale") \|\| die $!; print $getmaindiv->as_HTML, "\n"; }` [download] [] Just one of my preferences. Of course all of my suggestions are based on my preferences, but the other ones are pretty-well accepted, while this one is the most discretionary. Since I'm just another programmer among many, take it with a grain of salt. I hope you find some of this useful... Update:** I specifically said "don't use prototypes", yet I left the prototype in all versions...removed. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: Scrape a blog: a statistical approach by Laurent_R (Canon) on Apr 12, 2014 at 19:06 UTC
Hi, I definitely agree with your two suggestions, roboticus. The original post is a pathological case of comments making the code much less readable. And I of course also agree that prototypes should only be used by people who really understand what they do and in which cases they are useful.	[reply]
Re^3: Scrape a blog: a statistical approach by Anonymous Monk on Apr 13, 2014 at 12:12 UTC
Thanks for your tips guys. I have learned a good lesson.	[reply]
Re: Scrape a blog: a statistical approach by kcott (Archbishop) on Apr 13, 2014 at 02:39 UTC
G'day epimenidecretese, I'm not completely sure what you're after. A small, representative sample of data and expected output would have helped. If by "pile up all the code" you mean process all of your HTML files in a single run, then yes, getting a frequency count of all the tags in all the files "is possibile in a few lines of code". Here's `pm_1082082_html_tag_count.pl`: `#!/usr/bin/env perl -l use strict; use warnings; my %tags; while (<>) { ++$tags{$1} while /<(?!\/)([^ >]+)/g; } print "$_\t$tags{$_}" for sort { $tags{$b} <=> $tags{$a} } keys %tags;` [download] Here's some dummy input data. You can view those in a browser if you want: they render OK but they're not very interesting. `$ ls -l pm_1082082_.html -rw-r--r-- 1 ken staff 237 13 Apr 12:13 pm_1082082_1.html -rw-r--r-- 1 ken staff 237 13 Apr 12:12 pm_1082082_2.html` [download] `$ cat pm_1082082_1.html <h1 id="H-1">Heading 1</h1> <p class="sub-heading"> Some <strong>bold</strong> and <em>italic</em> text.</p> <h2 id="H-1-2">Heading 1.2</h2> <p>Para1 (1.2.1)</p><p>Para2 (1.2.2)</p> <p><strong>Fake newlines:</strong><br /><br><br /></p>` [download] `$ cat pm_1082082_2.html <h1 id="H-2">Heading 2</h1> <p class="sub-heading"> Some <strong>bold</strong> and <em>italic</em> text.</p> <h2 id="H-2-2">Heading 2.2</h2> <p>Para1 (2.2.1)</p><p>Para2 (2.2.2)</p> <p><strong>Fake newlines:</strong><br /><br><br /></p>` [download] Here's a sample run: `$ pm_1082082_html_tag_count.pl pm_1082082_.html p 8 br 6 strong 4 h1 2 em 2 h2 2` [download] If that's not what you're after, you'll need to clarify what you do want and, as already mentioned, sample input and expected output will help. [If you're unsure of what information to provide, the guidelines in "How do I post a question effectively?" should help.] -- Ken	[reply] [d/l] [select]
Re^2: Scrape a blog: a statistical approach by epimenidecretese (Acolyte) on Apr 13, 2014 at 12:26 UTC
I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term? So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change. So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate. With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?) I'll post same examplef of code and the results I've got so far as soon as I can.	[reply]
Re^3: Scrape a blog: a statistical approach by soonix (Chancellor) on Apr 13, 2014 at 22:18 UTC
You could take a diff between consecutive pages instead of counting lines. You'd have to experiment with different modules like e.g. HTML::Diff or Text::Diff, but this approach could also help with style/layout changes.	[reply]
Re: Scrape a blog: a statistical approach by epimenidecretese (Acolyte) on Apr 15, 2014 at 14:10 UTC
Ok, I did some further researches and I've found that this stuff is too complicated to be solved "in a few lines of code". For those who will have to handle the same problem I post the following link which contains up to date informations and useful libraries. Now I using justext within python. There is a NCleaner perl module but I've not been able to use it. As always, thanks guys for your support. https://sites.google.com/a/morganclaypool.com/wcc/home/software	[reply]