Re: Scrape a blog: a statistical approach

I'll apologize up front--I'm not answering your question. Instead, I'm going to provide a couple comments on your code.

When you create a subroutine, it's a bad idea to add a prototype (i.e., the parenthesized part) unless you know *exactly* what you're asking for. Perl prototypes aren't like prototypes in other languages.
When you write your code with good variable names and flow, then it's generally self-documenting, and comments can actually get in the way of making your intentions clear. Your comments are large and blocky, so they can be visually distracting. If you delete most of the comments in your code, it actually reads pretty clearly. If you keep comments, make them simple and non-distracting.

After applying these two suggestions to your subroutine, I get this:

############################################################
#specifica cosa deve fare la subroutine edit
############################################################
sub edit {
    my $file = $_;

    # only operate on html files
    if ((-e $file) && (! -d $file) && (/.html?/)){    
    open (FH, "<",$file) || die $!;

        my $tree = HTML::Tree->new();
    $tree->parse_file($file) || die $!;

        # The main div contains the post of interest
    my $getmaindiv = $tree->look_down(_tag => "div",id  => "post_princ
+ipale") || die $!;
    print $getmaindiv->as_HTML, "\n";    
    close FH;
    }
}
[download]

Most of your subroutine is inside an if statement. In cases like this, I prefer^[*] to simply return if the case isn't met, then you save an indentation level, reducing the visual complexity a bit.

sub edit {
    my $file = $_;

    # only operate on html files
    return unless (-e $file) && (! -d $file) && (/.html?/);

    open (FH, "<",$file) || die $!;

    my $tree = HTML::Tree->new();
    $tree->parse_file($file) || die $!;

    # The main div contains the post of interest
    my $getmaindiv = $tree->look_down(_tag => "div",id  => "post_princ
+ipale") || die $!;
    print $getmaindiv->as_HTML, "\n";    
    close FH;
}
[download]

Now that the code is a little easier to read, I notice that you're not actually using the file handle you open. You're using the HTML parsers ability to accept a filename instead of a file handle. So I'd just remove the file handle code:

sub edit {
    my $file = $_;

    # only operate on html files
    return unless (-e $file) && (! -d $file) && (/.html?/);

    my $tree = HTML::Tree->new();
    $tree->parse_file($file) || die $!;

    # The main div contains the post of interest
    my $getmaindiv = $tree->look_down(_tag => "div",id  => "post_princ
+ipale") || die $!;
    print $getmaindiv->as_HTML, "\n";    
}
[download]

[*] Just one of my preferences. Of course all of my suggestions are based on my preferences, but the other ones are pretty-well accepted, while this one is the most discretionary. Since I'm just another programmer among many, take it with a grain of salt.

I hope you find some of this useful...

Update: I specifically said "don't use prototypes", yet I left the prototype in all versions...removed.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: Scrape a blog: a statistical approach Select or Download Code

Replies are listed 'Best First'.
Re^2: Scrape a blog: a statistical approach by Laurent_R (Canon) on Apr 12, 2014 at 19:06 UTC
Hi, I definitely agree with your two suggestions, roboticus. The original post is a pathological case of comments making the code much less readable. And I of course also agree that prototypes should only be used by people who really understand what they do and in which cases they are useful.	[reply]
Re^3: Scrape a blog: a statistical approach by Anonymous Monk on Apr 13, 2014 at 12:12 UTC
Thanks for your tips guys. I have learned a good lesson.	[reply]