Hints for getting this perly...

hynek has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm currently trying to sharpen my perl-skills and do nearly everything with a perl-script. Currently, I decided to move my blog from blogger.com to a own blosxom-installation. To be not forced to do massive copy'n'paste I decided to write a perl-script. I also succeeded, the job is done but the script looks somehow unperlish to me. :( So I'd like some suggestion how to do it better.

As it's a one-use script, some assumptions are made: The old .html-files from blogger.com are in /home/hynek/old-blog/tmp and the new files are put into /home/hynek/old-blog/new. The date has always the same form and appears only once and the body and the subject are also easy to distinguish so it doesn't pay off to use a full-blown html-parser, however it leads to a rather ugly loop. An example of such a page is at here. Blosxom-files are simple .txt files with html in it and with the first line being the subject.

So here's the code:

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

use File::Find;
use Date::Format;
use Date::Parse;

sub process {
    if (-f $_ && /\.html$/) {
    my $fn = $_;

    open(F, "<:utf8", $fn) or die("Cannot open ${File::Find::name}: $!
+");

    my @file = <F>;
    chomp(@file);
    close F;

    (my $date) = grep m%\w+, \w+ \d\d, \d\d\d\d%, @file;
    $date =~ m%\w+, (\w+ \d\d, \d\d\d\d)%;
    my $time = str2time($1);

    my ($in_text, $in_subject) = undef;
    my ($text, $subject) = "";
READ:    foreach (@file) {
        if (m%<h3 class="post-title">%) { # Subject
        $in_subject = 1;
        } elsif ($in_subject && $_ !~ m%</h3>%) {
        $subject = $_ if /\w/;
        } elsif ($in_subject && $_ =~ m%</h3>%) {
        $in_subject = undef;
        $text = "$subject\n";
        } elsif (m%<div class="post-body">%) { # Text
        $in_text = 1;
        } elsif ($in_text && $_ !~ m%</div>%) { 
        $text .= $_ . "\n";
        } elsif ($in_text && $_ =~ m%</div>%) {
        last READ;
        }
    }

    $text =~ s/(\r|<p>|<\/p>|^\s*)//gm;
    $fn =~ s/html$/txt/;

    open(OUT, ">:encoding(iso-8859-15)", "/home/hynek/old-blog/new/$fn
+") or die "Cannot open for write: $!.";
    print OUT $text;
    close(OUT);
    utime $time, $time, ("/home/hynek/old-blog/new/$fn");
    }
}

find(\&process, ("tmp"));
[download]

What I especially dislike is the loop and the reading of the date consisting of two regexps.

Thanks in advance.

UPDATE: Removed the copy'n'paste error from utime.

Comment on Hints for getting this perly... Download Code

Replies are listed 'Best First'.
Re: Hints for getting this perly... by Anonymous Monk on May 11, 2005 at 08:52 UTC
My 2 cents, The code has a syntax error: (the ')' on the utime-line) no need to label the loop, but that's not a big problem though The style of the script could be better \d\d\d\d can be written as \d{4} which is lots shorter Using grep to select ALL elements matching a regex from the file, only using the first one that matches, and then the same regex again on it... perhaps a for/while loop would be better? Reading the file in memory is not something I would do... I would probably put the date-regex in the main-while loop... (while (<F>))	[reply]
Re^2: Hints for getting this perly... by Animator (Hermit) on May 11, 2005 at 08:55 UTC
Hmm, apperently I wasn't logged in... (the previous post is mine too) Another note: instead of `$in_text` you can use the .. operator. (something like `if (m#<div ...># .. m#</div>#) { ...` Also, you are doing a match, but not checking if it returns true or false, and then go on using `$1`... If the match fails then you will end up using `$1` from the previous match, and that might corrupt data... Update: a good document about the style (I mean the indeting etc) is the perlstyle POD The main thing that botters me is that last READ etc start at the same position as your elsif... (or is that only the way it is displayed here?)	[reply] [d/l] [select]
Re^3: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 09:10 UTC
Ah I see, it's just here. Of course I indent my code...dunno why it's fubared here. :( However I'd say that the ...-operator won't work, when working line-wise?	[reply]
Re^2: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 09:00 UTC
Could you elaborate on the style? That's one of the points I'm hunting in special. I'm an old C-fellow and changing habits is hard. :( I'll try to implement the rest though and'll post the result here.	[reply]
Re: Hints for getting this perly... by bart (Canon) on May 11, 2005 at 08:56 UTC
You can try map instead of grep, and just keep the matches. From `(my $date) = grep m%\w+, \w+ \d\d, \d\d\d\d%, @file; $date =~ m%\w+, (\w+ \d\d, \d\d\d\d)%;` [download] to `my($date) = map m%\w+, (\w+ \d\d, \d\d\d\d)%, @file;` [download] If the regex doesn't match, it returns an empty list, so it'll disappear from the results. Also, if the files are largish and the date always appears near the top, I'd avoid reading in the whole file and just read in, say, the first k of text, as a single string. Then you don't even need map. `local $/ = \1024; my $begin = <F>; my ($date) = $begin =~ m%\w+, (\w+ \d\d, \d\d\d\d)%; my $time = str2time($date);` [download] update Thanks to Animator for the hint... You do appear to need the entire file. I don't know what is eeasier, checking the file line by line, as you do, or reading in the whole file as a single string, and processing it that way. You can achieve that by undeffing $/ instead: `undef $/;` [download] I think, for this specific file format, your way is easier.	[reply] [d/l] [select]
Re^2: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 09:16 UTC
Well, I've just put it also into the while loop. d'oh As you might see, the development of this tiny script was pretty....evolutionary so it's full of strange artifacts. :) Regarding the update, I also played with the undef $\-idea however now when considering "..." it might really pay off.	[reply]
Re: Hints for getting this perly... by tlm (Prior) on May 11, 2005 at 11:23 UTC
Here's how I would write your script (in my admittedly idiosyncratic, but hopefully sufficiently Perl-ish, style). You probably won't like all of it, or even most of it; take what you like. Read more... (2 kB) TIMTOWTDI-ly, the lowliest monk	[reply] [d/l]
Re^2: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 13:37 UTC
Wow. Thanks for taking the time to recode it completely. It's really inspiring and admittedly much cleaner with respect to perl (eg. using BEGIN whose use I still haven't grokked).	[reply]
Re^3: Hints for getting this perly... by tlm (Prior) on May 11, 2005 at 15:11 UTC
In this case the `BEGIN` basically causes the line initializing `$output_dir` to be executed before the sub that uses it gets called; without it, you'd have to rearrange the source code so that the initialization happened lexically before the first call to the sub. This is no big deal, at least in this case, but I like the freedom to arrange my code that the `BEGIN` trick allows. Combined with the enclosing block (around both the `BEGIN` block and the definition of `process`) it gives you something similar to (in fact better than) a C static variable. This is discussed at length here¹. ¹NB: in that thread I advocated using `INIT` blocks instead of `BEGIN` blocks, a position that was strongly challenged by ihb and ikegami. Since then I have learned that `INIT` blocks are broken, so I no longer recommend them. the lowliest monk	[reply]
Re: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 10:28 UTC
Thanks for the feedback so for, a first update based on the suggestions would be: #!/usr/bin/perl use strict; use warnings; use diagnostics; use File::Find; use Date::Format; use Date::Parse; sub process { if (-f $_ && /\.html$/) { my $fn = $_; open(F, "<:utf8", $fn) or die("Cannot open ${File::Find::name} +: $!"); my ($in_text, $in_subject) = undef; my ($text, $subject, $time); while (<F>) { chomp; if ($_ =~ /\w+, (\w+ \d\d, \d{4})/) { $time = str2time($1); } elsif (/<h3 class="post-title">/) { # Subject $in_subject = 1; } elsif ($in_subject && $_ !~ m%</h3>%) { $subject = $_ if /\w/; } elsif ($in_subject && $_ =~ m%</h3>%) { $in_subject = undef; $text = "$subject\n"; } elsif (/<div class="post-body">/) { # Text $in_text = 1; } elsif ($in_text && $_ !~ m%</div>%) { $text .= $_ . "\n"; } elsif ($in_text && $_ =~ m%</div>%) { last; } } die "Invalid fileformat: $text $time." unless ($text && $time) +; $text =~ s/(\r\|<p>\|<\/p>\|^\s*)//gm; $fn =~ s/html$/txt/; open(OUT, ">:encoding(iso-8859-15)", "/home/hynek/old-blog/new +/$fn") or die "Cannot open for write: $!."; print OUT $text; close(OUT); utime $time, $time, ("/home/hynek/old-blog/new/$fn"); } } find(\&process, ("tmp")); [download] Now I'm going to read it flatly and try to utilitize the ".."-operator...let's see how it further shortens the script (comments on this code are still welcome though).	[reply] [d/l]
Re^2: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 13:26 UTC
".." proved to be kinda pointless...but I really like my new version: :) #!/usr/bin/perl use strict; use warnings; use diagnostics; use File::Find; use Date::Format; use Date::Parse; sub process { if (-f $_ && /\.html$/) { my $fn = $_; open(F, "<:utf8", $fn) or die("Cannot open ${File::Find::name} +: $!"); my $file = <F>; close F; my ($text, $time); ($text = $1) =~ s/\r\n//mg if ($file =~ m%<h3 class="post-titl +e">(.?)</h3>%s); $text .= $1 if ($file =~ m%<div class="post-body">(.?)</div>% +s); $time = str2time($1) if ($file =~ /\w+, (\w+ \d\d, \d{4})/s); die "Invalid fileformat: $text $time." unless ($text && $time) +; $text =~ s/(\r\|<p>\|<\/p>\|^\s*\|[\t ]{2,})//g; $fn =~ s/html$/txt/; open(OUT, ">:encoding(iso-8859-15)", "/home/hynek/old-blog/new +/$fn") or die "Cannot open for write: $!."; print OUT $text; close(OUT); utime $time, $time, ("/home/hynek/old-blog/new/$fn"); } } undef $/; find(\&process, ("tmp")); [download] Any comments? Thanks for the support so far!	[reply] [d/l]