in reply to Hints for getting this perly...

Here's how I would write your script (in my admittedly idiosyncratic, but hopefully sufficiently Perl-ish, style). You probably won't like all of it, or even most of it; take what you like.

(Thoroughly untested.)

#!/usr/bin/perl use strict; use warnings; # use diagnostics; use File::Find; use Date::Parse; find \&process, 'tmp'; exit 0; { my $output_dir; BEGIN { $output_dir = '/home/hynek/old-blog/new'; } sub process { if ( -f $_ && /\.html$/ ) { my ( $time, $text ) = parse( $File::Find::name ); ( my $output_file = "$output_dir/$_" ) =~ s/html$/txt/; print_out( $text, $output_file ); utime $time, $time, $output_file; } } } sub parse { my $input_file = shift; open my $in, "<:utf8", $input_file or die "Can't read $input_file: $ +!"; my ( $time, $subject, $text ); while ( <$in> ) { $time = Date::Parse::str2time( $1 ) and next if !$time && /\w+, (\w+ \d\d, \d\d\d\d)/; chomp; # is this *really* necessary for all lines? my $match_no; if ( $match_no = ( m%<h3 class="post-title">% ... m%</h3>% ) ) { next if $match_no == 1; $subject = $_ and next if /\w/; $text = $subject and next if $match_no =~ y/E//; } if ( $match_no = ( m%<div class="post-body">% ... m%</div>% ) ) { next if $match_no == 1; last if $match_no =~ y/E//; $text .= "$_\n"; # why chomp earlier? are you getting # rid of DOS eol sequences? } } close $in or die "Failed to close $input_file: $!"; $text =~ s,(\r|</?p>|^\s*),,gm; return ( $time, $text ); } sub print_out { my ( $text, $filename ) = @_; open my $out, ">:encoding(iso-8859-15)", $filename or die "Cannot write to $filename: $!"; print $out $text; close $out; } __END__

TIMTOWTDI-ly,

the lowliest monk

Replies are listed 'Best First'.
Re^2: Hints for getting this perly...
by hynek (Novice) on May 11, 2005 at 13:37 UTC
    Wow.

    Thanks for taking the time to recode it completely. It's really inspiring and admittedly much cleaner with respect to perl (eg. using BEGIN whose use I still haven't grokked).

      In this case the BEGIN basically causes the line initializing $output_dir to be executed before the sub that uses it gets called; without it, you'd have to rearrange the source code so that the initialization happened lexically before the first call to the sub. This is no big deal, at least in this case, but I like the freedom to arrange my code that the BEGIN trick allows. Combined with the enclosing block (around both the BEGIN block and the definition of process) it gives you something similar to (in fact better than) a C static variable. This is discussed at length here1.

      1NB: in that thread I advocated using INIT blocks instead of BEGIN blocks, a position that was strongly challenged by ihb and ikegami. Since then I have learned that INIT blocks are broken, so I no longer recommend them.

      the lowliest monk