Re: Hints for getting this perly...

Here's how I would write your script (in my admittedly idiosyncratic, but hopefully sufficiently Perl-ish, style). You probably won't like all of it, or even most of it; take what you like.

(Thoroughly untested.)

#!/usr/bin/perl

use strict;
use warnings;
# use diagnostics;

use File::Find;
use Date::Parse;

find \&process, 'tmp';
exit 0;

{
  my $output_dir;
  BEGIN { $output_dir = '/home/hynek/old-blog/new'; }

  sub process {
    if ( -f $_ && /\.html$/ ) {
      my ( $time, $text ) = parse( $File::Find::name );
      ( my $output_file = "$output_dir/$_" ) =~ s/html$/txt/;
      print_out( $text, $output_file );
      utime $time, $time, $output_file;
    }
  }
}

sub parse {
  my $input_file = shift;
  open my $in, "<:utf8", $input_file or die "Can't read $input_file: $
+!";
  my ( $time, $subject, $text );
  while ( <$in> ) {
    $time = Date::Parse::str2time( $1 ) and next
      if !$time && /\w+, (\w+ \d\d, \d\d\d\d)/;
    chomp;                 # is this *really* necessary for all lines?
    my $match_no;
    if ( $match_no = ( m%<h3 class="post-title">% ... m%</h3>% ) ) {
      next if $match_no == 1;
      $subject = $_       and next if /\w/;
      $text    = $subject and next if $match_no =~ y/E//;
    }
    if ( $match_no = ( m%<div class="post-body">% ... m%</div>% ) ) {
      next if $match_no == 1;
      last if $match_no =~ y/E//;
      $text .= "$_\n";          # why chomp earlier?  are you getting
                                # rid of DOS eol sequences?
    }
  }
  close $in or die "Failed to close $input_file: $!";
  $text =~ s,(\r|</?p>|^\s*),,gm;
  return ( $time, $text );
}

sub print_out {
  my ( $text, $filename ) = @_;
  open my $out, ">:encoding(iso-8859-15)", $filename or
    die "Cannot write to $filename: $!";
  print $out $text;
  close $out;
}
__END__
[download]

TIMTOWTDI-ly,

the lowliest monk

Comment on Re: Hints for getting this perly... Download Code

Replies are listed 'Best First'.
Re^2: Hints for getting this perly... by hynek (Novice) on May 11, 2005 at 13:37 UTC
Wow. Thanks for taking the time to recode it completely. It's really inspiring and admittedly much cleaner with respect to perl (eg. using BEGIN whose use I still haven't grokked).	[reply]
Re^3: Hints for getting this perly... by tlm (Prior) on May 11, 2005 at 15:11 UTC
In this case the `BEGIN` basically causes the line initializing `$output_dir` to be executed before the sub that uses it gets called; without it, you'd have to rearrange the source code so that the initialization happened lexically before the first call to the sub. This is no big deal, at least in this case, but I like the freedom to arrange my code that the `BEGIN` trick allows. Combined with the enclosing block (around both the `BEGIN` block and the definition of `process`) it gives you something similar to (in fact better than) a C static variable. This is discussed at length here¹. ¹NB: in that thread I advocated using `INIT` blocks instead of `BEGIN` blocks, a position that was strongly challenged by ihb and ikegami. Since then I have learned that `INIT` blocks are broken, so I no longer recommend them. the lowliest monk	[reply]