Filters within a filter...

mw has asked for the wisdom of the Perl Monks concerning the following question:

Monks, Appentices, Journeymen, be so good as to lend me your ear.

Recently I stepped out on the path of upgrading my brane from Korn Shell and associated utilities to Perl. My learning was greatly assisted by this document, which I can recommend up to the point of OO.

One of the constructs I often use in ksh is:

remove_cruft()
{
  ... sed, sed, awk! awk! awk!
}

cat $HTML_FILE | remove_cruft >$OUTPUT_FILE
[download]

I have a library of HTML tools that create stylesheets, change all H1's to H2's, remove unwanted tags and so on, which I'm porting to perl now. I'd like to know what the preferred method is to filter the output of a perl sub through another perl sub. Or maybe, the problem is approached in a completely different manner.

Any words of wisdom greatly appreciated.

Comment on Filters within a filter... Download Code

Replies are listed 'Best First'.
Re: Filters within a filter... by Tanktalus (Canon) on Feb 04, 2005 at 16:29 UTC
As a former ksh-(ab)user, who has many utilities in perl ... let's just say that, in and of itself, there is little reason to move working programs over ... the only reasons to do this, IMO, are either to add new features which are getting prohibitively expensive (in time) to add in shell, or for the sake of learning perl. I'm going to assume one of these is the case. I still have many shell scripts floating around aside my perl scripts - and perl scripts that generate shell code which I can eval - and I still write new shell code from time to time. Right tool for the right job. Personally, I would do one of the following - depending on the rest of the code. Remember: TMTOWTDI. Not all of them are great, but often more than one is sufficient or even desirable. Option one: `sub remove_cruft { my @lines = @_; # do stuff to @lines; return @lines; } # use as: my @no_cruft = remove_cruft(@lines); # or: my @no_cruft = remove_cruft(<FH>);` [download] Option two # Using a reference: sub remove_cruft { my $lines = shift; # do stuff to @$lines; } # use as: remove_cruft(\@lines); # Note that the following won't work: # remove_cruft(<FH>); # you need to use a temporary variable </code> Option Three `# Using prototypes (which are considered "evil" by some) sub remove_cruft(@) { my $lines = shift; # do stuff to @$lines } # use as: remove_cruft(@lines) # this still doesn't work: # remove_cruft(<FH>)` [download] Option Four `sub remove_cruft { my @lines; if (@_ > 1) { # lines were passed in. @lines = @_; } # we either got a filename or a filehandle. elsif (ref $_[0]) # assume object is a filehandle. { @lines = <$_[0]>; } else # must be a filename. { my $fh = IO::File->new(shift, 'r'); @lines = <$fh>; } # do stuff to @lines; @lines; }` [download] I realise that's not the most efficient way to do that last one, but I'm too lazy to do all of the work in this little textarea box... :-)	[reply] [d/l] [select]
Re^2: Filters within a filter... by mw (Sexton) on Feb 04, 2005 at 16:54 UTC
* mw nods Well, there are two reasons why I'm porting these scripts over: First, as an exercise to learn perl, and second because I'm hoping for increased speed. I suppose most of the HTML files I'm likely to see here, will fit in memory without too many problems, so I think I'll standardise on arrays of lines slurped therefrom. Thanks!	[reply]
Re^3: Filters within a filter... by Tanktalus (Canon) on Feb 04, 2005 at 17:07 UTC
I knew I missed an option or so ... Option Five `sub remove_cruft { my $line = shift; # do stuff to one $line here. $line; } # use as: while (my $l = <$fh>) { $l = remove_cruft($l); $l = remove_other_cruft($l); $l = remove_yet_more_cruft($l); # or ... $l = $_->($l) foreach (\&remove_cruft, \&remove_other_cruft, \&remov +e_yet_more_cruft); # or ... $l = remove_yet_more_cruft(remove_other_cruft(remove_cruft($l))); # on second thought, don't do that last one :-) }` [download] Option Six `sub remove_cruft { # do stuff to single line $_[0]; } # use as: while (my $l = <$fh>) { remove_cruft($l); remove_other_cruft($l); remove_yet_more_cruft($l); # or ... $_->($l) foreach (\&remove_cruft, \&remove_other_cruft, \&remove_yet +_more_cruft); }` [download] The options are endless. What I highly discourage you from doing is writing shell script in perl. I've seen that so many times that it makes me cringe each time. Whether that is to write my $data = `grep blah $filename` rather than `open my $fh, $filename; my $data = join '', grep { /blah/ } <$fh>;` (and this is just the least perlish of the not-shell-script options), or it's `system("mkdir $dir");` rather than `mkdir $dir` ... there are some really nifty perl idioms that take care of these things for you. They say you can write ForTran in any language. Same is true of shell scripts :-)	[reply] [d/l] [select]
Re: Filters within a filter... by fauria (Deacon) on Feb 04, 2005 at 16:39 UTC
Hi! When using perl, you can pipe in your commands by using this: `while(<STDIN>){ print "Current line is storaged in: ".$_; chop; print "Current line without new line character: ".$_."\n"; }` [download] For example: cat /etc/services \| perl foo.pl STDIN can be changed to STDERR. Whitout input, acts like read in bash (i think its the same in ksh). There is also an operator called "diamond operator", which is represented by "<>", and means "whatever file passed as an argument, or whatever is piped into": `while(<>){ chop; print "I have $_ as input\n"; }` [download] For example: perl foo.pl /etc/services /etc/hosts From here, you can operate $_ (current line) in any way you like, using regex, awk like operations, etc. You can save your converted text by redirecting output to a file from command line, or inside your perl script. Tip: Syntax used for regex in sed is very similar than the one used in perl. For a translation from awk to perl, you can try "a2p" (man a2p).	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom