http://qs1969.pair.com?node_id=607885

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Masters of the Script-fu! I am a noob with Perl, and wrote the following code to parse through an HTML or JSP file and remove all code blocks and extraneous whitespace, leaving just the text of the page. I know there are probably easier ways to do this on a one-off basis, but I had a couple hundred files to do this to. I also needed an excuse to start learning Perl, and I wrote this using PerlMonks & the camel book. This script works, but man is it ugly.
use strict; if($#ARGV < 0){ die "You did not specify any files to process! : $!\n"; } if($#ARGV > 2){ die "This is only going to work on the first one in the list, I th +ink. Please try again. : $!\n"; } my $onceonly=0; while(<>) { my $infname = $ARGV; my $outfname = $infname.".txt"; my $inputtxt = ""; my $outputtxt = ""; local $/=undef; open (INPUT, "$infname") or die "$infname could not be opened.: $! +\n"; while (<INPUT>){ $inputtxt .= $_; } open (OUTPUT, ">>$outfname") or die "$outfname could not be opened +.: $!\n"; while($onceonly==0) { my $bracecounter = 0; my $startbrace = '<'; my $stopbrace = '>'; for (my $i=length($inputtxt); $i>0 && length($inputtxt)>0; $i- +=1) { #going to count the number of open braces I find, #subtracting the number of close braces I find, #storing the data into outputtxt when counter==0 #looks funny because we're working backwards. my $testchar= chop ($inputtxt); if ($bracecounter == 0 && $testchar ne "<" && $testchar ne + ">"){ $outputtxt = $testchar.$outputtxt; $outputtxt =~ s/&nbsp;/ /g; $outputtxt =~ s/^\s\s//g; } elsif ($testchar eq $stopbrace){ $bracecounter+=1; } elsif ($testchar eq $startbrace){ $bracecounter-=1; } } print OUTPUT $outputtxt; #print $outputtxt; $onceonly+=1; } close $infname; close $outfname; }
I have the following issues that I can't seem to get a clear explanation of:

1. Placing more than one file for an argument results in new files ending in .txt, as expected, but they are all blank.
2. I had an issue that I resolved with the 'onceonly' var, where the file would have multiple iterations of the text in the file - somewhere around 50-60 times, but never the same number twice. I checked this with line number counts, and yes, I cleared the files between each test.
3. I know I can probably do this with modules, but as I'm still green with Perl anyway, I thought to do this in the basic syntax without clouding problems with module interactions. This is the next project, creating this script in a module form.
4. I had originally tried using a regex for this, but I couldn't get one to do nested tags.
Thanks for any suggestions!