comment on

Hello Masters of the Script-fu! I am a noob with Perl, and wrote the following code to parse through an HTML or JSP file and remove all code blocks and extraneous whitespace, leaving just the text of the page. I know there are probably easier ways to do this on a one-off basis, but I had a couple hundred files to do this to. I also needed an excuse to start learning Perl, and I wrote this using PerlMonks & the camel book. This script works, but man is it ugly.

use strict;

if($#ARGV < 0){
    die "You did not specify any files to process! : $!\n";
}
if($#ARGV > 2){
    die "This is only going to work on the first one in the list, I th
+ink. Please try again. : $!\n";
}

my $onceonly=0;
while(<>)
{
    my $infname = $ARGV;
    
    my $outfname = $infname.".txt";
    
    my $inputtxt = "";
    my $outputtxt = "";
    
    local $/=undef;
    open (INPUT, "$infname") or die "$infname could not be opened.: $!
+\n";
    while (<INPUT>){
 $inputtxt .= $_;
}
    open (OUTPUT, ">>$outfname") or die "$outfname could not be opened
+.: $!\n";
 
    while($onceonly==0)
    {

        my $bracecounter = 0;
        my $startbrace = '<';
        my $stopbrace = '>';
        
        for (my $i=length($inputtxt); $i>0 && length($inputtxt)>0; $i-
+=1)
        {
            #going to count the number of open braces I find,
            #subtracting the number of close braces I find,
            #storing the data into outputtxt when counter==0
            #looks funny because we're working backwards.
            
            my $testchar= chop ($inputtxt);
            if ($bracecounter == 0 && $testchar ne "<" && $testchar ne
+ ">"){
                $outputtxt = $testchar.$outputtxt;
                $outputtxt =~ s/&nbsp;/ /g;
                $outputtxt =~ s/^\s\s//g;
                
            }
            elsif ($testchar eq $stopbrace){
                $bracecounter+=1;
            }
            elsif ($testchar eq $startbrace){
                $bracecounter-=1;
            }
            
        }
        print OUTPUT $outputtxt;
        #print $outputtxt;
        $onceonly+=1;
    }

    close $infname;
    close $outfname;
}
[download]

I have the following issues that I can't seem to get a clear explanation of:

1. Placing more than one file for an argument results in new files ending in .txt, as expected, but they are all blank.
2. I had an issue that I resolved with the 'onceonly' var, where the file would have multiple iterations of the text in the file - somewhere around 50-60 times, but never the same number twice. I checked this with line number counts, and yes, I cleared the files between each test.
3. I know I can probably do this with modules, but as I'm still green with Perl anyway, I thought to do this in the basic syntax without clouding problems with module interactions. This is the next project, creating this script in a module form.
4. I had originally tried using a regex for this, but I couldn't get one to do nested tags.
Thanks for any suggestions!

In reply to Simplify parsing a file by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.