Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Simplify parsing a file

by Anonymous Monk
on Apr 02, 2007 at 17:28 UTC ( #607885=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Masters of the Script-fu! I am a noob with Perl, and wrote the following code to parse through an HTML or JSP file and remove all code blocks and extraneous whitespace, leaving just the text of the page. I know there are probably easier ways to do this on a one-off basis, but I had a couple hundred files to do this to. I also needed an excuse to start learning Perl, and I wrote this using PerlMonks & the camel book. This script works, but man is it ugly.
use strict; if($#ARGV < 0){ die "You did not specify any files to process! : $!\n"; } if($#ARGV > 2){ die "This is only going to work on the first one in the list, I th +ink. Please try again. : $!\n"; } my $onceonly=0; while(<>) { my $infname = $ARGV; my $outfname = $infname.".txt"; my $inputtxt = ""; my $outputtxt = ""; local $/=undef; open (INPUT, "$infname") or die "$infname could not be opened.: $! +\n"; while (<INPUT>){ $inputtxt .= $_; } open (OUTPUT, ">>$outfname") or die "$outfname could not be opened +.: $!\n"; while($onceonly==0) { my $bracecounter = 0; my $startbrace = '<'; my $stopbrace = '>'; for (my $i=length($inputtxt); $i>0 && length($inputtxt)>0; $i- +=1) { #going to count the number of open braces I find, #subtracting the number of close braces I find, #storing the data into outputtxt when counter==0 #looks funny because we're working backwards. my $testchar= chop ($inputtxt); if ($bracecounter == 0 && $testchar ne "<" && $testchar ne + ">"){ $outputtxt = $testchar.$outputtxt; $outputtxt =~ s/&nbsp;/ /g; $outputtxt =~ s/^\s\s//g; } elsif ($testchar eq $stopbrace){ $bracecounter+=1; } elsif ($testchar eq $startbrace){ $bracecounter-=1; } } print OUTPUT $outputtxt; #print $outputtxt; $onceonly+=1; } close $infname; close $outfname; }
I have the following issues that I can't seem to get a clear explanation of:

1. Placing more than one file for an argument results in new files ending in .txt, as expected, but they are all blank.
2. I had an issue that I resolved with the 'onceonly' var, where the file would have multiple iterations of the text in the file - somewhere around 50-60 times, but never the same number twice. I checked this with line number counts, and yes, I cleared the files between each test.
3. I know I can probably do this with modules, but as I'm still green with Perl anyway, I thought to do this in the basic syntax without clouding problems with module interactions. This is the next project, creating this script in a module form.
4. I had originally tried using a regex for this, but I couldn't get one to do nested tags.
Thanks for any suggestions!

Replies are listed 'Best First'.
Re: Simplify parsing a file
by valdez (Monsignor) on Apr 02, 2007 at 17:36 UTC

    For the parsing part I would use HTML::TokeParser::Simple, wrote by our brother Ovid; here it is an example from the documentation:

    use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }
    Nice, isn't it? HTML parsing is not easy as it may seem, relying on a well written module is not a sin :)

    Ciao, Valerio

      Wow. That is sweet. I'll post my implementation of it when I get a chance. Still, any thoughts on the multiple-file issue, or do you think this will address that too?
Re: Simplify parsing a file
by saintly (Scribe) on Apr 02, 2007 at 18:24 UTC
    • Hmm... instead of using <> to run through the files, you could do something like:
      foreach my $in_fname (@ARGV) { ... }
    • As for a regex, you could do:
      $input_txt =~ s/\<script.*?\<\/script\>//igs; # remove code $input_txt =~ s/\<.*?\>//gs; # remove html tags $input_txt =~ s/\&nbsp;/ /gs; # &nbsp -> space $input_txt =~ s/\s{2,}/ /gs; # Remove extra spaces print $input_txt;

      It's not valid HTML to have <> elements inside of other HTML tags (they should be entity-encoded as in:
      <a href="" onClick="alert('&lt; foo! &gt;');"> not: <a href="" onClick="alert('< foo! >');"> Invalid HTML!
      However, if they're there, then you can do the regex inside a while loop:
      while( $input_txt =~ s/\<.*?\>//gs ) {};
    • Using modules is probably better. Every time you reinvent the wheel, somewhere, a monk sheds a tear...
      instead of using <> to run through the files, you could do something like:

      That's excellent. I'll give that a shot. But can you tell me why the camel book says to do it this way if it doesn't really work?

      For the regex part, you're right that that would work for standard HTML, but I have to parse JSP pages too, and they are allowed to nest tags like that. I'll try your while loop and see how that works, though.

      As for the sadness of monks, I apologize. The reason I did it this way was to familiarize myself with syntax, standard perl functions, and little gotchas that are inherent in any language. Once I feel proficient with this skill, I'll be happy to keep you guys smiling :)
        The syntax
        while( <> ) { }
        Will 'automagically' run through every line of each of the files the user specified on the command line. It is the equivalent of:
        foreach $ARGV (@ARGV) { open( THISFILE, $ARGV ) || next; while( $_ = <THISFILE> ) { } }
        Which is why your script was executing many times per file. UNIX programs traditionally don't automatically create new files from their input, they just slurp in all the input files they're given and then print output directly to STDOUT so it can be redirected. Consider:
        $ grep 'foo' thisfile.txt thisOtherFile.txt
        Which will run through all the files it's given and print matching lines. Perl allows the programmer to do this kind of task easily with the <> syntax.
Re: Simplify parsing a file
by betterworld (Curate) on Apr 02, 2007 at 18:37 UTC
    On a side note, $! is only meaningful after doing something with the operating system, such as opening files. In code like this:
    if($#ARGV < 0){ die "You did not specify any files to process! : $!\n"; }
    there won't be any useful information in $!.
      oh, i thought that would give me info about what caused it to die. Now that I looked it up again, do you think $@ is really what I should use?

        No. Why would it be? There's no system error. There's merely a usage error.

        You didn't make a system call that failed ($!) and there's no caught exception ($@). Your program didn't get called with any arguments, which your code detects itself, and there's no more information to provide about the source of the error--you know exactly what the source of the error is!

        Now I prefer to write that code as:

        die "No files to process\n" unless @ARGV;

        ... mostly because I prefer the boolean check of the number of elements in @ARGV to checking the number of the last array index.

Re: Simplify parsing a file
by myrrdyn (Novice) on Apr 02, 2007 at 17:34 UTC
    whoops, got logged out as I was writing this.
      My new source is below. This works fine on HTML, but not so great on the JSP. I'll get to work on that part myself. Thanks to everyone!
      use strict; use HTML::TokeParser::Simple; if($#ARGV < 0){ die "You did not specify any files to process! : $@\n"; } foreach my $infname (@ARGV) { my $outfname = $infname.".txt"; my $inputtxt = HTML::TokeParser::Simple->new($infname); my $outputtxt = ""; #this section removes the code while(my $token = $inputtxt->get_token){ next unless $token->is_text; $outputtxt.= $token->as_is; } #this section removes whitespace $outputtxt =~ s/&nbsp;/ /g; #HTML special space char $outputtxt =~ s/\s\s\s//mg; #tabs (mostly) and newlines open (OUTPUT, ">>$outfname") or die "$outfname could not be opened +.: $@ *_* $!\n"; print OUTPUT $outputtxt; close $infname; close $outfname; }

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://607885]
Approved by Old_Gray_Bear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2022-07-04 05:15 GMT
Find Nodes?
    Voting Booth?

    No recent polls found