Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hello Masters of the Script-fu! I am a noob with Perl, and wrote the following code to parse through an HTML or JSP file and remove all code blocks and extraneous whitespace, leaving just the text of the page. I know there are probably easier ways to do this on a one-off basis, but I had a couple hundred files to do this to. I also needed an excuse to start learning Perl, and I wrote this using PerlMonks & the camel book. This script works, but man is it ugly.
use strict; if($#ARGV < 0){ die "You did not specify any files to process! : $!\n"; } if($#ARGV > 2){ die "This is only going to work on the first one in the list, I th +ink. Please try again. : $!\n"; } my $onceonly=0; while(<>) { my $infname = $ARGV; my $outfname = $infname.".txt"; my $inputtxt = ""; my $outputtxt = ""; local $/=undef; open (INPUT, "$infname") or die "$infname could not be opened.: $! +\n"; while (<INPUT>){ $inputtxt .= $_; } open (OUTPUT, ">>$outfname") or die "$outfname could not be opened +.: $!\n"; while($onceonly==0) { my $bracecounter = 0; my $startbrace = '<'; my $stopbrace = '>'; for (my $i=length($inputtxt); $i>0 && length($inputtxt)>0; $i- +=1) { #going to count the number of open braces I find, #subtracting the number of close braces I find, #storing the data into outputtxt when counter==0 #looks funny because we're working backwards. my $testchar= chop ($inputtxt); if ($bracecounter == 0 && $testchar ne "<" && $testchar ne + ">"){ $outputtxt = $testchar.$outputtxt; $outputtxt =~ s/&nbsp;/ /g; $outputtxt =~ s/^\s\s//g; } elsif ($testchar eq $stopbrace){ $bracecounter+=1; } elsif ($testchar eq $startbrace){ $bracecounter-=1; } } print OUTPUT $outputtxt; #print $outputtxt; $onceonly+=1; } close $infname; close $outfname; }
I have the following issues that I can't seem to get a clear explanation of:

1. Placing more than one file for an argument results in new files ending in .txt, as expected, but they are all blank.
2. I had an issue that I resolved with the 'onceonly' var, where the file would have multiple iterations of the text in the file - somewhere around 50-60 times, but never the same number twice. I checked this with line number counts, and yes, I cleared the files between each test.
3. I know I can probably do this with modules, but as I'm still green with Perl anyway, I thought to do this in the basic syntax without clouding problems with module interactions. This is the next project, creating this script in a module form.
4. I had originally tried using a regex for this, but I couldn't get one to do nested tags.
Thanks for any suggestions!

In reply to Simplify parsing a file by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-03-29 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found