Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: parsing of large files

by arturo (Vicar)
on Mar 19, 2003 at 17:40 UTC ( #244395=note: print w/replies, xml ) Need Help??

in reply to parsing of large files

You don't need to have your script "remember" all the lines in the current section (which I take it is the issue you want to solve), as long as you are judicious in your use of filehandles. My initial thought is along the following lines:

my $outfilename = "outfilename"; open INFILE, "$input_file" or die "Can't open $input_file: $!\n"; while (<INFILE>) { if ( /section-end-marker/ ) { close OUTFILE; next; } if ( /section-start-marker/ ) { # generate the new $outfilename however open OUTFILE, "> $outfilename" or die"$!\n"; next; } print OUTFILE; }

That's very basic, but the idea is that you print to the currently open filehandle, unless you've found the start section marker, in which case you open the output file (to that filehandle), or the end section marker, in which case you close the curently open filehandle.


update OK, two people have failed to notice that the code is not to be used "as is": it is a skeleton upon which to build a functioning script. I left this implicit by putting comments where there would, in an actual script, be functioning code. I now make that implict warning explicit.

If not P, what? Q maybe?
"Sidney Morgenbesser"

Replies are listed 'Best First'.
Re: Re: parsing of large files
by JaWi (Hermit) on Mar 19, 2003 at 18:14 UTC
    Hi arturo,

    The solution you presented will overwrite the output-file on each occurance of the section start. Furthermore, writing to closed filehandles isn't a very clean solution, IMHO.

    You could try to use the magic '..' operator:

    open OUT, '>&STDOUT' or die; while ( <DATA> ) { print OUT if /start-marker/../end-marker/ and !/(start-marker|end-marker)/; } close OUT or die; __DATA__ a b c start-marker d e end-marker f g start-marker h i end-marker j k
    This will print the lines within the markers (thus: d,e,h and i in my example) but ignores the markers.


    (update: fixed some layout issues and used the actual '..' operator instead of the '...' one!).

    -- JaWi

    "A chicken is an egg's way of producing more eggs."

      First, thanks!

      And: There is no "end-section". the end is the start of the next section and a "start-section" is one of 5-6 different tags that implemebt some kind of hierarchy between sections.
      I still can parse the big file into files but in any case I need to save a data struct with the files names, sections headers, etc. so it seems as a double wrok.

      I finally solved it by Tie::File module, reading "line by line" from the tie array, and save for each section its name, its place at hierarchy and start and end index.
      I ended up with one read of the whole file, and then direct access to each section by its name.

      However, I was worried about the size of the tie array, but I guess it won't be bigger than 4 or 8 bytes multiple by the number of lines. I can live with (and correct me if I'm wrong :-)).

      thanks again!

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://244395]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2023-05-28 15:22 GMT
Find Nodes?
    Voting Booth?

    No recent polls found