MadraghRua has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,
I want to open one file, look for multiple lines of text blocks that are repeated and write these to seperate files. So if we have four blocks of text starting with the word 'Experiment' and ending with the word 'Reagent Lot', I would like to write each block of text to four seperate files.

What I want to do is

  1. open the file
  2. look for my first repeated block
  3. save this first repeated block to a variable (which is best?)
  4. open the first output file
  5. write the code to this file and close it
  6. repeat to end of file
Does anyone have a suggestion as to how best to go about this?

Thank you for your help

MadraghRua
yet another biologist hacking perl....

  • Comment on matching and writing multiple line blocks

Replies are listed 'Best First'.
Re: matching and writing multiple line blocks
by merlyn (Sage) on Sep 01, 2000 at 10:00 UTC
    Perhaps something as simple as this:
    my $name = "file000"; open STDOUT, ">>file000" or die "cannot append to file000: $!"; while (<>) { if (/Experiment/) { open STDOUT, ">>".(++$name) or die "Cannot append to $name: $!"; } print; if (/Reagent Lot/) { open STDOUT, ">>file000" or die "Cannot re-append to file000: $!"; } }
    This puts each new section into file001, file002, file003, and so on, and anything outside those sections into file000.

    Will that do, or are there other secret requirements? {grin}

    -- Randal L. Schwartz, Perl hacker


    Update: oops, I had an off-by-one error, and it's fixed now. Maybe I need to read up on pre-increment vs post-increment again. {grin}
      You can increment a string like that? Wow!
        From perldoc perlop:
        Auto-increment and Auto-decrement "++" and "--" work as in C. That is, if placed before a variable, they increment or decrement the variable before returning the value, and if placed after, increment or decrement the variable after returning the value. The auto-increment operator has a little extra builtin magic to it. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has been used in only string contexts since it was set, and has a value that is not the empty string and matches the pattern /^[a- zA-Z]*[0-9]*$/, the increment is done as a string, preserving each character within its range, with carry: print ++($foo = '99'); # prints '100' print ++($foo = 'a0'); # prints 'a1' print ++($foo = 'Az'); # prints 'Ba' print ++($foo = 'zz'); # prints 'aaa' The auto-decrement operator is not magical.

        -- Randal L. Schwartz, Perl hacker

Re: matching and writing multiple line blocks
by Adam (Vicar) on Sep 01, 2000 at 04:28 UTC
    I'm a bit confused by your post. Do you already know what the repeating block looks like? Or are you trying to detect repeat blocks? And if you are looking for repeat blocks, do you know the end points? Are they always, "Experiment" and "Reagent Lots" ??

    # 1. open the file open FILEHANDLE, $filename or die $!; # 2. look for my first repeated block while( <FILEHANDLE> ) { if( /(some-regex-for-the-repeat-block)/ ) { # 3. save this first repeated block to a variable (which is best?) # 4. open the first output file open TEMP, "> $outfile" or die $!; # 5. write the code to this file and close it print TEMP $1, $/; close TEMP or die $!; # 6. repeat to end of file }
    Another idea might be to undef $/ so that you can read the infile as a scalar and apply your regex to the whole thing:
    { # note the {} brackets. This concerns the scope of the next line. local $/; # sets $/ = undef for this block only. open FH, $infile or die $!; $_ = <FH>; # Reads the entire file into memory. close FH or die $!; @_ = m/(some-regex-for-the-block)/gm; } # now @_ contains an array of matches for the regex. # write each one to a different file: my $filenum = 0; for( @_ ) { ++$filenum; open FH, ">MATCH$filenum" or die "Failed to open MATCH$filenum, $! +"; print FH $_, $/; close FH or die $!; }
Re: matching and writing multiple line blocks
by chromatic (Archbishop) on Sep 01, 2000 at 06:20 UTC
    Something like the following might make a good regex. Slurp the file as Adam suggests:

    @_ = m/(Experiment.*?Reagent Lot)/gs;

    It's unclear how you decide the resulting filenames, though. If you have a base name, and just increment a counter, you could do:

    foreach my $block (@_) { open(OUTPUT, "$file$i") or do { warn "Can't open file: $!"; next } +; print OUTPUT $block; close OUTPUT; }
    Two caveats. First, if your input file is large, slurp mode will take up a lot of memory, and you'll have to do line-by-line processing. Search for the first token, then loop through the file again looking for the end token. For every line that isn't the end token, push it on an array or append it to a string to save. When you hit the end token, drop out of the inner loop.

    Second, if there are many lines between the starting and ending tokens, or if the input file is large, the regex will be very slow. (It has to backtrack quite a ways.) You might be better off going line-by-line if you're dealing with moderately large data files.

Re: matching and writing multiple line blocks
by MadraghRua (Vicar) on Sep 01, 2000 at 20:41 UTC
    I have three sets of repeating blocks of data. Each one has a start word and an end word. I'm showing these and the sizes below:
    Block 1
    Start: Experiment 
    End  : Reagent Lot
    Size : 10 lines followed by one blank line
    Block 2
    Start : Algorithm Parameters
    End   : FCMax
    Size  : 5 lines followed by one blank line
    Block 3
    Start : Experiment
    End   : Pixels
    Size  : 8 lines followed by one blank line
    

    So what happens is for four sets of data, there will be four sets of block 1, followed by four sets of block 2, followed by 4 sets of block 3. Followed by more stuff I was going to try and figure out before bothering you kind monks again :).

    I had already thought out a short cut - all blocks are the same size, so I simply needed to know how many experiments there were, say 4, and then count down through each of the blocks as I already know the order and size of each. So some simple addition and division could get me where I needed to be.

    What I was pretty clueless about was how to match the first block and somehow write it to an output file, repeat for all the Block 1s, then do the same for all the Block 2, etc. I do want to learn how to do multiple line extractions of text - it is a good general technique and I can use it for other stuff - Thank you!

    Now I'll go play with your suggestions...
    MadraghRua
    yet another biologist hacking perl....