matching and writing multiple line blocks

MadraghRua has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: matching and writing multiple line blocks by merlyn (Sage) on Sep 01, 2000 at 10:00 UTC
Perhaps something as simple as this: `my $name = "file000"; open STDOUT, ">>file000" or die "cannot append to file000: $!"; while (<>) { if (/Experiment/) { open STDOUT, ">>".(++$name) or die "Cannot append to $name: $!"; } print; if (/Reagent Lot/) { open STDOUT, ">>file000" or die "Cannot re-append to file000: $!"; } }` [download] This puts each new section into `file001`, `file002`, `file003`, and so on, and anything outside those sections into `file000`. Will that do, or are there other secret requirements? {grin} -- Randal L. Schwartz, Perl hacker Update: oops, I had an off-by-one error, and it's fixed now. Maybe I need to read up on pre-increment vs post-increment again. {grin}	[reply] [d/l]
RE: Re: matching and writing multiple line blocks by Adam (Vicar) on Sep 01, 2000 at 19:40 UTC
You can increment a string like that? Wow!	[reply]
RE: RE: Re: matching and writing multiple line blocks by merlyn (Sage) on Sep 01, 2000 at 19:55 UTC
From `perldoc perlop`: Auto-increment and Auto-decrement "++" and "--" work as in C. That is, if placed before a variable, they increment or decrement the variable before returning the value, and if placed after, increment or decrement the variable after returning the value. The auto-increment operator has a little extra builtin magic to it. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has been used in only string contexts since it was set, and has a value that is not the empty string and matches the pattern /^[a- zA-Z][0-9]$/, the increment is done as a string, preserving each character within its range, with carry: print ++($foo = '99'); # prints '100' print ++($foo = 'a0'); # prints 'a1' print ++($foo = 'Az'); # prints 'Ba' print ++($foo = 'zz'); # prints 'aaa' The auto-decrement operator is not magical. [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: matching and writing multiple line blocks by Adam (Vicar) on Sep 01, 2000 at 04:28 UTC
I'm a bit confused by your post. Do you already know what the repeating block looks like? Or are you trying to detect repeat blocks? And if you are looking for repeat blocks, do you know the end points? Are they always, "Experiment" and "Reagent Lots" ?? `# 1. open the file open FILEHANDLE, $filename or die $!; # 2. look for my first repeated block while( <FILEHANDLE> ) { if( /(some-regex-for-the-repeat-block)/ ) { # 3. save this first repeated block to a variable (which is best?) # 4. open the first output file open TEMP, "> $outfile" or die $!; # 5. write the code to this file and close it print TEMP $1, $/; close TEMP or die $!; # 6. repeat to end of file }` [download] Another idea might be to undef $/ so that you can read the infile as a scalar and apply your regex to the whole thing: `{ # note the {} brackets. This concerns the scope of the next line. local $/; # sets $/ = undef for this block only. open FH, $infile or die $!; $_ = <FH>; # Reads the entire file into memory. close FH or die $!; @_ = m/(some-regex-for-the-block)/gm; } # now @_ contains an array of matches for the regex. # write each one to a different file: my $filenum = 0; for( @_ ) { ++$filenum; open FH, ">MATCH$filenum" or die "Failed to open MATCH$filenum, $! +"; print FH $_, $/; close FH or die $!; }` [download]	[reply] [d/l] [select]
Re: matching and writing multiple line blocks by chromatic (Archbishop) on Sep 01, 2000 at 06:20 UTC
Something like the following might make a good regex. Slurp the file as Adam suggests: `@_ = m/(Experiment.*?Reagent Lot)/gs;` It's unclear how you decide the resulting filenames, though. If you have a base name, and just increment a counter, you could do: `foreach my $block (@_) { open(OUTPUT, "$file$i") or do { warn "Can't open file: $!"; next } +; print OUTPUT $block; close OUTPUT; }` [download] Two caveats. First, if your input file is large, slurp mode will take up a lot of memory, and you'll have to do line-by-line processing. Search for the first token, then loop through the file again looking for the end token. For every line that isn't the end token, push it on an array or append it to a string to save. When you hit the end token, drop out of the inner loop. Second, if there are many lines between the starting and ending tokens, or if the input file is large, the regex will be very slow. (It has to backtrack quite a ways.) You might be better off going line-by-line if you're dealing with moderately large data files.	[reply] [d/l] [select]
Re: matching and writing multiple line blocks by MadraghRua (Vicar) on Sep 01, 2000 at 20:41 UTC
I have three sets of repeating blocks of data. Each one has a start word and an end word. I'm showing these and the sizes below: Block 1 Start: Experiment End : Reagent Lot Size : 10 lines followed by one blank line Block 2 Start : Algorithm Parameters End : FCMax Size : 5 lines followed by one blank line Block 3 Start : Experiment End : Pixels Size : 8 lines followed by one blank line So what happens is for four sets of data, there will be four sets of block 1, followed by four sets of block 2, followed by 4 sets of block 3. Followed by more stuff I was going to try and figure out before bothering you kind monks again :). I had already thought out a short cut - all blocks are the same size, so I simply needed to know how many experiments there were, say 4, and then count down through each of the blocks as I already know the order and size of each. So some simple addition and division could get me where I needed to be. What I was pretty clueless about was how to match the first block and somehow write it to an output file, repeat for all the Block 1s, then do the same for all the Block 2, etc. I do want to learn how to do multiple line extractions of text - it is a good general technique and I can use it for other stuff - Thank you! Now I'll go play with your suggestions... MadraghRua yet another biologist hacking perl....	[reply]