An (almost) useful idiom; needs work.

Update:

I guess I didn't make the idea very clear. There are of course many way to tackle this problem and as the original OP suggested, I could have coded a set-a-flag or retain-a-token solution and posted it there.

The idea came to me that if you want to extract a range of lines from a file, you are often advised to use the flip-flop operator

perl -nle "100 .. 200 and print;"

Neat, quick and simple.

This particular problem seemed analogous to me. You want to start capturing at some point and stop at another. I had envisaged that something like

while( @data ) {
    /^HEADING/ ... !/^HEADING/ and push @headings, $_;
    /^TITLE/   ... !/^TITLE/   and push @titles,   $_;
    /^COMPND/  ... !/^COMPND/  and push @cmpnds,   $_;
}
[download]

might be possible. And that this could be generalised into

$re_types = qr/(HEADING|TITLE|COMPND)/;
while (@data) {
    /^$re_types(.*)$/ ... !/^$1/ and push @{$type{$1}}, $2;
}
[download]

Which I think would be useful and usable.

The OP's requirements to catenate the lines complicates this somewhat, but the idea still struck me interesting. However, I still have problems with using the bi-stable operator with other than constant parameters and thought that by posting it here in its half finished state someone might see how to move the idea on from where I came unstuck.

So, the following code isn't attempting to solve that OP's original problem, nor is it presented as a working or usable piece of code hence the "needs work" in the title. its purpose is purely to serve as a demonstrator for an idea and the problems encountered trying to implement it.

Maybe its simply a solution looking for a problem that can be solved other ways, but I like TIMTOWTDI ;)

</update>

This started off as a tentative, generalised solution to Finding first block of contiguous elements in an array, but it's way to messy and inscrutable to post there, but I think it has potential.

It is currently, as Aristotle has a habit of saying about my kludges--Yuk!

If your one of those people... (Shades of "It'll be Allright on the Night" :^)... that has done all their Xmas shopping and is sitting at home on a wet weekend with nothing to do, maybe you can see how to generalise and tidy this up.

In particular, if anyone can explain why the last (compound) statement in the loop acts differently from the first two such statements, cos it's got me foxed.

#! perl -slw
use strict;

my @data = <DATA>;

my (@headers, @titles, @compnds);

for (@data) {
    (/^HEADER (?:\d+ )?(.*?)$/ and push(@headers, '' ))
        ...
    !/^HEADER (?:\d+ )?(.*?)$/
        and $1
        and $headers[-1] .= $1, next;

    (/^TITLE (?:\d+ )?(.*?)$/ and push(@titles, '' ))
        ...
    !/^TITLE (?:\d+ )?(.*?)$/
        and $1
        and $titles[-1] .= $1, next;

    (/^COMPND (?:\d+ )?(.*?)$/ and push(@compnds, '' ))
        ...
    !/^COMPND (?:\d+ )?(.*?)$/
        and $1
        and $compnds[-1] .= $1;
}
print for @headers;
print for @titles;
print for @compnds;


__DATA__
HEADER  Header 1 stuff
TITLE   Title 1 stuff
TITLE   2 more title 1 stuff
COMPND  complicated stuff 1
COMPND  2 continued complicated stuff 1
HEADER  Header 2 stuff
TITLE   Title 2 stuff
TITLE   2 more title 2 stuff
COMPND  complicated stuff 2
COMPND  2 continued complicated stuff 2
HEADER  Header 3 stuff
TITLE   Title 3 stuff
TITLE   2 more title 3 stuff
COMPND  complicated stuff 3
COMPND  2 continued complicated stuff 3
COMPND  3 continued complicated stuff 3
HEADER  Header 4 stuff
TITLE   Title 4 stuff
TITLE   2 more title 4 stuff
COMPND  complicated stuff 4
COMPND  2 continued complicated stuff 4
HEADER  Header 5 stuff
TITLE   Title 5 stuff
TITLE   2 more title 5 stuff
COMPND  complicated stuff 5
COMPND  2 continued complicated stuff 5
HEADER  Header 6 stuff
TITLE   Title 6 stuff
TITLE   2 more title 6 stuff
COMPND  complicated stuff 6
COMPND  2 continued complicated stuff 6
[download]

Outputs

C:\test>221570
 Header 1 stuff
 Header 2 stuff
 Header 3 stuff
 Header 4 stuff
 Header 5 stuff
 Header 6 stuff
  Title 1 stuff  2 more title 1 stuff
  Title 2 stuff  2 more title 2 stuff
  Title 3 stuff  2 more title 3 stuff
  Title 4 stuff  2 more title 4 stuff
  Title 5 stuff  2 more title 5 stuff
  Title 6 stuff  2 more title 6 stuff
complicated stuff 1 2 continued complicated stuff 1 complicated stuff 
+2 2 continued complicated stuff 2 complicated stuff 3 2 continued com
+plicated stuff 3 3 continued complicated stuff 3 
complicated stuff 4 2 continued complicated stuff 4 complicated stuff 
+5 2 continued complicated stuff 5 complicated stuff 6 2 continued com
+plicated stuff 6
C:\test>
[download]

I've broken the last bit up for posting but not well to show that the problem with the last statement in the loop.

Examine what is said, not who speaks.

Comment on An (almost) useful idiom; needs work. Select or Download Code

Replies are listed 'Best First'.
Re: An (almost) useful idiom; needs work. by tachyon (Chancellor) on Dec 21, 2002 at 12:53 UTC
That is really ugly code. Here is short and very generic solution: my %hash; my $current_token = ''; while (<DATA>) { my ($token, $value ) = $_ =~ m/^([^\s]+)\s+(.*)/; next unless $token; if ( $token eq $current_token ) { ${$hash{$token}}[-1] .= ' ' . $value; } else { $current_token = $token; push @{$hash{$token}}, $value; } } use Data::Dumper; print Dumper \%hash; __DATA__ HEADER Header 1 stuff TITLE Title 1 stuff TITLE 2 more title 1 stuff COMPND complicated stuff 1 COMPND 2 continued complicated stuff 1 HEADER Header 2 stuff TITLE Title 2 stuff TITLE 2 more title 2 stuff COMPND complicated stuff 2 COMPND 2 continued complicated stuff 2 HEADER Header 3 stuff TITLE Title 3 stuff TITLE 2 more title 3 stuff COMPND complicated stuff 3 COMPND 2 continued complicated stuff 3 COMPND 3 continued complicated stuff 3 HEADER Header 4 stuff TITLE Title 4 stuff TITLE 2 more title 4 stuff COMPND complicated stuff 4 COMPND 2 continued complicated stuff 4 HEADER Header 5 stuff TITLE Title 5 stuff TITLE 2 more title 5 stuff COMPND complicated stuff 5 COMPND 2 continued complicated stuff 5 HEADER Header 6 stuff TITLE Title 6 stuff TITLE 2 more title 6 stuff COMPND complicated stuff 6 COMPND 2 continued complicated stuff 6 __END__ $VAR1 = { 'HEADER' => [ 'Header 1 stuff', 'Header 2 stuff', 'Header 3 stuff', 'Header 4 stuff', 'Header 5 stuff', 'Header 6 stuff' ], 'TITLE' => [ 'Title 1 stuff 2 more title 1 stuff', 'Title 2 stuff 2 more title 2 stuff', 'Title 3 stuff 2 more title 3 stuff', 'Title 4 stuff 2 more title 4 stuff', 'Title 5 stuff 2 more title 5 stuff', 'Title 6 stuff 2 more title 6 stuff' ], 'COMPND' => [ 'complicated stuff 1 2 continued complicated s +tuff 1', 'complicated stuff 2 2 continued complicated s +tuff 2', 'complicated stuff 3 2 continued complicated s +tuff 3 3 continued complicated stuff 3', 'complicated stuff 4 2 continued complicated s +tuff 4', 'complicated stuff 5 2 continued complicated s +tuff 5', 'complicated stuff 6 2 continued complicated s +tuff 6' ] }; [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: An (almost) useful idiom; needs work. by Aristotle (Chancellor) on Dec 21, 2002 at 12:31 UTC
Yuk! `;^)` Btw, I write plenty of Yuk!^TM code myself, it's just I don't start coding before I have an approach I like and then don't post code I'm not satisfied with yet, so you rarely get to see the yuck stuff. `:)` You'll notice that you are posting a fair bit more code than I do. What I dislike here are two things: copypaste and parallel arrays. I'm not sure what to propose to improve this, because my preferences would require quite a large reorganization and you'd end up with something very different that doesn't relate to the original issues much if at all anymore. Compare my approach at Re: Finding first block of contiguous elements in an array. I chose to write a central loop that untangles multiline tags first, so that is taken care of centrally. A bunch of little handlers can then do whatever they like with the collected data. Makeshifts last the longest.	[reply]
Re: An (almost) useful idiom; needs work. by tachyon (Chancellor) on Dec 21, 2002 at 22:40 UTC
In reference to your update the example posted is already generic and does exactly as you suggest with a minute change. Have a look at the Data::Dumper output. You have a hash of arrays keyed on the HEADER\|TITLE\|COMPND tokens. Each element in the arrays is the concatenated contiguous lines. You would reference one array like `@{$hash{HEADER}}` or a single element like `$hash{HEADER}->[2]` `my %hash; my $current_token = ''; $re_types = qr/(HEADER\|TITLE\|COMPND)/; while (<DATA>) { my ($token, $value ) = $_ =~ m/^$re_types\s+(.*)/; next unless $token; if ( $token eq $current_token ) { ${$hash{$token}}[-1] .= ' ' . $value; } else { $current_token = $token; push @{$hash{$token}}, $value; } } use Data::Dumper; print Dumper \%hash; __DATA__ blah` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Re: An (almost) useful idiom; needs work. by BrowserUk (Patriarch) on Dec 22, 2002 at 06:02 UTC
You still miss the point! Which was to use the inherent state memory of the bistable operator in one of its forms, '..' or '...' to replace the need for using a variable and an if statement to control the capture. Saying that you can do it "this other way", doesn't progress the idea. Examine what is said, not who speaks.	[reply]
Re: Re: Re: An (almost) useful idiom; needs work. by tachyon (Chancellor) on Dec 22, 2002 at 11:11 UTC
Oh well if 12 lines of code that provide an efficient, elegant (if I do say so myself) and generic solution to the problem are worse than a couple of dozen lines that don't work, use cut and paste, parallel arrays and that would require sym refs to access in some sort or scalable fashion..... The single variable $token is a metastable operator (if you want to nit pick) and performs far more than a simple flag function as it is also the index key to the hash of arrays. In my world problem->solution == happiness cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: An (almost) useful idiom; needs work. by Aristotle (Chancellor) on Dec 22, 2002 at 20:39 UTC
I just tried to give it a long, hard look to see if maybe I could make sense of what you're trying to do. I don't think I understand how it's supposed to work. You'll need to explain your thoughts on how it's all supposed to go together. I have a vague idea of why it breaks though: `/^$re_types(.)$/ ... !/^$1/ and push @{$type{$1}}, $2;`* As long as the left side evaluates true, the right side will never be looked at before the next evaluation, which is on the next iteration. I get the vague feeling that you're being surprised by which `$1` is used when. At least if my impression is correct that whichever test you put last in the loop will be the one that breaks. I don't know though. I'm really quite confused as to how your code works (not) at all, and wondering when this is supposed to be a useful trick. Makeshifts last the longest.	[reply] [d/l]
Moleman2 for reading PDB files - Re: An (almost) useful idiom; needs work. by metadoktor (Hermit) on Dec 23, 2002 at 23:39 UTC
I realize that your question is about how to solve collecting multiple lines but since I notice that you're trying to read PDB files then you might find this program useful. Of course, if you just posted this as an exercise in theory and have no interest in the file you're using as an example then ignore this post. Moleman2 metadoktor "The doktor is in."	[reply]