regex question: store multiple lines as a string

Replies are listed 'Best First'.
Re: regex question: store multiple lines as a string by johngg (Canon) on Oct 12, 2010 at 09:59 UTC
However, if I want to read another text file line by line in the same perl script, redefining $/ = "\n" doesn't work. It will depend on whether you are reading the other file in the same lexical scope. If you are not then you can `local`ise the scope of your redefine of `$/`; you should be doing this anyway as a habit to avoid the side effects you describe. `{ local $/ = q{}; # paragraph mode in this scope while ( <$file1FH> ) { # do something with the multi-line record # from file1 ... } } ... # $/ now back to normal while ( <$file2FH> ) { # do something with a line from file2 ... }` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: regex question: store multiple lines as a string by nurulnad (Acolyte) on Oct 12, 2010 at 11:26 UTC
thanks a lot :D. this helps a bunch!	[reply]
Re: regex question: store multiple lines as a string by moritz (Cardinal) on Oct 12, 2010 at 08:41 UTC
Just use split with a a separator `/\n\n+/`, and store the result in an array. See also: perlintro, perlretut. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re^2: regex question: store multiple lines as a string by nurulnad (Acolyte) on Oct 12, 2010 at 09:18 UTC
I tried: `#!/usr/bin/perl open (data, 'data.txt') or die "die"; @words = split (/\n\n+/, <data>); print $words[0]; exit;` [download] and the output is only `line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user 1 3` can you tell me what I'm doing wrong?	[reply] [d/l] [select]
Re^3: regex question: store multiple lines as a string by moritz (Cardinal) on Oct 12, 2010 at 09:29 UTC
split evaluates its argument in scalar context, so you only get one line. Splitting one line by two newlines doesn't make much sense. `use strict; use warnings; use autodie; open my $f, '<', 'data.txt'; my @words = split /\n\n+/, do { local $/; <$f> }; close $f;` [download] Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re: regex question: store multiple lines as a string by kcott (Archbishop) on Oct 12, 2010 at 08:52 UTC
#!perl use 5.12.0; use warnings; while (<DATA>) { chomp; if (m{ \A line=ULMNm }msx && $. > 1) { print qq{\n}; } print; } __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? [download] Outputs: $ multiline_join.pl line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A + MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 +0.7442 0.1108 -16.917 -91.429 -35.632 D 4 +7 SER A 57 SER.? D 48 THR + A 56 THR.? D 165 LYS A 33 +LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A + MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 - +0.8462 0.3266 52.913 23.262 25.449 A 16 +9 TYR A 41 TYR~? A 172 HIS + A 95 HIS^? A 267 ASP A 98 +ASP~? [download] -- Ken	[reply] [d/l] [select]
Re^2: regex question: store multiple lines as a string by nurulnad (Acolyte) on Oct 12, 2010 at 09:10 UTC
sorry, could you please explain? if possible could you tell me how to store these as variables?	[reply]
Re^3: regex question: store multiple lines as a string by kcott (Archbishop) on Oct 12, 2010 at 09:52 UTC
This concatenates your multiple lines and stores the single string in an array element: #!perl use 5.12.0; use warnings; my @joined = (); my $index = 0; while (<DATA>) { chomp; if (m{ \A line=ULMNm }msx && $. > 1) { ++$index; } $joined[$index] .= $_; } for (@joined) { say } __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? [download] Outputs: $ multiline_join_array.pl line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A + MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 +0.7442 0.1108 -16.917 -91.429 -35.632 D 4 +7 SER A 57 SER.? D 48 THR + A 56 THR.? D 165 LYS A 33 +LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A + MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 - +0.8462 0.3266 52.913 23.262 25.449 A 16 +9 TYR A 41 TYR~? A 172 HIS + A 95 HIS^? A 267 ASP A 98 +ASP~? [download] I don't know what subsequent processing you want to do. I've just output each array element to the screen (`say` just tags on a newline). -- Ken	[reply] [d/l] [select]
Re: regex question: store multiple lines as a string by ig (Vicar) on Oct 12, 2010 at 13:38 UTC
You can change $/ at any time - even alternating as you read a single file. This example demonstrates the flexibility you have, but note that each I/O operation could be on a different file handle as easily as on the same one. `use strict; use warnings; for (0..2) { my $line1 = do { local $/ = "\n\n"; <DATA> }; print "got a line1: \"$line1\"\n" if(defined($line1)); my $line2 = do { local $/ = "paragraph"; <DATA> }; print "got a line2: \"$line2\"\n" if(defined($line2)); } __DATA__ This is a paragraph with two lines. This is another paragraph with two lines. This is a third paragraph. This paragraph has three lines.` [download]	[reply] [d/l]
Re: regex question: store multiple lines as a string by ww (Archbishop) on Oct 12, 2010 at 15:17 UTC
Am I missing something when I interpret OP's spec, "I'd like to store everything starting from 'line=ULMNm' till before the next 'line=ULMNm' as one string", as meaning the sample data should be divided into elements, each with a single element begining with "line=" and ending with the first instance of two newlines? Missing something or not, that's how I read it in writing this to satisfy my understanding of the spec: #!/usr/bin/perl use strict; use warnings; # 864768 my @words = split /(line=)/, do { local $/="\n\n"; <DATA> }; # a v +ariant of moritz' advice for my $words(@words) { chomp $words; if ($words eq "line=") { print $words; }else{ print "$words \n -------\n"; # the dashes visually separa +te the output records } } exit; __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? line=ULMNm 3 4fdy_07 P-HYDROOXIDE user 1 +3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 PQR A 33 PRQ~? line=ULMNm 3 5tmd_00 BAZ Blivitz user 1 3 + RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 XYZ A 98 XYZ~? [download] and we see this, upon execution: F:\_wo\pl_test>perl 864768.pl ------- line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? ------- line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? ------- line=ULMNm 3 4fdy_07 P-HYDROOXIDE user 1 +3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 PQR A 33 PRQ~? ------- line=ULMNm 3 5tmd_00 BAZ Blivitz user 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 XYZ A 98 XYZ~? ------- F:\_wo\pl_test> [download] Note the empty record that is the first output. Not good... hence, I'd welcome comments on my algorithm/code AND any comments rebutting my interpretation of the spec. Belated addition, 2125 EDT (U.S., roughly 10 hours later): Re OP's question about storing the munged data in variables. Whilst working this out, I used Data::Dumper to try to ascertain why an earlier iteration didn't work... and after fixing my foolishness but before removing D::D from the code, observed that D::D's list of vars had "line=" (see split at line 10) in Var2, Var4... and the rest of each munged data section in Var3, Var5, ....	[reply] [d/l] [select]
Re: regex question: store multiple lines as a string by locked_user sundialsvc4 (Abbot) on Oct 12, 2010 at 13:36 UTC
I wonder whether the `/s` and/or `/m` operators might be useful here. If the file is “quite large,” as I assume it is, one strategy might to be to take each line that is read and, first, concatenate it (and a newline) to a buffer string. Then, repeatedly regex that string using the `/m` and `/p` modifiers. Each time the string matches, extract the matched portion using `{$^MATCH}` (“what matched”), then assign the string to be `{$^POSTMATCH}` (“what follows”). Repeat this until the pattern no longer matches. Something like this: `my $buffer = ""; do { my $line = <$fh>; $buffer .= "\n$line" if defined($line); # I.E. NOT END-OF-FILE while ($buffer =~ /$pattern/mp) { process(${^MATCH}); $buffer = ${^POSTMATCH}; } } while(defined($line)); # I.E. END-OF-FILE.` [download] You need to be sure that your pattern is set up so that it is not “greedy.” By default, a regex will match as much of the string as it can ... “always taking the biggest possible piece of the pie,” if you will. But you don’t want that to happen. If, at any time, the buffer contains more than one complete occurrences of whatever it is that you are looking for, you want to grab each one in turn. Let me explain... Let’s say that you want to find whatever is between `BEGIN` and `END` in some string. And let’s say that our test-string, just for fun, consists of: “`BEGIN FOO END BEGIN BAR END`.” A “greedy” pattern, such as (say...) `/BEGIN(.)END/`, would grab the longest* possible substring that still permits the entire pattern to match, viz: `FOO END BEGIN BAR`. Because the regex went for the longest string, it grabbed everything that it found between the first occurrence of `BEGIN` and the last occurrence of `END`. This is obviously not what we want. But, if we insert the '?' modifier into the pattern, it now grabs the shortest possible match. A pattern such as `/BEGIN(.)?END/` would now match: `FOO` the first time. `BAR` the second. (Caution: extemporaneous coding. There might be syntax errors. Do not try this at home.)*
Re: regex question: store multiple lines as a string by ig (Vicar) on Oct 12, 2010 at 14:15 UTC
If your question is: how to get good at regex... Remember that regex is a separate language and you have to learn this language. You don't have to learn all of it at once. You can learn a bit at a time but to get good you must keep studying. You can read the manual pages: perlretut and perlre are good places to start. There are some tutorials here: Pattern Matching, Regular Expressions, and Parsing. O'Reily has Mastering Regular Expressions. Most of all, you must practice. There are lots of questions and answers here that you can study. Try searching for 'regex' or 'regular expression' in Super Search. Try writing your own solutions to the problems you find. In the beginning, some of the problems will be too difficult for you, but if you keep reading and keep practicing, you too can be good at it.	[reply]
Re^2: regex question: store multiple lines as a string by planetscape (Chancellor) on Oct 13, 2010 at 00:11 UTC
See also: My Favourite Regex Tools HTH, planetscape	[reply]

However, if I want to read another text file line by line in the same perl script, redefining $/ = "\n" doesn't work.

It will depend on whether you are reading the other file in the same lexical scope. If you are not then you can localise the scope of your redefine of $/; you should be doing this anyway as a habit to avoid the side effects you describe.

{
    local $/ = q{}; # paragraph mode in this scope
    while ( <$file1FH> )
    {
        # do something with the multi-line record
        # from file1
        ...
    }
}
...
# $/ now back to normal
while ( <$file2FH> )
{
    # do something with a line from file2
    ...
}
[download]

I hope this is helpful.

Cheers,

JohnGG

[reply]
[d/l]
[select]

thanks a lot :D. this helps a bunch!

[reply]

split

/\n\n+/

See also: perlintro, perlretut.

Perl 6 - links to (nearly) everything that is Perl 6.

[reply]
[d/l]

#!/usr/bin/perl
open (data, 'data.txt') or die "die";

@words = split (/\n\n+/, <data>); 
print $words[0];
exit;
[download]

and the output is only

line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user 1 3

can you tell me what I'm doing wrong?

[reply]
[d/l]
[select]

split

use strict;
use warnings;
use autodie;

open my $f, '<', 'data.txt';
my @words = split /\n\n+/, do { local $/; <$f> };
close $f;
[download]

Perl 6 - links to (nearly) everything that is Perl 6.

[reply]
[d/l]

#!perl

use 5.12.0;
use warnings;

while (<DATA>) {
    chomp;
    if (m{ \A line=ULMNm }msx && $. > 1) {
        print qq{\n};
    }
    print;
}

__DATA__
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3                                    
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632  
                   D  47    SER       A  57    SER.?     
                   D  48    THR       A  56    THR.?     
                   D 165    LYS       A  33    LYS~?     
 
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3                                    
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449  
                   A 169    TYR       A  41    TYR~?     
                   A 172    HIS       A  95    HIS^?     
                   A 267    ASP       A  98    ASP~?
[download]

Outputs:

$ multiline_join.pl
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3                                      RMSD =     1.06 A
+  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  
+0.7442  0.1108   -16.917   -91.429   -35.632                     D  4
+7    SER       A  57    SER.?                        D  48    THR    
+   A  56    THR.?                        D 165    LYS       A  33    
+LYS~?      
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3                                      RMSD =     1.15 A
+  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -
+0.8462  0.3266    52.913    23.262    25.449                     A 16
+9    TYR       A  41    TYR~?                        A 172    HIS    
+   A  95    HIS^?                        A 267    ASP       A  98    
+ASP~?
[download]

-- Ken

[reply]
[d/l]
[select]

sorry, could you please explain? if possible could you tell me how to store these as variables?

[reply]

This concatenates your multiple lines and stores the single string in an array element:

#!perl

use 5.12.0;
use warnings;

my @joined = ();
my $index = 0;

while (<DATA>) {
    chomp;
    if (m{ \A line=ULMNm }msx && $. > 1) {
        ++$index;
    }
    $joined[$index] .= $_;
}

for (@joined) { say }

__DATA__
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3                                    
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632  
                   D  47    SER       A  57    SER.?     
                   D  48    THR       A  56    THR.?     
                   D 165    LYS       A  33    LYS~?     
 
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3                                    
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449  
                   A 169    TYR       A  41    TYR~?     
                   A 172    HIS       A  95    HIS^?     
                   A 267    ASP       A  98    ASP~?
[download]

Outputs:

$ multiline_join_array.pl
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3                                      RMSD =     1.06 A
+  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  
+0.7442  0.1108   -16.917   -91.429   -35.632                     D  4
+7    SER       A  57    SER.?                        D  48    THR    
+   A  56    THR.?                        D 165    LYS       A  33    
+LYS~?      
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3                                      RMSD =     1.15 A
+  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -
+0.8462  0.3266    52.913    23.262    25.449                     A 16
+9    TYR       A  41    TYR~?                        A 172    HIS    
+   A  95    HIS^?                        A 267    ASP       A  98    
+ASP~?
[download]

I don't know what subsequent processing you want to do. I've just output each array element to the screen (say just tags on a newline).

-- Ken

[reply]
[d/l]
[select]

You can change $/ at any time - even alternating as you read a single file. This example demonstrates the flexibility you have, but note that each I/O operation could be on a different file handle as easily as on the same one.

use strict;
use warnings;

for (0..2) {
    my $line1 = do { local $/ = "\n\n"; <DATA> };
    print "got a line1: \"$line1\"\n" if(defined($line1));
    my  $line2 = do { local $/ = "paragraph"; <DATA> };
    print "got a line2: \"$line2\"\n" if(defined($line2));
}
__DATA__
This is a paragraph
with two lines.

This is another paragraph
with two lines.

This is a third paragraph.
This paragraph
has three lines.
[download]

[reply]
[d/l]

Am I missing something when I interpret OP's spec, "I'd like to store everything starting from 'line=ULMNm' till before the next 'line=ULMNm' as one string", as meaning the sample data should be divided into elements, each with a single element begining with "line=" and ending with the first instance of two newlines?

Missing something or not, that's how I read it in writing this to satisfy my understanding of the spec:

#!/usr/bin/perl
use strict;
use warnings;
# 864768

my @words = split /(line=)/, do { local $/="\n\n"; <DATA> };     # a v
+ariant of moritz' advice

for my $words(@words) {
        chomp $words;
        if ($words eq "line=") {
            print $words;    
        }else{
            print "$words \n -------\n";  # the dashes visually separa
+te the output records
        }
}

exit;

__DATA__ 
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3                                                     
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632  
                   D  47    SER       A  57    SER.?     
                   D  48    THR       A  56    THR.?     
                   D 165    LYS       A  33    LYS~?     
 
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3                                                     
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449  
                   A 169    TYR       A  41    TYR~?     
                   A 172    HIS       A  95    HIS^?     
                   A 267    ASP       A  98    ASP~?     
 
line=ULMNm  3  4fdy_07      P-HYDROOXIDE           user   1           
+3                                                     
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632  
                   D  47    SER       A  57    SER.?     
                   D  48    THR       A  56    THR.?     
                   D 165    PQR       A  33    PRQ~?     
 
line=ULMNm  3  5tmd_00      BAZ    Blivitz        user   1           3
+                                                     
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449  
                   A 169    TYR       A  41    TYR~?     
                   A 172    HIS       A  95    HIS^?     
                   A 267    XYZ       A  98    XYZ~?
[download]

and we see this, upon execution:

F:\_wo\pl_test>perl 864768.pl

 -------
line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user  
+ 1           3
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632
                   D  47    SER       A  57    SER.?
                   D  48    THR       A  56    THR.?
                   D 165    LYS       A  33    LYS~?

 -------
line=ULMNm  3  2tmd_00      TRIMETHYLAMINE DEHYDROGENASE        user  
+ 1           3
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449
                   A 169    TYR       A  41    TYR~?
                   A 172    HIS       A  95    HIS^?
                   A 267    ASP       A  98    ASP~?

 -------
line=ULMNm  3  4fdy_07      P-HYDROOXIDE           user   1           
+3
  RMSD =     1.06 A
  MATRIX:   -0.3862 -0.2080 -0.8987  0.6457  0.6347 -0.4244 -0.6587  0
+.7442  0.1108   -16.917   -91.429   -35.632
                   D  47    SER       A  57    SER.?
                   D  48    THR       A  56    THR.?
                   D 165    PQR       A  33    PRQ~?

 -------
line=ULMNm  3  5tmd_00      BAZ Blivitz        user   1           3
  RMSD =     1.15 A
  MATRIX:    0.9011 -0.4313  0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0
+.8462  0.3266    52.913    23.262    25.449
                   A 169    TYR       A  41    TYR~?
                   A 172    HIS       A  95    HIS^?
                   A 267    XYZ       A  98    XYZ~?
 -------

F:\_wo\pl_test>
[download]

Note the empty record that is the first output. Not good... hence, I'd welcome comments on my algorithm/code AND any comments rebutting my interpretation of the spec.

Belated addition, 2125 EDT (U.S., roughly 10 hours later): Re OP's question about storing the munged data in variables. Whilst working this out, I used Data::Dumper to try to ascertain why an earlier iteration didn't work... and after fixing my foolishness but before removing D::D from the code, observed that D::D's list of vars had "line=" (see split at line 10) in Var2, Var4... and the rest of each munged data section in Var3, Var5, ....

[reply]
[d/l]
[select]

I wonder whether the /s and/or /m operators might be useful here.

If the file is “quite large,” as I assume it is, one strategy might to be to take each line that is read and, first, concatenate it (and a newline) to a buffer string. Then, repeatedly regex that string using the /m and /p modifiers. Each time the string matches, extract the matched portion using {$^MATCH} (“what matched”), then assign the string to be {$^POSTMATCH} (“what follows”). Repeat this until the pattern no longer matches. Something like this:

  my $buffer = "";
  do {
    my $line = <$fh>;
    $buffer .= "\n$line" if defined($line);  # I.E. NOT END-OF-FILE
    while ($buffer =~ /$pattern/mp) {
      process(${^MATCH});
      $buffer = ${^POSTMATCH};
    }
  } while(defined($line));      # I.E. END-OF-FILE.
[download]

You need to be sure that your pattern is set up so that it is not “greedy.” By default, a regex will match as much of the string as it can ... “always taking the biggest possible piece of the pie,” if you will. But you don’t want that to happen. If, at any time, the buffer contains more than one complete occurrences of whatever it is that you are looking for, you want to grab each one in turn. Let me explain...

Let’s say that you want to find whatever is between BEGIN and END in some string. And let’s say that our test-string, just for fun, consists of:
“BEGIN FOO END BEGIN BAR END.”

A “greedy” pattern, such as (say...) /BEGIN(.*)END/, would grab the longest possible substring that still permits the entire pattern to match, viz:
FOO END BEGIN BAR.

Because the regex went for the longest string, it grabbed everything that it found between the first occurrence of BEGIN and the last occurrence of END. This is obviously not what we want. But, if we insert the '?' modifier into the pattern, it now grabs the shortest possible match. A pattern such as /BEGIN(.*)?END/ would now match:

FOO the first time.
BAR the second.

(Caution: extemporaneous coding. There might be syntax errors. Do not try this at home.)

If your question is: how to get good at regex...

Remember that regex is a separate language and you have to learn this language. You don't have to learn all of it at once. You can learn a bit at a time but to get good you must keep studying.

You can read the manual pages: perlretut and perlre are good places to start.

There are some tutorials here: Pattern Matching, Regular Expressions, and Parsing.

O'Reily has Mastering Regular Expressions.

Most of all, you must practice. There are lots of questions and answers here that you can study. Try searching for 'regex' or 'regular expression' in Super Search. Try writing your own solutions to the problems you find. In the beginning, some of the problems will be too difficult for you, but if you keep reading and keep practicing, you too can be good at it.

[reply]