nurulnad has asked for the wisdom of the Perl Monks concerning the following question:

I have a data like this:
line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~?
I'd like to store everything starting from 'line=ULMNm' till before the next 'line=ULMNm' as one string. Previously, I always used to define the input line separator ($/) like this:
$/ = " ";
to make the script read multiple lines instead of line by line.

However, if I want to read another text file line by line in the same perl script, redefining $/ = "\n" doesn't work. I'm really bad with regex (and have no idea how exactly to get good at it, it baffles me how you guys can be so good). Could you tell me how to do this?

Replies are listed 'Best First'.
Re: regex question: store multiple lines as a string
by johngg (Canon) on Oct 12, 2010 at 09:59 UTC
    However, if I want to read another text file line by line in the same perl script, redefining $/ = "\n" doesn't work.

    It will depend on whether you are reading the other file in the same lexical scope. If you are not then you can localise the scope of your redefine of $/; you should be doing this anyway as a habit to avoid the side effects you describe.

    { local $/ = q{}; # paragraph mode in this scope while ( <$file1FH> ) { # do something with the multi-line record # from file1 ... } } ... # $/ now back to normal while ( <$file2FH> ) { # do something with a line from file2 ... }

    I hope this is helpful.

    Cheers,

    JohnGG

      thanks a lot :D. this helps a bunch!
Re: regex question: store multiple lines as a string
by moritz (Cardinal) on Oct 12, 2010 at 08:41 UTC
    Just use split with a a separator /\n\n+/, and store the result in an array.

    See also: perlintro, perlretut.

    Perl 6 - links to (nearly) everything that is Perl 6.
      I tried:
      #!/usr/bin/perl open (data, 'data.txt') or die "die"; @words = split (/\n\n+/, <data>); print $words[0]; exit;

      and the output is only

      line=ULMNm  3  1fdy_07      N-ACETYLNEURAMINATE LYASE           user   1           3

      can you tell me what I'm doing wrong?

        split evaluates its argument in scalar context, so you only get one line. Splitting one line by two newlines doesn't make much sense.
        use strict; use warnings; use autodie; open my $f, '<', 'data.txt'; my @words = split /\n\n+/, do { local $/; <$f> }; close $f;
        Perl 6 - links to (nearly) everything that is Perl 6.
Re: regex question: store multiple lines as a string
by kcott (Archbishop) on Oct 12, 2010 at 08:52 UTC
    #!perl use 5.12.0; use warnings; while (<DATA>) { chomp; if (m{ \A line=ULMNm }msx && $. > 1) { print qq{\n}; } print; } __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~?

    Outputs:

    $ multiline_join.pl line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A + MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 +0.7442 0.1108 -16.917 -91.429 -35.632 D 4 +7 SER A 57 SER.? D 48 THR + A 56 THR.? D 165 LYS A 33 +LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A + MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 - +0.8462 0.3266 52.913 23.262 25.449 A 16 +9 TYR A 41 TYR~? A 172 HIS + A 95 HIS^? A 267 ASP A 98 +ASP~?

    -- Ken

      sorry, could you please explain? if possible could you tell me how to store these as variables?

        This concatenates your multiple lines and stores the single string in an array element:

        #!perl use 5.12.0; use warnings; my @joined = (); my $index = 0; while (<DATA>) { chomp; if (m{ \A line=ULMNm }msx && $. > 1) { ++$index; } $joined[$index] .= $_; } for (@joined) { say } __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~?

        Outputs:

        $ multiline_join_array.pl line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A + MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 +0.7442 0.1108 -16.917 -91.429 -35.632 D 4 +7 SER A 57 SER.? D 48 THR + A 56 THR.? D 165 LYS A 33 +LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A + MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 - +0.8462 0.3266 52.913 23.262 25.449 A 16 +9 TYR A 41 TYR~? A 172 HIS + A 95 HIS^? A 267 ASP A 98 +ASP~?

        I don't know what subsequent processing you want to do. I've just output each array element to the screen (say just tags on a newline).

        -- Ken

Re: regex question: store multiple lines as a string
by ig (Vicar) on Oct 12, 2010 at 13:38 UTC

    You can change $/ at any time - even alternating as you read a single file. This example demonstrates the flexibility you have, but note that each I/O operation could be on a different file handle as easily as on the same one.

    use strict; use warnings; for (0..2) { my $line1 = do { local $/ = "\n\n"; <DATA> }; print "got a line1: \"$line1\"\n" if(defined($line1)); my $line2 = do { local $/ = "paragraph"; <DATA> }; print "got a line2: \"$line2\"\n" if(defined($line2)); } __DATA__ This is a paragraph with two lines. This is another paragraph with two lines. This is a third paragraph. This paragraph has three lines.
Re: regex question: store multiple lines as a string
by ww (Archbishop) on Oct 12, 2010 at 15:17 UTC

    Am I missing something when I interpret OP's spec, "I'd like to store everything starting from 'line=ULMNm' till before the next 'line=ULMNm' as one string", as meaning the sample data should be divided into elements, each with a single element begining with "line=" and ending with the first instance of two newlines?

    Missing something or not, that's how I read it in writing this to satisfy my understanding of the spec:

    #!/usr/bin/perl use strict; use warnings; # 864768 my @words = split /(line=)/, do { local $/="\n\n"; <DATA> }; # a v +ariant of moritz' advice for my $words(@words) { chomp $words; if ($words eq "line=") { print $words; }else{ print "$words \n -------\n"; # the dashes visually separa +te the output records } } exit; __DATA__ line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? line=ULMNm 3 4fdy_07 P-HYDROOXIDE user 1 +3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 PQR A 33 PRQ~? line=ULMNm 3 5tmd_00 BAZ Blivitz user 1 3 + RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 XYZ A 98 XYZ~?

    and we see this, upon execution:

    F:\_wo\pl_test>perl 864768.pl ------- line=ULMNm 3 1fdy_07 N-ACETYLNEURAMINATE LYASE user + 1 3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 LYS A 33 LYS~? ------- line=ULMNm 3 2tmd_00 TRIMETHYLAMINE DEHYDROGENASE user + 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 ASP A 98 ASP~? ------- line=ULMNm 3 4fdy_07 P-HYDROOXIDE user 1 +3 RMSD = 1.06 A MATRIX: -0.3862 -0.2080 -0.8987 0.6457 0.6347 -0.4244 -0.6587 0 +.7442 0.1108 -16.917 -91.429 -35.632 D 47 SER A 57 SER.? D 48 THR A 56 THR.? D 165 PQR A 33 PRQ~? ------- line=ULMNm 3 5tmd_00 BAZ Blivitz user 1 3 RMSD = 1.15 A MATRIX: 0.9011 -0.4313 0.0445 -0.1032 -0.3130 -0.9441 -0.4211 -0 +.8462 0.3266 52.913 23.262 25.449 A 169 TYR A 41 TYR~? A 172 HIS A 95 HIS^? A 267 XYZ A 98 XYZ~? ------- F:\_wo\pl_test>

    Note the empty record that is the first output. Not good... hence, I'd welcome comments on my algorithm/code AND any comments rebutting my interpretation of the spec.

    Belated addition, 2125 EDT (U.S., roughly 10 hours later): Re OP's question about storing the munged data in variables. Whilst working this out, I used Data::Dumper to try to ascertain why an earlier iteration didn't work... and after fixing my foolishness but before removing D::D from the code, observed that D::D's list of vars had "line=" (see split at line 10) in Var2, Var4... and the rest of each munged data section in Var3, Var5, ....

Re: regex question: store multiple lines as a string
by locked_user sundialsvc4 (Abbot) on Oct 12, 2010 at 13:36 UTC

    I wonder whether the /s and/or /m operators might be useful here.

    If the file is “quite large,” as I assume it is, one strategy might to be to take each line that is read and, first, concatenate it (and a newline) to a buffer string.   Then, repeatedly regex that string using the /m and /p modifiers.   Each time the string matches, extract the matched portion using {$^MATCH} (“what matched”), then assign the string to be {$^POSTMATCH} (“what follows”).   Repeat this until the pattern no longer matches. Something like this:

    my $buffer = ""; do { my $line = <$fh>; $buffer .= "\n$line" if defined($line); # I.E. NOT END-OF-FILE while ($buffer =~ /$pattern/mp) { process(${^MATCH}); $buffer = ${^POSTMATCH}; } } while(defined($line)); # I.E. END-OF-FILE.

    You need to be sure that your pattern is set up so that it is not “greedy.”   By default, a regex will match as much of the string as it can ...   “always taking the biggest possible piece of the pie,” if you will.   But you don’t want that to happen.   If, at any time, the buffer contains more than one complete occurrences of whatever it is that you are looking for, you want to grab each one in turn.   Let me explain...

    Let’s say that you want to find whatever is between BEGIN and END in some string.   And let’s say that our test-string, just for fun, consists of:
    BEGIN FOO END BEGIN BAR END.”

    A “greedy” pattern, such as (say...) /BEGIN(.*)END/, would grab the longest possible substring that still permits the entire pattern to match, viz:
    FOO END BEGIN BAR.

    Because the regex went for the longest string, it grabbed everything that it found between the first occurrence of BEGIN and the last occurrence of END.   This is obviously not what we want.   But, if we insert the '?' modifier into the pattern, it now grabs the shortest possible match.   A pattern such as /BEGIN(.*)?END/ would now match:

    • FOO the first time.
    • BAR the second.

    (Caution: extemporaneous coding.   There might be syntax errors.   Do not try this at home.)

Re: regex question: store multiple lines as a string
by ig (Vicar) on Oct 12, 2010 at 14:15 UTC

    If your question is: how to get good at regex...

    Remember that regex is a separate language and you have to learn this language. You don't have to learn all of it at once. You can learn a bit at a time but to get good you must keep studying.

    You can read the manual pages: perlretut and perlre are good places to start.

    There are some tutorials here: Pattern Matching, Regular Expressions, and Parsing.

    O'Reily has Mastering Regular Expressions.

    Most of all, you must practice. There are lots of questions and answers here that you can study. Try searching for 'regex' or 'regular expression' in Super Search. Try writing your own solutions to the problems you find. In the beginning, some of the problems will be too difficult for you, but if you keep reading and keep practicing, you too can be good at it.