danj35 has asked for the wisdom of the Perl Monks concerning the following question:

Good day Monks!

I have a minor problem with a regular expression I'm using that needs tweeking. What I have is a variable of a page of text with lots of different paragraphs. I need to take chunks of that text and store it to separate variables. The variable looks like this (the content is not so important, it's mainly the headings that are):

=Title of Page= A general introduction to the topic. =Literature= Information here refers to the literature surrounding the topic. ===Comments=== User comments are added here. A user may write whatever they may wish. =Microarray Data= Information surrounding the topic related to microarrays. ===Comments=== Comments are related to the microarray data here. =Pathway Information= Information related to pathways for the topic is found here. ===Comments=== Comments related to the pathway information here. =Aditional Info= Any additional information can be found here.

I am trying to store all the text below each comments (===Comments===) header to separate variables. So what this means is that I should have 3 new variables, one for each comments header. The problem I have had is being able to use regular expressions to take this information, as each Comments header I would use as the start tag is exactly the same (===Comments===). I had solved this by using the following regular expression:

# $page refers to the variable containing the text my @comments = Dumper($page) =~ m/[=]+Comments[=]+.*?\n(.*?)[=]+/gs;

I have run into a problem with this now, in that if an '=' is encountered before the end of the specified comments section text after this is no longer stored. I assume the way to solve this would be to write 3 separate statements for each comments section, differentiated by their relevant end tags (=Microarray Data=; =Pathway Information= and =Aditional Info=).

Any help with this would be great, as I'm close to having this finished and I'm excited to see it working :)

Cheers.

P.s. I should mention that the comments sections do vary and they can be as long as you like. There may be numerous newline characters here.

Replies are listed 'Best First'.
Re: Regular Expressions Challenge
by moritz (Cardinal) on May 18, 2010 at 07:10 UTC
    Just split at two newlines in a row, and post-process the result list. Much easier than trying to come up with a regex for the whole thing.

      Sorry forgot to mention that there may be numerous newline characters within the comments sections.

        So? You can join the lines again.
Re: Regular Expressions Challenge
by JavaFan (Canon) on May 18, 2010 at 07:29 UTC
    Why use a complicated regular expression?
    my @section; my $i = -1; foreach (split /\n/, $str) { if (/^===Comments===$/) { ++$i; next; } next unless $i >= 0; $section[$i] .= "$_\n"; }
    Now @section will contain all the text after ===Comments===, one chapter at a time.
      This code is all well and good, except it takes the information after the end of each comments section also. I need to take only the information within each comments section (e.g. between ===Comments=== and =Microarray Data=). Sorry if that wasn't clear.
        my @data = split /\n/, $str; my @section; my $i = 0; while (@data) { $_ = shift @data; if (/^===Comments===$/) { while (@data) { $_ = shift @data; last if /^=/; $section[$i] .= "$_\n"; } $i++; } }
Re: Regular Expressions Challenge
by danj35 (Sexton) on May 18, 2010 at 08:25 UTC

    In the absence of being able to come up with a single line of code for regular expressions I've combined the split suggestions with a regular expression to achieve the desired result. Here is the working code:

    my @section; my $i = -1; foreach (split /\n/, $str) { if (/^===Comments===$/) { ++$i; next; } next unless $i >= 0; $section[$i] .= "$_\n"; } my $comments1; my $comments2; my $comments3; if ($section[0] =~ /\n(.*?)=Microarray Data=/gs){ $comments1 = $1; } if ($section[1] =~ /\n(.*?)=Pathway Information=/gs){ $comments2 = $1; } if ($section[2] =~ /\n(.*?)=Aditional Info=/gs){ $comments3 = $1; } print $comments1, $comments2, $comments3;

    Thanks for all pointers. If anyone can think of a way of making this neater and simpler that would be great. But otherwise it's working so I'm happy!

    Thanks.

      Not sure if this is neater or simpler, but certainly shorter - maybe you would like one of the following:

      my @sections = $str =~ m/^===Comments===\n(.*?)\n=.*?=$/gms; print Dumper(\@sections); print "\n\nOR\n\n"; my %sections = reverse( $str =~ m/^===Comments===\n(.*?)\n=(.*?)=$/gms + ); print Dumper(\%sections); print "\n\nOR\n\n"; while($str =~ m/^===Comments===\n(.*?)\n=(.*?)=$/gms) { print " $2 : $1\n"; }

      Which produces:

      $VAR1 = [ ' User comments are added here. A user may write whatever they may wish. ', ' Comments are related to the microarray data here. ', ' Comments related to the pathway information here. ' ]; OR $VAR1 = { 'Microarray Data' => ' User comments are added here. A user may write whatever they may wish. ', 'Pathway Information' => ' Comments are related to the microarray data here. ', 'Aditional Info' => ' Comments related to the pathway information here. ' }; OR Microarray Data : User comments are added here. A user may write whatever they may wish. Pathway Information : Comments are related to the microarray data here. Aditional Info : Comments related to the pathway information here.

      update: but I see the last two are wrong, now that I've posted it.

      my %x = map { /^=([^\n]*)=\n.*\n===Comments===\n(.*)$/s ? ( $1 => $2 ) : () } split(/(?=^=[^=])/m,$str); print Dumper(\%x);

      gives

      $VAR1 = { 'Microarray Data' => ' Comments are related to the microarray data here. ', 'Pathway Information' => ' Comments related to the pathway information here. ', 'Literature' => ' User comments are added here. A user may write whatever they may wish. ' };
Re: Regular Expressions Challenge
by LanX (Saint) on May 18, 2010 at 09:51 UTC
    here a regex solution for splitting headers in multiline strings:

    $slurp.=$_ while <DATA>; @array=split/^=+([\w\s]+)=+$/ms, $slurp; use Data::Dumper; print Dumper \@array; __DATA__ {{{your data}}}

    you can easily extend it to split successively at different header levels. (take care that the first element is always the text preceding the first header)

    when processing large texts you should consider using the flip-flop operator with parsing per line instead of splitting whole texts.

    Cheers Rolf

      So here is an example using the flip-flop operator - however I am not certain what the OP is actually looking for:
      use strict; use warnings; use Data::Dumper; my @sections; while (<DATA>) { chomp; if (/===Comments===/../^[^=]/) { if (/===Comments===/) { push @sections,[] } else { push @{$sections[-1]}, $_ if $_ } } } print Dumper(\@sections);
      Which for the supplied data gives:
      $VAR1 = [ [ 'User comments are added here. A user may write whatever t +hey may wish.' ], [ 'Comments are related to the microarray data here.' ], [ 'Comments related to the pathway information here.' ] ];
      generalized solution to parse level headers:
      $slurp.=$_ while <DATA>; @data=split/^(=+)([\w\s]+)\1$/m, $slurp; unshift @data,'','<Filename>'; while ( ($level,$header,$text,@data) = @data ) { print " " x length($level),$header,"\n"; # print $text,"\n\n"; } __DATA__
      OUTPUT
      <Filename> Title of Page Literature Comments Microarray Data Comments Pathway Information Comments Aditional Info

      Cheers Rolf

Re: Regular Expressions Challenge
by danj35 (Sexton) on May 18, 2010 at 11:42 UTC

    Thanks for all solutions and knowledge from people on this query. All very useful. Have a great day!

Re: Regular Expressions Challenge
by dineed (Scribe) on May 18, 2010 at 17:25 UTC

    While certainly not shorter than other posts, I believe the following provides what the OP is looking for.


    #!c:\perl\bin\perl.exe use strict; use warnings; my @comments; my $i; my $j = 0; my $found_comment = 0; my $input = 'c:\yourdirectoryhere\sample_perl_data1.txt'; my $line; open(IN, $input) or die("Can't open file - $input: $!"); while($line = <IN>) { if($line =~ m/^=/) { if($line =~ m/^===Comments===/) { $found_comment = 1; if(! defined $i) { $i = 0; } else { $i++; } $comments[$i] = ""; # initialize new comments array el +ements } else { $found_comment = 0; } } else { if($found_comment == 1) { $comments[$i] = $comments[$i] . $line; } } } close(IN) or die("Unable to close file - $input: $!"); #print output with a separator line between each element for($j=0; $j<=$i; $j++) { print("comments[$j] = $comments[$j]"); print("------------separator-------------------\n"); }

    Update:

    Sample output below:

    comments[0] = User comments are added here. A user may write whatever they may wish. ------------separator------------------- comments[1] = Comments are related to the microarray data here. ------------separator------------------- comments[2] = Comments related to the pathway information here. ------------separator-------------------