joec_ has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a file in a specific format (see below), with line breaks and tabs in specific places. What i need to do group it up to 'M END' and then append $$$$ after 'M END' to make it individual elements - im not bothered about the stuff after M END - that can be discarded. I gather i can use a regex for this, but keep hitting problems with line breaks.

File::

-OECHEM 658567-

1 2 0000 V2000
4 \t 5 8.7 7.655 3
2 \t 55 6 4 5
M END
> <compound id>
665765765
> <source>
db1

$$$$
-OECHEM 35343-

3 6 0000 V2000
1 \t 7 6 4.6 9
2 \t 45 0 3 5
M END
> <compound id>
3546789
> <source>
db1

$$$$

Any ideas appeciated.

TIA - Joe

please note that between M and END there is two spaces.

Replies are listed 'Best First'.
Re: Regex for matching and appending
by moritz (Cardinal) on Dec 10, 2008 at 09:09 UTC
    So what have you tried so far? Show us some code, it's likely that it needs only minor corrections.

    It would also help to see what you want the output to be, because your verbal description isn't very exact (at least not to me).

      Hi,

      I would like my output to be:

      $individual[0]=
      -OECHEM 658567-

      1 2 0000 V2000
      4 \t 5 8.7 7.655 3
      2 \t 55 6 4 5
      M END
      $$$$

      $individual1=
      -OECHEM 35343-

      3 6 0000 V2000
      1 \t 7 6 4.6 9
      2 \t 45 0 3 5
      M END
      $$$$

      i.e. having 2 seperate regex groups (or array elements) of just everything up to M END. and then add $$$$ to the next line.

      I have so far tried:

      my @individual; @individual = split (/\$\$\$\$/,$file);

      The split works but doesnt include the token used to split on, i.e. i end up with

      $individual[0] =
      -OECHEM 658567-

      1 2 0000 V2000
      4 \t 5 8.7 7.655 3
      2 \t 55 6 4 5
      M END
      > <compound id>
      665765765
      > <source>
      db1

      $individual[1] =
      -OECHEM 35343-

      3 6 0000 V2000
      1 \t 7 6 4.6 9
      2 \t 45 0 3 5
      M END
      > <compound id>
      3546789
      > <source>
      db1

      I tried a regex like /(.*)?M END/ig May have to do two steps to do grouping and then substitution.

      Thanks.

        split removes the token it matched, but since you know what it is (here: $$$$ you can simply add it again.

        Instead of slurping the whole file and then slipping, you can set the input record separator accordingly:

        #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @records; { local $/ = '$$$$' . "\n"; while (my $record = <DATA>) { my ($stripped) = split /\nM END\n/, $record, 2; push @records, "$stripped\nM END\n$/"; } } print Dumper \@records; __DATA__ -OECHEM 658567- 1 2 0000 V2000 4 \t 5 8.7 7.655 3 2 \t 55 6 4 5 M END > <compound id> 665765765 > <source> db1 $$$$ -OECHEM 35343- 3 6 0000 V2000 1 \t 7 6 4.6 9 2 \t 45 0 3 5 M END > <compound id> 3546789 > <source> db1 $$$$