in reply to Edit complex file

You can do this by extracting the multi-line data items in one go then replacing all but the last newline in each by using regular expressions. Like this

use strict; use warnings; # Slurp data. # { local $/ = undef; $_ = <DATA>; } # Construct regular expression to pull out each data # item; use extended syntax and single-line matching # (i.e. a . matches newline). # my $rxExtract = qr{(?xs) (.+?) # capture one or more characters # with non-greedy matching (?= # zero-width look ahead (?: # alternation group, either \d+\t # digits then a tab | # or \z # end of string ) # close alternation ) # close look-ahead }; # Global match to pull data items out of string. Then # replace all but the last newline in each data item # with tabs. Print items out. # my @items = /$rxExtract/g; s/\n(?=.)/\t/g for @items; print @items; __END__ 1 John Doe, Joe Bloggs title journal Animal Female Protein 2 Mary Clary title magazine Fish Nor Fowl 3 Charley Farley, Piggy Malone title book The Phantom Raspberry Blower

This produces

1 John Doe, Joe Bloggs title journal Animal Female P +rotein 2 Mary Clary title magazine Fish Nor Fowl 3 Charley Farley, Piggy Malone title book The Phantom + Raspberry Blower

I hope this is of use.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: Edit complex file
by Anonymous Monk on May 03, 2006 at 16:37 UTC
    lots of new regex stuff..no idea on some of them but this certianly helps a lot thanks
      I'll have a go at explaining some of it. The qr{ ... } construct allows you to pre-compile a regular expression and store it in a variable; it can be used later where you would normally place a regular expression. The (?xs) at the beginning of the expression says that we want to use extended regular expression syntax (the 'x') which allows whitespace and comments to be embedded for greater readability and we want to do single-line matching (the 's') so that the '.' metacharacter matches a newline. Thus, on the next line the (.+?) will match and remember (because of the '(' and ')') one or more (the '+') of any character (the '.') including newline but the '?' makes the match non-greedy, for example see these two one-liners

      $ perl -e '($s)="aaabbaabbb"=~/(a.*)b/;print "$s\n";' aaabbaabb $ perl -e '($s)="aaabbaabbb"=~/(a.*?)b/;print "$s\n";' aaa

      The next construct is the tricky bit. The (?= ... ) is called a zero-width positive look-ahead assertion; I think I've got that right. Basically, the regular expression engine keeps track of where it has reached in the string; the look-ahead says to the engine, staying where you are, look further along from this point to see if you can find whatever. In our case we are looking for one of two things; one or more digits followed by a tab (the \d+\t) or the end of the string (the \z), in effect EOF. The (?: ... ) uses the '(' and ')' to group the alternations ('|' is the regular expression or) and the ?: switches off regular expression memory because we aren't interested in what the look-ahead finds, only that it has found it.

      The line

      my @items = /$rxExtract/g;

      does a couple of things. It uses our previously constructed regular expression and matches it against $_ which is the default behaviour. The thing to note is that the match is done globally with the / ... /g flag. Because of global, the expression keeps going along the string finding matches and because we have used regular expression memory, what it matches is assigned to the @items list, all in one fell swoop.

      As an aside, if we had slurped the file into a lexical variable like this

      my $string = <DATA>;

      you can't rely on the default matching against $_ so you would do this

      my @items = $string =~ /$rxExtract/g;

      We now have each data item in it's own element in the list but the items still contain the unwanted newlines that you wish to turn into tabs. We can again use a look-ahead assertion, this time in a substitution. We want to replace a newline only if it is followed by another character, it doesn't matter what character. We don't want to touch the last newline in the data item as we want that in our modified data file and that will not be followed by anything else. The \n(?=.) says a newline followed by some single character and because the look-ahead consumes no characters leaving the pointer behind the newline, only the newline gets replaced. The

      s/\n(?=.)/\t/g for @items;

      iterates over @items aliasing each element in turn to $_ and then doing a global substitution of any newline in the middle of the data item with a tab.

      I hope this makes things clearer for you.

      Cheers,

      JohnGG

        thanks very much for going through it all. much appreciated