Re: Edit complex file

You can do this by extracting the multi-line data items in one go then replacing all but the last newline in each by using regular expressions. Like this

use strict;
use warnings;

# Slurp data.
#
{
    local $/ = undef;
    $_ = <DATA>;
}

# Construct regular expression to pull out each data
# item; use extended syntax and single-line matching
# (i.e. a . matches newline).
#
my $rxExtract = qr{(?xs)
   (.+?)          # capture one or more characters
                  #   with non-greedy matching
   (?=            # zero-width look ahead
      (?:         # alternation group, either
         \d+\t    # digits then a tab
         |        # or
         \z       # end of string
      )           # close alternation
   )              # close look-ahead
   };

# Global match to pull data items out of string. Then
# replace all but the last newline in each data item
# with tabs. Print items out.
#
my @items = /$rxExtract/g;
s/\n(?=.)/\t/g for @items;
print @items;

__END__
1    John Doe, Joe Bloggs    title    journal    Animal
Female
Protein
2    Mary Clary    title    magazine    Fish
Nor
Fowl
3    Charley Farley, Piggy Malone    title    book    The
Phantom
Raspberry
Blower
[download]

This produces

1    John Doe, Joe Bloggs    title    journal    Animal    Female    P
+rotein
2    Mary Clary    title    magazine    Fish    Nor    Fowl
3    Charley Farley, Piggy Malone    title    book    The    Phantom  
+  Raspberry    Blower
[download]

I hope this is of use.

Cheers,

JohnGG

Comment on Re: Edit complex file Select or Download Code

Replies are listed 'Best First'.
Re^2: Edit complex file by Anonymous Monk on May 03, 2006 at 16:37 UTC
lots of new regex stuff..no idea on some of them but this certianly helps a lot thanks	[reply]
Re^3: Edit complex file by johngg (Canon) on May 03, 2006 at 19:15 UTC
I'll have a go at explaining some of it. The `qr{ ... }` construct allows you to pre-compile a regular expression and store it in a variable; it can be used later where you would normally place a regular expression. The `(?xs)` at the beginning of the expression says that we want to use extended regular expression syntax (the 'x') which allows whitespace and comments to be embedded for greater readability and we want to do single-line matching (the 's') so that the '.' metacharacter matches a newline. Thus, on the next line the `(.+?)` will match and remember (because of the '(' and ')') one or more (the '+') of any character (the '.') including newline but the '?' makes the match non-greedy, for example see these two one-liners `$ perl -e '($s)="aaabbaabbb"=~/(a.)b/;print "$s\n";' aaabbaabb $ perl -e '($s)="aaabbaabbb"=~/(a.?)b/;print "$s\n";' aaa` [download] The next construct is the tricky bit. The `(?= ... )` is called a zero-width positive look-ahead assertion; I think I've got that right. Basically, the regular expression engine keeps track of where it has reached in the string; the look-ahead says to the engine, staying where you are, look further along from this point to see if you can find whatever. In our case we are looking for one of two things; one or more digits followed by a tab (the `\d+\t`) or the end of the string (the `\z`), in effect EOF. The `(?: ... )` uses the '(' and ')' to group the alternations ('\|' is the regular expression or) and the `?:` switches off regular expression memory because we aren't interested in what the look-ahead finds, only that it has found it. The line `my @items = /$rxExtract/g;` [download] does a couple of things. It uses our previously constructed regular expression and matches it against `$_` which is the default behaviour. The thing to note is that the match is done globally with the `/ ... /g` flag. Because of global, the expression keeps going along the string finding matches and because we have used regular expression memory, what it matches is assigned to the `@items` list, all in one fell swoop. As an aside, if we had slurped the file into a lexical variable like this `my $string = <DATA>;` [download] you can't rely on the default matching against `$_` so you would do this `my @items = $string =~ /$rxExtract/g;` [download] We now have each data item in it's own element in the list but the items still contain the unwanted newlines that you wish to turn into tabs. We can again use a look-ahead assertion, this time in a substitution. We want to replace a newline only if it is followed by another character, it doesn't matter what character. We don't want to touch the last newline in the data item as we want that in our modified data file and that will not be followed by anything else. The `\n(?=.)` says a newline followed by some single character and because the look-ahead consumes no characters leaving the pointer behind the newline, only the newline gets replaced. The `s/\n(?=.)/\t/g for @items;` [download] iterates over `@items` aliasing each element in turn to `$_` and then doing a global substitution of any newline in the middle of the data item with a tab. I hope this makes things clearer for you. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: Edit complex file by Anonymous Monk on May 03, 2006 at 21:16 UTC
thanks very much for going through it all. much appreciated	[reply]