Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am problems editting a fairly complex file - Endnote output file. The file contains a set of publication identifiers e.g. Author, title journal etc. I am having problems with the keyword field. Each of the fields is separed by a tab allowing for import into a mysql table. However, the keyword field contains mulitple entries which are separated by a carriage return. Simple example
1 John Doe, joe blogs, title journal Animal Female Protein 2 Fred.......
The keyword field starts at Animal followed by carriage return Female then Protein. I need to replace these carriage returns with tabs but retaining the carrigae return separating the entries i.e. between Protein and 2. Any suggestions would be great

Replies are listed 'Best First'.
Re: Edit complex file
by lima1 (Curate) on May 03, 2006 at 14:23 UTC
    the keyword lines do not contain tabs, right? then:
    my @fields; my $i = 0; while ( my $line = <$FILE> ) { chomp $line; if ( $line =~ /\t/ ) { if ($i > 0 ) { # do smth with last @fields array } @fields = split "\t", $line; $i++; } else { push @fields, $line; } }
    UPDATE: @fields contains the last line after finishing the while loop.
      thanks for that...think i've corrected the file! cheers
Re: Edit complex file
by johngg (Canon) on May 03, 2006 at 15:10 UTC
    You can do this by extracting the multi-line data items in one go then replacing all but the last newline in each by using regular expressions. Like this

    use strict; use warnings; # Slurp data. # { local $/ = undef; $_ = <DATA>; } # Construct regular expression to pull out each data # item; use extended syntax and single-line matching # (i.e. a . matches newline). # my $rxExtract = qr{(?xs) (.+?) # capture one or more characters # with non-greedy matching (?= # zero-width look ahead (?: # alternation group, either \d+\t # digits then a tab | # or \z # end of string ) # close alternation ) # close look-ahead }; # Global match to pull data items out of string. Then # replace all but the last newline in each data item # with tabs. Print items out. # my @items = /$rxExtract/g; s/\n(?=.)/\t/g for @items; print @items; __END__ 1 John Doe, Joe Bloggs title journal Animal Female Protein 2 Mary Clary title magazine Fish Nor Fowl 3 Charley Farley, Piggy Malone title book The Phantom Raspberry Blower

    This produces

    1 John Doe, Joe Bloggs title journal Animal Female P +rotein 2 Mary Clary title magazine Fish Nor Fowl 3 Charley Farley, Piggy Malone title book The Phantom + Raspberry Blower

    I hope this is of use.

    Cheers,

    JohnGG

      lots of new regex stuff..no idea on some of them but this certianly helps a lot thanks
        I'll have a go at explaining some of it. The qr{ ... } construct allows you to pre-compile a regular expression and store it in a variable; it can be used later where you would normally place a regular expression. The (?xs) at the beginning of the expression says that we want to use extended regular expression syntax (the 'x') which allows whitespace and comments to be embedded for greater readability and we want to do single-line matching (the 's') so that the '.' metacharacter matches a newline. Thus, on the next line the (.+?) will match and remember (because of the '(' and ')') one or more (the '+') of any character (the '.') including newline but the '?' makes the match non-greedy, for example see these two one-liners

        $ perl -e '($s)="aaabbaabbb"=~/(a.*)b/;print "$s\n";' aaabbaabb $ perl -e '($s)="aaabbaabbb"=~/(a.*?)b/;print "$s\n";' aaa

        The next construct is the tricky bit. The (?= ... ) is called a zero-width positive look-ahead assertion; I think I've got that right. Basically, the regular expression engine keeps track of where it has reached in the string; the look-ahead says to the engine, staying where you are, look further along from this point to see if you can find whatever. In our case we are looking for one of two things; one or more digits followed by a tab (the \d+\t) or the end of the string (the \z), in effect EOF. The (?: ... ) uses the '(' and ')' to group the alternations ('|' is the regular expression or) and the ?: switches off regular expression memory because we aren't interested in what the look-ahead finds, only that it has found it.

        The line

        my @items = /$rxExtract/g;

        does a couple of things. It uses our previously constructed regular expression and matches it against $_ which is the default behaviour. The thing to note is that the match is done globally with the / ... /g flag. Because of global, the expression keeps going along the string finding matches and because we have used regular expression memory, what it matches is assigned to the @items list, all in one fell swoop.

        As an aside, if we had slurped the file into a lexical variable like this

        my $string = <DATA>;

        you can't rely on the default matching against $_ so you would do this

        my @items = $string =~ /$rxExtract/g;

        We now have each data item in it's own element in the list but the items still contain the unwanted newlines that you wish to turn into tabs. We can again use a look-ahead assertion, this time in a substitution. We want to replace a newline only if it is followed by another character, it doesn't matter what character. We don't want to touch the last newline in the data item as we want that in our modified data file and that will not be followed by anything else. The \n(?=.) says a newline followed by some single character and because the look-ahead consumes no characters leaving the pointer behind the newline, only the newline gets replaced. The

        s/\n(?=.)/\t/g for @items;

        iterates over @items aliasing each element in turn to $_ and then doing a global substitution of any newline in the middle of the data item with a tab.

        I hope this makes things clearer for you.

        Cheers,

        JohnGG