Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi friends, I have just got a job that requires me to sort through a tagged text file. The file is in an Adobe Pagemaker 6.5 tagged text file format. I need some help with doing the pattern matching and string manipulation.
Heres an example of the file
]a[1 ]b[FORTUNE BAY]1[R74]c[(9)]d[58 ]e[b g 9 Little Brown Jug-Gemfly (by Golden Elder (GB))]f[(BLINKERS) ]g[A W & Mrs P I Reynolds, T A Robinson, D J & Mrs V G Stuart & A R Wa +ghorn ]h[Gay Stuart, Riverton ]i[Emerald green, navy blue braces & diamond armbands, emerald green c +ap ]j[04640]w[CANT 13 Aug H Rating 76 1600m 11 of 16 Tailend inr,btld clo +ser inr T C Newton 58 (6) 15/15 Wnr: BRIGHTON PIER (AUS) 1:42.69 13.8 +L ]k[ :(46-2-7-6-6-$21145) ]m[ F(0-0-0) G(17-1-5) D(7-0-1) S(15-1-6) H(7-0-1) C(1-0-0) Dst(2-0-0) ]a[2 ]b[LUCKY LAD]1[R72]c[(10)]d[57 ]e[ch g 6 Shinko King (IRE)-Argyll Dazzler (by Tarrago (ITY))]f[(BLINK +ERS) ]g[D H F Green ]h[Kevin Hughes, Riccarton Park ]i[Emerald green & light blue diagonal stripes, striped sleeves & cap ]j[X3333]w[CANT 10 Aug H Rating 76 1800m 3 of 15 Trld ldr,2nd tn,fght +mid N G Harris 57 (4) 1/1 Wnr: SHRKAN 2:01.69 4.5L ]k[ :(28-1-4-7-2-$12688) ]m[ F(0-0-0) G(13-0-4) D(7-0-3) S(6-1-3) H(2-0-1) C(0-0-0) Dst(3-0-2) ]a[3
--------------------- The tags are the flagged using ] and [ eg ]a[

Heres the structure of the program I am wanting
my $a = '\[a\]'; my $b = '\[b\]'; open(HANDLE, "$dataFile") or warn "Error: Unable to open data file"; my @raw_data=<HANDLE>; foreach my $line (@raw_data) { if ($line =~ /$a/ ) { # Process } if ($line =~ /$b/) { # Process } } close HANDLE;
What I then want to do is remove the tag from the line. Notice how with the ]b[ tag line there is more than one tag
eg ]b[LUCKY LAD]1[R72]c[(10)]d[57
What would be a technique to remove the tags and sort into independant variables.

Replies are listed 'Best First'.
Re: Tag pattern matching
by graff (Chancellor) on Sep 06, 2005 at 03:36 UTC
    If the data file is always consistent with the example you gave, in the following respects:
    • record blocks are always separated by one or more blank lines
    • each record block always begins with "]a["
    • the "field delimiter" pattern within each record is always three characters: open-sq-bracket, single-letter-or-digit, close-sq-bracket
    then here is a way that would let you read the file one whole record at a time, and load fields into a hash, keyed by the tag name:
    open( IN, $filename ) or die "$filename: $!"; { local $/ = ''; # cf. perldoc perlvar about $/ and "paragraph mode +" while (<IN>) # read a whole record into $_ { @fields = split( /\](\w)\[/ ); # use parens to capture the let +ters shift @fields; # split puts an empty element before ']a[', so +drop that my %record = @fields; # convert array to key=>value hash # you now have tags as hash keys, strings as hash values: # $record{"a"}==1, $record{"b"}=="FORTUNE BAY", etc. # to use as you see fit. } }
    If you want to have all records in memory at once (not just one record at a time), you can simply declare an array before the first while loop, and then after the fields are loaded into %record just  push @array, { %record }; to build an array of hashes (AoH), and get to individual fields of a record like this:  $array[0]{"a"}

    I wouldn't use "$a" or "$b" as names for scalar variables like you suggested -- this can get messed up if you use the "sort" function in the same scope as these variables.

    If the input data varies from file to file or record to record regarding the features listed above, you'll need to tweak this approach (or tweak the data before using it).

    (update: I simplified the split regex so that it's easier to read -- and it accepts a wider range of tags than appears in the OP data, which probably won't cause a problem. (ahem...) Then I updated again to get the bracketing right in that regex.)

Re: Tag pattern matching
by GrandFather (Saint) on Sep 06, 2005 at 03:10 UTC

    I'm not sure what you mean by sort into independant variables, but the code below should point you in the right direction.

    use strict; use warnings; my $a = ']a\['; my $b = ']b\['; #open(HANDLE, "$dataFile") or die "Error: Unable to open data file"; while (my $line = <DATA>) {#(<HANDLE>) { chomp $line; if ($line =~ /$a/ ) { my @fields = split /\].*?\[/, $line; print ((join "]a[", @fields). "\n"); } elsif ($line =~ /$b/) { my @fields = split /\].*?\[/, $line; print ((join "]b[", @fields). "\n"); } } #close HANDLE;

    Prints:

    ]a[1 ]b[FORTUNE BAY]b[R74]b[(9)]b[58 ]a[2 ]b[LUCKY LAD]b[R72]b[(10)]b[57 ]a[3

    Perl is Huffman encoded by design.
Re: Tag pattern matching
by BrowserUk (Patriarch) on Sep 06, 2005 at 05:24 UTC

    Rather than a big block of if then elsifs, you might consider using a dispatch table:

    #! perl -slw use strict; sub other{ print "Processing '$_[0]' with arg: '$_[1]'" } my %dispatch = map{ $_ => \&other } 'b'..'z'; $dispatch{a} = sub { print "\nNew section\nProcessing 'a' with arg '$_[1]'"; }; while( <DATA> ) { chomp; $dispatch{$1}->( $1, $2 ) while m/ \] ( [a-z] ) \[ ( [^]]+ ) /xg; } __DATA__ ...
    P:\test>489332 New section Processing 'a' with arg '1' Processing 'b' with arg: 'FORTUNE BAY' Processing 'c' with arg: '(9)' Processing 'd' with arg: '58' Processing 'e' with arg: 'b g 9 Little Brown Jug-Gemfly (by Golden Eld +er (GB))' Processing 'f' with arg: '(BLINKERS)' Processing 'g' with arg: 'A W & Mrs P I Reynolds, T A Robinson, D J & +Mrs V G Stuart & A R Waghorn' Processing 'h' with arg: 'Gay Stuart, Riverton' Processing 'i' with arg: 'Emerald green, navy blue braces & diamond ar +mbands, emerald green cap' Processing 'j' with arg: '04640' Processing 'w' with arg: 'CANT 13 Aug H Rating 76 1600m 11 of 16 Taile +nd inr,btld closer inr TC Newton 58 (6) 15/15 Wnr: BRIGHTON PIER (AUS +) 1:42.69 13.8L Processing 'k' with arg: ' :(46-2-7-6-6-$21145)' Processing 'm' with arg: ' F(0-0-0) G(17-1-5) D(7-0-1) S(15-1-6) H(7-0 +-1) C(1-0-0) Dst(2-0-0)' New section Processing 'a' with arg '2' Processing 'b' with arg: 'LUCKY LAD' Processing 'c' with arg: '(10)' Processing 'd' with arg: '57' Processing 'e' with arg: 'ch g 6 Shinko King (IRE)-Argyll Dazzler (by +Tarrago (ITY))' Processing 'f' with arg: '(BLINKERS)' Processing 'g' with arg: 'D H F Green' Processing 'h' with arg: 'Kevin Hughes, Riccarton Park' Processing 'i' with arg: 'Emerald green & light blue diagonal stripes, + striped sleeves & cap' Processing 'j' with arg: 'X3333' Processing 'w' with arg: 'CANT 10 Aug H Rating 76 1800m 3 of 15 Trld l +dr,2nd tn,fght mid N G Harris 57 (4) 1/1 Wnr: SHRKAN 2:01.69 4.5L' Processing 'k' with arg: ' :(28-1-4-7-2-$12688)' Processing 'm' with arg: ' F(0-0-0) G(13-0-4) D(7-0-3) S(6-1-3) H(2-0- +1) C(0-0-0) Dst(3-0-2)' New section Processing 'a' with arg '3'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Ooh I like that!

      Thx,
      ~Diz

Re: Tag pattern matching
by albert (Monk) on Sep 06, 2005 at 05:08 UTC
    Looking at this if one assumes that each record starts with ]a[, and that you want to group all the records by the value in this key field, you can collect the records as listed below. This caputures one or more record on a line, though the solution assumes that each record is never spans multiple lines. I also assume that the markers for each field is always a single character except for 'a' (case insensitive).

    -albert

    #!/usr/bin/perl -w my $key; my $records; while(<DATA>){ ($key) = /\][Aa]\[(.*?)(?=[\]\n])/ if /\][Aa]\[/; my @items = /\]([b-zB-Z0-9])\[(.*?)(?=[\]\n])/g if /\][b-zB-Z0-9]\ +[/; foreach (my $i = 0; $i < @items; $i += 2){ $records->{$key}->{$items[$i]} = $items[$i+1]; } } foreach my $k(sort keys %$records){ foreach my $item (sort keys %{$records->{$k}}){ # Do stuff with found records here print join("\t", $k, $item, $records->{$k}->{$item}), "\n"; } } __DATA__ ]a[1 ]b[FORTUNE BAY]1[R74]c[(9)]d[58 ]e[b g 9 Little Brown Jug-Gemfly (by Golden Elder (GB))]f[(BLINKERS) ]g[A W & Mrs P I Reynolds, T A Robinson, D J & Mrs V G Stuart & A R Wa +ghorn ]h[Gay Stuart, Riverton ]i[Emerald green, navy blue braces & diamond armbands, emerald green c +ap ]j[04640]w[CANT 13 Aug H Rating 76 1600m 11 of 16 Tailend inr,btld clo +ser inr T C Newton 58 (6) 15\ /15 Wnr: BRIGHTON PIER (AUS) 1:42.69 13.8L ]k[ :(46-2-7-6-6-$21145) ]m[ F(0-0-0) G(17-1-5) D(7-0-1) S(15-1-6) H(7-0-1) C(1-0-0) Dst(2-0-0) ]a[2 ]b[LUCKY LAD]1[R72]c[(10)]d[57 ]e[ch g 6 Shinko King (IRE)-Argyll Dazzler (by Tarrago (ITY))]f[(BLINK +ERS) ]g[D H F Green ]h[Kevin Hughes, Riccarton Park ]i[Emerald green & light blue diagonal stripes, striped sleeves & cap ]j[X3333]w[CANT 10 Aug H Rating 76 1800m 3 of 15 Trld ldr,2nd tn,fght +mid N G Harris 57 (4) 1/1 Wn\ r: SHRKAN 2:01.69 4.5L ]k[ :(28-1-4-7-2-$12688) ]m[ F(0-0-0) G(13-0-4) D(7-0-3) S(6-1-3) H(2-0-1) C(0-0-0) Dst(3-0-2)
Re: Tag pattern matching
by nedals (Deacon) on Sep 06, 2005 at 05:33 UTC

    I'm guessing you wanted to split the data on more than just 'a' and 'b'.

    This will split on each of the tags, 'a'..'m', and put each bit of text into seperate elements of an AoA (array of arrays)

    use strict; use Data::Dumper; ## Seperate each element of each group into an AoA my @AoA = (); my $i = 0; while (<DATA>) { chomp; ## split each line so each 'element' is processed as seperate line my @lines = split(']',$_); for (@lines) { if ($_) { ## Skip the blank lines created bt 'split' my $line = $_; $line =~ s/\w\[//; ## Get rid of the 'x[' push @{$AoA[$i]},$line; } } $i++ if (@lines[1] =~ /m\[/); } print Dumper(\@AoA); exit; __DATA__ ]a[1 ]b[FORTUNE BAY]1[R74]c[(9)]d[58 ]e[b g 9 Little Brown Jug-Gemfly (by Golden Elder (GB))]f[(BLINKERS) ]g[A W & Mrs P I Reynolds, T A Robinson, D J & Mrs V G Stuart & A R Wa +ghorn ]h[Gay Stuart, Riverton ]i[Emerald green, navy blue braces & diamond armbands, emerald green c +ap ]j[04640]w[CANT 13 Aug H Rating 76 1600m 11 of 16 Tailend inr,btld clo +ser inr T C Newton 58 (6) 15/15 Wnr: BRIGHTON PIER (AUS) 1:42.69 13.8 +L ]k[ :(46-2-7-6-6-$21145) ]m[ F(0-0-0) G(17-1-5) D(7-0-1) S(15-1-6) H(7-0-1) C(1-0-0) Dst(2-0-0) ]a[2 ]b[LUCKY LAD]1[R72]c[(10)]d[57 ]e[ch g 6 Shinko King (IRE)-Argyll Dazzler (by Tarrago (ITY))]f[(BLINK +ERS) ]g[D H F Green ]h[Kevin Hughes, Riccarton Park ]i[Emerald green & light blue diagonal stripes, striped sleeves & cap ]j[X3333]w[CANT 10 Aug H Rating 76 1800m 3 of 15 Trld ldr,2nd tn,fght +mid N G Harris 57 (4) 1/1 Wnr: SHRKAN 2:01.69 4.5L ]k[ :(28-1-4-7-2-$12688) ]m[ F(0-0-0) G(13-0-4) D(7-0-3) S(6-1-3) H(2-0-1) C(0-0-0) Dst(3-0-2)
Re: Tag pattern matching
by Dizzley (Novice) on Sep 06, 2005 at 04:43 UTC
    Err...
    Your data file has tags matching e.g. ']a['.
    ]a[1 ]b[FORTUNE BAY]1[R74]c[(9)]d[58 ]e[b g 9 Little Brown Jug-Gemfly (by Golden Elder (GB))]f[(BLINKERS)
    Your own code matches e.g. '[a]'.
    my $a = '\[a\]'; my $b = '\[b\]';

    hth,
    Diz

Re: Tag pattern matching
by Anonymous Monk on Sep 06, 2005 at 05:13 UTC
    Thanks guys. Both examples are exactly what I wanted although did take me sometime to digest. So lucky to have such a good perl community. Sorry about the typo. (hey no ones perfect).