Character Text Delimiters

Ninthwave has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that is generated from a database. The string contains multiple records. The program that generates this string appends a two character code to the front of each string.

Example:

AA(<text values>), AB(<text values>), AC(<text values>)

Note the parentheses, commas and spaces are not part of the text values, but the text values themselves may contain commas and parentheses.

I am looking for the elegant way to capture each value, noting that the order is important as AA lines up with another dataset. Also note that so far I have seen this string with as many as 400+ values scrolling into QR and above though I have not seen if the system has any option for when the value exceeds ZZ so I used ZZ as my upper limit.

Also the reason I have an array for the sting value is the data field that this string comes from is an array though so far only the first element has been populated. I need more examples to determine if I will be able to change the section with @{$strings} to just $string in the future.

My solution - improvements or better algorithms sought

    my $start = q{AA};
    my $next  = q{AB};
    my $count = 0;
    my $value;
    foreach my $entry ( @{$strings} ) {
        if ( $count == $#$strings ) {
            $next = q{$};
        }
        else {
            $next = q{\,\ } . _alpha_add($start);
        }
        $affiliation_string =~ m/$start\((.*?)\)($next)/s;
        $value                     = $1;
        $affiliation_key->{$entry} = $value;
        $start                     = _alpha_add($start);
        $next                      = _alpha_add($next);
        $count++;
    }
[download]

And the _alpha_add sub

sub _alpha_add {
    my $string = shift;
    if ( substr( $string, 1, 1 ) eq q{Z} ) {
        my $first_character  = substr( $string, 0, 1 );
        my $second_character = substr( $string, 1, 1 );
        $first_character++;
        $second_character = q{A};
        $string = $first_character . $second_character;
    }
    else {
        $string++;
    }
    return $string;
}
[download]

UPDATE: Please note as a delimiter is most likely the wrong term, as the format prepends the data and is absent at the end of the string.

"No matter where you go, there you are." BB

Comment on Character Text Delimiters Select or Download Code

Replies are listed 'Best First'.
Re: Character Text Delimiters by AnomalousMonk (Archbishop) on Oct 26, 2011 at 00:22 UTC
Others have noted this is an inherently fragile data format. (An example, I think, of the Semipredicate problem.) See what happens when records in the test data below are swapped, or if 'AE(foo)' in record AD is changed to '(fubar),AE(foo)'. However, one possible way: >perl -wMstrict -le "my $s = 'AA(Acme Widgets. 123 Coyote St. AZ(Ariz.),USA) ,' . 'AB(Your Name. 99 Some St. HI(Hawaii), USA), ' . 'AC(Dep deAstro. Uni de Val. C/Dr. M 50, 461 Bur (Val), Sp),' . 'AD(AE(foo), approaching breaking point AD(bar)) , ' . 'AE(optional trailing comma, spaces on last record)' ; ;; my $tag = 'AA'; my $stop = 'ZZ'; ;; EXTRACT: for (++(my $after = $tag); $tag le $stop; ++$tag, ++$after) { my $pre = qr{ \G $tag [(] }xms; my $post = qr{ [)] (?: \s* , \s* (?= $after) \| \s* ,? \s* \z) }xms; ;; last EXTRACT unless $s =~ m{ $pre (.?) $post }xmsg; my $extract = $1; print qq{'$tag': [[$extract]]}; } " 'AA': [[Acme Widgets. 123 Coyote St. AZ(Ariz.),USA]] 'AB': [[Your Name. 99 Some St. HI(Hawaii), USA]] 'AC': [[Dep deAstro. Uni de Val. C/Dr. M 50, 461 Bur (Val), Sp]] 'AD': [[AE(foo), approaching breaking point AD(bar)]] 'AE': [[optional trailing comma, spaces on last record]] [download] Update:* Enhanced discussion, improved 'robustness' of extraction (for some definition of robust), added stress-test data records to example data.	[reply] [d/l]
Re: Character Text Delimiters by RichardK (Parson) on Oct 25, 2011 at 16:31 UTC
Oh - an interesting problem ;) You don't really need _alpha_add as perl is smart enough to do that for you `my $c = 'AA'; $c++; say $c;` [download] will output AB I don't think I can do any better for the rest of your code, unless you know what format the text entries take?	[reply] [d/l]
Re^2: Character Text Delimiters by Ninthwave (Chaplain) on Oct 25, 2011 at 16:38 UTC
Wow I just tested it I really expected ++ to go from AZ to Aa or the next character after Z thanks that gets rid of some mess in the code. I am always amazed by Perl's ability to know what I want it to do. Thank you. "No matter where you go, there you are." BB	[reply]
Re: Character Text Delimiters by Sue D. Nymme (Monk) on Oct 25, 2011 at 18:39 UTC
Have you considered splitting the string on "), XX("? Something like: `@chunks = split /\), ([A-Z][A-Z])$/, $input;` [download] That'll give you all the values interspersed with the prefixes (with the first and last elements being wacky because you matched on the stuff that goes between each value). In fact, you could make the split pattern a bit more complex, to separate out the first "AA(" and the last ")": `@chunks = split /(?: (?:^\|$,\s) ([A-Z][A-Z])$ \| $$)/x, $input; shift @chunks; # remove extra blank element at front` [download] Then you can use `part` from `List::MoreUtils` to partition the list into prefixes and values, or you could just assign it to a hash and use it directly: `use List::MoreUtils qw(part); my @chunks = ...; # as above shift @chunks; # as above my $ix = 0; my ($prefixes,$texts) = part {++$ix % 2} @chunks; # or my %text = @chunks; say "Text value AF is $text{AF}";` [download]	[reply] [d/l] [select]
Re^2: Character Text Delimiters by Ninthwave (Chaplain) on Oct 25, 2011 at 19:45 UTC
I did try something similar but on the last pattern the text value in one case was AC(Departament d'Astronomia i Astrofísica. Universitat de Valčncia. C/Dr. Moliner 50, 46100 Burjassot (Valčncia), Spain) And was having problems with the (Valencia) portion and if I used the \| was getting too much but I will try this or variant to see if it can be consistent. Sorry I don't have the exact pattern from the previous attempt that was similar to this split. "No matter where you go, there you are." BB	[reply]
Re: Character Text Delimiters by johngg (Canon) on Oct 26, 2011 at 10:52 UTC
An alternative to the various regular expression approaches would be to use index and substr to work along the text. This method makes no attempt at coping with the potential fragilities of the data format so YMMV. use strict; use warnings; use 5.010; my $text = q{xxxAAsgdhsdgABudhwuACedAGuwydADwieuhwiudAEudwdfuw}; my $tag = shift \|\| q{AA}; my $limit = q{ZZ}; my @records = (); my $prevPosn = index $text, $tag, 0; die qq{ERROR: First record '$tag' not found in text\n} if $prevPosn == -1; warn qq{WARNING: First record '$tag' not at start of text\n} unless $prevPosn == 0; while ( ( my $posn = index $text, ++ $tag, $prevPosn + 2 ) != -1 ) { push @records, substr $text, $prevPosn, $posn - $prevPosn; $prevPosn = $posn; last if $tag eq q{ZZ}; } push @records, substr $text, $prevPosn; say qq{>$_<} for @records; [download] The output. `>AAsgdhsdg< >ABudhwu< >ACedAGuwyd< >ADwieuhwiud< >AEudwdfuw<` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]