rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm hoping someone can help me/teach me about regex's. I can do basic ones but i'm trying to do something and it's got me quite stumped.
Best way to describe it is with an example so here we go:
#!/usr/bin/perl use strict; use warnings; my @full_authors = ( "Smith, John", "Smith, John Ronald", "Johnson, Ja +mes", "James, Ray Jack", "Van der Burg, Jon", "O'Neil, Sarah" ); my @authors = ( "Smith J", "Jackson J", "James RJ", "Van der Burg J", +"O'Neil S" ); # Results should be: # Smith J = Smith, John # Jackson J = Jackson J # James RJ = James, Ray Jack # Van der Burg J = Van der Burg, Jon # O'Neil S = O'Neil, Sarah foreach my $author (@authors) { print "$author = "; # Regex rules # Last ' ' before all-uppercase word should become ', ' # Every singular or grouped capital letter # (i.e. F or RJ) should become F(.*) or R(.*) J(.*) # What I have so far $author =~ s/ (\w+?)\p{IsUpper}/, $1\(\.*\)/; print "[ $author ] : "; if ( my ($match) = ( grep $_ =~ /$author/, @full_authors ) ) { $author = $match; @full_authors = grep { $_ ne $match } @full_authors; } # end-if print "$author\n"; } # end-foreach exit;
Any help anyone could provide would be much appreciated.

Cheers,
Reagen

Replies are listed 'Best First'.
Re: regex help
by dragonchild (Archbishop) on Dec 09, 2004 at 15:03 UTC
    No need for a regex. Just create a hash where the keys are your full authors collapsed and your values are the full authors complete name. Then, just do lookups in your hash. If you find something, great! If you don't, oh well.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Hey Dragonchild,
      What do you mean by "full authors collapsed"?

      Update: Think I have understood. You mean collapse the full author names to the abrev. like Smith, James -> Smith J?
        Exactly. Instead of making things hard for yourself, break the problem down to its component parts and solve it the easy way. :-)

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: regex help
by gopalr (Priest) on Dec 10, 2004 at 06:21 UTC

    Hi Reagen,

    Here is simple regexp. Let me know if there is any mistake.

    No need to use Second array i.e. @author.

    #!/usr/bin/perl use strict; use warnings; my @full_authors = ( "Smith, John", "Smith, John Ronald", "Johnson, Ja +mes", "James, Ray Jack", "Van der Burg, Jon", "O'Neil, Sarah" ); foreach my $author(@full_authors) { my $fullauthorname=$author; while ($author=~s#(, .*?[A-Z])[a-z]+\s*#$1#) { } $author=~s#,##g; print "\n$author = $fullauthorname"; }

    Regards,

    Gopal.R.

Re: regex help
by Animator (Hermit) on Dec 09, 2004 at 15:42 UTC

    You can use grep.

    Some working code: (all you need to do to let this suit your needs is check if @x has elements)

    If you have 'Smith, John' and 'Smith, Jack' in your @full_authors-array and you are searching for 'Smith J' then @x-array will have both these elemnts.

    Update, added the note about duplicates

      To suit your other needs (posted in the reply to the previous message):

      Make a hash of the full authors, with their literal name.
      Code you can use for that (I would advice you search yourself first before looking at my solution)

      If you've done that, then you can repalce the @full_authors in the grep method with keys %my_own_hash, and then you can remove all the elements that were found (or just the first one) from that hash using the delete-function.

      Two other techniques to prevent that an author appears twice in your final result are:

      • Using a hash in which the key is the author's name, and each time you find an author you increment's it value (and compare it)
      • Build a hash of the author's full name and the value the index in the array, and when you find the author name, use that hash to look up the index and set the value of that index (in the array) to a symbol that can't occur in your data (a number, a semicolon, a colon, ...)

      I guess that both of these techniques will be faster but I'm not really sure of this since I'm too lazy to Benchmark it :) (so if you want to be sure then you should benchmark it yourself)

      Thanks heaps Animator. Very descriptive.
      Now I will try to modify it a little to cope with "Smith J Jr" and see if I've learnt something :)
      Cheers,
      Reagen
Re: regex help
by sasikumar (Monk) on Dec 09, 2004 at 15:07 UTC
    Do it with hash table that is the best way