Re: regex help

You can use grep.

Some working code: (all you need to do to let this suit your needs is check if @x has elements)

#!/usr/bin/perl -l

use strict;
use warnings;

my @full_authors = ( "Smith, John", "Smith, John Ronald", "Johnson, Ja
+mes", "James, Ray Jack", "Van der Burg, Jon", "O'Neil, Sarah" );
my @authors = ( "Smith J", "Jackson J", "James RJ", "Van der Burg J", 
+"O'Neil S" );

$, = " & ";


foreach my $a (@authors) {
  my ($f, $s) = $a =~ m/(.*)\s+(\w+)$/; # This regex will put everythi
+n before the last space in $f, and everything after the space in $s
  # Example: if $a is James RJ, then $f is 'James', and $s 'RJ'

  my @g       = split //, $s; # Split the last charachter, so that eac
+h element in the @g-array is one letter

  $f = quotemeta($f); # Remove all special thingies of $f, like a . an
+d a space (needed because the /x modifier is in use)
  

  my $p = join('\w+\s+', @g);
  # Join the letters together, and add the regex charachter \w+\s+, me
+aning that RJ will become R\w+\s+J (which is used in the regex)

  my @x = grep (m/
    ^      # Match start of line 
    $f     # Match the last name
    \s*    # Match some optional whitespace
    ,      # Match a comma
    \s+    # Match some whitespace (not optinal)
    $p     # Match the second part of the name
    \w+    # Match the remaining word-charachters of this name
    \s*    # Match some optional whitespace
    $      # Match the end
   /xi, @full_authors);

  print $a, @x;
}
[download]

If you have 'Smith, John' and 'Smith, Jack' in your @full_authors-array and you are searching for 'Smith J' then @x-array will have both these elemnts.

Update, added the note about duplicates

Comment on Re: regex help Download Code

Replies are listed 'Best First'.
Re^2: regex help by Animator (Hermit) on Dec 09, 2004 at 15:49 UTC
To suit your other needs (posted in the reply to the previous message): Make a hash of the full authors, with their literal name. Code you can use for that (I would advice you search yourself first before looking at my solution) Read more... (176 Bytes) If you've done that, then you can repalce the `@full_authors` in the grep method with `keys %my_own_hash`, and then you can remove all the elements that were found (or just the first one) from that hash using the `delete`-function.	[reply] [d/l] [select]
Re^2: regex help by Animator (Hermit) on Dec 09, 2004 at 17:41 UTC
Two other techniques to prevent that an author appears twice in your final result are: Using a hash in which the key is the author's name, and each time you find an author you increment's it value (and compare it) Build a hash of the author's full name and the value the index in the array, and when you find the author name, use that hash to look up the index and set the value of that index (in the array) to a symbol that can't occur in your data (a number, a semicolon, a colon, ...) I guess that both of these techniques will be faster but I'm not really sure of this since I'm too lazy to Benchmark it :) (so if you want to be sure then you should benchmark it yourself)	[reply]
Re^2: regex help by rsiedl (Friar) on Dec 09, 2004 at 16:06 UTC
Thanks heaps Animator. Very descriptive. Now I will try to modify it a little to cope with "Smith J Jr" and see if I've learnt something :) Cheers, Reagen	[reply]