in reply to Re^2: Fine tuning a reg exp
in thread Fine tuning a reg exp

Is there anything (punctuation, perhaps? placement with other words and terms?) that will consistently distinguish a name from any other proper noun in your text? For example, how can your script consistently distinguish between "Ibrahim Ali Muhammad" and "Grand Trunk Road" and "Pushtoon Garhi Pabbi", since all use the same capitalization scheme? You might have to define some more complicated criteria for recognizing names. Or will names only be in the headings of each entry, i.e. toward the beginning?

In general, you would want:

$line =~ s{($regexp)}{<name>$1</name>}g;

The 'g' flag may or may not be needed, depending on what you're doing. If there's more than one name in a line, that would catch it. If there's only one name, you don't need it. The parentheses () match the name in your line and place it in $1, so you can put the tags around it in your replacement expression. Using curly brackets {} instead of / to mark your regexp avoids having to escape your slashes ("leaning toothpick syndrome," I think someone called it -- it can get confusing!). Any other characters could be used to delimit your regexp if you'd prefer. What I have above is equivalent to this:

$line =~ s/($regexp)/<name>$1<\/name>/g;

Replies are listed 'Best First'.
Re^4: Fine tuning a reg exp
by markjrouse (Initiate) on Feb 23, 2012 at 20:05 UTC
    Yes, names should only be at the beginning of each line. Yeah, it gets tricky because names are: , names space|comma, but then so are other elements that are bot names.

      Are you only going to be concerned with the names of people marked "individual"? And does "[SDGT]", or something similar, mark the end of each record? If you're doing this line by line, it might be more effective to read in the whole file, re-break it at the end of records, and then cut out the line breaks, so you can deal with whole, unbroken records (and look ahead for markers like "individual").

      Simply:

      undef $/; # This sets your input record separator # (i.e. $INPUT_RECORD_SEPARATOR if you 'use English') # to undef (usually it's \n -- it's what makes 'read' rea +d in # a line at a time. You could even set this to "[SDGT]\n" + if # you wanted, and read in a whole actual record at a time +, # if that was going to be a consistent marker of the end +of # a record my $stream = <FILE>; # Read the whole file into one scalar variable my @records = split($stream, m{\[[^\x5d]+\]\n}/); # Split the stream into records, based on a regexp, if you could # figure out a regexp that would consistently mark the end of a rec +ord # This should match any [.*] marker at the end of a record foreach (@records) { s{\n}{ }g; # Remove line breaks # You could also do your matching and marking up in this loop }

      I'm also intrigued by the idea of using $INPUT_RECORD_SEPARATOR to mark the end of your records. But I don't think you can set that to a regexp. It would only work if that literal were consistently the end of the record. (And I suspect that's something like a source citation, correct?)

      The cool thing about this approach, is that if you can get at a whole record and remove the line breaks, you can have your script look for that "individual" marker (or other markers) at the end of records, to know more specifically what to do with them.)

      Also, this method is probably not recommendable if you're dealing with really huge files. But it would work if you could break them up.

      Update: Eck, but using split to split apart records in that way would cause you to lose your "[SDGT]". You would have to split it another way, probably with regexps to match each whole record.

      Something like this might work, but it would probably need tweaking, and I don't have enough sample of your file to test it very well:

      my @records = ($stream =~ m/^(.*?\[[^\x5d]+\])\n/gms);

      Running on your little snippet here, this gives me (line breaks and indenting mine, for viewability):

      @records = ( 'ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (indiv +idual) [SDGT]', 'AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL I +SLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pu +shtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah Scho +ol, Jalalabad, Afghanistan [SDGT]' );

      Given something like this, you should be much better able to recognize names and place names, depending on the context and capitalization and other markers.