Re^4: Fine tuning a reg exp

Replies are listed 'Best First'.
Re^5: Fine tuning a reg exp by LonelyPilgrim (Beadle) on Feb 23, 2012 at 20:25 UTC
Are you only going to be concerned with the names of people marked "individual"? And does "[SDGT]", or something similar, mark the end of each record? If you're doing this line by line, it might be more effective to read in the whole file, re-break it at the end of records, and then cut out the line breaks, so you can deal with whole, unbroken records (and look ahead for markers like "individual"). Simply: undef $/; # This sets your input record separator # (i.e. $INPUT_RECORD_SEPARATOR if you 'use English') # to undef (usually it's \n -- it's what makes 'read' rea +d in # a line at a time. You could even set this to "[SDGT]\n" + if # you wanted, and read in a whole actual record at a time +, # if that was going to be a consistent marker of the end +of # a record my $stream = <FILE>; # Read the whole file into one scalar variable my @records = split($stream, m{\[[^\x5d]+\]\n}/); # Split the stream into records, based on a regexp, if you could # figure out a regexp that would consistently mark the end of a rec +ord # This should match any [.] marker at the end of a record foreach (@records) { s{\n}{ }g; # Remove line breaks # You could also do your matching and marking up in this loop } [download] I'm also intrigued by the idea of using $INPUT_RECORD_SEPARATOR to mark the end of your records. But I don't think you can set that to a regexp. It would only work if that literal were consistently the end of the record. (And I suspect that's something like a source citation, correct?) The cool thing about this approach, is that if you can get at a whole record and remove the line breaks, you can have your script look for that "individual" marker (or other markers) at the end of records, to know more specifically what to do with them.) Also, this method is probably not recommendable if you're dealing with really huge files. But it would work if you could break them up. Update:* Eck, but using `split` to split apart records in that way would cause you to lose your "[SDGT]". You would have to split it another way, probably with regexps to match each whole record. Something like this might work, but it would probably need tweaking, and I don't have enough sample of your file to test it very well: `my @records = ($stream =~ m/^(.*?\[[^\x5d]+\])\n/gms);` Running on your little snippet here, this gives me (line breaks and indenting mine, for viewability): `@records = ( 'ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (indiv +idual) [SDGT]', 'AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL I +SLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pu +shtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah Scho +ol, Jalalabad, Afghanistan [SDGT]' );` [download] Given something like this, you should be much better able to recognize names and place names, depending on the context and capitalization and other markers.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: Fine tuning a reg exp
by LonelyPilgrim (Beadle) on Feb 23, 2012 at 20:25 UTC

Are you only going to be concerned with the names of people marked "individual"? And does "[SDGT]", or something similar, mark the end of each record? If you're doing this line by line, it might be more effective to read in the whole file, re-break it at the end of records, and then cut out the line breaks, so you can deal with whole, unbroken records (and look ahead for markers like "individual").

Simply:

undef $/;    # This sets your input record separator
             # (i.e. $INPUT_RECORD_SEPARATOR if you 'use English')
             # to undef (usually it's \n -- it's what makes 'read' rea
+d in
             # a line at a time. You could even set this to "[SDGT]\n"
+ if
             # you wanted, and read in a whole actual record at a time
+,
             # if that was going to be a consistent marker of the end 
+of
             # a record
             
my $stream = <FILE>;   # Read the whole file into one scalar variable

my @records = split($stream, m{\[[^\x5d]+\]\n}/);
   # Split the stream into records, based on a regexp, if you could
   # figure out a regexp that would consistently mark the end of a rec
+ord
   # This should match any [.*] marker at the end of a record

foreach (@records) {
    s{\n}{ }g;   # Remove line breaks
    # You could also do your matching and marking up in this loop
}
[download]

I'm also intrigued by the idea of using $INPUT_RECORD_SEPARATOR to mark the end of your records. But I don't think you can set that to a regexp. It would only work if that literal were consistently the end of the record. (And I suspect that's something like a source citation, correct?)

The cool thing about this approach, is that if you can get at a whole record and remove the line breaks, you can have your script look for that "individual" marker (or other markers) at the end of records, to know more specifically what to do with them.)

Also, this method is probably not recommendable if you're dealing with really huge files. But it would work if you could break them up.

Update: Eck, but using split to split apart records in that way would cause you to lose your "[SDGT]". You would have to split it another way, probably with regexps to match each whole record.

Something like this might work, but it would probably need tweaking, and I don't have enough sample of your file to test it very well:

my @records = ($stream =~ m/^(.*?\[[^\x5d]+\])\n/gms);

Running on your little snippet here, this gives me (line breaks and indenting mine, for viewability):

@records = (
'ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (indiv
+idual) [SDGT]',
'AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a.
     JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL I
+SLAMIA;
     a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pu
+shtoon Garhi
     Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah Scho
+ol, Jalalabad,
     Afghanistan [SDGT]'
);
[download]

Given something like this, you should be much better able to recognize names and place names, depending on the context and capitalization and other markers.

[reply]
[d/l]
[select]