in reply to Re^3: Fine tuning a reg exp
in thread Fine tuning a reg exp
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Fine tuning a reg exp
by LonelyPilgrim (Beadle) on Feb 23, 2012 at 20:25 UTC | |
Are you only going to be concerned with the names of people marked "individual"? And does "[SDGT]", or something similar, mark the end of each record? If you're doing this line by line, it might be more effective to read in the whole file, re-break it at the end of records, and then cut out the line breaks, so you can deal with whole, unbroken records (and look ahead for markers like "individual"). Simply:
I'm also intrigued by the idea of using $INPUT_RECORD_SEPARATOR to mark the end of your records. But I don't think you can set that to a regexp. It would only work if that literal were consistently the end of the record. (And I suspect that's something like a source citation, correct?) The cool thing about this approach, is that if you can get at a whole record and remove the line breaks, you can have your script look for that "individual" marker (or other markers) at the end of records, to know more specifically what to do with them.) Also, this method is probably not recommendable if you're dealing with really huge files. But it would work if you could break them up. Update: Eck, but using split to split apart records in that way would cause you to lose your "[SDGT]". You would have to split it another way, probably with regexps to match each whole record. Something like this might work, but it would probably need tweaking, and I don't have enough sample of your file to test it very well: my @records = ($stream =~ m/^(.*?\[[^\x5d]+\])\n/gms);Running on your little snippet here, this gives me (line breaks and indenting mine, for viewability):
Given something like this, you should be much better able to recognize names and place names, depending on the context and capitalization and other markers. | [reply] [d/l] [select] |