in reply to Fine tuning a reg exp

Before proceeding further with this, I think that it should be noted that "SDGT" is a buzzword for "Specially Designated Global Terrorists".

Al-Libi, Abd al-Muhsin or Ibrahim Ali Muhammad is on the "Most Wanted Terrorist" list.

Normally I would help anybody with anything related to Perl.

However in this case, I would like to hear more about who you are and why you are doing this? And why you do not have access to the more easily parseable databases?

I hope that you realize that parsing a terrorist list is a very touchy subject.

Update: If you are getting this info from a public URL, then post that URL.
Posting anything like this from a US government internal database, even just a short excerpt, is not appropriate here.

Replies are listed 'Best First'.
Re^2: Fine tuning a reg exp
by choroba (Cardinal) on Feb 23, 2012 at 22:05 UTC
      I think the OP should post the URL that he is working from.

      Working from the whole list will make it easier to understand and parse out what he needs. If the info is public, I have no problem with making it "easier to understand" via re-formatting. And I would help with that.

      I personally feel "very uncomfortable" if the full info is not available to the general public and the OP's info looks more specific than what I could find. Something like this has not come up before in my time on Monks.

      Your URL comes up with:

      ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin; a.k.a. SABRI, Abdel Ilah; a.k.a. TANTOUCHE, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Ali Abu Bakr; a.k.a. "'ABD AL-MUHSI"; a.k.a. "'ABD AL-RAHMAN"; a.k.a. "ABU ANAS"); DOB 1966; alt. DOB 27 Oct 1969; nationality Libya; Passport 203037 (Libya) (individual) [SDGT]
      Fine. Yes. I know this guy in on the terrorist list.
      But it doesn't show all of the info that the OP had although it shows additional information.

      I think my pointing out that this is a terrorist list was appropriate. Let's see what the OP has to say and we go from there.

        These are public files: http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/archive.aspx yes, these are OFAC files, but they are the archive of changes, which is in an unstructured format and I'm looking for a way to parse them out.
        I was trying to show the possible source, in no way was I suggesting it was wrong to point out what really was in the list. Helping a hacker might be cool, helping a terrorist—not really.
        Marshall, thanks for that nice code. How would I extend it so that to each record I can apply various tags to identify elements, so for each surname something like this:
        s/^(([A-Z]+\s[A-Z]+,|[A-Z]+-[A-Z]+,|[A-Z]+\s[A-Z]+\s[A-Z]+,)|([A-Z]+,) +)/\<surname\>$1\<\/surname\>/;
        Would I put this in the my @records bit as a map, or in the foreach (@records) bit? I would like to apply to each record a range of markup tags, but just not sure how I tell Perl to do this. I did try putting the above regexp sub as a map in the my @records bit but it does always work.