in reply to Re^2: Fine tuning a reg exp
in thread Fine tuning a reg exp

I think the OP should post the URL that he is working from.

Working from the whole list will make it easier to understand and parse out what he needs. If the info is public, I have no problem with making it "easier to understand" via re-formatting. And I would help with that.

I personally feel "very uncomfortable" if the full info is not available to the general public and the OP's info looks more specific than what I could find. Something like this has not come up before in my time on Monks.

Your URL comes up with:

ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin; a.k.a. SABRI, Abdel Ilah; a.k.a. TANTOUCHE, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Abubaker; a.k.a. TANTOUSH, Ibrahim Ali Abu Bakr; a.k.a. "'ABD AL-MUHSI"; a.k.a. "'ABD AL-RAHMAN"; a.k.a. "ABU ANAS"); DOB 1966; alt. DOB 27 Oct 1969; nationality Libya; Passport 203037 (Libya) (individual) [SDGT]
Fine. Yes. I know this guy in on the terrorist list.
But it doesn't show all of the info that the OP had although it shows additional information.

I think my pointing out that this is a terrorist list was appropriate. Let's see what the OP has to say and we go from there.

Replies are listed 'Best First'.
Re^4: Fine tuning a reg exp
by markjrouse (Initiate) on Feb 23, 2012 at 23:30 UTC
    These are public files: http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/archive.aspx yes, these are OFAC files, but they are the archive of changes, which is in an unstructured format and I'm looking for a way to parse them out.
      I don't have problem in making it easier for terrorists to know that we are looking for them - once we've said that publicly.

      Update: Often when parsing data, there is not a well defined specification. We have to make guesses or ad-hoc rules based upon what we see. That is just the "real world" and how it is. To the best of my knowledge, the below code runs with the complete downloaded file as well as with my DATA segment.

      The desired search term is at the beginning of the record, terminated by "," or "(" with trailing spaces removed.

      #!usr/bin/perl -w use strict; # to process the file from: # http://www.treasury.gov/ofac/downloads/sdnlist.txt # this is about a 93K line file # that means that it easily fits into memory # # to get the valid "records" # (1) separate the records based upon them having # an extra \n between them # The records are "paragraphs". # (2) "squeeze" the lines together so that hyphenated # names will get put "back together" # This is needed so that simple searches will work. # (3) Apply hueristics to get rid of the extraneous # records, here a "valid input record": # (a) can't start with [ and must # (b) have a comma or 'a.k.a' in the first 50 characters # (c) get rid of leading ' if it is there # cannot get rid of ' globally because there are # records where this does have meaning. my @records = map {s/^'//;$_} # another hueristic grep{ !/^\s*\[/ and # huerististic (rule-of-thmub) substr ($_,0,50) =~/,|\Qa.k.a.\E/} map{s/\n//g; $_} # squeeze lines back together do { local $/= "\n\n"; (<DATA>)}; # at this point, there are <12K records # from the 93K lines that we started with foreach (@records) { # your regex to select a record could maybe go here.. # also possible to make a translation table # of any name back to one of these records print "$_\n"; } __DATA__ Output from: http://www.treasury.gov/ofac/downloads/sdnlist.txt goes here... ALPHABETICAL LISTING OF SPECIALLY DESIGNATED NATIONALS AND BLOCKED PERSONS ("SDN List"): This publication of Treasury's Office of Foreign Assets Control ("OFAC") is designed as a reference tool providing actual notice of actions by OFAC with respect to Specially Designated Nationals and ...blah... 17 NOVEMBER (a.k.a. EPANASTATIKI ORGANOSI 17 NOEMVRI; a.k.a. REVOLUTIONARY ORGANIZATION 17 NOVEMBER) [FTO] [SDGT] 32 COUNTY SOVEREIGNTY COMMITTEE (a.k.a. 32 COUNTY SOVEREIGNTY MOVEMENT; a.k.a. IRISH REPUBLICAN PRISONERS WELFARE ASSOCIATION; a.k.a. REAL IRA; a.k.a. REAL IRISH REPUBLICAN ARMY; a.k.a. REAL OGLAIGH NA HEIREANN; a.k.a. RIRA) [FTO] [SDGT] 32 COUNTY SOVEREIGNTY MOVEMENT (a.k.a. 32 COUNTY SOVEREIGNTY COMMITTEE; a.k.a. IRISH REPUBLICAN PRISONERS WELFARE ASSOCIATION; a.k.a. REAL IRA; a.k.a. REAL IRISH REPUBLICAN ARMY; a.k.a. REAL OGLAIGH NA HEIREANN; a.k.a. RIRA) [FTO] [SDGT] 101 DAYS CAMPAIGN (a.k.a. CHARITY COALITION; a.k.a. COALITION OF GOOD; a.k.a. ETELAF AL-KHAIR; a.k.a. ETILAFU EL-KHAIR; a.k.a. I'TILAF AL-KHAIR; a.k.a. I'TILAF AL-KHAYR; a.k.a. UNION OF GOOD), P.O. Box 136301, Jeddah 21313, Saudi Arabia [SDGT] ..etc..
Re^4: Fine tuning a reg exp
by choroba (Cardinal) on Feb 24, 2012 at 00:08 UTC
    I was trying to show the possible source, in no way was I suggesting it was wrong to point out what really was in the list. Helping a hacker might be cool, helping a terrorist—not really.
      I posted some code to show how to parse the data from the URL that you found. If these guys know that they are "hunted" so much the better. I hope that they are very afraid.

      Know that you are a "hunted man", cut off your ability to transmit to the outside world and hunker down. I'm fine with that outcome.

      Update: This is not about political beliefs or what political party I am in most favor with in my particular country. If you are an international criminal, I don't want to help you. I will help with better understanding of public data - it is after all "public".

Re^4: Fine tuning a reg exp
by markjrouse (Initiate) on Feb 26, 2012 at 13:13 UTC
    Marshall, thanks for that nice code. How would I extend it so that to each record I can apply various tags to identify elements, so for each surname something like this:
    s/^(([A-Z]+\s[A-Z]+,|[A-Z]+-[A-Z]+,|[A-Z]+\s[A-Z]+\s[A-Z]+,)|([A-Z]+,) +)/\<surname\>$1\<\/surname\>/;
    Would I put this in the my @records bit as a map, or in the foreach (@records) bit? I would like to apply to each record a range of markup tags, but just not sure how I tell Perl to do this. I did try putting the above regexp sub as a map in the my @records bit but it does always work.