in reply to Re^2: Reg Exp to handle variations in the matched pattern
in thread Reg Exp to handle variations in the matched pattern

a space, followed by a dash, followed by a carriage return OR a colon, followed by a carriage return

So far that's simple / -[:r\r]\r/

BUT NOT a colon, followed by carriage return

If you're looking for two carriage returns in a row, then you'll never find something where the first carriage return is followed by a colon (because then it's not two carriage returns in a row, d'oh), so I don't see why you emphasize it like that.

followed by carriage return, followed by a digit, or a letter.
\r\w
One of the text files is actually located here: http://www.treasury.gov/resource-center/sanctions/SDN-List/Documents/sdnew02.txt

The pattern you describe matches nowhere in that file; in fact I can't find a single occurence of a carriage return in that file.

If you describe what information you want to extract from that file, we might be able to help you. But right now it seems that you don't have a clear mental image yourself, so it's pretty hard to help you.

Replies are listed 'Best First'.
Re^4: Reg Exp to handle variations in the matched pattern
by markjrouse (Initiate) on Feb 22, 2012 at 16:32 UTC

    Ultimately, I'm looking to ascertain how Perl could parse this file to turn it into a structured format. My first thoughts are to update the file using Perl to break out elements that would then constitute a line, then import these lines into a database to extract out the various fields. Unless, Perl is able to do this better and more efficiently. I'm still learning Perl, so I'm sure there is a better way of doing it in Perl.

    If you look at that file, from line 27:

    Licensing at 202/622-2480. The following changes have occurred with respect to the Office of Foreign Assets Control Listing of Specially Designated Nationals and Blocked Persons since January 1, 2002: 01/09/02: The following have been named as "Specially Designated Global Terrorists" [SDGTs] -

    There are two distinct patterns that I'm trying to match here, hence my original regexp (\s-\r)|(:\r). After the "January 1,2002:" text is a cariage return, line feed x2. Hex values 0D 0A 0D 0A. I'm looking to insert a string between ":" and the cariage return. So the first pattern is /(:)\r\n\r\n/ Therefore, my substuition code is this

    s/(:)\r\n\r\n/\1\$\$\n/g but of course this insertion is not working

    It may be my hex/text editor, but It tells me there are lots of carriage returns in this data.

    The second pattern is after the "01/09/02: The following have been named as "Specially Designated Global Terrorists" SDGTs -" text, where the dash at the end is proceeded by a space, and followed by a carriage return, new line feed x2, so my match regexp is /(\s-)\r\n\r\n/ Therefore, my substuition code is this s/(\s-)\r\n\r\n/\1\$\$\n/g but of course this insertion is not working

    The subsequent result would be:

    Licensing at 202/622-2480. The following changes have occurred with respect to the Office of Foreign Assets Control Listing of Specially Designated Nationals and Blocked Persons since January 1, 2002:$$ 01/09/02: The following have been named as "Specially Designated Global Terrorists" [SDGTs] -$$

    sorry for it not being much clearer. It's a bit difficult to explain.

      I think this does what you want, except I've put in "<stuff>" where you had "$$"-- when I use Perl to tag text I tend to put in HTML or XML-like tags and then use an XML or HTML parser to extract a data structure to stick into a database or whatever.

      open(MYINPUTFILE, "<sdnew02.txt"); while (<MYINPUTFILE>){ $_ =~ s/(\s-)$/$1\<stuff\>/; $_ =~ s/(:)$/$1\<stuff\>/; print $_,""; }

      You seem to have gotten hung up on worrying about the returns or newlines, when you should have recognized that you needed the end of line anchor. If you want to make the replacement more robust you could put in some matches to arbitrary amounts of whitespace before and after the "-" or ":", but before the $ anchor.

      From what you describe, Perl would probably do all the text munging you need. Databases are great for randomly accessing data based on whatever relationships you want to select on, but Perl is hard to beat for dismantling text. Most of what I use Perl for is taking apart text and sticking it into databases for other purposes. Friedl's book "Mastering Regular Expressions" is still a great place to start. There are probably free tutorials floating around the web, but MRE gives clear explanations and gets you up to speed fast.

        Thanks for this. This is a great help. Do you happen to have an example of code that you would use to tag a text file? I like the idea of tag with HTML/XML style tags, but I don't have time to build something, so maybe I'll use Perl to convert this text file to a delimited file and use a db to extract text.
Re^4: Reg Exp to handle variations in the matched pattern
by markjrouse (Initiate) on Feb 22, 2012 at 17:02 UTC
    Hi Moritz, Yes your right. I've just re-downloaded the file and there are no carriage returns. I'll try this again.