Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

We're migrating some legacy flat files to databases. The structure of the flat files is like that in the __DATA__ block shown below. Tags are short upper case strings followed by equal-sign and white space. Data fields are mostly unstructured text (we have verified that upper case TAGn strings never occur in the data). I want to capture both the tags and the data to build hashes.

Ideally, I want to use method 1 and create the regex on the fly from tags read from other files. As you can see, my attempt fails miserably (does not split where I want to split).

So I tried method 2 and made the regex manually. That one almost works, except that I end up with an extra (empty) value in $stuff2[0], which makes a subsequent %hash = @stuff2 break.

OK, the input is not that ugly, but I thought the solution would be easier. What am I missing?

use 5.010; use strict; use warnings; $/ = undef; my $data = <DATA>; #method 1: build regex on the fly (read tags from files) say '---------- method 1 ----------'; my @tags = qw( TAG1 TAG2 TAG3 TAG4 ); my $tags_re = join "|", @tags; $tags_re = qr{ $tags_re }; say $tags_re; my @stuff = split /($tags_re)=\s*/, $data; say "#$_#" for @stuff; # method 2: static regex say '---------- method 2 ----------'; my @stuff2 = split /(TAG1|TAG2|TAG2|TAG4)=\s*/, $data; say "#$_#" for @stuff2; __DATA__ TAG1= data TAG2= more data TAG3= even more data that sometimes has = and runs on to more than one line TAG4= still more
OUTPUT:
---------- method 1 ---------- (?-xism: TAG1|TAG2|TAG3|TAG4 ) #TAG1= data # #TAG2# #more data # #TAG3# #even more data that sometimes has = and runs on to more than one line TAG4= still more # ---------- method 2 ---------- ## #TAG1# #data # #TAG2# #more data TAG3= even more data that sometimes has = and runs on to more than one line # #TAG4# #still more #

Replies are listed 'Best First'.
Re: problems splitting ugly input data
by BrowserUk (Patriarch) on Dec 23, 2010 at 00:06 UTC
    That one almost works, except that I end up with an extra (empty) value in $stuff2[0],

    That's because using split there will always be an implied empty field preceding the first tag. You could just shift it off the array before building your hash.

    Personally, I think I'd use m/// for this:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 50; my %hash = do{ local $/; <DATA> } =~ m[(TAG\d=)\s+(.+?)(?=TAG|\Z)]gsm; pp \%hash; __DATA__ TAG1= data TAG2= more data TAG3= even more data that sometimes has = and runs on to more than one line TAG4= still more

    Produces:

    c:\test>junk15 { "TAG1=" => "data\n", "TAG2=" => "more data\n", "TAG3=" => "even more data that sometimes has = and\nruns on to more +\nthan one line\n", "TAG4=" => "still more", }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Nice. Thank you.
      That's because using split there will always be an implied empty field preceding the first tag.
      Yes, I have re-learned that many times.

      Would it be OK if I up the ante a bit? The tags are really not so well structured. They are things like HOSTNAME, CONTACT, .... And my mojo ain't workin' quite well enough to see how to adapt your solution to 10-15 tags of that ilk (at least I don't see how to do it in a nice, tidy way).

        Would it be OK if I up the ante a bit? The tags are really not so well structured. They are things like HOSTNAME, CONTACT, ....

        Sure:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 50; my $reTags = join '|', map quotemeta, qw[ HOSTNAME CONTACT TAG1 TAG2 TAG3 TAG4 ]; $reTags = qr[$reTags]; my %hash = do{ local $/; <DATA> } =~ m[($reTags)=\s+(.+?)(?=$reTags|\Z)]gsm; pp \%hash; __DATA__ TAG1= data TAG2= more data HOSTNAME= fred TAG3= even more data that sometimes has = and runs on to more than one line CONTACT= Wiley Coyote Hiesenberg Road The Desert TAG4= still more

        Produces:

        c:\test>junk15 { CONTACT => "Wiley Coyote\nHiesenberg Road\nThe Desert\n", HOSTNAME => "fred\n", TAG1 => "data\n", TAG2 => "more data\n", TAG3 => "even more data that sometimes has = and\nruns on to mor +e\nthan one line\n", TAG4 => "still more", }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.