gbwien has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am getting back into Perl programming and I would appreciate you help with the following issue. I have a file which contains many lines of text such as:-

THREAD_ID:1bf1d698 CDR_TYPE:AO     SUB_TIME:240815144127   DEL_TIME:240815144127   OA_ADDR:5.0.OTSDC   PRE_TRANS_OA:5.0.OTSDC  DA_ADDR:1.1.966555696176   PRE_TRANS_DA:1.1.966555696176   ORIG_LOCN:10.100.80.7/7220      ORIG_IDNT:OTS A2P       DEST_LOCN:173.209.195.44/8341   DEST_IDNT:Syniverse A2P I_ERR:0.0       PPS_ID: PPS_PROFILE:AO Submission - OA charged  PPS_ERR:1.0     O_ERR:0.0       SILO:   MSG_LEN:22      SEG_NUM:1 of 1  DLV_ATT:0      END_POINT:ESME   FINAL_STATE:DELIVERED   REG_DEL:1 I would like to post process these lines of text such that I have an array of elements which I can later use as I want to output to a result file. The elements (post processed) should look like this :
THREAD_ID:1bf1d698 CDR_TYPE:AO SUB_TIME:240815144127 DEL_TIME:240815144127 OA_ADDR:5.0.OTSDC PRE_TRANS_OA:5.0.OTSDC DA_ADDR:1.1.966555696176 PRE_TRANS_DA:1.1.966555696176 ORIG_LOCN:10.100.80.7/7220 ORIG_IDNT:OTS A2P DEST_LOCN:173.209.195.44/8341 DEST_IDNT:Syniverse A2P I_ERR:0.0 PPS_ID: PPS_PROFILE:AO Submission - OA charged PPS_ERR:1.0 O_ERR:0.0 SILO: MSG_LEN:22 SEG_NUM:1 of 1 DLV_ATT:0 END_POINT:ESME FINAL_STATE:DELIVERED REG_DEL:1
Thanks for your help

Replies are listed 'Best First'.
Re: Parsing file in Perl post processing
by NetWallah (Canon) on Sep 08, 2015 at 22:38 UTC
    A perl programmer would prefer a HASH to represent the parsed data, rather than an array. Assuming that this is what you want, here is a one-liner to get you started:
    >perl -E "my %h = map {split ':',$_} split /\s+/, $ARGV[0]; say qq|$_\t$h{$_}\n| for sort keys %h" "THREAD_ID:1bf1d698 CDR_TY 40815144127 DEL_TIME:240815144127 OA_ADDR:5.0.OTSDC PRE_TRANS_OA +:5.0.OTSDC DA_ADDR:1.1.966555696176 PRE_TRANS_DA:1.1.966555696176 +"
    The data does have some inconsistencies that are not handled by the regex .. this is just to get you started. You will need to develop the regex to handle the pathological data.

    Update:If you run into trouble parsing the pathalogical data, please post the code you tried here, and explain your problems.
    Monks here will gladly explain and help correct code, provided you display some effort.

            Software efficiency halves every 18 months, thus compensating for Moore's Law.

Re: Parsing file in Perl post processing
by Laurent_R (Canon) on Sep 09, 2015 at 08:11 UTC
    The immediate idea would be to split the data on spaces, but that does not work entirely because your two last "fields" have embedded spaces:
    SEG_NUM:1 of 1 DLV_ATT:0 END_POINT:ESME FINAL_STATE:DELIVERED REG_DEL:1
    So I would probably try to first process these two last fields with a regular expression, something like:
    @endfields = /(SEG_NUM.+?)\s+(END_POINT.+?)/;
    remove them from the string and then use something like split /\s+/ on the rest of the string, and finally to reassemble the array in the proper order.

    Update: I did not originally noticed, but it appears that at least two other fields have embedded spaces:

    DEST_IDNT:Syniverse A2P I_ERR:0.0 PPS_ID: PPS_PROFILE:AO Submission - OA charged
    So splitting on spaces becomes harder to use, at least for about the last half of the original string. Although I don't like the idea too much, perhaps a long regex with each field key is the only solution, at least for the eight fields or so.
Re: Parsing file in Perl post processing
by GotToBTru (Prior) on Sep 09, 2015 at 13:31 UTC

    I am thinking there are no embedded spaces in the field names by the otherwise consistent use of underscore. But lines 12 and 13 of the example output do confuse things. Can you clarify?

    This almost works (can't get value of last key):

    use strict; use warnings; use Data::Dumper; my $string = 'THREAD_ID:1bf1d698 CDR_TYPE:AO SUB_TIME:240815144127 + DEL_TIME:240815144127 OA_ADDR:5.0.OTSDC PRE_TRANS_OA:5.0.OTSDC + DA_ADDR:1.1.966555696176 PRE_TRANS_DA:1.1.966555696176 ORIG_LOC +N:10.100.80.7/7220 ORIG_IDNT:OTS A2P DEST_LOCN:173.209.195 +.44/8341 DEST_IDNT:Syniverse A2P I_ERR:0.0 PPS_ID: PPS_PROFIL +E:AO Submission - OA charged PPS_ERR:1.0 O_ERR:0.0 SILO: + MSG_LEN:22 SEG_NUM:1 of 1 DLV_ATT:0 END_POINT:ESME FINA +L_STATE:DELIVERED REG_DEL:1'; my (@keys) = ($string =~ m/([A-Z_]+):/g); my $z = qr{(?:[A-Z_]+:|$)}; my %hash = map { $_, ($string =~ m/$_:(.+?)\s*$z/)} @keys; print Dumper(\%hash);

    Output:

    $VAR1 = { 'PPS_ID' => ' ', 'THREAD_ID' => '1bf1d698 ', 'DEST_IDNT' => 'Syniverse A2P ', 'CDR_TYPE' => 'AO ', 'ORIG_LOCN' => '10.100.80.7/7220 ', 'REG_DEL' => '1', 'SILO' => ' ', 'DEST_LOCN' => '173.209.195.44/8341 ', 'O_ERR' => '0.0 ', 'OA_ADDR' => '5.0.OTSDC ', 'PRE_TRANS_DA' => '1.1.966555696176 ', 'PPS_PROFILE' => 'AO Submission - OA charged ', 'I_ERR' => '0.0 ', 'DLV_ATT' => '0 ', 'ORIG_IDNT' => 'OTS A2P ', 'DA_ADDR' => '1.1.966555696176 ', 'MSG_LEN' => '22 ', 'FINAL_STATE' => 'DELIVERED ', 'SEG_NUM' => '1 of 1 ', 'SUB_TIME' => '240815144127 ', 'DEL_TIME' => '240815144127 ', 'PPS_ERR' => '1.0 ', 'END_POINT' => 'ESME ', 'PRE_TRANS_OA' => '5.0.OTSDC ' };

    Update: with help of MidLifeXis, corrected regex in map to work even for last key:value pair in list. Update 2: changed final \s in map to \s*. Thanks to NetWallah and poj.

    Dum Spiro Spero
      Great solution (++).

      adding a '+' to the regex eliminates trailing spaces in the values:

      # = my %hash = map { $_, ($string =~ m/$_:(.+?)\s+$z/)} @keys; # =

              Software efficiency halves every 18 months, thus compensating for Moore's Law.

      Sorry with my limited exposure to perl I am trying to understand what you are doing in these lines of code

      my $z = qr{(?:[A-Z_]+:|$)}; my %hash = map { $_, ($string =~ m/$_:(.+?)\s*$z/)} @keys

      How does qr work could you please explain what you are doing?

      How is the hash created, I don't understand map and $_, and the $string part

      Thanks Tom

        By using qr{} I am telling Perl the string inside will be used in a regex. See http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators. The most common reason to do this is to save time when you are using the same regex over and over; I use it because it often seems to make sure Perl interprets the regex in the way I expect.

        map {block} @array returns an array created by executing {block} once for each element of @array, each time assigning one value to $_.

        @a=(1,2,3); @a_plus_1 = map { $_ + 1 } @a;

        The block, which in this example is $_ + 1, will be execute 3 times, once for each value in @a. It will put the results also in an array, so the values in @a_plus_1 will be (2,3,4).

        In my solution, I have the map return two values separated by a comma. This is one way to define a hash. You can see this using the debugger:

        perl -d -e 1 Loading DB routines from perl5db.pl version 1.28 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): 1 DB<1> %h=(1,2,3,4) DB<2> print $h{1} 2
        Dum Spiro Spero
Re: Parsing file in Perl post processing
by u65 (Chaplain) on Sep 09, 2015 at 10:48 UTC

    Regarding the data format, it looks to me to be a string of keys (\w+\:) followed by their values consisting of all characters up to the next key, and the value may be empty. How any subkeys are to be assembled with their parent keys in the output is another matter that needs defining.

    UPDATE: Unless we are told otherwise, I see we have been given the subkey arrangement in the example output.

A reply falls below the community's threshold of quality. You may see it by logging in.