monty77 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a file to disassemble and I'm thinking Perl might be the way forward.

Sample data:

0004$ADAM0002*330004%19770004$BOB 0002*430004%1967

There are 3 data types in here that I need to split into separate files. Each data type is prefixed by the record length as the length may vary (but doesn't in this example)

So ... 1st record type '$' is 0004 length so it's value is 'ADAM', 2nd record type is '*' which is 0002 length so it's value is '33' and so on.

The record types may not appear in this order also.

End desired result is 3 files looking like:

dollar.txt ADAM BOB asterisk.txt 33 43 percent.txt 1977 1967

I'm a complete novice when it comes to Perl so some code to get me started or links to similar examples would be massively appreciated!

Cheers, Monty

Replies are listed 'Best First'.
Re: Splitting file into separate files based on record lengths and identifiers
by ikegami (Patriarch) on Aug 25, 2010 at 23:16 UTC
    You could do it using unpack, but it leaves it up to you to group back into pairs.
    $ perl -E'say for unpack "(x4 A X5 A4x/A)*", $ARGV[0]' \ '0004$ADAM0002*330004%19770004$BOB 0002*430004%1967' $ ADAM * 33 % 1977 $ BOB * 43 % 1967
      Shouldn't it be /a instead of /A to include the space after BOB?
        Probably. "A" will trim trailing spaces. "a" won't. User's choice.
Re: Splitting file into separate files based on record lengths and identifiers
by JavaFan (Canon) on Aug 25, 2010 at 22:12 UTC
    use strict; use warnings; use autodie; $_ = '0004$ADAM0002*330004%19770004$BOB 0002*430004%1967'; my %fh; open $fh{'$'}, ">", "dollar.txt"; open $fh{'*'}, ">", "asterisk.txt"; open $fh{'%'}, ">", "percent.txt"; print {$fh{$2}} $3, "\n" while /([0-9]+)(.)((??{".{$1}"}))/g; __END__
Re: Splitting file into separate files based on record lengths and identifiers
by kennethk (Abbot) on Aug 25, 2010 at 21:21 UTC
    What have you tried? What didn't work? Demonstrated effort is generally appreciated around the monastery - see How do I post a question effectively?.

    You could do this quite compactly using embedded Perl code in a regular expression, but given that you are a "complete novice", a simpler algorithm of wrapping a substitution in a while loop would likely be sufficient. I assume you know how to open files and print output, so I will give you a simple demonstration of how you might implement your algorithm using regular expressions.

    #!/usr/bin/perl use strict; use warnings; my $input = '0004$ADAM0002*330004%19770004$BOB 0002*430004%1967'; while( length $input ) { unless ($input =~ s/^(\d+)(.)//) { die "Input misformatted: $input"; } my $len = $1; my $type = $2; unless ($input =~ s/^(.{$len})//) { die "Input misformatted: $len, $type, $input"; } my $record = $1; print "Type:\t<$type>\nRecord:\t<$record>\n\n"; }

    outputs:

    Type: <$> Record: <ADAM> Type: <*> Record: <33> Type: <%> Record: <1977> Type: <$> Record: <BOB > Type: <*> Record: <43> Type: <%> Record: <1967>

    If you have questions about how this works, I'd be happy to expound, though you should be able to find any answer in perlre and/or perlretut.

      Oh and I'd love you to break out the logic of your script, would much rather understand than blindly copy!

      Thanks!

        The most fundamental difference between my code and that posted in Re: Splitting file into separate files based on record lengths and identifiers is the use of ^ in my regular expressions. ^ requires that the match start at the beginning of the string. Unlike some regular expression implementations (Java comes to mind), without that anchor Perl can begin a match anywhere in the string. The likely issue you are having with your posted code is that your order of operations is getting messed up because you may start in the middle of your string. My approach requires that you process the string in order.
Re: Splitting file into separate files based on record lengths and identifiers
by Generoso (Prior) on Aug 25, 2010 at 22:55 UTC

    Hope this will help you. Uses Hash.

    #!/usr/bin/perl -w # use warnings; use strict; my %hash = (); while (<DATA>) { chomp($_); my $n = length($_); my $i = 0; while ($i<$n) { my $long = substr($_,$i,4); $i += 4; my $delim = substr($_,$i,1); $i += 1; my $val = substr($_,$i,$long); $i += $long; # print $long,' ',$delim, ' ', $val,"\n"; $hash{ $delim } .= $val.','; } } s/,\z// for values %hash; while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } print "size of hash: " . keys( %hash ) . ".\n"; print '-' x (60),"\n"; __DATA__ 0004$ADAM0002*330004%19770004$BOB 0002*430004%1967 0003$XDA0002*440004%22220003$XOB0002*990004%3333

    Results.

    Process started >>> $ => ADAM,BOB ,XDA,XOB % => 1977,1967,2222,3333 * => 33,43,44,99 size of hash: 3. ------------------------------------------------------------ <<< Process finished.

      The hash solution intrigued me, but it seems to break when I make the strings longer, can someone enlighten me as to why? Sample data:

      0100$THIS IS A 100 CHAR FIELD £$%£%$£                                                               0030*THIS IS A 30 CHAR ONE

      Thanks!

        It is working fin for me. The £ sign comes out funny because I an in Latinamerica and using the standard character set.

        DATA

        0004$ADAM0002*330004%19770004$BOB 0002*430004%1967 0003$XDA0002*440004%22220003$XOB0002*990004%3333 0100$THIS IS A 100 CHAR FIELD £$%£%$£ + 12345 0030*THIS IS A 30 CHAR ONE

        Result

        Process started >>> $ => ADAM,BOB ,XDA,XOB,THIS IS A 100 CHAR FIELD ú$%ú%$ú + 12345 % => 1977,1967,2222,3333 * => 33,43,44,99,THIS IS A 30 CHAR ONE size of hash: 3. ------------------------------------------------------------ <<< Process finished.
Re: Splitting file into separate files based on record lengths and identifiers
by monty77 (Initiate) on Aug 25, 2010 at 21:31 UTC

    Thank-you, I will try that.

    This is what I've tried to date, with help to build the if clauses with expressions, but as soon as I removed the first $ record it broke, and I'm still none the wiser as to why

    #!/usr/bin/perl use strict; use warnings; open my $data_fh, '<', 'data.txt' or die "could not open data.txt: $!\ +n"; open my $dollar_fh, '>', 'dollar.txt' or die "could not open dollar.tx +t: $!\n"; open my $star_fh, '>', 'asterisk.txt' or die "could not open asterisk. +txt: $!\n"; open my $percent_fh, '>', 'percent.txt' or die "could not open dollar. +txt: $!\n"; while( <$data_fh> ){ chomp; while( length ){ if( s{ (\d+) \$ }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $dollar_fh "$data\n" or die "could not print to dollar.txt +: $!\n"; } if( s{ (\d+) \* }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $star_fh "$data\n" or die "could not print to asterisk.txt +: $!\n"; } if( s{ (\d+) \% }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $percent_fh "$data\n" or die "could not print to percent.t +xt: $!\n"; } } } close $dollar_fh or die "could not close dollar.txt: $!\n"; close $star_fh or die "could not close asterisk.txt: $!\n"; close $percent_fh or die "could not close percent.txt: $!\n"; close $data_fh or die "could not close data.txt: $!\n";
    Thanks!
      Thank you for posting code - now I better understand your coding abilities and can give you specific critiques. I ran your code, as posted, against your posted sample input and it seems to have functioned as per spec. What makes you think it broke?

      As a side note, there is no need to explicitly close your files. Since you've used Indirect Filehandles, Perl with automatically clean those up once the variables got out of scope.

        This is the data that breaks my original script:

        0002*330004%19770004$BOB 0002*430004%1967

        ...it results in spitting this into the dollar.txt file and nothing into the other two.

        0002*330004%BOB 0002*430004%1967

        Thanks!

Re: Splitting file into separate files based on record lengths and identifiers
by umasuresh (Hermit) on Aug 25, 2010 at 21:47 UTC
    Try something like this and save the output in 3 files:
    use strict; my (@year, @name, @asterisk); while(<DATA>) { chomp; my $line = $_; while ($line =~ m/%(\d\d\d\d)/g) { push @year, $1; } while ($line =~m/\$([A-Z]+)/g) { push @name, $1; } while ($line =~m/\*([^000]+)/g) { push @asterisk, $1; } } print join("\n", @year); print "\n"; print join("\n", @name); print "\n"; print join("\n", @asterisk); print "\n"; __DATA__ 0004$ADAM0002*330004%19770004$BOB 0002*430004%1967 0007$AD0002*110004%19750004$BOB 0001*440004%1961
Re: Splitting file into separate files based on record lengths and identifiers
by psini (Deacon) on Aug 25, 2010 at 21:40 UTC

    You could probably use a regex like /^(\d{4}[$*%].*?)+$/ to split your input into "records" and /(\d{4})(.)(.*)/ to split each record into its fields

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Splitting file into separate files based on record lengths and identifiers
by Marshall (Canon) on Aug 27, 2010 at 12:45 UTC
    Update: I thought I was just adding an additional note to the choir here, but.......evidently not as new requirements surfaced below.

    I think Javafan has it in terms of brevity. But here is yet another way. I changed the data a bit in an attempt to cover more cases. Worthy of note here is that some solutions, mine included essentially just "throw away" the field length information.

    #!/usr/bin/perl -w use strict; use Data::Dumper; my %xlatetype = ('$' => 'string', '*' => 'number', '%' => 'year', ); my %results; my $data = '0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967'; my @tokens = split(/\d{4}([\$\*\%])/, $data); shift @tokens; #there is a "dead" field at the beginning while (@tokens) { my ($type, $value) = splice(@tokens,0,2); my $type_text = $xlatetype{$type}; push @{$results{$type_text}}, $value } print Dumper \%results; __END__ $VAR1 = { 'number' => [ '333', '4' ], 'string' => [ 'ADAM SMITH', 'BOB ' ], 'year' => [ '1977', '1967' ] };

      Morning all!

      While I massively appreciate the scripts here I just feel like I'm not learning anything, which is not good for anybody :) I have in my head the logic (probably flawed at this stage) I want to use and I've started from scratch on something really simple so perhaps people could help me go down that path?

      #!/usr/bin/perl use strict; use warnings; my $string = '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ +'; my $len1 = substr $string, 0, 4; my $type = substr $string, $len1, 1; my $fragment = substr $string, $len1+1, $len1; my $nextpos = $len1+5; print " string: $string\n"; print " 1st data length: $len1\n"; print " data: $fragment\n"; print " type: $type\n"; print " next record lenth pos: $nextpos\n";
      string: 0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ 1st data length: 0004 data: ADAM type: $ next record lenth pos: 9

      The way I'd like to approach this is to start at beginning of the line and use the length string in each case to extract the string. So in the example above I use the '0004' to identify the data in the first field and then to identify where the next length marker for record 2 begins

      What I'm stuck with is putting this into a loop so that it'll continue down the string, using each record length field as it goes, to strip out the data

      Thanks!

        #!/usr/bin/perl -w use strict; use Data::Dumper; my @input=( '0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967', '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ' ); foreach my $data (@input) { print "\nSTARTING NEW LINE ......\ndata = $data\n"; while ($data) { print "\n"; my $len = substr($data,0,4); my $type = substr($data,4,1); my $stuff= substr($data,5,$len); substr($data,0,$len+5,''); #delete this "set of data" print "data length = $len\n"; print "data = $stuff\n"; print "type = $type\n"; } } __END__ STARTING NEW LINE ...... data = 0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967 data length = 0010 data = ADAM SMITH type = $ data length = 0003 data = 333 type = * data length = 0004 data = 1977 type = % data length = 0004 data = BOB type = $ data length = 0001 data = 4 type = * data length = 0004 data = 1967 type = % STARTING NEW LINE ...... data = 0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ data length = 0004 data = ADAM type = $ data length = 0002 data = 33 type = * data length = 0005 data = TESTL type = L data length = 0005 data = TESTS type = S data length = 0005 data = TESTJ type = J data length = 0005 data = TESTZ type = Z

        Update .. got this far, but it just spits out the same data over and over again so I've not got the while loop quite right, any ideas?

        #!/usr/bin/perl use strict; use warnings; my $string = '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ +'; my $currentpos = 0; while ($string) { my $recordlength = (substr $string, $currentpos, 4)+5; my $recordtype = substr $string, $currentpos+4, 1; my $fragment = substr $string, $currentpos+5, $recordlength-5; my $nextstartpos = $recordlength+1; print " string: $string\n"; print " record type: $recordtype\n"; print " data: $fragment\n"; print " tot record length: $recordlength\n"; print " next start pos: $nextstartpos\n"; my $currentpos = $currentpos + $recordlength+1 }

        Thanks!