Splitting file into separate files based on record lengths and identifiers

monty77 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting file into separate files based on record lengths and identifiers by ikegami (Patriarch) on Aug 25, 2010 at 23:16 UTC
You could do it using unpack, but it leaves it up to you to group back into pairs. `$ perl -E'say for unpack "(x4 A X5 A4x/A)", $ARGV[0]' \ '0004$ADAM0002330004%19770004$BOB 0002430004%1967' $ ADAM 33 % 1977 $ BOB * 43 % 1967` [download]	[reply] [d/l]
Re^2: Splitting file into separate files based on record lengths and identifiers by choroba (Cardinal) on Aug 30, 2010 at 16:04 UTC
Shouldn't it be `/a` instead of `/A` to include the space after BOB?	[reply] [d/l] [select]
Re^3: Splitting file into separate files based on record lengths and identifiers by ikegami (Patriarch) on Aug 30, 2010 at 20:29 UTC
Probably. "A" will trim trailing spaces. "a" won't. User's choice.	[reply]
Re: Splitting file into separate files based on record lengths and identifiers by JavaFan (Canon) on Aug 25, 2010 at 22:12 UTC
`use strict; use warnings; use autodie; $_ = '0004$ADAM0002330004%19770004$BOB 0002430004%1967'; my %fh; open $fh{'$'}, ">", "dollar.txt"; open $fh{'*'}, ">", "asterisk.txt"; open $fh{'%'}, ">", "percent.txt"; print {$fh{$2}} $3, "\n" while /([0-9]+)(.)((??{".{$1}"}))/g; __END__` [download]	[reply] [d/l]
Re: Splitting file into separate files based on record lengths and identifiers by kennethk (Abbot) on Aug 25, 2010 at 21:21 UTC
What have you tried? What didn't work? Demonstrated effort is generally appreciated around the monastery - see How do I post a question effectively?. You could do this quite compactly using embedded Perl code in a regular expression, but given that you are a "complete novice", a simpler algorithm of wrapping a substitution in a while loop would likely be sufficient. I assume you know how to open files and print output, so I will give you a simple demonstration of how you might implement your algorithm using regular expressions. `#!/usr/bin/perl use strict; use warnings; my $input = '0004$ADAM0002330004%19770004$BOB 0002430004%1967'; while( length $input ) { unless ($input =~ s/^(\d+)(.)//) { die "Input misformatted: $input"; } my $len = $1; my $type = $2; unless ($input =~ s/^(.{$len})//) { die "Input misformatted: $len, $type, $input"; } my $record = $1; print "Type:\t<$type>\nRecord:\t<$record>\n\n"; }` [download] outputs: `Type: <$> Record: <ADAM> Type: <> Record: <33> Type: <%> Record: <1977> Type: <$> Record: <BOB > Type: <> Record: <43> Type: <%> Record: <1967>` [download] If you have questions about how this works, I'd be happy to expound, though you should be able to find any answer in perlre and/or perlretut.	[reply] [d/l] [select]
Re^2: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 25, 2010 at 22:14 UTC
Oh and I'd love you to break out the logic of your script, would much rather understand than blindly copy! Thanks!	[reply]
Re^3: Splitting file into separate files based on record lengths and identifiers by kennethk (Abbot) on Aug 25, 2010 at 22:31 UTC
The most fundamental difference between my code and that posted in Re: Splitting file into separate files based on record lengths and identifiers is the use of `^` in my regular expressions. `^` requires that the match start at the beginning of the string. Unlike some regular expression implementations (Java comes to mind), without that anchor Perl can begin a match anywhere in the string. The likely issue you are having with your posted code is that your order of operations is getting messed up because you may start in the middle of your string. My approach requires that you process the string in order.	[reply] [d/l] [select]
Re: Splitting file into separate files based on record lengths and identifiers by Generoso (Prior) on Aug 25, 2010 at 22:55 UTC
Hope this will help you. Uses Hash. #!/usr/bin/perl -w # use warnings; use strict; my %hash = (); while (<DATA>) { chomp($_); my $n = length($_); my $i = 0; while ($i<$n) { my $long = substr($_,$i,4); $i += 4; my $delim = substr($_,$i,1); $i += 1; my $val = substr($_,$i,$long); $i += $long; # print $long,' ',$delim, ' ', $val,"\n"; $hash{ $delim } .= $val.','; } } s/,\z// for values %hash; while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } print "size of hash: " . keys( %hash ) . ".\n"; print '-' x (60),"\n"; __DATA__ 0004$ADAM0002330004%19770004$BOB 0002430004%1967 0003$XDA0002440004%22220003$XOB0002990004%3333 [download] Results. `Process started >>> $ => ADAM,BOB ,XDA,XOB % => 1977,1967,2222,3333 * => 33,43,44,99 size of hash: 3. ------------------------------------------------------------ <<< Process finished.` [download]	[reply] [d/l] [select]
Re^2: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 26, 2010 at 21:53 UTC
The hash solution intrigued me, but it seems to break when I make the strings longer, can someone enlighten me as to why? Sample data: `0100$THIS IS A 100 CHAR FIELD Ł$%Ł%$Ł 0030*THIS IS A 30 CHAR ONE` Thanks!	[reply] [d/l]
Re^3: Splitting file into separate files based on record lengths and identifiers by Generoso (Prior) on Aug 27, 2010 at 17:35 UTC
It is working fin for me. The Ł sign comes out funny because I an in Latinamerica and using the standard character set. DATA `0004$ADAM0002330004%19770004$BOB 0002430004%1967 0003$XDA0002440004%22220003$XOB0002990004%3333 0100$THIS IS A 100 CHAR FIELD Ł$%Ł%$Ł + 12345 0030THIS IS A 30 CHAR ONE` [download] Result `Process started >>> $ => ADAM,BOB ,XDA,XOB,THIS IS A 100 CHAR FIELD ú$%ú%$ú + 12345 % => 1977,1967,2222,3333 => 33,43,44,99,THIS IS A 30 CHAR ONE size of hash: 3. ------------------------------------------------------------ <<< Process finished.` [download]	[reply] [d/l] [select]
Re: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 25, 2010 at 21:31 UTC
Thank-you, I will try that. This is what I've tried to date, with help to build the if clauses with expressions, but as soon as I removed the first $ record it broke, and I'm still none the wiser as to why #!/usr/bin/perl use strict; use warnings; open my $data_fh, '<', 'data.txt' or die "could not open data.txt: $!\ +n"; open my $dollar_fh, '>', 'dollar.txt' or die "could not open dollar.tx +t: $!\n"; open my $star_fh, '>', 'asterisk.txt' or die "could not open asterisk. +txt: $!\n"; open my $percent_fh, '>', 'percent.txt' or die "could not open dollar. +txt: $!\n"; while( <$data_fh> ){ chomp; while( length ){ if( s{ (\d+) \$ }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $dollar_fh "$data\n" or die "could not print to dollar.txt +: $!\n"; } if( s{ (\d+) \* }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $star_fh "$data\n" or die "could not print to asterisk.txt +: $!\n"; } if( s{ (\d+) \% }{}msx ){ my $len = $1; my $data = substr( $_, 0, $len, '' ); print $percent_fh "$data\n" or die "could not print to percent.t +xt: $!\n"; } } } close $dollar_fh or die "could not close dollar.txt: $!\n"; close $star_fh or die "could not close asterisk.txt: $!\n"; close $percent_fh or die "could not close percent.txt: $!\n"; close $data_fh or die "could not close data.txt: $!\n"; [download] Thanks!	[reply] [d/l]
Re^2: Splitting file into separate files based on record lengths and identifiers by kennethk (Abbot) on Aug 25, 2010 at 21:58 UTC
Thank you for posting code - now I better understand your coding abilities and can give you specific critiques. I ran your code, as posted, against your posted sample input and it seems to have functioned as per spec. What makes you think it broke? As a side note, there is no need to explicitly close your files. Since you've used Indirect Filehandles, Perl with automatically clean those up once the variables got out of scope.	[reply]
Re^3: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 25, 2010 at 22:09 UTC
This is the data that breaks my original script: `0002330004%19770004$BOB 0002430004%1967` [download] ...it results in spitting this into the dollar.txt file and nothing into the other two. `0002330004%BOB 0002430004%1967` [download] Thanks!	[reply] [d/l] [select]
Re^4: Splitting file into separate files based on record lengths and identifiers by kennethk (Abbot) on Aug 25, 2010 at 22:41 UTC
Re: Splitting file into separate files based on record lengths and identifiers by umasuresh (Hermit) on Aug 25, 2010 at 21:47 UTC
Try something like this and save the output in 3 files: `use strict; my (@year, @name, @asterisk); while(<DATA>) { chomp; my $line = $_; while ($line =~ m/%(\d\d\d\d)/g) { push @year, $1; } while ($line =~m/\$([A-Z]+)/g) { push @name, $1; } while ($line =~m/\([^000]+)/g) { push @asterisk, $1; } } print join("\n", @year); print "\n"; print join("\n", @name); print "\n"; print join("\n", @asterisk); print "\n"; __DATA__ 0004$ADAM0002330004%19770004$BOB 0002430004%1967 0007$AD0002110004%19750004$BOB 0001*440004%1961` [download]	[reply] [d/l]
Re: Splitting file into separate files based on record lengths and identifiers by psini (Deacon) on Aug 25, 2010 at 21:40 UTC
You could probably use a regex like `/^(\d{4}[$%].?)+$/` to split your input into "records" and `/(\d{4})(.)(.*)/` to split each record into its fields Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply] [d/l] [select]
Re: Splitting file into separate files based on record lengths and identifiers by Marshall (Canon) on Aug 27, 2010 at 12:45 UTC
Update: I thought I was just adding an additional note to the choir here, but.......evidently not as new requirements surfaced below. I think Javafan has it in terms of brevity. But here is yet another way. I changed the data a bit in an attempt to cover more cases. Worthy of note here is that some solutions, mine included essentially just "throw away" the field length information. #!/usr/bin/perl -w use strict; use Data::Dumper; my %xlatetype = ('$' => 'string', '' => 'number', '%' => 'year', ); my %results; my $data = '0010$ADAM SMITH00033330004%19770004$BOB 000140004%1967'; my @tokens = split(/\d{4}([\$\\%])/, $data); shift @tokens; #there is a "dead" field at the beginning while (@tokens) { my ($type, $value) = splice(@tokens,0,2); my $type_text = $xlatetype{$type}; push @{$results{$type_text}}, $value } print Dumper \%results; __END__ $VAR1 = { 'number' => [ '333', '4' ], 'string' => [ 'ADAM SMITH', 'BOB ' ], 'year' => [ '1977', '1967' ] }; [download]	[reply] [d/l]
Re^2: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 27, 2010 at 17:05 UTC
Morning all! While I massively appreciate the scripts here I just feel like I'm not learning anything, which is not good for anybody :) I have in my head the logic (probably flawed at this stage) I want to use and I've started from scratch on something really simple so perhaps people could help me go down that path? `#!/usr/bin/perl use strict; use warnings; my $string = '0004$ADAM0002330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ +'; my $len1 = substr $string, 0, 4; my $type = substr $string, $len1, 1; my $fragment = substr $string, $len1+1, $len1; my $nextpos = $len1+5; print " string: $string\n"; print " 1st data length: $len1\n"; print " data: $fragment\n"; print " type: $type\n"; print " next record lenth pos: $nextpos\n";` [download] `string: 0004$ADAM0002330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ 1st data length: 0004 data: ADAM type: $ next record lenth pos: 9` [download] The way I'd like to approach this is to start at beginning of the line and use the length string in each case to extract the string. So in the example above I use the '0004' to identify the data in the first field and then to identify where the next length marker for record 2 begins What I'm stuck with is putting this into a loop so that it'll continue down the string, using each record length field as it goes, to strip out the data Thanks!	[reply] [d/l] [select]
Re^3: Splitting file into separate files based on record lengths and identifiers by Marshall (Canon) on Aug 27, 2010 at 19:05 UTC
#!/usr/bin/perl -w use strict; use Data::Dumper; my @input=( '0010$ADAM SMITH00033330004%19770004$BOB 000140004%1967', '0004$ADAM0002330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ' ); foreach my $data (@input) { print "\nSTARTING NEW LINE ......\ndata = $data\n"; while ($data) { print "\n"; my $len = substr($data,0,4); my $type = substr($data,4,1); my $stuff= substr($data,5,$len); substr($data,0,$len+5,''); #delete this "set of data" print "data length = $len\n"; print "data = $stuff\n"; print "type = $type\n"; } } __END__ STARTING NEW LINE ...... data = 0010$ADAM SMITH00033330004%19770004$BOB 000140004%1967 data length = 0010 data = ADAM SMITH type = $ data length = 0003 data = 333 type = data length = 0004 data = 1977 type = % data length = 0004 data = BOB type = $ data length = 0001 data = 4 type = * data length = 0004 data = 1967 type = % STARTING NEW LINE ...... data = 0004$ADAM0002330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ data length = 0004 data = ADAM type = $ data length = 0002 data = 33 type = data length = 0005 data = TESTL type = L data length = 0005 data = TESTS type = S data length = 0005 data = TESTJ type = J data length = 0005 data = TESTZ type = Z [download]	[reply] [d/l]
Re^3: Splitting file into separate files based on record lengths and identifiers by monty77 (Initiate) on Aug 27, 2010 at 17:49 UTC
Update .. got this far, but it just spits out the same data over and over again so I've not got the while loop quite right, any ideas? #!/usr/bin/perl use strict; use warnings; my $string = '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ +'; my $currentpos = 0; while ($string) { my $recordlength = (substr $string, $currentpos, 4)+5; my $recordtype = substr $string, $currentpos+4, 1; my $fragment = substr $string, $currentpos+5, $recordlength-5; my $nextstartpos = $recordlength+1; print " string: $string\n"; print " record type: $recordtype\n"; print " data: $fragment\n"; print " tot record length: $recordlength\n"; print " next start pos: $nextstartpos\n"; my $currentpos = $currentpos + $recordlength+1 } [download] Thanks!	[reply] [d/l]