Re: Splitting file into separate files based on record lengths and identifiers
by ikegami (Patriarch) on Aug 25, 2010 at 23:16 UTC
|
You could do it using unpack, but it leaves it up to you to group back into pairs.
$ perl -E'say for unpack "(x4 A X5 A4x/A)*", $ARGV[0]' \
'0004$ADAM0002*330004%19770004$BOB 0002*430004%1967'
$
ADAM
*
33
%
1977
$
BOB
*
43
%
1967
| [reply] [d/l] |
|
|
Shouldn't it be /a instead of /A to include the space after BOB?
| [reply] [d/l] [select] |
|
|
Probably. "A" will trim trailing spaces. "a" won't. User's choice.
| [reply] |
Re: Splitting file into separate files based on record lengths and identifiers
by JavaFan (Canon) on Aug 25, 2010 at 22:12 UTC
|
use strict;
use warnings;
use autodie;
$_ = '0004$ADAM0002*330004%19770004$BOB 0002*430004%1967';
my %fh;
open $fh{'$'}, ">", "dollar.txt";
open $fh{'*'}, ">", "asterisk.txt";
open $fh{'%'}, ">", "percent.txt";
print {$fh{$2}} $3, "\n" while /([0-9]+)(.)((??{".{$1}"}))/g;
__END__
| [reply] [d/l] |
Re: Splitting file into separate files based on record lengths and identifiers
by kennethk (Abbot) on Aug 25, 2010 at 21:21 UTC
|
What have you tried? What didn't work? Demonstrated effort is generally appreciated around the monastery - see How do I post a question effectively?.
You could do this quite compactly using embedded Perl code in a regular expression, but given that you are a "complete novice", a simpler algorithm of wrapping a substitution in a while loop would likely be sufficient. I assume you know how to open files and print output, so I will give you a simple demonstration of how you might implement your algorithm using regular expressions.
#!/usr/bin/perl
use strict;
use warnings;
my $input = '0004$ADAM0002*330004%19770004$BOB 0002*430004%1967';
while( length $input ) {
unless ($input =~ s/^(\d+)(.)//) {
die "Input misformatted: $input";
}
my $len = $1;
my $type = $2;
unless ($input =~ s/^(.{$len})//) {
die "Input misformatted: $len, $type, $input";
}
my $record = $1;
print "Type:\t<$type>\nRecord:\t<$record>\n\n";
}
outputs:
Type: <$>
Record: <ADAM>
Type: <*>
Record: <33>
Type: <%>
Record: <1977>
Type: <$>
Record: <BOB >
Type: <*>
Record: <43>
Type: <%>
Record: <1967>
If you have questions about how this works, I'd be happy to expound, though you should be able to find any answer in perlre and/or perlretut. | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
The most fundamental difference between my code and that posted in Re: Splitting file into separate files based on record lengths and identifiers is the use of ^ in my regular expressions. ^ requires that the match start at the beginning of the string. Unlike some regular expression implementations (Java comes to mind), without that anchor Perl can begin a match anywhere in the string. The likely issue you are having with your posted code is that your order of operations is getting messed up because you may start in the middle of your string. My approach requires that you process the string in order.
| [reply] [d/l] [select] |
Re: Splitting file into separate files based on record lengths and identifiers
by Generoso (Prior) on Aug 25, 2010 at 22:55 UTC
|
#!/usr/bin/perl -w
#
use warnings;
use strict;
my %hash = ();
while (<DATA>) {
chomp($_);
my $n = length($_);
my $i = 0;
while ($i<$n) {
my $long = substr($_,$i,4);
$i += 4;
my $delim = substr($_,$i,1);
$i += 1;
my $val = substr($_,$i,$long);
$i += $long;
# print $long,' ',$delim, ' ', $val,"\n";
$hash{ $delim } .= $val.',';
} }
s/,\z// for values %hash;
while ( my ($key, $value) = each(%hash) ) {
print "$key => $value\n";
}
print "size of hash: " . keys( %hash ) . ".\n";
print '-' x (60),"\n";
__DATA__
0004$ADAM0002*330004%19770004$BOB 0002*430004%1967
0003$XDA0002*440004%22220003$XOB0002*990004%3333
Results.
Process started >>>
$ => ADAM,BOB ,XDA,XOB
% => 1977,1967,2222,3333
* => 33,43,44,99
size of hash: 3.
------------------------------------------------------------
<<< Process finished.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
0004$ADAM0002*330004%19770004$BOB 0002*430004%1967
0003$XDA0002*440004%22220003$XOB0002*990004%3333
0100$THIS IS A 100 CHAR FIELD £$%£%$£
+ 12345
0030*THIS IS A 30 CHAR ONE
Result
Process started >>>
$ => ADAM,BOB ,XDA,XOB,THIS IS A 100 CHAR FIELD ú$%ú%$ú
+ 12345
% => 1977,1967,2222,3333
* => 33,43,44,99,THIS IS A 30 CHAR ONE
size of hash: 3.
------------------------------------------------------------
<<< Process finished.
| [reply] [d/l] [select] |
Re: Splitting file into separate files based on record lengths and identifiers
by monty77 (Initiate) on Aug 25, 2010 at 21:31 UTC
|
Thank-you, I will try that.
This is what I've tried to date, with help to build the if clauses with expressions, but as soon as I removed the first $ record it broke, and I'm still none the wiser as to why
#!/usr/bin/perl
use strict;
use warnings;
open my $data_fh, '<', 'data.txt' or die "could not open data.txt: $!\
+n";
open my $dollar_fh, '>', 'dollar.txt' or die "could not open dollar.tx
+t: $!\n";
open my $star_fh, '>', 'asterisk.txt' or die "could not open asterisk.
+txt: $!\n";
open my $percent_fh, '>', 'percent.txt' or die "could not open dollar.
+txt: $!\n";
while( <$data_fh> ){
chomp;
while( length ){
if( s{ (\d+) \$ }{}msx ){
my $len = $1;
my $data = substr( $_, 0, $len, '' );
print $dollar_fh "$data\n" or die "could not print to dollar.txt
+: $!\n";
}
if( s{ (\d+) \* }{}msx ){
my $len = $1;
my $data = substr( $_, 0, $len, '' );
print $star_fh "$data\n" or die "could not print to asterisk.txt
+: $!\n";
}
if( s{ (\d+) \% }{}msx ){
my $len = $1;
my $data = substr( $_, 0, $len, '' );
print $percent_fh "$data\n" or die "could not print to percent.t
+xt: $!\n";
}
}
}
close $dollar_fh or die "could not close dollar.txt: $!\n";
close $star_fh or die "could not close asterisk.txt: $!\n";
close $percent_fh or die "could not close percent.txt: $!\n";
close $data_fh or die "could not close data.txt: $!\n";
Thanks! | [reply] [d/l] |
|
|
| [reply] |
|
|
0002*330004%19770004$BOB 0002*430004%1967
...it results in spitting this into the dollar.txt file and nothing into the other two.
0002*330004%BOB 0002*430004%1967
Thanks! | [reply] [d/l] [select] |
|
|
Re: Splitting file into separate files based on record lengths and identifiers
by umasuresh (Hermit) on Aug 25, 2010 at 21:47 UTC
|
Try something like this and save the output in 3 files:
use strict;
my (@year, @name, @asterisk);
while(<DATA>)
{
chomp;
my $line = $_;
while ($line =~ m/%(\d\d\d\d)/g)
{
push @year, $1;
}
while ($line =~m/\$([A-Z]+)/g)
{
push @name, $1;
}
while ($line =~m/\*([^000]+)/g)
{
push @asterisk, $1;
}
}
print join("\n", @year);
print "\n";
print join("\n", @name);
print "\n";
print join("\n", @asterisk);
print "\n";
__DATA__
0004$ADAM0002*330004%19770004$BOB 0002*430004%1967
0007$AD0002*110004%19750004$BOB 0001*440004%1961
| [reply] [d/l] |
Re: Splitting file into separate files based on record lengths and identifiers
by psini (Deacon) on Aug 25, 2010 at 21:40 UTC
|
You could probably use a regex like /^(\d{4}[$*%].*?)+$/ to split your input into "records" and /(\d{4})(.)(.*)/ to split each record into its fields
Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."
| [reply] [d/l] [select] |
Re: Splitting file into separate files based on record lengths and identifiers
by Marshall (Canon) on Aug 27, 2010 at 12:45 UTC
|
Update: I thought I was just adding an additional note to the choir here, but.......evidently not as new requirements surfaced below.
I think Javafan has it in terms of brevity. But here is yet another way. I changed the data a bit in an attempt to cover more cases. Worthy of note here is that some solutions, mine included essentially just "throw away" the field length information.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %xlatetype = ('$' => 'string',
'*' => 'number',
'%' => 'year',
);
my %results;
my $data = '0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967';
my @tokens = split(/\d{4}([\$\*\%])/, $data);
shift @tokens; #there is a "dead" field at the beginning
while (@tokens)
{
my ($type, $value) = splice(@tokens,0,2);
my $type_text = $xlatetype{$type};
push @{$results{$type_text}}, $value
}
print Dumper \%results;
__END__
$VAR1 = {
'number' => [
'333',
'4'
],
'string' => [
'ADAM SMITH',
'BOB '
],
'year' => [
'1977',
'1967'
]
};
| [reply] [d/l] |
|
|
Morning all!
While I massively appreciate the scripts here I just feel like I'm not learning anything, which is not good for anybody :) I have in my head the logic (probably flawed at this stage) I want to use and I've started from scratch on something really simple so perhaps people could help me go down that path?
#!/usr/bin/perl
use strict;
use warnings;
my $string = '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ
+';
my $len1 = substr $string, 0, 4;
my $type = substr $string, $len1, 1;
my $fragment = substr $string, $len1+1, $len1;
my $nextpos = $len1+5;
print " string: $string\n";
print " 1st data length: $len1\n";
print " data: $fragment\n";
print " type: $type\n";
print " next record lenth pos: $nextpos\n";
string: 0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ
1st data length: 0004
data: ADAM
type: $
next record lenth pos: 9
The way I'd like to approach this is to start at beginning of the line and use the length string in each case to extract the string. So in the example above I use the '0004' to identify the data in the first field and then to identify where the next length marker for record 2 begins
What I'm stuck with is putting this into a loop so that it'll continue down the string, using each record length field as it goes, to strip out the data
Thanks! | [reply] [d/l] [select] |
|
|
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my @input=(
'0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967',
'0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ'
);
foreach my $data (@input)
{
print "\nSTARTING NEW LINE ......\ndata = $data\n";
while ($data)
{
print "\n";
my $len = substr($data,0,4);
my $type = substr($data,4,1);
my $stuff= substr($data,5,$len);
substr($data,0,$len+5,''); #delete this "set of data"
print "data length = $len\n";
print "data = $stuff\n";
print "type = $type\n";
}
}
__END__
STARTING NEW LINE ......
data = 0010$ADAM SMITH0003*3330004%19770004$BOB 0001*40004%1967
data length = 0010
data = ADAM SMITH
type = $
data length = 0003
data = 333
type = *
data length = 0004
data = 1977
type = %
data length = 0004
data = BOB
type = $
data length = 0001
data = 4
type = *
data length = 0004
data = 1967
type = %
STARTING NEW LINE ......
data = 0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ
data length = 0004
data = ADAM
type = $
data length = 0002
data = 33
type = *
data length = 0005
data = TESTL
type = L
data length = 0005
data = TESTS
type = S
data length = 0005
data = TESTJ
type = J
data length = 0005
data = TESTZ
type = Z
| [reply] [d/l] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
my $string = '0004$ADAM0002*330005LTESTL0005STESTS0005JTESTJ0005ZTESTZ
+';
my $currentpos = 0;
while ($string) {
my $recordlength = (substr $string, $currentpos, 4)+5;
my $recordtype = substr $string, $currentpos+4, 1;
my $fragment = substr $string, $currentpos+5, $recordlength-5;
my $nextstartpos = $recordlength+1;
print " string: $string\n";
print " record type: $recordtype\n";
print " data: $fragment\n";
print " tot record length: $recordlength\n";
print " next start pos: $nextstartpos\n";
my $currentpos = $currentpos + $recordlength+1
}
Thanks! | [reply] [d/l] |