Re: Seeking an Enlightened Path (Parsing, Translating, Translocating)
by BrowserUk (Patriarch) on Mar 10, 2008 at 20:21 UTC
|
perl -ple"$_ = join'',(unpack 'A10 A4 A6 A6', $_)[ 2,1,0,3 ]" infile >
+outfile
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
| [reply] |
|
|
Unpack and then join. Very nice solution. I am getting my field definition before I would actually look at the file, so this is a very viable solution.
Is it possible to make this dynamically? In other words, would it be possible to define a string and feed that into unpack, then define another string and feed it into the second part of join? Also, if it is possible to make it dynamic in this way, would it hinder performance?
| [reply] |
|
|
Perl is a "dynamic language", and generally encourages late, lazy, and dynamic programming.(Sorry - could not find a better introductory platitude)
Typically, built-in functions do not care (or are unaware) weather they are passed constants, or variables. The only exception I can think of is re-evaluating regular-expressions in a loop.
So the advice is to go ahead and feed in constructed arguments. Benchmark it, if you have doubts.
"As you get older three things happen. The first is your memory goes, and I can't remember the other two... "
- Sir Norman Wisdom
| [reply] |
|
|
#!/user/bin/perl
use strict;
use warnings;
#----------- define field mappings ---------------------
# hash to map from field i in input to field j in output
# field i is the key to hash, field j is the value of hash
my %from_to = (
1 => 4,
2 => 2,
3 => 1,
4 => 3,
);
# field length hash...key is the field number, value of hash
# is the length in characters. Presumes the length of the
# input and output fields are the same.
my %field_len = (
1 => 10,
2 => 4,
3 => 6,
4 => 6,
);
#---------- setup decode string --------
my $decode_string = "";
foreach my $num (sort keys %from_to)
{$decode_string .= 'A' . $field_len{$num} . ' '};
#-------------- process files ----------
my @input;
my @output;
foreach my $in_record (<DATA>)
{
chomp($in_record);
print $in_record . " ---> ";
@input =(unpack $decode_string,$in_record);
foreach my $index (sort keys %from_to)
{$output[($from_to{$index}-1)] = $input[($index-1)]};
my $out_record = join "",@output;
print $out_record . "\n";
}
exit(0);
__END__
AAAAAAAAAA1111BBBBBB222222
BBBBBBBBBB2222CCCCCC333333
CCCCCCCCCC3333DDDDDD444444
I used the two hashes to contain the mapping of the input field to the output field (%from_to) and the field lengths (%field_len). You, of course, could use whatever strategy you want. I also have prints in to see how things work. I had tried to make it more compact by trying to use a dynamic strategy for specifying the array slice as part of the join..decode line. But I couln't figure out (or remember) how to do that.
| [reply] [d/l] [select] |
|
|
|
|
In other words, would it be possible to define a string and feed that into unpack, then define another string and feed it into the second part of join?
You'd have to explain that a little better, but someting like this might be close depending where/how you want to obtain those strings, The following would take comma separated arguments to construct the unpack template and field ordering:
perl
-e"BEGIN{$T=join'',map{qq[A$_ ]}split',',shift;@F=split',',shift}"
-ple"$_ = join'',(unpack $T,$_)[@F]"
"10,4,6,6" "2,1,0,3" infile >outfile
But note I've had to split the "one-liner" over several lines for posting. Once they start getting this long, writing a proper script is more convenient if you are going to reuse it.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Seeking an Enlightened Path (Parsing, Translating, Translocating)
by olus (Curate) on Mar 10, 2008 at 19:08 UTC
|
use strict;
use warnings;
while(<DATA>) {
$_ =~ s/(\D*)(\d*)(\D*)(\d*)/$3$2$4$1/g;
print $_;
}
__DATA__
AAAAAAAAAA1111BBBBBB222222
BBBBBBBBBB2222CCCCCC333333
CCCCCCCCCC3333DDDDDD444444
outputs:
BBBBBB1111222222AAAAAAAAAA
CCCCCC2222333333BBBBBBBBBB
DDDDDD3333444444CCCCCCCCCC
| [reply] [d/l] [select] |
|
|
$_ =~ s/(\D[1-10])(\D[11-16]) ... /$3$4$1$2/g;
That would work for input. Could I configure the second part of the search dynamically?
| [reply] [d/l] |
|
|
From what I understood from your example, you wanted to switch the places of sequences of letters and sequences of numbers. In the example solution I gave you, the regular expression will be looking for four of those sequences regardless of the number of characters in each sequence (provided those sequences are in alternate order).That regexp is not looking for the positions of the characters in the line.
From your example I saw a sequence of letters, or non-digits, so used the \D wildcard that matches non-digits. Then there is a sequence of digits, and the wildcard that matches digits is \d. Since there will be the need to switch the positions of those sequences, there is the need to capture them with () for later use.
The example as explained above does not know the number of characters on each sequence, but if you do want to do the rearrangement based on particular positions in the line, there are alternatives that take that into account (besides the excellent one BrowserUK showed).
If you say you have 10 characters, then 4, then 6 and finally 6 more, we can write such a regexp. For that we will use the . (Match any character), {n} (Match exactly n times) and the grouping (). The regexp would be:
$_ =~ s/(.{10})(.{4})(.{6})(.{6})/$3$2$4$1/g;
| [reply] [d/l] [select] |
|
|
Re: Seeking an Enlightened Path (Parsing, Translating, Translocating)
by igelkott (Priest) on Mar 10, 2008 at 20:14 UTC
|
...under half an hour per day
With the assumption that the pattern suggested by olus is applicable to your real record format, I ran a quick test for the other part of your question -- how long this might take.
On my modest hardware (Intel Core 2 6600), 10 million records took about 37 seconds to process. Assuming a reasonable overhead from other operations, there should be no problem keeping within the time constraints.
| [reply] |
Re: Seeking an Enlightened Path (Parsing, Translating, Translocating)
by Roy Johnson (Monsignor) on Mar 10, 2008 at 19:05 UTC
|
| [reply] |
|
|
Fair enough. Let me describe a little more.
I have a fixed-length file. The length of each record should not change. I have a field definition table within Oracle that I can configure if input or output requirements change.
The data within each field is alphanumeric. My post showed data that was numeric for one field, then alpha the next. That was confusing. Sorry. The fields can consist of any combination of letters, numbers or spaces (since everything will be left-padded).
Since I have the field definitions, I know that characters 1-10 will be field 1 in the input, and that it has to map to field 4 in the output (same length). Field 2 in the input (let's say 6 characters) has to map to field 1 in the output. Field 3 in the input will actually have to switch values. It may be 10 characters long in input, but I have to match this value against a list of values (the key of a hash table) and have it be 4 characters in the output (the value of the same hash table).
| [reply] |
Re: Seeking an Enlightened Path (Parsing, Translating, Translocating)
by ack (Deacon) on Mar 10, 2008 at 19:21 UTC
|
I, too (as Roy noted), can't tell what you're trying to do. Looks like you're trying to ...well, upon second look, I can't tell what you're trying to do. Subsequent nodlet seems to have a Regex that does, at least, what you're example suggests. Don't know how to help without better understanding of what the pattern transformation is trying to accomplish.
| [reply] |