Re: Parsing Header Last Field
by toolic (Bishop) on Jul 05, 2011 at 15:04 UTC
|
I think you have an off-by-one error. Change:
for ( my $i = 0 ; $i <= @parts ; $i++ ) {
to:
for ( my $i = 0 ; $i < @parts ; $i++ ) {
You probably need to chomp @Proto, then add the newline back to your @Assembly.
After using perltidy, here are the changes:
use warnings;
use strict;
use Text::ParseWords;
my @Proto = <DATA>;
chomp @Proto;
my $Header = shift @Proto;
my @Assembly;
my $Nr = &parsingHeader( $Header, "Number" );
my $Nr_1 = &parsingHeader( $Header, "Number_1" );
my $name = &parsingHeader( $Header, "Name" );
foreach (@Proto) {
my @fields = quotewords( ';', 0, $_ );
$fields[$Nr] =~ tr/0123456789/9876543210/ if defined $Nr;
$fields[$Nr_1] =~ tr/0123456789/9876543210/ if defined $Nr_1;
$fields[$name] =~ tr/A-Za-z/B-Wzb-w/ if defined $name;
push @Assembly, join( ";", @fields ), "\n";
}
unshift @Assembly, $Header;
print @Assembly;
sub parsingHeader {
my $x;
my $Row = shift @_;
my $Column = shift @_;
my @parts = split( /;/, $Row );
for ( my $i = 0 ; $i < @parts ; $i++ ) {
if ( $parts[$i] eq $Column ) {
$x = $i;
last;
}
}
return $x;
}
__DATA__
Number;Name;Age;Gender;Number_1;Dummy
123;Andrew;54;m;123;AAA
234;John;43;m;234;JJJ
345;Helen;23;w;345;HHH
| [reply] [d/l] [select] |
|
|
for my $i (0 .. $#parts) {
which is clearer, safer and, as icing on the cake, is more succinct too.
True laziness is hard work
| [reply] [d/l] |
Re: Parsing Header Last Field
by Tux (Canon) on Jul 05, 2011 at 15:07 UTC
|
Looking at the data, wouldn't it be much easier to use a CSV based module like Text::CSV_XS, Text::CSV or DBD::CSV?
These all support semi-colon as separator character.
Enjoy, Have FUN! H.Merijn
| [reply] |
Re: Parsing Header Last Field
by Marshall (Canon) on Jul 05, 2011 at 16:01 UTC
|
Your code (after the off by one error is fixed) does work, but you aren't getting much "mileage" out of ParseWords. I recoded without that below. A few comments:
-don't push when you could print.
-there is no need to save all the output in an array.
-using a hash allowed me to get rid of variables like $Nr_1 and put code in same lingo as the file itself.
-this kind of "anonimization" is easily cracked - not so anon.
-you will need a better scheme if this data is sensitive.
#!/usr/bin/perl -w
use strict;
my $header_line = <DATA>;
print $header_line;
$header_line =~ s/\s*$//; # no trailing blanks or \n
# a bit better than chomp() here
# (chomp only gets rid of \n)
# since words will be hash keys
+
my %index; # field name to index number
my $i=0;
foreach ( split /;/,$header_line )
{
$index{$_}=$i++;
}
while(<DATA>)
{
next if /^\s*$/; # skip blank lines
chomp;
my @parts = split(/;/,$_);
$parts[$index{Number}] =~ tr/0123456789/9876543210/;
$parts[$index{Number_1}] =~ tr/0123456789/9876543210/;
$parts[$index{Name}] =~ tr/A-Za-z/B-Wzb-w/;
print join (";",@parts),"\n";
}
=output
Number;Name;Age;Gender;Number_1
876;Brhviw;54;m;876;AAA
765;Kslr;43;m;765;JJJ
654;Iipir;23;w;654;HHH
=cut
__DATA__
Number;Name;Age;Gender;Number_1
123;Andrew;54;m;123;AAA
234;John;43;m;234;JJJ
345;Helen;23;w;345;HHH
| [reply] [d/l] |
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 15:23 UTC
|
I have to anonymize several files (semicolon-separated). The following script does the job. ... Where is the mistake?
There are several.
A simple substitution cipher (like rot 13 ) doesn't make the data anonymous, and its easily reversed.
You're not chomping your input, so parsingHeader( $Header, "Number_1" ) returns undef
You have a useless while loop :)
| [reply] |
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 15:55 UTC
|
Dear All,
thank you very much!
After chomp'ing the input and changing from $i <= @parts to $i < @parts it works now!
I also changed
unshift @Assembly, $Header;
to
unshift @Assembly, $Header."\n";
since the first row sticks on the header otherwise.
However I could not do it without while (<DATA>)-loop. I am a newbie thus please do not beat me, but the code of toolic works by me only after I returned the while(<DATA>)-loop back.
In fact, this loop is necessary anyway since I process the files not the DATA-snippet, this was just an example.
You are right, this anonymizing is very primitive, but this is just a fool-proofing and is sufficient in this case, it is not necessary to protect this data from the hacker attack.
And: unfortunately I can work with the standard installation of perl only, I cannot install the other modules - it is just a policy here. Too bad.
Thank you all very much again. If you find another mistakes in this snippet of code it would be grate.
VE
| [reply] [d/l] [select] |
|
|
You're welcome.
And: unfortunately I can work with the standard installation of perl only, I cannot install the other modules - it is just a policy here. Too bad.
Are you sure about that? See: Yes, even you can use CPAN
| [reply] |
|
|
Thank you again, I will try it.
| [reply] |
|
|
However I could not do it without while (<DATA>)-loop
Sure you can, simply shift the first, empty (all whitespace) line before the header -- you don't need a loop for that
<DATA> is readline DATA, and readline in scalar context reads one line
| [reply] |
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 16:02 UTC
|
... and:
toolic, thank you for link to perltidy, I did not know about the tool. | [reply] |
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 16:09 UTC
|
Thank you Marshall, this is great! | [reply] |
Re: Parsing Header Last Field
by Anonymous Monk on Jul 08, 2011 at 07:37 UTC
|
Dear Monks,
I would like to give a feedback.
It works great now (the hash solution twice so fast as a solution with a sub).
Still I have a problem :-)
There are many files which I process. Neither of files have every field in the header, which I modify. Some of files have even no such fields in their header. I noticed that e.g.
$index{"NAME"};
returns undef in these cases which is interpreted as 0. Therefore the first column in these files is erroneously modified.
I did the following thing:
if ($index{"NAME"} =~/\d+/) {
$parts[$index{"NAME"}] = "";
}
and this works. I need to repeat this preamble each time however. Is this a workaround or the solution (i.e. can this be done better)? Please note that the different fields are processed differently (some become "", some become replacement of digits etc.).
Thanks!
VE | [reply] [d/l] [select] |
|
|
Sounds like you need something like this:
$parts[$index{Name}] =~ tr/A-Za-z/B-Wzb-w/ if defined ($index{Name});
Checking whether or not the hash key is defined or exists is faster and more clear than running a regex to see if it contains a digit.
Perhaps you need something similar to this on other replacements also? | [reply] [d/l] |