Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have to anonymize several files (semicolon-separated). The following script does the job. But: if the field "Number_1" is the last field, then it stays unchanged. With the dummy field (see DATA) it works. Where is the mistake? Thank you very much! VE
use strict; use Text::ParseWords; while(<DATA>) { my @Proto = <DATA>; my $Header = shift @Proto; my @Assembly; my $Nr = &parsingHeader($Header, "Number"); my $Nr_1 = &parsingHeader($Header, "Number_1"); my $name = &parsingHeader($Header, "Name"); foreach(@Proto) { my @fields = quotewords( ';', 0, $_); $fields[$Nr] =~tr/0123456789/9876543210/ if defined $Nr; $fields[$Nr_1] =~tr/0123456789/9876543210/ if defined $Nr_1; $fields[$name] =~tr/A-Za-z/B-Wzb-w/ if defined $name; push @Assembly, join(";",@fields); } unshift @Assembly, $Header; print @Assembly; } sub parsingHeader { my $x; my $Row = shift @_; my $Column = shift @_; my @parts = split(/;/, $Row); for(my $i = 0; $i <= @parts; $i++) { if($parts[$i] eq $Column) { $x = $i; last; }; } return $x; } __DATA__ Number;Name;Age;Gender;Number_1;Dummy 123;Andrew;54;m;123;AAA 234;John;43;m;234;JJJ 345;Helen;23;w;345;HHH

Replies are listed 'Best First'.
Re: Parsing Header Last Field
by toolic (Bishop) on Jul 05, 2011 at 15:04 UTC
    I think you have an off-by-one error. Change:
    for ( my $i = 0 ; $i <= @parts ; $i++ ) {
    to:
    for ( my $i = 0 ; $i < @parts ; $i++ ) {

    You probably need to chomp @Proto, then add the newline back to your @Assembly.

    After using perltidy, here are the changes:

    use warnings; use strict; use Text::ParseWords; my @Proto = <DATA>; chomp @Proto; my $Header = shift @Proto; my @Assembly; my $Nr = &parsingHeader( $Header, "Number" ); my $Nr_1 = &parsingHeader( $Header, "Number_1" ); my $name = &parsingHeader( $Header, "Name" ); foreach (@Proto) { my @fields = quotewords( ';', 0, $_ ); $fields[$Nr] =~ tr/0123456789/9876543210/ if defined $Nr; $fields[$Nr_1] =~ tr/0123456789/9876543210/ if defined $Nr_1; $fields[$name] =~ tr/A-Za-z/B-Wzb-w/ if defined $name; push @Assembly, join( ";", @fields ), "\n"; } unshift @Assembly, $Header; print @Assembly; sub parsingHeader { my $x; my $Row = shift @_; my $Column = shift @_; my @parts = split( /;/, $Row ); for ( my $i = 0 ; $i < @parts ; $i++ ) { if ( $parts[$i] eq $Column ) { $x = $i; last; } } return $x; } __DATA__ Number;Name;Age;Gender;Number_1;Dummy 123;Andrew;54;m;123;AAA 234;John;43;m;234;JJJ 345;Helen;23;w;345;HHH

      Even better would be to use a Perl for loop:

      for my $i (0 .. $#parts) {

      which is clearer, safer and, as icing on the cake, is more succinct too.

      True laziness is hard work
Re: Parsing Header Last Field
by Tux (Canon) on Jul 05, 2011 at 15:07 UTC

    Looking at the data, wouldn't it be much easier to use a CSV based module like Text::CSV_XS, Text::CSV or DBD::CSV?

    These all support semi-colon as separator character.


    Enjoy, Have FUN! H.Merijn
Re: Parsing Header Last Field
by Marshall (Canon) on Jul 05, 2011 at 16:01 UTC
    Your code (after the off by one error is fixed) does work, but you aren't getting much "mileage" out of ParseWords. I recoded without that below.

    A few comments:
    -don't push when you could print.
    -there is no need to save all the output in an array.
    -using a hash allowed me to get rid of variables like $Nr_1 and put code in same lingo as the file itself.
    -this kind of "anonimization" is easily cracked - not so anon.
    -you will need a better scheme if this data is sensitive.

    #!/usr/bin/perl -w use strict; my $header_line = <DATA>; print $header_line; $header_line =~ s/\s*$//; # no trailing blanks or \n # a bit better than chomp() here # (chomp only gets rid of \n) # since words will be hash keys + my %index; # field name to index number my $i=0; foreach ( split /;/,$header_line ) { $index{$_}=$i++; } while(<DATA>) { next if /^\s*$/; # skip blank lines chomp; my @parts = split(/;/,$_); $parts[$index{Number}] =~ tr/0123456789/9876543210/; $parts[$index{Number_1}] =~ tr/0123456789/9876543210/; $parts[$index{Name}] =~ tr/A-Za-z/B-Wzb-w/; print join (";",@parts),"\n"; } =output Number;Name;Age;Gender;Number_1 876;Brhviw;54;m;876;AAA 765;Kslr;43;m;765;JJJ 654;Iipir;23;w;654;HHH =cut __DATA__ Number;Name;Age;Gender;Number_1 123;Andrew;54;m;123;AAA 234;John;43;m;234;JJJ 345;Helen;23;w;345;HHH
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 15:23 UTC

    I have to anonymize several files (semicolon-separated). The following script does the job. ... Where is the mistake?

    There are several.

    A simple substitution cipher (like rot 13 ) doesn't make the data anonymous, and its easily reversed.

    You're not chomping your input, so parsingHeader( $Header, "Number_1" ) returns undef

    You have a useless while loop :)

Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 15:55 UTC
    Dear All, thank you very much!
    After chomp'ing the input and changing from $i <= @parts to $i < @parts it works now!

    I also changed

     unshift @Assembly, $Header;
    to
    unshift @Assembly, $Header."\n";
    since the first row sticks on the header otherwise.
    However I could not do it without while (<DATA>)-loop. I am a newbie thus please do not beat me, but the code of toolic works by me only after I returned the while(<DATA>)-loop back.

    In fact, this loop is necessary anyway since I process the files not the DATA-snippet, this was just an example.

    You are right, this anonymizing is very primitive, but this is just a fool-proofing and is sufficient in this case, it is not necessary to protect this data from the hacker attack.

    And: unfortunately I can work with the standard installation of perl only, I cannot install the other modules - it is just a policy here. Too bad.

    Thank you all very much again. If you find another mistakes in this snippet of code it would be grate.

    VE
      You're welcome.
      And: unfortunately I can work with the standard installation of perl only, I cannot install the other modules - it is just a policy here. Too bad.
      Are you sure about that? See: Yes, even you can use CPAN
        Thank you again, I will try it.

      However I could not do it without while (<DATA>)-loop

      Sure you can, simply shift the first, empty (all whitespace) line before the header -- you don't need a loop for that

      <DATA> is readline DATA, and readline in scalar context reads one line

Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 16:02 UTC
    ... and: toolic, thank you for link to perltidy, I did not know about the tool.
Re: Parsing Header Last Field
by Anonymous Monk on Jul 05, 2011 at 16:09 UTC
    Thank you Marshall, this is great!
Re: Parsing Header Last Field
by Anonymous Monk on Jul 08, 2011 at 07:37 UTC
    Dear Monks,
    I would like to give a feedback.
    It works great now (the hash solution twice so fast as a solution with a sub).
    Still I have a problem :-)
    There are many files which I process. Neither of files have every field in the header, which I modify. Some of files have even no such fields in their header. I noticed that e.g.
     $index{"NAME"}; returns undef in these cases which is interpreted as 0. Therefore the first column in these files is erroneously modified.
    I did the following thing:
    if ($index{"NAME"} =~/\d+/) { $parts[$index{"NAME"}] = ""; }
    and this works. I need to repeat this preamble each time however. Is this a workaround or the solution (i.e. can this be done better)? Please note that the different fields are processed differently (some become "", some become replacement of digits etc.). Thanks! VE
      Sounds like you need something like this:
      $parts[$index{Name}] =~ tr/A-Za-z/B-Wzb-w/ if defined ($index{Name});
      Checking whether or not the hash key is defined or exists is faster and more clear than running a regex to see if it contains a digit.

      Perhaps you need something similar to this on other replacements also?