Remove unwanted chars and maintain table integrity

Replies are listed 'Best First'.
Re: Remove unwanted chars and maintain table integrity by jethro (Monsignor) on May 19, 2010 at 10:22 UTC
To learn you need to solve the problem yourself. So I'll list only some code fragments, you can fill in the rest AoA is a very good idea. You need an array for each column that could have multiple items in it. If this is only the Pro_ID column, a simple array will do the trick. To collect the data in your while loop, just ignore the 'W': `if ($array[1]!='W') { $length= $array[3] } if ($array[3]!='W') { push @id, $array[3] }` [download] You should be able to add the lines for count and character too. If you have read a line that belongs to a new character, just call a subroutine that prints out the previous character with all the information in $count,$length and @id in a loop. You can use perl formats or spin your own formatting. Don't forget to call the subroutine a last time after your while loop to print the last character	[reply] [d/l]
Re^2: Remove unwanted chars and maintain table integrity by Anonymous Monk on May 19, 2010 at 12:42 UTC
Would you please mind elaborating a little bit more?, I could not follow the lines of code you posted... besides the '!=' confuses me, since it is supposed to be used on numerical and not numerical values right?	[reply]
Re^3: Remove unwanted chars and maintain table integrity by jethro (Monsignor) on May 19, 2010 at 14:31 UTC
You are right, '!=' should have been 'ne'. The code I posted simply stores/remembers any value that isn't 'W'. Because I assumed that the length column has only one value to remember (per character), I used a variable instead of an array for the length column. Whenever you encounter a new character name in the loop, call a subroutine to output the previous character. I.e. add the following line before the other lines I posted: `if ($array[0]!='W') { OutputCharacter($char,$count,$length,@id); @id=(); }` [download] The subroutine OutputCharacter just prints one line per value in @id, and in the first line also the name, count and length. One way to do that is to print the first line separately and then the rest in a loop	[reply] [d/l]
Re: Remove unwanted chars and maintain table integrity by wfsp (Abbot) on May 19, 2010 at 12:21 UTC
Apart from the headers are those fixed width fields? If they are unpack can be pressed into service to break the records into their respective fields. This uses an array of hashes and, for the pro id fields, a AoHoA. I've used printf to reassemble the data. You may have to adjust the field widths to suit. The debugging `print`s are left in if you want to see what is going on. You might want to consider what error checking you need. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $headers = <DATA>; my (%field, @db); while (my $record = <DATA>){ chomp $record; my ($character, $count, $length, $pro_id) = unpack(q{A29A6A7A}, $re +cord); #print qq{$_* } for $character, $count, $length, $pro_id; #print qq{\n}; if ($character ne q{W}){ push @db, {%field} if exists $field{character}; %field = (); $field{character} = $character; } elsif ($count ne q{W}){ $field{count} = $count; } elsif ($length ne q{W}){ $field{length} = $length; } elsif ($pro_id ne q{W}){ push @{$field{pro_id}}, $pro_id; } } push @db, {%field}; #print Dumper \@db; for my $record (@db){ printf( qq{%-29s%-6s%-7s%s\n}, $record->{character}, $record->{count}, $record->{length}, $record->{pro_id}[0], ); for (1..$#{$record->{pro_id}}){ printf( qq{%47s\n}, $record->{pro_id}[$_], ); } } # 29 6 7 5 __DATA__ Character Count Length Pro_ID Timothy Watson 12 Medulla W W W W W W ID:10 W W W ID:11 W W W ID:12 W W W ID:13 W W W ID:14 W 5 W W W W 16 W Maya Alabina 5 Exo W W W W W W ID:28 W W W ID:30 W 1 W W W W 11 W [download] `Timothy Watson 12 Medulla 5 16 ID:10 ID:11 ID:12 ID:13 ID:14 Maya Alabina 5 Exo 1 11 ID:28 ID:30` [download]	[reply] [d/l] [select]
Re: Remove unwanted chars and maintain table integrity by choroba (Cardinal) on May 19, 2010 at 10:08 UTC
First load the file into an array, detecting the longest string in each column. Then use printf to print the output while setting the column width to detected maximal value.	[reply]
Re: Remove unwanted chars and maintain table integrity by biohisham (Priest) on May 19, 2010 at 13:47 UTC
I think another Monk can take it from here in a TIMTOWDI, I was trying my luck (Depleted my Resources too...), my code needs improvement still and it doesn't exactly achieve the goal but I just thought I would post this for other Monks to look at and provide guidance, this only removes the Ws from the code but with untamed indiscrimination ('Watson' becomes 'atson')... Interesting problem indeed... `use strict; my @array; while(@array = <DATA>){ chomp; for(my $i =0;$i<=scalar(@array);$i++){ my @data = split(//,$array[$i]); foreach my $element (@data){ next if $element eq 'W' \|\| !$element ; print $element; } }; # print "@data"; }` [download] Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l]
Re: Remove unwanted chars and maintain table integrity by doug (Pilgrim) on May 19, 2010 at 16:34 UTC
You are going to have to parse the data and put them into columns. This basic idea has been described above by other posters. One thing you might want to consider is to avoid the printf() call by using a format. Read perlform if this doesn't mean anything to you. formats were a big deal in the early days for perl, and are fairly powerful when dealing with columnar data. They've fallen out of vogue in the past decade, but might be the right tool for this particular problem. - doug	[reply]
Re: Remove unwanted chars and maintain table integrity by Marshall (Canon) on May 21, 2010 at 09:08 UTC
I took a stab at this (below). You actually have a "pretty well behaved" input format. The basic job of the main loop is to assemble a "%record" which is all the stuff that relates to last line containing some "name" info that was seen. Many of these types of parsing problems have issues about the first record, last record or both, ie these situations are often handled just slightly different than all the stuff in between. Below, I try to print if I see a line with the name info. That will happen for the first real "data" line, but the print routine won't actually do anything because it will figure out that there is no "real" data there yet! I use a regex to parse the input line and I go straight into application specific $variables without using any $1,$2,$3,$4 stuff. Those numbers don't have any application specific meaning and are just "clutter". I used a special switch on the regex so that I could line up the variable names with what is being captured in the regex by adding spaces. The next lines are very regular in appearance and function if the $var isn't a "W", then something is done with it. The id's are an array. Use split when the separator is very regular. Think regex when this simple idea doesn't work. The output subroutine uses a "formatted print". This is ancient stuff (predates even 'C') and Perl supports this functionality. You can specify if things are left or right justified and how wide the field is. For most report generation applications, it is not necessary to "find the longest line" and then adjust things based upon that. In fact that is often the wrong thing to do! I advocate a nice solution for the 99% case and let the other 1% go into some "unaligned, wacko looking case". If you get too much space in between the columns, this degrades the ability to read the report easily -> go for 99% always looks nice vs sometimes 100% hard to read! You can read Perl doc for how to use printf and adjust spacings accordingly. Always put at least one explicit space between fields! (so that 2 fields don't ever "run together") Note that I call output again after the main loop to take care of "last record" special case. Hope this additional explanation verbiage helps you. You said that you were new and that often triggers me to at least try to explain more. Have fun! #!/usr/bin/perl -w use strict; <DATA>; #throws away first line, no need for an lvalue my %record =(); while (<DATA>) { next if /^\s$/; # skip blank lines output_record() if (!/^W\s/); # just an "attempt to print" my ( $name, $count, $length, $id) = (m/^(.?)\s{2,} (\S+) \s+ (\S+) \s+ (\S+)/x); $record{'name'} = $name if $name !~ /^W\s$/; $record{'count'} = $count if $count !~ /^W\s$/; $record{'length'}= $length if $length !~ /^W\s$/; push (@{$record{'id'}},$id) if $id !~ /^W\s$/; } output_record(); sub output_record { if (!exists($record{'name'})) { return } printf "%-30s %-3s %-3s %s\n", $record{'name'}, $record{'count'}, $record{'length'}, shift @{$record{'id'}}; foreach my $id ( @{$record{'id'}} ) { printf "%47s\n", $id; } print "\n"; #blank as spacer before next record %record=(); #record dumped, so delete it! return; } =CODE PRINTS: Timothy Watson 12 Medulla 5 16 ID:10 ID:11 ID:12 ID:13 ID:14 Maya Alabina 5 Exo 1 11 ID:28 ID:30 =cut __DATA__ Character Count Length Pro_ID Timothy Watson 12 Medulla W W W W W W ID:10 W W W ID:11 W W W ID:12 W W W ID:13 W W W ID:14 W 5 W W W W 16 W Maya Alabina 5 Exo W W W W W W ID:28 W W W ID:30 W 1 W W W W 11 W [download]	[reply] [d/l]

AoA is a very good idea. You need an array for each column that could have multiple items in it. If this is only the Pro_ID column, a simple array will do the trick.

To collect the data in your while loop, just ignore the 'W':

if ($array[1]!='W') { $length= $array[3] }
if ($array[3]!='W') { push @id, $array[3] }
[download]

If you have read a line that belongs to a new character, just call a subroutine that prints out the previous character with all the information in $count,$length and @id in a loop. You can use perl formats or spin your own formatting.

Don't forget to call the subroutine a last time after your while loop to print the last character

[reply]
[d/l]

Would you please mind elaborating a little bit more?, I could not follow the lines of code you posted... besides the '!=' confuses me, since it is supposed to be used on numerical and not numerical values right?

[reply]

The code I posted simply stores/remembers any value that isn't 'W'.
Because I assumed that the length column has only one value to remember (per character), I used a variable instead of an array for the length column.

Whenever you encounter a new character name in the loop, call a subroutine to output the previous character. I.e. add the following line before the other lines I posted:

if ($array[0]!='W') {
  OutputCharacter($char,$count,$length,@id);
  @id=();
}
[download]

[reply]
[d/l]

If they are unpack can be pressed into service to break the records into their respective fields.

This uses an array of hashes and, for the pro id fields, a AoHoA. I've used printf to reassemble the data. You may have to adjust the field widths to suit.

The debugging prints are left in if you want to see what is going on. You might want to consider what error checking you need.

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my $headers = <DATA>;

my (%field, @db);
while (my $record = <DATA>){
  chomp $record;
  my ($character, $count, $length, $pro_id) = unpack(q{A29A6A7A*}, $re
+cord);
  #print qq{*$_* } for $character, $count, $length, $pro_id;
  #print qq{\n};
  
  if ($character ne q{W}){
    push @db, {%field} if exists $field{character};
    %field = ();
    $field{character} = $character;
  }
  elsif ($count ne q{W}){
    $field{count} = $count;
  }
  elsif ($length ne q{W}){
    $field{length} = $length;
  }
  elsif ($pro_id ne q{W}){
    push @{$field{pro_id}}, $pro_id;
  }
}
push @db, {%field};

#print Dumper \@db;

for my $record (@db){
  printf(
     qq{%-29s%-6s%-7s%s\n},
     $record->{character}, 
     $record->{count}, 
     $record->{length}, 
     $record->{pro_id}[0],
  );
  for (1..$#{$record->{pro_id}}){
    printf(
      qq{%47s\n}, 
      $record->{pro_id}[$_],
    );
  }
}

# 29 6 7 5
__DATA__
Character                        Count Length Pro_ID
Timothy Watson 12 Medulla    W     W      W
W                            W     W      ID:10
W                            W     W      ID:11
W                            W     W      ID:12
W                            W     W      ID:13
W                            W     W      ID:14
W                            5     W      W
W                            W     16     W
Maya Alabina 5 Exo           W     W      W
W                            W     W      ID:28
W                            W     W      ID:30
W                            1     W      W
W                            W     11     W
[download]

Timothy Watson 12 Medulla    5     16     ID:10
                                          ID:11
                                          ID:12
                                          ID:13
                                          ID:14
Maya Alabina 5 Exo           1     11     ID:28
                                          ID:30
[download]

[reply]
[d/l]
[select]

printf

[reply]

I think another Monk can take it from here in a TIMTOWDI, I was trying my luck (Depleted my Resources too...), my code needs improvement still and it doesn't exactly achieve the goal but I just thought I would post this for other Monks to look at and provide guidance, this only removes the Ws from the code but with untamed indiscrimination ('Watson' becomes 'atson')...

Interesting problem indeed...

use strict;

my @array;
while(@array = <DATA>){
        chomp;
        for(my $i =0;$i<=scalar(@array);$i++){
                 my @data = split(//,$array[$i]);
                 foreach my $element (@data){
                next if $element eq 'W' || !$element ;
                        print $element;
                        }
                };

      #  print "@data";

        }
[download]

Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

[reply]
[d/l]

You are going to have to parse the data and put them into columns. This basic idea has been described above by other posters. One thing you might want to consider is to avoid the printf() call by using a format. Read perlform if this doesn't mean anything to you. formats were a big deal in the early days for perl, and are fairly powerful when dealing with columnar data. They've fallen out of vogue in the past decade, but might be the right tool for this particular problem.

- doug

[reply]

Many of these types of parsing problems have issues about the first record, last record or both, ie these situations are often handled just slightly different than all the stuff in between. Below, I try to print if I see a line with the name info. That will happen for the first real "data" line, but the print routine won't actually do anything because it will figure out that there is no "real" data there yet!

I use a regex to parse the input line and I go straight into application specific $variables without using any $1,$2,$3,$4 stuff. Those numbers don't have any application specific meaning and are just "clutter". I used a special switch on the regex so that I could line up the variable names with what is being captured in the regex by adding spaces. The next lines are very regular in appearance and function if the $var isn't a "W", then something is done with it. The id's are an array. Use split when the separator is very regular. Think regex when this simple idea doesn't work.

The output subroutine uses a "formatted print". This is ancient stuff (predates even 'C') and Perl supports this functionality. You can specify if things are left or right justified and how wide the field is. For most report generation applications, it is not necessary to "find the longest line" and then adjust things based upon that. In fact that is often the wrong thing to do! I advocate a nice solution for the 99% case and let the other 1% go into some "unaligned, wacko looking case". If you get too much space in between the columns, this degrades the ability to read the report easily -> go for 99% always looks nice vs sometimes 100% hard to read! You can read Perl doc for how to use printf and adjust spacings accordingly. Always put at least one explicit space between fields! (so that 2 fields don't ever "run together")

Note that I call output again after the main loop to take care of "last record" special case. Hope this additional explanation verbiage helps you. You said that you were new and that often triggers me to at least try to explain more.

Have fun!

#!/usr/bin/perl -w
use strict;

<DATA>; #throws away first line, no need for an lvalue

my %record =();

while (<DATA>)
{
   next if /^\s*$/;                 # skip blank lines
   output_record() if (!/^W\s/);   # just an "attempt to print"
   
   my (   $name,      $count,     $length,       $id) = 
      (m/^(.*?)\s{2,} (\S+)  \s+  (\S+)    \s+   (\S+)/x);
      
   $record{'name'}  = $name     if $name   !~ /^W\s*$/;
   $record{'count'} = $count    if $count  !~ /^W\s*$/;
   $record{'length'}= $length   if $length !~ /^W\s*$/;
   push (@{$record{'id'}},$id)  if $id     !~ /^W\s*$/;
}

output_record();

sub output_record
{ 
   if (!exists($record{'name'})) { return }
   
   printf "%-30s  %-3s  %-3s  %s\n", $record{'name'},
                                     $record{'count'},
                                     $record{'length'},
                                     shift @{$record{'id'}};
                                
   foreach my $id ( @{$record{'id'}} )
   {
      printf "%47s\n", $id;
   }
   
   print "\n";     #blank as spacer before next record
   
   %record=();     #record dumped, so delete it!
   return;                                
}

=CODE PRINTS:
Timothy Watson 12 Medulla       5    16   ID:10
                                          ID:11
                                          ID:12
                                          ID:13
                                          ID:14

Maya Alabina 5 Exo              1    11   ID:28
                                          ID:30
=cut                                          


__DATA__
Character                        Count Length Pro_ID
Timothy Watson 12 Medulla    W     W      W
W                            W     W      ID:10
W                            W     W      ID:11
W                            W     W      ID:12
W                            W     W      ID:13
W                            W     W      ID:14
W                            5     W      W
W                            W     16     W
Maya Alabina 5 Exo           W     W      W
W                            W     W      ID:28
W                            W     W      ID:30
W                            1     W      W
W                            W     11     W
[download]

[reply]
[d/l]