Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks.
I have a task to do and I was wondering if this can be done or not.
Suppose you have 50 file all named RESULT_FILE.1, RESULT_FILE.2, RESULT_FILE.3 ... until RESULT_FILE_50.
In each file you have 4 names with 4 values:
Kostas [NUMBER1] Maria [NUMBER2] Georgia [NUMBER3] Nikos [NUMBER4]
What I wanted to do is to construct a tab-separated file which would look like:
NAME RESULTS_FILE1 RESULTS_FILE2 RESULTS_FILE3 Kostas [NUMBER1] [NUMBER1] [NUMBER1] Maria [NUMBER2] [NUMBER2] [NUMBER2] Georgia [NUMBER3] [NUMBER3] [NUMBER3] Nikos [NUMBER4] [NUMBER4] [NUMBER4]
What I thought unfortunately didn't work out. I used find as Unix command to find all RESULTS_FILE, then I isolated the number after each RESULTS_FILE (i.e. 1 for RESULTS_FILE1, 2 for RESULTS_FILE2 etc) and created small sub-files for each RESULTS_FILE where I printed the numbers for each name and before each number I printed as many TABS as the number of the RESULTS_FILE i, so as, when I concatanate all these subfiles, I would end up with a large one and the numbers for each person would be printed out in the correct column ( I want to paste the data in a spreadsheet)... But, when I contatanated all these small sub-files, I realised that the contents of each sub-file where pasted UNDER the contents of the other (and not side-by-side as I hoped...
Any ideas on how to do it correctly???

Replies are listed 'Best First'.
Re: Is this possible
by pc88mxer (Vicar) on Mar 24, 2008 at 02:12 UTC
    Untested, but this should work:
    my @files = @ARGV; my %table; while (<>) { my ($name, $value) = split(' ', $_); $table{$name}->{$ARGV} = $value; } print join("\t", "NAME", @files), "\n"; for my $n (keys %table) { print $n; for my $f (@files) { print "\t", $table{$name}->{$f}; } print "\n"; }
    Then invoke it with:
    perl script-name RESULTS_FILE*
      It seems to work but it gets a bit messy with the filenames (it prints one over the other, not correctly separated with tabs)...
      The numbers seem to print ok though...
        Are you sure you have this line entered correctly?
        print join("\t", "NAME", @files), "\n";
        That's the code responsible for printing the header line, so that's where I'd look. This will print tab characters in between each file name. What you are seeing might be an artifact of your text editor or the way your are viewing the file.
Re: Is this possible
by davido (Cardinal) on Mar 24, 2008 at 03:20 UTC

    There's not really any "correct" way. ...or maybe it's more accurate to say, there's no wrong way as long as it works and doesn't gobble all your memory in the process. ;)

    Here is yet another way to do it:

    use strict; use warnings; my @files = @ARGV; my %table; foreach my $filename (@files) { open my $infile, '<', $filename or die $!; while( <$infile> ) { chomp; next unless $_; my( $name, $val ) = split; push( @{$table{ $name }}, $val ); } } my @names = sort keys %table; my $output = join "\t", "NAME\t", @files; foreach my $entry ( @names ) { $output .= join "\t\t", "\n$entry", @{$table{$entry}}; } $output .= map { local $" = "\t\t"; "@{$table{$_}}\n"; } @names; print $output, "\n";

    Usage: perl scriptname filename1 filename2

    It doesn't bother to check to ensure that all files contain the proper number of entries. That could cause your columns to become misaligned, if for example, the second file doesn't contain an entry for Georgia.


    Dave

Re: Is this possible
by NetWallah (Canon) on Mar 24, 2008 at 03:22 UTC
    MOre untested:
    use strict; my %info; for (1..50){ process_file($_); } print "NAME\t"; print "$_\t" for (1..50); print "\n"; for my $k (sort keys %info){ print "$k\t"; print "$info{$k}->[$_]\t" for (1..50); print "\n"; } sub process_file{ my $filenbr = shift; open my $f, "<" , "RESULT_FILE$filenbr" or die "Cannot read file $_\ +n$!"; while(<$f>){ my ($name,$val) = split; $info{$name}->[$filenbr] = $val; } close $f; }

         "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom

Re: Is this possible
by jwkrahn (Abbot) on Mar 24, 2008 at 06:28 UTC

    The main problem with the first two solutions presented is that they rely on the glob sort order so that 'RESULTS_FILE2' will be between 'RESULTS_FILE19' and 'RESULTS_FILE20', etc.

    This will glob the files in the correct order:

    #!/usr/bin/perl use warnings; use strict; @ARGV = map glob( 'RESULTS_FILE' . ( '[0-9]' x $_ ) ), 1 .. 4; my @headers = ( 'NAME', @ARGV ); my %data; while ( <> ) { my ( $name, $value ) = split; push @{ $data{ $name } }, $value; } print join( "\t", @headers ), "\n"; for my $name ( keys %data ) { print join( "\t", $name, @{ $data{ $name } } ), "\n"; }
      Good point - depending on the OP's needs I can see where this would be useful.

      One issue I have with the solutions which use push is that if a file doesn't contain a name (or contains a name not found in other files), then the outputted data is not going to be lined up correctly. When implemented with a hash, the script can be used on non-uniform data files, and it makes it easy to spot data files which don't have the same names as the others (i.e. missing data.)

Re: Is this possible
by BrowserUk (Patriarch) on Mar 24, 2008 at 08:20 UTC
      You don't need a seperate -e for the BEGIN (saves 5 chars)
Re: Is this possible
by locked_user sundialsvc4 (Abbot) on Mar 24, 2008 at 15:54 UTC

    For the record, if you do have enormous amounts of information that you need to, say, “cross-tabulate” in this way, you can accomplish the task for arbitrary amounts of data by:

    1. Dump the data into a temporary flat-file, as tuples like {'Kostas', 'RESULTS_FILE2', 3.1415926}.
    2. Sort the file, using an external disk sort such as the venerable sort command or an equivalent CPAN module. Sort it by two keys: first by the 'Kostas' field and then (i.e. within 'Kostas') by 'RESULTS_FILE2.'
    3. Now process the sorted file. Everything for 'Kostas' will be together. Everything for Kostas' columns will be already-arranged in ascending column order. Anything that's not in the right place does not exist, period. Virtually no memory is required.

    Sorting is an “unexpectedly fast” algorithm that, for large amounts of data, consistently beats-the-pants off of any random-access algorithm. When you saw all those tapes spinning back and forth (actually, if you watched very closely, the tapes never changed direction...) that's what the computers were doing. Back before computers existed, that's also how it was done ... with punched cards. The same principles still apply. They still work like nothing else does.

Re: Is this possible
by radiantmatrix (Parson) on Mar 24, 2008 at 15:11 UTC

    Well, you can get a lot of this done using File::Find and Text::CSV_XS. Some example code:

    use strict; use warnings; use Text::CSV_XS; use IO::File; use File::Find; my (%data, @file); my $in_csv = Text::CSV_XS->new({ sep_char => " " }); #space-sep in my $out_csv = Text::CSV_XS->new({ sep_char => "\x09" }); #tab-sep out my $out = IO::File->new('report.tab','>') or die "Can't open report.tab for writing: $!\n"; find(\&parse_file, '.'); #find files, running parse_file() for each $out_csv->print($out, [ 'Name', @file ]); ## write out report foreach my $name (sort keys %data) { my @row = ($name); foreach (@file) { push @row, $data{$name}{$_}; } $out_csv->print($out, \@row); } sub parse_file { # only deal with *files* that match naming convention return unless -f $_ && m/RESULTS_FILE\.(\d+)/; my $file_number = $1; # captured number from match my $filename = $_; push @file, $filename; my $io = IO::File->new($filename,'<') or do { warn "Can't open $filename: $!\n"; return; }; while (my $row = $in_csv->getline($io)) { # e.g. first row in your example results in: # $data{Kostas}{RESULT_FILE.1} = Number1 $data{ $row->[0] }{ $filename } = $row->[1]; } }

    I think you'll find that very fast and maintainable.

    <radiant.matrix>
    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet