in reply to Binary file handling

The simple combination of File::ReadBackwards, split join and reverse does it off disk, fast and efficiently. The split assumes some sort of space separated data but you could modify this if required. HTH

C:\>type transpose.pl #!/usr/bin/perl -w use strict; use File::ReadBackwards; transpose( "c:/data.txt", "c:/data-transpose.txt" ); sub transpose { my ( $infile, $outfile ) = @_; tie *BW, 'File::ReadBackwards', $infile or die "Can't read $infile + $!\n"; open OUT, ">$outfile" or die "Can't write $outfile $!\n"; while( <BW> ) { chomp; print OUT join "\t", reverse(split ' '),"\n"; } close BW; close OUT; } C:\>type data.txt 11 12 13 21 22 23 31 32 33 C:\>transpose.pl C:\>type data-transpose.txt 33 32 31 23 22 21 13 12 11 C:\>

cheers

tachyon

Replies are listed 'Best First'.
Re:^2 Binary file handling
by Hena (Friar) on Mar 18, 2004 at 14:15 UTC
    Hum. Transposed matrix (or then i'm doing something wrong) in this case should be:
    11	21	31
    12	22	23
    13	23	33
    

      Is this what you want, it flips on the diagonal? The main requirement is that you have as much free disk space for the temp files as the total file size. You will be limited in the number of columns you can transpose by the number of open file descriptors your Perl will let you have. It is very easy to hack the logic to do N colunms per pass at the expense of 1 full read of the input file per extra pass. Alternatively you could DBM or tie a hash to a file and use the keys as pseudo file handles and just append data to the values. Although there is more I/O with a multipass approach is is very vanilla I/O which perl does really fast.

      Update

      See this article for info on how to up the number of available file descriptors (probably 1024/process) on a Linux based system. No idea how it is dealt with on other systems.

      It should be really fast as we make a single pass through the input data and then effectively just write it out (each temp file has one full line in it).

      C:\>type transpose.pl #!/usr/bin/perl -w use strict; transpose90( "c:/data.txt", "c:/data-transpose.txt" ); sub transpose90 { my ( $infile, $outfile, $tmp ) = @_; $tmp ||= 'c:/tmp/temp'; open IN, $infile or die "Can't read $infile $!\n"; # find number of columns and open a temp file for each local $_ = <IN>; chomp; my @data = split ' '; my $num_cols = $#data; my @fhs; for( 0..$num_cols ) { open $fhs[$_], ">$tmp$_.txt" or die "Can't create temp file $t +mp$_ $!\n"; print {$fhs[$_]} $data[$_], "\t"; } while( <IN> ) { chomp; @data = split ' '; print {$fhs[$_]} $data[$_], "\t" for 0..$num_cols; } close IN; open OUT, ">$outfile" or die "Can't write $outfile $!\n"; for ( 0.. $num_cols ) { close $fhs[$_]; # close the temp file open IN, "$tmp$_.txt" or die "Can't read temp file $tmp$_ $!\n +"; print OUT scalar(<IN>), "\n"; close IN; unlink "$tmp$_.txt" } close OUT; } C:\>type data.txt 11 12 13 21 22 23 31 32 33 C:\>transpose.pl C:\>type data-transpose.txt 11 21 31 12 22 32 13 23 33 C:\>

      cheers

      tachyon

        See this article for info on how to up the number of available file descriptors (probably 1024/process) on a Linux based system. No idea how it is dealt with on other systems.

        It mentioned that there is a limit of 1024 in 2.2 kernels. So i'm not worried about that. There is limits in filesystems though, but in reiserfs its 2^31 per directory and 2^32 per filesystem (can't find info on ext2/3).

        I though about something like this. Have to figure out some limit on opened filehandles (~100k or 1M or something) i quess, just to make sure i don't kill of filesystem :).
        I actually did something very similar in here. And tested with 6000*2000 matrix. Here's the code snippet:
        open (INM,"$in_matrix") or die "Unable to open 'in_matrix': $!"; my @row=split (/\t/,readline(INM)); my @colfiles=(); my $max=$#row; foreach (0 .. $max) { open ($colfiles[$_],">$out_matrix.$_") or die "Failed on opening tem +pfile $_: $!"; print {$colfiles[$_]} shift (@row); }
        Message was
        Failed on opening tempfile 1019: Too many open files at ../bin/transposematrix.pl line 62, <INM> line 1.
        
        So is that error from perl or from shell i'm executing the script? I think its from perl, since it tells me the line (correctly) from the script. So it seems to me that i need to do this in 1000 file groups if i do it this way.

        Another thing which I though about this morning is combining your original way and idea of splitting the matrix. Now if i split it into 4 pieces, each piece needs to transposed 180 degrees, in which case your original code would do the trick. I figure it should work, even if the split isn't exact (eg. original matrix has 183 rows and 235 columns).

      Ah, that is a different problem. That function does a 180 degree tansposition. You did not specify what you were after so I guessed :-) What is the type of the data? Is it int, long, double, string?

      cheers

      tachyon

        Basicly doubles. Although strings [+-]inf or nan is allowed as well.