in reply to Re:^2 Binary file handling
in thread Binary file handling

Is this what you want, it flips on the diagonal? The main requirement is that you have as much free disk space for the temp files as the total file size. You will be limited in the number of columns you can transpose by the number of open file descriptors your Perl will let you have. It is very easy to hack the logic to do N colunms per pass at the expense of 1 full read of the input file per extra pass. Alternatively you could DBM or tie a hash to a file and use the keys as pseudo file handles and just append data to the values. Although there is more I/O with a multipass approach is is very vanilla I/O which perl does really fast.

Update

See this article for info on how to up the number of available file descriptors (probably 1024/process) on a Linux based system. No idea how it is dealt with on other systems.

It should be really fast as we make a single pass through the input data and then effectively just write it out (each temp file has one full line in it).

C:\>type transpose.pl #!/usr/bin/perl -w use strict; transpose90( "c:/data.txt", "c:/data-transpose.txt" ); sub transpose90 { my ( $infile, $outfile, $tmp ) = @_; $tmp ||= 'c:/tmp/temp'; open IN, $infile or die "Can't read $infile $!\n"; # find number of columns and open a temp file for each local $_ = <IN>; chomp; my @data = split ' '; my $num_cols = $#data; my @fhs; for( 0..$num_cols ) { open $fhs[$_], ">$tmp$_.txt" or die "Can't create temp file $t +mp$_ $!\n"; print {$fhs[$_]} $data[$_], "\t"; } while( <IN> ) { chomp; @data = split ' '; print {$fhs[$_]} $data[$_], "\t" for 0..$num_cols; } close IN; open OUT, ">$outfile" or die "Can't write $outfile $!\n"; for ( 0.. $num_cols ) { close $fhs[$_]; # close the temp file open IN, "$tmp$_.txt" or die "Can't read temp file $tmp$_ $!\n +"; print OUT scalar(<IN>), "\n"; close IN; unlink "$tmp$_.txt" } close OUT; } C:\>type data.txt 11 12 13 21 22 23 31 32 33 C:\>transpose.pl C:\>type data-transpose.txt 11 21 31 12 22 32 13 23 33 C:\>

cheers

tachyon

Replies are listed 'Best First'.
Re:^4 Binary file handling
by Hena (Friar) on Mar 19, 2004 at 08:59 UTC
    See this article for info on how to up the number of available file descriptors (probably 1024/process) on a Linux based system. No idea how it is dealt with on other systems.

    It mentioned that there is a limit of 1024 in 2.2 kernels. So i'm not worried about that. There is limits in filesystems though, but in reiserfs its 2^31 per directory and 2^32 per filesystem (can't find info on ext2/3).

    I though about something like this. Have to figure out some limit on opened filehandles (~100k or 1M or something) i quess, just to make sure i don't kill of filesystem :).
Re:^4 Binary file handling
by Hena (Friar) on Mar 19, 2004 at 11:40 UTC
    I actually did something very similar in here. And tested with 6000*2000 matrix. Here's the code snippet:
    open (INM,"$in_matrix") or die "Unable to open 'in_matrix': $!"; my @row=split (/\t/,readline(INM)); my @colfiles=(); my $max=$#row; foreach (0 .. $max) { open ($colfiles[$_],">$out_matrix.$_") or die "Failed on opening tem +pfile $_: $!"; print {$colfiles[$_]} shift (@row); }
    Message was
    Failed on opening tempfile 1019: Too many open files at ../bin/transposematrix.pl line 62, <INM> line 1.
    
    So is that error from perl or from shell i'm executing the script? I think its from perl, since it tells me the line (correctly) from the script. So it seems to me that i need to do this in 1000 file groups if i do it this way.

    Another thing which I though about this morning is combining your original way and idea of splitting the matrix. Now if i split it into 4 pieces, each piece needs to transposed 180 degrees, in which case your original code would do the trick. I figure it should work, even if the split isn't exact (eg. original matrix has 183 rows and 235 columns).

      It is Perl reporting an OS message. Consided open F, $foo or die $! say you get permission denied, that is an OS message delivered via perl.

      You don't have to use real file descriptors. You could simply open a DBM database or a tied hash, pretend that the keys are the file descriptors and just append to the values. It actually simplifies the code, but it will be a big speed hit.

      Here is one of the examples from Re: Binary file handling converted to use a hash tied to a file.

      sub rotate_minus90 { my ( $infile, $outfile, $tmp ) = @_; $tmp ||= 'c:/tmp/temp'; open IN, $infile or die "Can't read $infile $!\n"; # find number of columns and open a temp file for each chomp( local $_ = <IN> ); my @data = split ' '; my $num_cols = $#data; dbmopen(my %fhs, $tmp, 0666) or die "dbmopen can't grok $tmp $!\n" +; $fhs{$_} = "$data[$_]\t" for 0..$num_cols; while( <IN> ) { chomp; @data = split ' '; $fhs{$_} .= "$data[$_]\t" for 0..$num_cols; } close IN; open OUT, ">$outfile" or die "Can't write $outfile $!\n"; for ( reverse 0.. $num_cols ) { print OUT $fhs{$_}, "\n"; } dbmclose(%fhs); close OUT; }

      cheers

      tachyon