Reading records of various lengths

CalebH has asked for the wisdom of the Perl Monks concerning the following question:

I have recently been trying to put the finishing touches on a script and am having a huge problem.

The purpose of the script is to take a torrent file, parse the file(s) from it, find files of the same size in a folder and match the hashes up with the data to see if any pieces are missing.

The script works fine with single files, but when it comes to multiple files it hits a roadblock - the last piece may not be the same size as the others, and may span multiple files. When this happens, it causes the script to spit out bad hashes and say that it has failed a hash check even if it hasn't actually.

Below is two examples of the code I am using (the first one is the code I modified, and the second one is the original one).

sub filegrabber {
($filed) = @_;
print Dumper($filed);
print "We Are In FileGrabber Loop now \n\n";
    my $file_index = 0;
    my $this_path  = undef;
    my $cnt_good   = 0;
    my $cnt_bad    = 0;
    my $badfiles   = {};
  my $goodfiles  = {};
    for(my $i=0; $i<length($sha); $i+=SHASIZE) {
        my $this_sha   = substr($sha,$i,SHASIZE );
        my $this_piece = $i / SHASIZE;
        my $need_bytes = $piece_length;
        my $pbuffer    = '';

        
        while($need_bytes > 0) {
            my $src_ref = $files->[$file_index] or last; # got all pie
+ces

        if($src_ref->{path} ne $this_path) { 
                close(FH); 
              $this_path  = $src_ref->{path};


        open(FH, "<", $filed) or die "Couldn't open $filed because $!\
+n";
            }

            my $buff        = '';
            my $got_bytes   = sysread(FH,$buff,$need_bytes);
               $pbuffer    .= $buff;
               $need_bytes -= $got_bytes;
        print "\n";
        print "Got Bytes: " . $got_bytes . "\n";
        print "Got Bytes Length: " . length($got_bytes) . "\n";
        print "Need Bytes: " . $need_bytes . "\n";
        print "This Sha (size) " . length($this_sha) . "\n";
        print "This Sha (unpack): " . unpack("H*", $this_sha) . "\n";
        print "This Sha (shahex): " . sha1_hex($pbuffer) . "\n";
        print "This Piece: " . $this_piece . "\n";
        print "Piece Size: $piece_length\n";
        $file_index++ if $got_bytes < 1;


  if($got_bytes ne $piece_length) {
    print "Pbuffer Size: " . length($pbuffer) . "\n";
    print "Byte Mismatch \nGot: $got_bytes\nP-Len: $piece_length\n";
    next;
    }

        }
    
  

        if( unpack("H*",$this_sha) eq sha1_hex($pbuffer)) {
            $cnt_good++;
      $goodfiles->{$this_path}++;
        }
   
        else {
            $cnt_bad++;
            $badfiles->{$this_path}++;
      print "Bad: $this_piece\n";
      print "Got_bytes = $got_bytes\n";
      print "BadPiece : $this_piece - " . unpack("H*", $this_sha) . "\
+n";
      print "This Sha : " . length($this_sha) . "\n";
      print "Pbuffer length: " . length($pbuffer) . "\n";
      print "File index is $file_index\n";
      print "BadPiece2 : " . sha1_hex($pbuffer) . "\n";
        }
    
    }
    
    print "\r".(" " x 32 );
    print "\rfound $cnt_bad bad piece(s)\n";
print "\rfound $cnt_good good piece(s)\n";

    foreach my $this_bad (keys(%$badfiles)) {
        printf("%-64s : %d bad bytes (%d pieces)\n", $this_bad, 
get_sized($badfiles->{$this_bad}*$piece_length), $badfiles->{$this_bad
+});
    }
        
}
[download]

Below is the original code that I modified a bit.

sub verify_torrent {
    my($filename, $basepath) = @_;
    
    my $ref   = _slurp($filename);
    my $plen  = $ref->{info}->{'piece length'};
    my $sha   = $ref->{info}->{pieces};
    my $files = [];
    
    if(ref($ref->{info}->{files}) eq 'ARRAY') {
        foreach my $fref (@{$ref->{info}->{files}}) {
            push(@$files, {path=>join("/", @{$fref->{path}}), length=>
+$fref->{length}});
        }
    }
    else {
        push(@$files, {path=>$ref->{info}->{name}, length=>$ref->{info
+}->{length}});
    }
    
    
    my $file_index = 0;
    my $this_path  = undef;
    my $cnt_good   = 0;
    my $cnt_bad    = 0;
    my $badfiles   = {};
    
    for(my $i=0; $i<length($sha); $i+=SHASIZE) {
        my $this_sha   = substr($sha,$i,SHASIZE);
        my $this_piece = $i / SHASIZE;
        my $need_bytes = $plen;
        my $pbuffer    = '';
        
        while($need_bytes > 0) {
            my $src_ref = $files->[$file_index] or last; # got all pie
+ces
            if($src_ref->{path} ne $this_path) { # new file -> must up
+date FH
                close(FH);
                $this_path  = $src_ref->{path};
                my $vfs     = join("/",$basepath, $this_path);
                open(FH, "<", join("/",$vfs)) or warn "Could not open:
+ $vfs\n";
            }
            my $buff        = '';
            my $got_bytes   = sysread(FH,$buff,$need_bytes);
               $pbuffer    .= $buff;
               $need_bytes -= $got_bytes;
            $file_index++ if $got_bytes < 1;
        }
        
        if( unpack("H*",$this_sha) eq sha1_hex($pbuffer) ) {
            $cnt_good++;
        }
        else {
            $cnt_bad++;
            $badfiles->{$this_path}++;
        }
        
        print "\rpiece=$this_piece, ok=$cnt_good, bad=$cnt_bad" if $th
+is_piece % 4 == 0;
    }
    
    print "\r".(" " x 32 );
    print "\rfound $cnt_bad bad piece(s)\n";
    foreach my $this_bad (keys(%$badfiles)) {
        printf("%-64s : %d bad bytes (%d pieces)\n", $this_bad, $badfi
+les->{$this_bad}*$plen, $badfiles->{$this_bad});
    }
    
    
}
[download]

When unpack is used to match up the hashes in the first example, the hashes match what they should be. When sha1_hex is used, the numbers are completely different. What I mean is, all hashes match up until the final piece which is split across multiple files and will not be the same size as $need_bytes. In that case, the hashes change and the only one with correct data seems to be unpack.

Am I missing an easier way to make the sha1_hex buffer match the unpack one so the data will come up the right way? I have racked my brain around this trying various methods to get it to work with no success. One idea I had was to make the got_bytes loop in my code try to add all the data together until the end of the file, but again... I couldn't figure out how to pull it off.

Maybe a more seasoned guru can help, since I have an idea in mind and no idea how to implement it.

Thanks in advance, Monks!

Comment on Reading records of various lengths Select or Download Code

Replies are listed 'Best First'.
Re: Reading records of various lengths by Anonymous Monk on Jun 09, 2016 at 14:03 UTC
I think it would be a good idea to verify the file sizes before doing SHA digests. Truncated file, garbage at the end - either could be the cause of the symptoms you're seeing.	[reply]