Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Any ideas how to stitch together a text file recovered by photorec/testdisk?

Originally its a 13MB file , photorec gave me 40 thousand kb sized text files, 400mb worth

I have about 5mb of the original file, about 8mb are missing (NULS)

Other similarly sized files that dont need recovery are referenced, but in itty bitty chunks

https://notepad-plus-plus.org/community/topic/13302/fix-corrupted-txt-file-null/30 references similar situation, some guys says they used "recuva" but it didnt work for me

I've poked around the files a bit, the data is there, but what goes where? qphotorec.log lists these two files side by side which have the target data

recup_dir.79/f93737696.txt 155170319-155170358 recup_dir.79/f93737736.txt 155170359-155170366

Problem is first one is the last 16610 bytes from the original file, but the second one is the second one is from 134 bytes from 2MB into the original file (6 lines, 2 partial ones).

the second one seems to be duplicated a bunch of times

recup_dir.4/f8584568.txt recup_dir.47/f80423368.txt recup_dir.5/f13045792.txt recup_dir.57/f85016144.txt recup_dir.63/f88799480.txt recup_dir.68/f89877496.txt recup_dir.71/f90986416.txt recup_dir.71/f91006272.txt recup_dir.79/f93737736.txt

So something clever with md5sums that doesnt require that I read 13mb worth of file, ugh :)

Thanks

Replies are listed 'Best First'.
Re: stitch together text file recovered by photorec/testdisk?
by Corion (Patriarch) on Sep 10, 2018 at 13:27 UTC

    Most likely what you have are parts of "older" versions of that file. This isn't bad, as there is likely still a difference to you between an old version and nothing at all.

    My approach to attempt to reconstruct the file would be to do it in several steps:

    1. Read the "restored" file containing the nulls
    2. Read the partial files
    3. Eliminate all partial files that occur completely in the good parts of the restored file
    4. Try to find a partial overlap of the end of a readable part of the restored file with at least one partial file
    5. Repeat with the concatenation of the good file and the partial file until you've exhausted all partial files
    6. If you find multiple partial files that match, flag those for manual user review. Maybe the longest overlap is better, or maybe the shortest overlap is better.

    That should give you one potential version of your file, with fewer missing parts than before.

    You could also try the same with your partial files, and/or try to find the overlaps between different partial files to piece those together.

      This narrows the list to 68.6 MB by elminating subsequent duplicates

      #!/usr/bin/perl -- use strict; use warnings; use Path::Tiny qw/ path /; use File::Find::Rule qw/ find /; use autodie; use Digest::MD5 qw( md5_hex ); my $qphotoreclog = 'qphotorec.log'; $qphotoreclog = path( $qphotoreclog )->realpath; chdir path( $qphotoreclog )->parent; my $log = path( $qphotoreclog )->slurp_raw; my @files; my %seen; while( $log =~ m{^(.*?)[\r\n]*$}mg ){ my $line = $1; next if not $line =~ /recup_dir/; my( $filename, $blocks ) = split ' ', $line, 2; my $md5 = md5_hex( path( $filename )->slurp_raw ) ; push @{$seen{$md5}}, $filename; push @files, [ $filename, $blocks , $md5 , int@{$seen{$md5}} ]; } undef $log; # dd(\@files ); use constant FILENAME => 0; use constant SEEN => 3; print "Files before ", int @files, "\n"; @files = map { $_->[FILENAME] } grep { $_->[SEEN()] == 1 } @files; print "Files after ", int @files, "\n"; # dd(\@files ); path('myfinalrecup')->mkpath; for my $filename ( @files ){ path( $filename )->copy( 'myfinalrecup/' ); } __END__ Files before 40330 Files after 6341

      Cant really see a relationship between the blocks and the filename , probably there isnt one

      [ "recup_dir.3/f8580464.txt", "70013087-70013102", "2615e08f437222995c7aab0569f015f3", 1, [ "C:/undelet/testdisk-7.0.win/recup_dir.3/f8580480.txt", "70013103-70013110", "6fb0dd36db299c9b713d5c622bf5b499", 1, ], ... [ "recup_dir.3/f8583480.txt", "70016103-70016118", "2615e08f437222995c7aab0569f015f3", 2, ],

      Most likely what you have are parts of "older" versions of that file. This isn't bad, as there is likely still a difference to you between an old version and nothing at all.

      Luckily I included timestamps in files and they're mostly sequential ... grepping for the last timestamp it appears at most I've lost 4 hours

      So this is eliminating more stuff I know for sure i already have

      #!/usr/bin/perl -- use strict; use warnings; use Path::Tiny qw/ path /; use File::Find::Rule qw/ find /; use autodie; use Digest::MD5 qw( md5_hex ); my $qphotoreclog = 'qphotorec.log'; $qphotoreclog = path( $qphotoreclog )->realpath; chdir path( $qphotoreclog )->parent; my $notit = ''; for my $unwanted ( find( file => maxdepth => 1 , in => 'D:/' ) ){ next if not -T $unwanted; $notit.=path( $unwanted )->slurp_raw; } my @files = sort glob 'myfinalfiles/*'; my @maybeit; my @notit; for my $file ( @files ){ my $isit = path( $file )->slurp_raw; if( $notit =~ /\Q$isit\E/ ){ push @notit, $file; } else { push @maybeit, $file; } } dd( 'notit', @notit ); dd( 'maybeit', @maybeit ); dd( 'files', int @files ); dd( 'notit', int @notit ); dd( 'maybeit', int @maybeit ); path('myfinalmaybeit')->mkpath; for my $file ( @maybeit ){ path( $file )->copy( 'myfinalmaybeit/'); } __END__ ... ("files", 6341) ("notit", 4294) ("maybeit", 2047) 18M myfinalmaybeit

      randomly viewing a few file, i found a file that seems to be a mix of wanted and unwanted, ugh

        randomly viewing a few file, i found a file that seems to be a mix of wanted and unwanted, ugh

        Grepping from the mixed file I found two more files that match that file , and with no mixing,

        myfinalmaybeit\f67507544.txt is inside the longer file myfinalmaybeit\f9151816.txt and neither file has the unwanted mix from myfinalmaybeit\f93596792.txt

        Not seeing any matches already rejected (from myfinalfiles)

        Also not really seeing how to program my way out of looking at these 2k files, I've got overlapping logic fatigue, but hooray, 2k files