knight.of.ni has asked for the wisdom of the Perl Monks concerning the following question:

Hello my friends,

I'd like to extract a png-file out of a huge file like a memory dump. In this example the png-file sits an the very beginning, so I just have to search for the end-string ("IEND"-string).

My idea is to read the file char-by-char, then read 4 bytes ahead and check these 4 bytes for the matching "IEND"-string, then put the pointer back to the old position and write the char into the new file.

Unfortunately it works only when the matching string is located not far from the beginning of the file. My test-png sizes 5.5kb and the code exits after 193 bytes in the middle of nowhere.

What am I doing wrong?

#!/usr/bin/env perl use strict; use warnings; my $dateiread = "/home/ni/OUTPUT/test.png"; my $dateiwrite = "/home/ni/OUTPUT/test2.png"; my $buffer; my $zeichen; my $bytes = 4; open (my $handle1, "<", $dateiread); open (my $handle2, ">", $dateiwrite); while (my $zeichen = getc($handle1)) { seek($handle1, -1, 1); read($handle1, $buffer, $bytes); if ($buffer =~ /IEND/g) { print $handle2 $buffer; exit; } seek($handle1, -3, 1); print $handle2 $zeichen; } close ($handle1); close ($handle2);

Another question:
is my method suitable for handling large dumpfiles (2gb for example) or is there a faster one?

Sincerely,
Ni

Replies are listed 'Best First'.
Re: File extraction 2nd try
by tangent (Parson) on Jan 19, 2016 at 03:33 UTC
    What am I doing wrong?
    while (my $zeichen = getc($handle1))
    This will fail if the character you are reading is "0" or some other "false" value.

    Try:

    while (defined ( my $zeichen = getc($handle1) ) )
    Also, although 'IEND' does signify the end marker of the PNG file it does so as part of a "chunk", and a chunk always ends with a CRC - from the PNG specs:
    CRC
    A 4-byte CRC (Cyclic Redundancy Check) calculated on the preceding bytes in the chunk, including the chunk type code and chunk data fields, but not including the length field. The CRC is always present, even for chunks containing no data.
    So you will need to read/write a further 4 bytes after 'IEND'. I'm no expert on PNG files and can't say this is the correct way to do it but adding these two lines worked for me:
    if ($buffer =~ /IEND/) { print $handle2 $buffer; read($handle1, $buffer, $bytes); ### 1 print $handle2 $buffer; ### 2 exit; }
Re: File extraction 2nd try
by AnomalousMonk (Archbishop) on Jan 19, 2016 at 03:11 UTC

    Your "1st" try seems to be discussed here. Do a Super Search on "sliding window". This is the approach you're using in the OPed code, although the size of the window you're using is microscopic; it will work a lot better with a 100 MB - 1GB window, assuming 4GB system RAM.

    The simplest approach is to do a regex search on the the entire file slurped into system memory, but this does not scale well for large files, "large" being in the range you mention, although this depends on your hardware; it would be a trivial approach on a system with, say, 8GB of system RAM.


    Give a man a fish:  <%-{-{-{-<

Re: File extraction 2nd try
by Athanasius (Archbishop) on Jan 19, 2016 at 04:16 UTC

    Hello knight.of.ni,

    tangent has identified the logic error in your code, and AnomalousMonk has outlined a better approach. I just want to point out that if you exit from the while loop you will never reach the close statements after the loop ends. Use last instead.

    A few other points:

    • The first my $zeichen is masked by the declaration of my $zeichen within the while loop condition.
    • The /g modifier on the regex does nothing in this case.
    • In fact, a regex isn’t really needed here at all: if ($buffer eq 'IEND') is clearer and more efficient.
    • You should either test the open and close statements for failure, or use autodie.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thanks to all of you for your answers. Especially to tangent and Athanasius. You two helped me a lot!
Re: File extraction 2nd try
by GrandFather (Saint) on Jan 19, 2016 at 05:14 UTC

    If the PNG starts at the start of the dump then Re: Weird file extraction problem still applies - parse the PNG. It'll be much faster than chewing your way through the dump a byte at a time and more reliable than searching a large buffer for a string that may exist in multiple places if you are unlucky with the image.

    Sometimes biting the bullet, learning some new stuff and doing the hard yards is what you have to do!

    Premature optimization is the root of all job security
      You are right, chewing bytes is time consuming. But my code is to be applied on files that are larger than my system memory. Maybe reading larger chunks into memory would be faster. I'll test different solutions since that is the best way to get experience.

        Parsing a .PNG file really isn't hard, especially if you don't really care about the content. Here's something to get you started:

        use strict; use warnings; my $file = '...'; open my $pngIn, '<:raw', $file or die "Can't open '$file': $!\n"; read $pngIn, my $header, 8; my ($prefix, $png, $crlf, $ctrlZ, $lf) = unpack "aa3a2aa", $header; die "Bad header prefix\n" if ord $prefix != 0x89; die "Bad header type\n" if $png ne 'PNG'; die "Bad header crlf\n" if $crlf ne "\r\n"; die "Bad header Ctrl-Z\n" if ord $ctrlZ != 0x1a; die "Bad header newline\n" if $lf ne "\n"; while (!eof $pngIn) { read $pngIn, my $chunkHeader, 8 or last; my ($length, $type) = unpack "Na4", $chunkHeader; read $pngIn, my $body, $length + 4 or die "Truncated chunk\n"; my ($payload, $crc) = unpack "a$length N", $body; last if $type eq 'IEND'; }

        Tested on Windows. Probably OK on a big endien machine but those are harder to come by now that Apple have gone Intel. It doesn't do anything with the chunk data other than skipping through the chunk headers. It also doesn't try to be smart about excessively large chunks.

        Premature optimization is the root of all job security