knight.of.ni has asked for the wisdom of the Perl Monks concerning the following question:

Hello my friends,
please have patience with me. I'm a bloody perl-beginner :)

My problem:
I have a big file and inside there is an html-file hidden.

My quest:
I want to extract only this html-file from the big file. So I guess I have to search the big file for strings like "<html>" and "</html>".

So far:
I did a little test with a png-file. Here is my code:

#!/usr/bin/env perl use strict; use warnings; open(INPUT, "<test_in.png") or die $!; open(OUTPUT, ">test_out.png") or die $!; while(<INPUT>) { last if /END/; print OUTPUT $_; }; close INPUT; close OUTPUT;

Weird result:
Inside every png-file there is an "END" string pretty much at the end. But this code doesn't output the data from beginning to that word "END". It ends a few bits before. The same code with a text-file where I put the word "END" inbetween works.

I hope you can clear the clouds in my brain...

Sincerely,
Ni

Replies are listed 'Best First'.
Re: Weird file extraction problem
by GrandFather (Saint) on Jan 13, 2016 at 09:13 UTC

    If the file is a PNG image file (see Portable_Network_Graphics) then a better technique is probably to actually parse the file to find the chunk of interest then simply extract the chunk. The Wikipedia article gives all the information you need to understand the file format and a little playing around with read, unpack and seek should give you the main tools you need to write the code.

    Premature optimization is the root of all job security
Re: Weird file extraction problem
by Laurent_R (Canon) on Jan 13, 2016 at 07:28 UTC
    Hi Ni,

    Try perhaps this:

    print OUTPUT $1 and last if /(.*)END/;
Re: Weird file extraction problem
by Ratazong (Monsignor) on Jan 13, 2016 at 07:20 UTC

    Hello Ni

    Try using a text file containing the line This is the END - and you will observe that the words This is the won't be printed either. So you get the same weird result as with a .png-file.

    The reason is that you process the file line-by-line - and don't print the line containing the word END.

    HTH, Rata
Re: Weird file extraction problem
by hotchiwawa (Scribe) on Jan 13, 2016 at 10:23 UTC
    An example of png file ending:
    { IENDŽB`‚
    With null bytes between { and I.
    You should treat your file as binary and bytes representation if you want to check them.