dimar has asked for the wisdom of the Perl Monks concerning the following question:

The Background

MSFT Internet Explorer has a somewhat useful feature called MHT Web Archive. What it does is save an entire web page, including images, to a single file. An added bonus is the MHT format apparently uses straightforward, non-proprietary base64 encoding to handle the images.

The Question

This useful feature is apparently no longer supported. Now with a bunch of MHT files sitting around that are no longer readable, it is time for Perl to step in and convert all those MHT files into something else. The question is, what should that something else be, and do I have to code it myself or has someone else already gone down this road.

Absent any insight to the contrary, it looks like I will have to slap together a perl script to grab each 'region' from each MHT file, and export the regions as HTML, GIF, JPG (etc) files. This ruins the single-file-encapsulation feature, but at least makes the content readable again.

Any better ideas out there greatly appreciated.

Replies are listed 'Best First'.
Re: MSFT Explorer MHT File Munging
by iburrell (Chaplain) on Jul 23, 2004 at 20:36 UTC
    My impression is that the .mht files are MIME messages, of type multipart/related. A MIME parsing module, like MIME::Parser should work to split it apart. I don't know what content encoding they used; the binary files may not be encoded.

    The two option I see for storing the files are put them in a direcotry, or put them in archive of another format, probably .zip would be best. The advantage of a directory is that they should viewable as they are. The advantage of zip archive is that you get one file that normal tools can use, but can't be directly viewed by browsers.

Re: MSFT Explorer MHT File Munging
by Eimi Metamorphoumai (Deacon) on Jul 23, 2004 at 17:44 UTC
    I've seen some browsers offer similar features. About the best way, I would think, would be to create a directory and put all of the extracted files into it, rewriting the html so that it points to them (using relative paths). It's not a single file, but it should be accessible in any browser, portable (you can treat the directory as a unit), and if you really need a single file you can tar or zip it up.