oakb has asked for the wisdom of the Perl Monks concerning the following question:

I manage many clients' firewalls, and regularly generate pcap packet trace files of traffic passing through these firewalls. I do a lot of searching, matching, and extracting of data from these files, and often use Net::TcpDumpLog to automate the process.

I find myself stymied, however, by HTML that has been optimized for download speed through the use of chunked Transfer-Encoding and gzip Content-Encoding. Since the text HTML has been turned into binary data, I can't automate the parsing process and systematically extract interesting information.

Is there a relatively simple way to decompress and decode this data so that it can be manipulated automatically in my program?

Here is what I have so far:

#!/usr/bin/perl use strict; use Net::TcpDumpLog; my $log = Net::TcpDumpLog->new(); $log->read( "/my/tracefile.pcap" ); my $maxindex = $log->maxindex(); my $gzip = 0; foreach my $index ( 0..$maxindex ) { my ( $length_orig, $length_incl, $drops, $seconds, $milliseconds ) + = $log->header( $index ); my $data = $log->data( $index ); if (( $data =~ /Transfer-Encoding: chunked/g ) && ( $data =~ /Co +ntent-Encoding: gzip/g )) { $gzip++; print $index + 1 . "\t$length_orig\t$length_incl\t$seconds\t$m +illiseconds\n"; print "\t$data\n\n"; } } print "$gzip chunked-gzip packets.";

Replies are listed 'Best First'.
Re: Extract chunked/gzip data from pcap file (OT: Regex Usage)
by AnomalousMonk (Archbishop) on Dec 28, 2009 at 15:12 UTC

    Note that in a statement like
        if ($data =~ /some text/g) { ... }
    the  /g regex modifier does nothing Update: See ikegami's correction. In boolean context, you only care if the pattern matched or did not match. Given the regex of the example, you cannot distinguish if the regex matched once or more than once (although a different regex could determine this).

    See g in the Modifiers section of perlre.

      If it did nothing, it would be harmless. Quite the opposite, it's a bug to use it there.
      >perl -le"for (1..2) { print 'ab' =~ /a/g ? 'match' : 'no match' }" match no match

        Thank you for pointing out my error, AnomalousMonk and ikegami. I was misapplying /g because I mistakenly thought it was necessary when matching against multi-line data. I have removed /g from my program.

Re: Extract chunked/gzip data from pcap file
by Anonymous Monk on Dec 28, 2009 at 09:04 UTC

      I have the PerlIO::gzip module installed, and I've tried using it. However, there's more to the puzzle than just using that module.

      When I look at these packets in Wireshark, a packet can be viewed in three different forms: (1) full frame, with the binary data intact and still encoded; (2) "De-chunked entity body", which exhibits the majority of the binary data intact -- but which has removed the "chunked encapsulation" (for lack of a better term); and (3) "Uncompressed entity body", which shows no binary data, just the decoded HTML text.

      This leads me to believe that there is some intermediate step that is required to remove the "chunked encapsulation", before being able to hand the "clean" compressed data to PerlIO::gzip as a stream for decompression.

        Most likely, you will need to reassemble the TCP frames into the complete TCP message, then parse the HTTP response from that, and then decode the payload of the HTTP response. This is something I've done with Sniffer::HTTP, which gices you a HTTP::Response object for each completed request. You can then query the ->decoded_content method of HTTP::Response to get the uncompressed data out.

        Note that I have at least one report of Sniffer::HTTP having a memory leak, so be sure to test the memory requirements before rolling it out on a large scale. Unfortunately, Net::Pcap doesn't currently build for me on Win32, so I can't conveniently replicate the environment to actually test things.