Extract chunked/gzip data from pcap file

oakb has asked for the wisdom of the Perl Monks concerning the following question:

I manage many clients' firewalls, and regularly generate pcap packet trace files of traffic passing through these firewalls. I do a lot of searching, matching, and extracting of data from these files, and often use Net::TcpDumpLog to automate the process.

I find myself stymied, however, by HTML that has been optimized for download speed through the use of chunked Transfer-Encoding and gzip Content-Encoding. Since the text HTML has been turned into binary data, I can't automate the parsing process and systematically extract interesting information.

Is there a relatively simple way to decompress and decode this data so that it can be manipulated automatically in my program?

Here is what I have so far:

#!/usr/bin/perl


use strict;
use Net::TcpDumpLog;

my $log                        =  Net::TcpDumpLog->new();
                                  $log->read( "/my/tracefile.pcap" );
my $maxindex                   =  $log->maxindex();
my $gzip                       =  0;

foreach my $index ( 0..$maxindex ) {
    my ( $length_orig, $length_incl, $drops, $seconds, $milliseconds )
+  =  $log->header( $index );
    my $data                   =  $log->data( $index );
    
    if (( $data  =~ /Transfer-Encoding: chunked/g ) && ( $data  =~ /Co
+ntent-Encoding: gzip/g )) {
        $gzip++;
        print $index + 1 . "\t$length_orig\t$length_incl\t$seconds\t$m
+illiseconds\n";
        print "\t$data\n\n";
    }
}

print "$gzip chunked-gzip packets.";
[download]

Comment on Extract chunked/gzip data from pcap file Download Code

Replies are listed 'Best First'.
Re: Extract chunked/gzip data from pcap file (OT: Regex Usage) by AnomalousMonk (Archbishop) on Dec 28, 2009 at 15:12 UTC
Note that in a statement like `if ($data =~ /some text/``g``) { ... }` the `/g` regex modifier ~~does nothing~~ Update: See ikegami's correction. In boolean context, you only care if the pattern matched or did not match. Given the regex of the example, you cannot distinguish if the regex matched once or more than once (although a different regex could determine this). See g in the Modifiers section of perlre.	[reply] [d/l] [select]
Re^2: Extract chunked/gzip data from pcap file (OT: Regex Usage) by ikegami (Patriarch) on Dec 28, 2009 at 15:27 UTC
If it did nothing, it would be harmless. Quite the opposite, it's a bug to use it there. `>perl -le"for (1..2) { print 'ab' =~ /a/g ? 'match' : 'no match' }" match no match` [download]	[reply] [d/l]
Re^3: Extract chunked/gzip data from pcap file (OT: Regex Usage) by oakb (Scribe) on Jan 04, 2010 at 16:04 UTC
Thank you for pointing out my error, AnomalousMonk and ikegami. I was misapplying /g because I mistakenly thought it was necessary when matching against multi-line data. I have removed /g from my program.	[reply]
Re: Extract chunked/gzip data from pcap file by Anonymous Monk on Dec 28, 2009 at 09:04 UTC
use PerlIO::gzip;	[reply]
Re^2: Extract chunked/gzip data from pcap file by oakb (Scribe) on Dec 28, 2009 at 14:17 UTC
I have the PerlIO::gzip module installed, and I've tried using it. However, there's more to the puzzle than just using that module. When I look at these packets in Wireshark, a packet can be viewed in three different forms: (1) full frame, with the binary data intact and still encoded; (2) "De-chunked entity body", which exhibits the majority of the binary data intact -- but which has removed the "chunked encapsulation" (for lack of a better term); and (3) "Uncompressed entity body", which shows no binary data, just the decoded HTML text. This leads me to believe that there is some intermediate step that is required to remove the "chunked encapsulation", before being able to hand the "clean" compressed data to PerlIO::gzip as a stream for decompression.	[reply]
Re^3: Extract chunked/gzip data from pcap file by Corion (Patriarch) on Dec 28, 2009 at 14:28 UTC
Most likely, you will need to reassemble the TCP frames into the complete TCP message, then parse the HTTP response from that, and then decode the payload of the HTTP response. This is something I've done with Sniffer::HTTP, which gices you a HTTP::Response object for each completed request. You can then query the `->decoded_content` method of HTTP::Response to get the uncompressed data out. Note that I have at least one report of Sniffer::HTTP having a memory leak, so be sure to test the memory requirements before rolling it out on a large scale. Unfortunately, Net::Pcap doesn't currently build for me on Win32, so I can't conveniently replicate the environment to actually test things.	[reply] [d/l]
Re^4: Extract chunked/gzip data from pcap file by oakb (Scribe) on Jan 04, 2010 at 16:19 UTC
Re^5: Extract chunked/gzip data from pcap file by Corion (Patriarch) on Jan 04, 2010 at 16:22 UTC