in reply to Testing Page Size with HTML 4/CSS

Everything is possible. Never give up. That said, this isn't that easy: you might well think that the following is just too convoluted to bother with, and that using some external agent (such as Internet Explorer driven by Perl) to do the task might be easier. Anyway, if you're doing it without external aid, there's two distinct steps to the task:

  1. Parse all the CSS that pertains to the page, following <link>ed stylesheets, @import rules etc, as you've described, to find all the rules that refer to a background property
  2. Find which of these rules have a selector which addresses a part of the HTML document to which the CSS is being applied

For step one, as already suggested, you might use the CSS package. This allows you to gather up lots of bits of CSS into a single ruleset, by repeatedly using the read_string method. Then, you can look at the aggregrate ruleset to find which rules have either background-image or background rules (the latter shorthand notation can include background-image specifications.)

use CSS; my $css = CSS->new(); $css->read_string('div#foo p.bar { background-image : url(/foo/bar.gif +) }'); # $css->read_string('table { border: 1px solid #FF0000 }'); # etc... my %bg_selectors; # for some reason, CSS doesn't supply accessor methods... foreach my $rule ( @{ $css->{'styles'} } ) { foreach my $prop ( @{ $rule->{'properties'} } ) { if ( $prop->{'property'} =~ /^background(?:-image)?$/ ) { foreach my $selector ( @{ $rule->{'selectors'} } ) { $bg_selectors{$selector->{'name'}} = $prop->{'simple_value'}; } } } }

You should then have a hash keyed on CSS selectors whose values are the relevant background property value. For step 2, you need to find out whether the HTML document contains elements to which the rule should be applied. One way to do this would be to parse the document into a tree, then use XPath generated from the selectors to test the document.

I'm working with XML::XPath, which means that you'll need your source document to be valid XHTML. If it's not, there's a few ways to get there, such as using the htmltidy utility with the appropriate options to convert the document, or possibly using the experimental XML methods on a parse tree generated by HTML::TreeBuilder.

To test the document for the existence of the elements, you'll need to convert the CSS rules into XPath expressions. Here's a very limited example, which only deals with CSS tag, containment, classes and id selectors. It's also not much tested:

sub selector_to_xpath { my $selector = shift; my $xpath = ''; foreach my $token ( split(/\s/, $selector) ) { if ( $token =~ /(\w+)? (?: \#(\w+) | \.(\w+) )?/x ) { $xpath .= '//'; my ( $tag, $id, $class ) = ( $1, $2, $3 ); if ( $tag ) { $xpath .= $tag; } if ( $id ) { $xpath .= "*" unless $tag; $xpath .= "[\@id='$id']"; } if ( $class ) { $xpath .= "*" unless $tag; $xpath .= "[\@class='$class']"; } } } return $xpath; }

Now, rolling this all together....

use strict; use warnings; use CSS; use XML::XPath; use Data::Dumper; sub selector_to_xpath { my $selector = shift; my $xpath = ''; # doesn't deal with much of the CSS spec ... foreach my $token ( split(/\s/, $selector) ) { if ( $token =~ /(\w+)? (?: \#(\w+) | \.(\w+) )?/x ) { $xpath .= '//'; my ( $tag, $id, $class ) = ( $1, $2, $3 ); if ( $tag ) { $xpath .= $tag; } if ( $id ) { $xpath .= "*" unless $tag; $xpath .= "[\@id='$id']"; } if ( $class ) { $xpath .= "*" unless $tag; $xpath .= "[\@class='$class']"; } } } return $xpath; } my $css = CSS->new(); # this rule matches an element in our doc $css->read_string('div#foo p.bar { background-image : url(/foo/bar.gif +) }'); # this doesn't match an element in our doc $css->read_string('div#foo p.qux { background-image : url(/foo/qux.gif +) }'); # nor does this $css->read_string('div#baz p.bar { background-image : url(/foo/baz.gif +) }'); # but this does $css->read_string('div { background-image : url(/foo/div.gif) }'); # gather up all rules talking about backgrounds my %bg_rules; foreach my $rule ( @{ $css->{'styles'} } ) { foreach my $prop ( @{ $rule->{'properties'} } ) { if ( $prop->{'property'} =~ /^background(?:-image)?$/ ) { foreach my $selector ( @{ $rule->{'selectors'} } ) { $bg_rules{$selector->{'name'}} = $prop->{'simple_value'}; } } } } # slurp up the XML and parse for XPath-ery my $xml; { local $/; $xml = XML::XPath->new(ioref => *DATA); } # go through our list of CSS rules seeing which ones apply my @used_images; while ( my ( $sel, $propvalue ) = each %bg_rules ) { my $xpath = selector_to_xpath($sel); push(@used_images, $propvalue) if $xml->exists($xpath); } # let's see what we got ... warn Dumper \@used_images; __END__ <html> <head> </head> <body> <div id="foo"> <div> <p class="bar">This one</p> </div> </div> <div id="qux"> <p class="bar">Not me</p> </div> </body> </html>

Obviously, there's still a bit of work to be done to retrieve the image urls from the CSS properties, and also LOTS of work to implement as much of the CSS selector spec as you need, but hopefully it might get you started. Or dissuade you from the whole idea ;)

Cheers
ViceRaid

Replies are listed 'Best First'.
Re^2: Testing Page Size with HTML 4/CSS
by adrianh (Chancellor) on Jun 08, 2004 at 16:37 UTC
    Then, you can look at the aggregrate ruleset to find which rules have either background-image or background rules (the latter shorthand notation can include background-image specifications.)

    If you want to catch everything you'd also have to keep an eye on generated content (before and after could cause the download other URLs) and list-style-image.

    Also, depending on what numbers you're interested in, you might want to consider:

    • The weight of the HTTP headers sent. This can easily end up being several Kb.
    • Browsers/servers that support compressed content, which obviously affects the amount of data that flows over the wire.

    Rather than emulating the browser you might want to consider automating one. Write a Perl W3 proxy that keeps track of the size of content that flows over it and point MSIE to it. Drive MSIE with Perl and then look at what the proxy fetched. Just a thought.