Testing Page Size with HTML 4/CSS

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on a script to test the total sizes of web pages.

It's easy enough to do in a fairly straightforward way, based on oldfashioned HTML. Here's what you do:

Get the page with a browser/agent
Parse it for linked files
Munge the URLs of those linked files if they're relative
Get the size of each linked file in turn
Add it up

So now I'm trying to do in in the modern world of CSS.

My first problem is that there are two kinds of CSS files which might be imported into a page.

One is via the LINK REL tag, which is easy enough to find with a parser, but the other is via the @import url(URLGOESHERE) statement.

I don't think there's a parser which will read that as a link, is there?

Never mind, it's easy enough to parse for in a regex (I know, I know).

But linked files can have nested linked files, that is, a CSS file that you import can contain one or more further @import url(URLGOESHERE) statements, so I'll have do to that recursively, but that's not the problem because...

...the true weight of an HTML 4/CSS page is the weight of the page, any CSS-tag files, any SCRIPT-tag files, any IMG-tag files, and any images referenced in the CSS files which have to be loaded.

For instance if there's a DIV with the ID "foo" with a P inside it with the class "bar", and somewhere in one of the CSS files there's a declaration which includes DIV#foo P.bar and sets a background image, that image should be counted toward the total.

But how will I know, without parsing the HTML and the CSS as well, which images are being loaded for that particular page?

The CSS file, if everything's going well, will be shared between multiple pages. Some of those pages will call on some of the images, but unless I parse the DOM and relate it to the CSS, I'm never going to know which images are being loaded on this particular page.

So, is this at all possible? Should I give up now..?

($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Comment on Testing Page Size with HTML 4/CSS Select or Download Code

Replies are listed 'Best First'.

Re: Testing Page Size with HTML 4/CSS
by PodMaster (Abbot) on Jun 08, 2004 at 10:04 UTC

... @import url(URLGOESHERE) statement. I don't think there's a parser which will read that as a link, is there?

CSS

update: And let's not forget any images pulled in by those SCRIPT files :)

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Testing Page Size with HTML 4/CSS
by Anonymous Monk on Jun 08, 2004 at 11:58 UTC

You also seem to be ignoring caching - is that deliberate? A lot of pages might have a lot of images, big CSS files etc related to them, but those objects only get pulled the first time the page loads. And then there's DOM-modifying scripts to take account of too... Maybe you could use HTTP::Recorder or some other proxy and simply record the byte counts when a real browser fetches the page?

[reply]

Re: Testing Page Size with HTML 4/CSS
by hakkr (Chaplain) on Jun 08, 2004 at 12:00 UTC

Probably also might want to look at the webserver access log as that will give you all the file hits needed for each page view. Easier than writing your own browser to parse css :)

[reply]

Re: Testing Page Size with HTML 4/CSS
by ViceRaid (Chaplain) on Jun 08, 2004 at 13:21 UTC

Everything is possible. Never give up. That said, this isn't that easy: you might well think that the following is just too convoluted to bother with, and that using some external agent (such as Internet Explorer driven by Perl) to do the task might be easier. Anyway, if you're doing it without external aid, there's two distinct steps to the task:

Parse all the CSS that pertains to the page, following <link>ed stylesheets, @import rules etc, as you've described, to find all the rules that refer to a background property
Find which of these rules have a selector which addresses a part of the HTML document to which the CSS is being applied

For step one, as already suggested, you might use the CSS package. This allows you to gather up lots of bits of CSS into a single ruleset, by repeatedly using the read_string method. Then, you can look at the aggregrate ruleset to find which rules have either background-image or background rules (the latter shorthand notation can include background-image specifications.)

use CSS;

my $css = CSS->new();

$css->read_string('div#foo p.bar { background-image : url(/foo/bar.gif
+) }');
# $css->read_string('table { border: 1px solid #FF0000 }');
# etc... 

my %bg_selectors;

# for some reason, CSS doesn't supply accessor methods...
foreach my $rule (  @{ $css->{'styles'} } ) {
    foreach my $prop (  @{ $rule->{'properties'} } ) {
        if ( $prop->{'property'} =~ /^background(?:-image)?$/ ) {
            foreach my $selector ( @{ $rule->{'selectors'} } ) {
                $bg_selectors{$selector->{'name'}} =
                    $prop->{'simple_value'};
            }

        }
    }
}
[download]

You should then have a hash keyed on CSS selectors whose values are the relevant background property value. For step 2, you need to find out whether the HTML document contains elements to which the rule should be applied. One way to do this would be to parse the document into a tree, then use XPath generated from the selectors to test the document.

I'm working with XML::XPath, which means that you'll need your source document to be valid XHTML. If it's not, there's a few ways to get there, such as using the htmltidy utility with the appropriate options to convert the document, or possibly using the experimental XML methods on a parse tree generated by HTML::TreeBuilder.

To test the document for the existence of the elements, you'll need to convert the CSS rules into XPath expressions. Here's a very limited example, which only deals with CSS tag, containment, classes and id selectors. It's also not much tested:


sub selector_to_xpath {
    my $selector = shift;
    my $xpath = '';

    foreach my $token ( split(/\s/, $selector) ) {
        if ( $token =~ /(\w+)? (?: \#(\w+) | \.(\w+) )?/x ) {
            $xpath .= '//';
            my ( $tag, $id, $class ) = ( $1, $2, $3 );
            if ( $tag ) {
                $xpath .= $tag;
            }
            if ( $id ) {
                $xpath .= "*" unless $tag;
                $xpath .= "[\@id='$id']";
            }
            if ( $class ) {
                $xpath .= "*" unless $tag;
                $xpath .= "[\@class='$class']";
            }
        }
    }
    return $xpath;
}
[download]

Now, rolling this all together....