Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've been working on a script to test the total sizes of web pages.

It's easy enough to do in a fairly straightforward way, based on oldfashioned HTML. Here's what you do:

So now I'm trying to do in in the modern world of CSS.

My first problem is that there are two kinds of CSS files which might be imported into a page.

One is via the LINK REL tag, which is easy enough to find with a parser, but the other is via the @import url(URLGOESHERE) statement.

I don't think there's a parser which will read that as a link, is there?

Never mind, it's easy enough to parse for in a regex (I know, I know).

But linked files can have nested linked files, that is, a CSS file that you import can contain one or more further @import url(URLGOESHERE) statements, so I'll have do to that recursively, but that's not the problem because...

...the true weight of an HTML 4/CSS page is the weight of the page, any CSS-tag files, any SCRIPT-tag files, any IMG-tag files, and any images referenced in the CSS files which have to be loaded.

For instance if there's a DIV with the ID "foo" with a P inside it with the class "bar", and somewhere in one of the CSS files there's a declaration which includes DIV#foo P.bar and sets a background image, that image should be counted toward the total.

But how will I know, without parsing the HTML and the CSS as well, which images are being loaded for that particular page?

The CSS file, if everything's going well, will be shared between multiple pages. Some of those pages will call on some of the images, but unless I parse the DOM and relate it to the CSS, I'm never going to know which images are being loaded on this particular page.

So, is this at all possible? Should I give up now..?



($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Testing Page Size with HTML 4/CSS
by PodMaster (Abbot) on Jun 08, 2004 at 10:04 UTC
    ... @import url(URLGOESHERE) statement. I don't think there's a parser which will read that as a link, is there?
    Sure, why not => CSS

    update: And let's not forget any images pulled in by those SCRIPT files :)

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Testing Page Size with HTML 4/CSS
by Anonymous Monk on Jun 08, 2004 at 11:58 UTC
    You also seem to be ignoring caching - is that deliberate? A lot of pages might have a lot of images, big CSS files etc related to them, but those objects only get pulled the first time the page loads. And then there's DOM-modifying scripts to take account of too... Maybe you could use HTTP::Recorder or some other proxy and simply record the byte counts when a real browser fetches the page?
Re: Testing Page Size with HTML 4/CSS
by hakkr (Chaplain) on Jun 08, 2004 at 12:00 UTC
    Well you need to somehow emulate the equivalent of a web browser File->Save As. and then stat the contents of the downloaded files. Sounds possible but if you only want rendered images then you need to write or call a browser to simulate rendering the page. Maybe you could do this by looking at the browser cache or calling some external page caching software?

    Probably also might want to look at the webserver access log as that will give you all the file hits needed for each page view. Easier than writing your own browser to parse css :)

Re: Testing Page Size with HTML 4/CSS
by ViceRaid (Chaplain) on Jun 08, 2004 at 13:21 UTC

    Everything is possible. Never give up. That said, this isn't that easy: you might well think that the following is just too convoluted to bother with, and that using some external agent (such as Internet Explorer driven by Perl) to do the task might be easier. Anyway, if you're doing it without external aid, there's two distinct steps to the task:

    1. Parse all the CSS that pertains to the page, following <link>ed stylesheets, @import rules etc, as you've described, to find all the rules that refer to a background property
    2. Find which of these rules have a selector which addresses a part of the HTML document to which the CSS is being applied

    For step one, as already suggested, you might use the CSS package. This allows you to gather up lots of bits of CSS into a single ruleset, by repeatedly using the read_string method. Then, you can look at the aggregrate ruleset to find which rules have either background-image or background rules (the latter shorthand notation can include background-image specifications.)

    use CSS; my $css = CSS->new(); $css->read_string('div#foo p.bar { background-image : url(/foo/bar.gif +) }'); # $css->read_string('table { border: 1px solid #FF0000 }'); # etc... my %bg_selectors; # for some reason, CSS doesn't supply accessor methods... foreach my $rule ( @{ $css->{'styles'} } ) { foreach my $prop ( @{ $rule->{'properties'} } ) { if ( $prop->{'property'} =~ /^background(?:-image)?$/ ) { foreach my $selector ( @{ $rule->{'selectors'} } ) { $bg_selectors{$selector->{'name'}} = $prop->{'simple_value'}; } } } }

    You should then have a hash keyed on CSS selectors whose values are the relevant background property value. For step 2, you need to find out whether the HTML document contains elements to which the rule should be applied. One way to do this would be to parse the document into a tree, then use XPath generated from the selectors to test the document.

    I'm working with XML::XPath, which means that you'll need your source document to be valid XHTML. If it's not, there's a few ways to get there, such as using the htmltidy utility with the appropriate options to convert the document, or possibly using the experimental XML methods on a parse tree generated by HTML::TreeBuilder.

    To test the document for the existence of the elements, you'll need to convert the CSS rules into XPath expressions. Here's a very limited example, which only deals with CSS tag, containment, classes and id selectors. It's also not much tested:

    sub selector_to_xpath { my $selector = shift; my $xpath = ''; foreach my $token ( split(/\s/, $selector) ) { if ( $token =~ /(\w+)? (?: \#(\w+) | \.(\w+) )?/x ) { $xpath .= '//'; my ( $tag, $id, $class ) = ( $1, $2, $3 ); if ( $tag ) { $xpath .= $tag; } if ( $id ) { $xpath .= "*" unless $tag; $xpath .= "[\@id='$id']"; } if ( $class ) { $xpath .= "*" unless $tag; $xpath .= "[\@class='$class']"; } } } return $xpath; }

    Now, rolling this all together....

    Obviously, there's still a bit of work to be done to retrieve the image urls from the CSS properties, and also LOTS of work to implement as much of the CSS selector spec as you need, but hopefully it might get you started. Or dissuade you from the whole idea ;)

    Cheers
    ViceRaid

      Then, you can look at the aggregrate ruleset to find which rules have either background-image or background rules (the latter shorthand notation can include background-image specifications.)

      If you want to catch everything you'd also have to keep an eye on generated content (before and after could cause the download other URLs) and list-style-image.

      Also, depending on what numbers you're interested in, you might want to consider:

      • The weight of the HTTP headers sent. This can easily end up being several Kb.
      • Browsers/servers that support compressed content, which obviously affects the amount of data that flows over the wire.

      Rather than emulating the browser you might want to consider automating one. Write a Perl W3 proxy that keeps track of the size of content that flows over it and point MSIE to it. Drive MSIE with Perl and then look at what the proxy fetched. Just a thought.