ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to write a simple CDN script, which will take the values from an HTML page, and pass the values to the CDN network

However, to do this I need to extract all the JS/CSS/Image URLs from the HTML page

Is there a simple way to do this?

I looked at HTML::LinkExtor, but couldn't quite get it to do what I want

Anyone got any suggestions?

TIA!

Andy
  • Comment on Extract CSS + JS + Image URLs from a HTML page?

Replies are listed 'Best First'.
Re: Extract CSS + JS + Image URLs from a HTML page?
by tmharish (Friar) on Jan 27, 2011 at 19:47 UTC

    HTML::Miner provides a way to extract Image URLs.

    If you take a look at the 'get_meta_elements' function you will see it pulls out RSS Feeds. Maybe you could make a couple of changes there to extract the CSS and JS URLs.

    DISCLOSURE: I maintain that module.

      Hi,

      BTW, are you aware in your documents you have:
      my $foo = HTML::Miner->new ( CURRENT_URL => 'www.perl.org' , CURRENT_URL_HTML => $html );
      Surely it should be $html_miner, not $foo ? (I was getting errors under "strict")

      :)

      Cheers

      Andy

        My bad!! Will fix in next update.

        Maybe I should add JS and CSS Extraction as well in the next update ...

        Thank You!

      Hi,

      I'm just having a look at the "relative" URL stuff, and can't seem to get it working right :(

      I'm trying:
      sub get_js { my $tmp = shift ; my $self ; my $url ; my $html ; my @result_arr ; my $user_agent = "Html_Miner/0.01" ; my $timeout = 60 ; my $domain ; ## First extract all required information. if( UNIVERSAL::isa( $tmp, 'HTML::Miner' ) ) { $self = $tmp ; $url = $self->{ CURRENT_URL } ; $html = $self->{ CURRENT_URL_HTML } ; $domain = $self->{ _BASE_DOMAIN } ; } else { $url = $tmp ; ## Check for validity of url! my ( $tmp, $protocol, $domain, $uri ) = _convert_to_valid_url( $url ) ; $url = $tmp ; my @params = @_ ; my $html_has_been_passed = @params ; if( $html_has_been_passed ) { $html = shift ; } else { ## Need to retrieve html eval { require LWP::UserAgent ; require HTTP::Request ; }; croak( "LWP::UserAgent and HTTP::Request are required if the u +rl is to be fetched!" ) if( $@ ); $html = _get_url_html( $url, $user_agent, $timeout ) ; } ## HTML Not passed } ## Not called on Object. while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $url = $1; if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); } push( @result_arr, "$url.js" ); } return \@result_arr; }
      ..with this being the bit in question:
      if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); }
      ..but I keep getting this error:
      A fatal error has occured: URL - http:///dev/static/utils/ - Malformed! Sorry I tried to fix +it but could not! at /var/home/linkssql/ultradev.com/cgi-bin/dev/admin/Plugins/CDN. +pm line 59 Please enable debugging in setup for more details.


      Any suggestions as to what I'm doing wrong? =)

      TIA!

      Andy

        It turns out that get_absolute_url takes two arguments, the first is the page the relative URL was found on and the second is the ( possibly ) relative URL.

        This should work:
        while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $js_url = $1; if ($js_url !~ /^https?:\/\//) { $js_url = HTML::Miner::get_absolute_url($url, $js_url); } push( @result_arr, "$js_url.js" ); } return \@result_arr;

        I just posted HERE about the HTML::Miner V0.05 that I uploaded which has the option to pull out CSS ans JS. It also provides for relative URLs.

        Finally it may not matter here but if there is a non-css <link /> or a file like 'blah.js?something' then this kind of RegEx might fail, I used:

        $html =~ m/(<link [^<]*?href=\"([^\"]+?\.css[^"]*?)\")/gis $html =~ m/(<script [^<]*?src=\"([^\"]+?\.js[^"]*?)\")/gis

      Hi,

      Thanks - will check that module out. Looks very promising :)

      Cheers

      Andy
Re: Extract CSS + JS + Image URLs from a HTML page?
by wjw (Priest) on Jan 27, 2011 at 20:07 UTC
    Perhaps this will help get you started.. Here
    • ...the majority is always wrong, and always the last to know about it...
    • The Spice must flow...
    • ..by my will, and by will alone.. I set my mind in motion
      Hi,

      Thanks for the reply wjw, but unfortunatly, people don't always keep their meta-stuff in the <head></head> part, so it may end up missing bits out :(

      Cheers

      Andy
Re: Extract CSS + JS + Image URLs from a HTML page?
by tmharish (Friar) on Jan 28, 2011 at 19:02 UTC

    Hi Andy

    Turned out I had some time today, just uploaded HTML::Miner V0.05 which includes functionality to extract JS and CSS.

    It might take a couple of hours to turn up on CPAN mirrors

    Looks like miss-spelling of Relative will have to wait another revision :-(