in reply to Extract CSS + JS + Image URLs from a HTML page?

HTML::Miner provides a way to extract Image URLs.

If you take a look at the 'get_meta_elements' function you will see it pulls out RSS Feeds. Maybe you could make a couple of changes there to extract the CSS and JS URLs.

DISCLOSURE: I maintain that module.

  • Comment on Re: Extract CSS + JS + Image URLs from a HTML page?

Replies are listed 'Best First'.
Re^2: Extract CSS + JS + Image URLs from a HTML page?
by ultranerds (Hermit) on Jan 28, 2011 at 08:21 UTC
    Hi,

    BTW, are you aware in your documents you have:
    my $foo = HTML::Miner->new ( CURRENT_URL => 'www.perl.org' , CURRENT_URL_HTML => $html );
    Surely it should be $html_miner, not $foo ? (I was getting errors under "strict")

    :)

    Cheers

    Andy

      My bad!! Will fix in next update.

      Maybe I should add JS and CSS Extraction as well in the next update ...

      Thank You!

        Not the prettiest regexes, but they seem to work =)
        while ($string =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis) { print "FOO: $1.js \n"; } while ($string =~ m/\<link .*? href=\"(.+?)\.css\" \/?\>/gis) { + print "FOO: $1.css \n"; }
        Cheers

        Andy
Re^2: Extract CSS + JS + Image URLs from a HTML page?
by ultranerds (Hermit) on Jan 28, 2011 at 17:15 UTC
    Hi,

    I'm just having a look at the "relative" URL stuff, and can't seem to get it working right :(

    I'm trying:
    sub get_js { my $tmp = shift ; my $self ; my $url ; my $html ; my @result_arr ; my $user_agent = "Html_Miner/0.01" ; my $timeout = 60 ; my $domain ; ## First extract all required information. if( UNIVERSAL::isa( $tmp, 'HTML::Miner' ) ) { $self = $tmp ; $url = $self->{ CURRENT_URL } ; $html = $self->{ CURRENT_URL_HTML } ; $domain = $self->{ _BASE_DOMAIN } ; } else { $url = $tmp ; ## Check for validity of url! my ( $tmp, $protocol, $domain, $uri ) = _convert_to_valid_url( $url ) ; $url = $tmp ; my @params = @_ ; my $html_has_been_passed = @params ; if( $html_has_been_passed ) { $html = shift ; } else { ## Need to retrieve html eval { require LWP::UserAgent ; require HTTP::Request ; }; croak( "LWP::UserAgent and HTTP::Request are required if the u +rl is to be fetched!" ) if( $@ ); $html = _get_url_html( $url, $user_agent, $timeout ) ; } ## HTML Not passed } ## Not called on Object. while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $url = $1; if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); } push( @result_arr, "$url.js" ); } return \@result_arr; }
    ..with this being the bit in question:
    if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); }
    ..but I keep getting this error:
    A fatal error has occured: URL - http:///dev/static/utils/ - Malformed! Sorry I tried to fix +it but could not! at /var/home/linkssql/ultradev.com/cgi-bin/dev/admin/Plugins/CDN. +pm line 59 Please enable debugging in setup for more details.


    Any suggestions as to what I'm doing wrong? =)

    TIA!

    Andy

      It turns out that get_absolute_url takes two arguments, the first is the page the relative URL was found on and the second is the ( possibly ) relative URL.

      This should work:
      while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $js_url = $1; if ($js_url !~ /^https?:\/\//) { $js_url = HTML::Miner::get_absolute_url($url, $js_url); } push( @result_arr, "$js_url.js" ); } return \@result_arr;

      I just posted HERE about the HTML::Miner V0.05 that I uploaded which has the option to pull out CSS ans JS. It also provides for relative URLs.

      Finally it may not matter here but if there is a non-css <link /> or a file like 'blah.js?something' then this kind of RegEx might fail, I used:

      $html =~ m/(<link [^<]*?href=\"([^\"]+?\.css[^"]*?)\")/gis $html =~ m/(<script [^<]*?src=\"([^\"]+?\.js[^"]*?)\")/gis

        Hi,

        Thanks - will have a play with that now :)

        Cheers!

        Andy
Re^2: Extract CSS + JS + Image URLs from a HTML page?
by ultranerds (Hermit) on Jan 28, 2011 at 08:10 UTC
    Hi,

    Thanks - will check that module out. Looks very promising :)

    Cheers

    Andy