in reply to Re: Extract CSS + JS + Image URLs from a HTML page?
in thread Extract CSS + JS + Image URLs from a HTML page?

Hi,

BTW, are you aware in your documents you have:
my $foo = HTML::Miner->new ( CURRENT_URL => 'www.perl.org' , CURRENT_URL_HTML => $html );
Surely it should be $html_miner, not $foo ? (I was getting errors under "strict")

:)

Cheers

Andy

Replies are listed 'Best First'.
Re^3: Extract CSS + JS + Image URLs from a HTML page?
by tmharish (Friar) on Jan 28, 2011 at 09:13 UTC

    My bad!! Will fix in next update.

    Maybe I should add JS and CSS Extraction as well in the next update ...

    Thank You!

      Not the prettiest regexes, but they seem to work =)
      while ($string =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis) { print "FOO: $1.js \n"; } while ($string =~ m/\<link .*? href=\"(.+?)\.css\" \/?\>/gis) { + print "FOO: $1.css \n"; }
      Cheers

      Andy
        Hi,

        Not sure if you wanna use my version, but this seems to work for CSS (JS should be a simple tweak);
        sub get_css { my $tmp = shift ; my $self ; my $url ; my $html ; my @result_arr ; my $user_agent = "Html_Miner/0.01" ; my $timeout = 60 ; my $domain ; ## First extract all required information. if( UNIVERSAL::isa( $tmp, 'HTML::Miner' ) ) { $self = $tmp ; $url = $self->{ CURRENT_URL } ; $html = $self->{ CURRENT_URL_HTML } ; $domain = $self->{ _BASE_DOMAIN } ; } else { $url = $tmp ; ## Check for validity of url! my ( $tmp, $protocol, $domain, $uri ) = _convert_to_valid_url( $url ) ; $url = $tmp ; my @params = @_ ; my $html_has_been_passed = @params ; if( $html_has_been_passed ) { $html = shift ; } else { ## Need to retrieve html eval { require LWP::UserAgent ; require HTTP::Request ; }; croak( "LWP::UserAgent and HTTP::Request are required if the u +rl is to be fetched!" ) if( $@ ); $html = _get_url_html( $url, $user_agent, $timeout ) ; } ## HTML Not passed } ## Not called on Object. while( $html =~ m/\<link .*? href=\"(.+?)\.css\" \/?\>/gis ){ push( @result_arr, "$1.css" ); } return \@result_arr; }
        Cheers

        Andy