Extract CSS + JS + Image URLs from a HTML page?

ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extract CSS + JS + Image URLs from a HTML page? by tmharish (Friar) on Jan 27, 2011 at 19:47 UTC
HTML::Miner provides a way to extract Image URLs. If you take a look at the 'get_meta_elements' function you will see it pulls out RSS Feeds. Maybe you could make a couple of changes there to extract the CSS and JS URLs. DISCLOSURE: I maintain that module.	[reply]
Re^2: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 08:21 UTC
Hi, BTW, are you aware in your documents you have: `my $foo = HTML::Miner->new ( CURRENT_URL => 'www.perl.org' , CURRENT_URL_HTML => $html );` [download] Surely it should be $html_miner, not $foo ? (I was getting errors under "strict") :) Cheers Andy	[reply] [d/l]
Re^3: Extract CSS + JS + Image URLs from a HTML page? by tmharish (Friar) on Jan 28, 2011 at 09:13 UTC
My bad!! Will fix in next update. Maybe I should add JS and CSS Extraction as well in the next update ... Thank You!	[reply]
Re^4: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 09:28 UTC
Re^5: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 09:33 UTC
Some notes below your chosen depth have not been shown here
Re^2: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 17:15 UTC
Hi, I'm just having a look at the "relative" URL stuff, and can't seem to get it working right :( I'm trying: sub get_js { my $tmp = shift ; my $self ; my $url ; my $html ; my @result_arr ; my $user_agent = "Html_Miner/0.01" ; my $timeout = 60 ; my $domain ; ## First extract all required information. if( UNIVERSAL::isa( $tmp, 'HTML::Miner' ) ) { $self = $tmp ; $url = $self->{ CURRENT_URL } ; $html = $self->{ CURRENT_URL_HTML } ; $domain = $self->{ _BASE_DOMAIN } ; } else { $url = $tmp ; ## Check for validity of url! my ( $tmp, $protocol, $domain, $uri ) = _convert_to_valid_url( $url ) ; $url = $tmp ; my @params = @_ ; my $html_has_been_passed = @params ; if( $html_has_been_passed ) { $html = shift ; } else { ## Need to retrieve html eval { require LWP::UserAgent ; require HTTP::Request ; }; croak( "LWP::UserAgent and HTTP::Request are required if the u +rl is to be fetched!" ) if( $@ ); $html = _get_url_html( $url, $user_agent, $timeout ) ; } ## HTML Not passed } ## Not called on Object. while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $url = $1; if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); } push( @result_arr, "$url.js" ); } return \@result_arr; } [download] ..with this being the bit in question: `if ($url !~ /^https?:\/\//) { $url = HTML::Miner::get_absolute_url($url); }` [download] ..but I keep getting this error: `A fatal error has occured: URL - http:///dev/static/utils/ - Malformed! Sorry I tried to fix +it but could not! at /var/home/linkssql/ultradev.com/cgi-bin/dev/admin/Plugins/CDN. +pm line 59 Please enable debugging in setup for more details.` [download] Any suggestions as to what I'm doing wrong? =) TIA! Andy	[reply] [d/l] [select]
Re^3: Extract CSS + JS + Image URLs from a HTML page? by tmharish (Friar) on Jan 28, 2011 at 19:15 UTC
It turns out that get_absolute_url takes two arguments, the first is the page the relative URL was found on and the second is the ( possibly ) relative URL. This should work: `while( $html =~ m/\<script .? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $js_url = $1; if ($js_url !~ /^https?:\/\//) { $js_url = HTML::Miner::get_absolute_url($url, $js_url); } push( @result_arr, "$js_url.js" ); } return \@result_arr;` [download] I just* posted HERE about the HTML::Miner V0.05 that I uploaded which has the option to pull out CSS ans JS. It also provides for relative URLs. Finally it may not matter here but if there is a non-css <link /> or a file like 'blah.js?something' then this kind of RegEx might fail, I used: `$html =~ m/(<link [^<]?href=\"([^\"]+?\.css[^"]?)\")/gis $html =~ m/(<script [^<]?src=\"([^\"]+?\.js[^"]?)\")/gis` [download]	[reply] [d/l] [select]
Re^4: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 31, 2011 at 16:51 UTC
Re^2: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 08:10 UTC
Hi, Thanks - will check that module out. Looks very promising :) Cheers Andy	[reply]
Re: Extract CSS + JS + Image URLs from a HTML page? by wjw (Priest) on Jan 27, 2011 at 20:07 UTC
Perhaps this will help get you started.. Here ...the majority is always wrong, and always the last to know about it... The Spice must flow... ..by my will, and by will alone.. I set my mind in motion	[reply]
Re^2: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 28, 2011 at 08:09 UTC
Hi, Thanks for the reply wjw, but unfortunatly, people don't always keep their meta-stuff in the <head></head> part, so it may end up missing bits out :( Cheers Andy	[reply]
Re: Extract CSS + JS + Image URLs from a HTML page? by tmharish (Friar) on Jan 28, 2011 at 19:02 UTC
Hi Andy Turned out I had some time today, just uploaded HTML::Miner V0.05 which includes functionality to extract JS and CSS. It might take a couple of hours to turn up on CPAN mirrors Looks like miss-spelling of Relative will have to wait another revision :-(	[reply]