Re^2: Extract CSS + JS + Image URLs from a HTML page?

Hi,

I'm just having a look at the "relative" URL stuff, and can't seem to get it working right :(

I'm trying:

sub get_js { 

    my $tmp = shift  ;

    my $self         ;
    my $url          ;
    my $html         ;

    my @result_arr   ;

    my $user_agent = "Html_Miner/0.01" ;
    my $timeout    = 60                ; 

    my $domain       ;
    
    ## First extract all required information.

    if( UNIVERSAL::isa( $tmp, 'HTML::Miner' )  ) { 

    $self = $tmp                        ;

    $url     =  $self->{ CURRENT_URL      } ;
    $html    =  $self->{ CURRENT_URL_HTML } ;
    $domain  =  $self->{ _BASE_DOMAIN     } ;

    } else { 
    
    $url = $tmp                         ;

    ## Check for validity of url! 
    my ( $tmp, $protocol, $domain, $uri ) =  
        _convert_to_valid_url( $url )   ;
    $url = $tmp                         ;

    my @params               = @_       ;
    my $html_has_been_passed = @params  ;

    
    if( $html_has_been_passed ) { 
        $html = shift                   ;
    } else { 

        ## Need to retrieve html 
    
        eval { 
        require LWP::UserAgent      ;
        require HTTP::Request       ;
        }; 
        croak( "LWP::UserAgent and HTTP::Request are required if the u
+rl is to be fetched!" ) 
        if( $@ );


        $html = _get_url_html( $url, $user_agent, $timeout )   ;
        
    } ## HTML Not passed


    }     ## Not called on Object.

    while( $html =~ m/\<script .*? src=\"(.+?)\Q.js"><\/script>\E/gis 
+){
        
        my $url = $1;

        if ($url !~ /^https?:\/\//) {
            $url = HTML::Miner::get_absolute_url($url);
        }

        push( @result_arr, "$url.js" );
    }

    return \@result_arr;

}
[download]

..with this being the bit in question:

        if ($url !~ /^https?:\/\//) {
            $url = HTML::Miner::get_absolute_url($url);
        }
[download]

..but I keep getting this error:

A fatal error has occured:

    URL - http:///dev/static/utils/ - Malformed! Sorry I tried to fix 
+it but could not!
     at /var/home/linkssql/ultradev.com/cgi-bin/dev/admin/Plugins/CDN.
+pm line 59

Please enable debugging in setup for more details.
[download]

Any suggestions as to what I'm doing wrong? =)

TIA!

Andy

Comment on Re^2: Extract CSS + JS + Image URLs from a HTML page? Select or Download Code

Replies are listed 'Best First'.
Re^3: Extract CSS + JS + Image URLs from a HTML page? by tmharish (Friar) on Jan 28, 2011 at 19:15 UTC
It turns out that get_absolute_url takes two arguments, the first is the page the relative URL was found on and the second is the ( possibly ) relative URL. This should work: `while( $html =~ m/\<script .? src=\"(.+?)\Q.js"><\/script>\E/gis +){ my $js_url = $1; if ($js_url !~ /^https?:\/\//) { $js_url = HTML::Miner::get_absolute_url($url, $js_url); } push( @result_arr, "$js_url.js" ); } return \@result_arr;` [download] I just* posted HERE about the HTML::Miner V0.05 that I uploaded which has the option to pull out CSS ans JS. It also provides for relative URLs. Finally it may not matter here but if there is a non-css <link /> or a file like 'blah.js?something' then this kind of RegEx might fail, I used: `$html =~ m/(<link [^<]?href=\"([^\"]+?\.css[^"]?)\")/gis $html =~ m/(<script [^<]?src=\"([^\"]+?\.js[^"]?)\")/gis` [download]	[reply] [d/l] [select]
Re^4: Extract CSS + JS + Image URLs from a HTML page? by ultranerds (Hermit) on Jan 31, 2011 at 16:51 UTC
Hi, Thanks - will have a play with that now :) Cheers! Andy	[reply]