Trace On has asked for the wisdom of the Perl Monks concerning the following question:

Hi wiseguys,

can I (better: how can I) use regex with web::scraper?

my html looks like this:

# <div class="ereignis " style="55;" data-type="link" data-content= +"/ajax/ereignis/185" rel="tooltip"> # <span class="point"> # &nbsp; # </span> # </div> <code> My scraper looks (so far like this): <code> my $scraper = scraper { "styles[]" => scraper { process 'div[data-content[contains("/ajax/ereignis")]]', "styles[] +" => scraper { process 'span', "ereignis" => '@class'; process 'div[style]', "zeit" => '@style'; } }; my $res = $scraper->scrape($html); print Dumper $res;

The "contains"-part is wrong... What would work is:

div[data-content="/ajax/ereignis/185"]

But that would only give me a single "ereignis" and not all.

I am grateful for any ideas!

Replies are listed 'Best First'.
Re: web::scraper and regex
by Corion (Patriarch) on Sep 07, 2015 at 08:23 UTC

    Web::Scraper resp. the underlying HTML::Selector::XPath doesn't understand XPath regular expressions - they came in with XPath version 2.0 which it doesn't support. The correct syntax for a contains() query would be:

    div[contains(@data-content, "/ajax/ereignis")]

    If that helps you already then that's it, otherwise you'll have to add some postprocessing in Perl.

        I'm sorry I was unclear - I didn't expect regular expressions but most of the time I'm content with matching substrings.

        I have to reinvestigate how to use Perl regular expressions in HTML::Selector::XPath and how these could be passed on downwards to HTML::TreeBuilder::XPath, thanks!

Re: web::scraper and regex
by Anonymous Monk on Sep 07, 2015 at 07:56 UTC

    Hi wiseguys, can I (better: how can I) use regex with web::scraper?

    What do the docs say?

    The "contains"-part is wrong... What would work is: But that would only give me a single "ereignis" and not all.

    What? What is your input?

      Ok so given input of matching on an attribute that contains a substring, seems to work :)

      $ cat 2 <body> <div data-content="/ajax/ereignis/4" data-type="link"> 4</div> <div data-content="/ajax/ereignis/185" data-type="link"> 185</div> <div data-content="wowee" data-type="link"> wowee</div> </body> $ xmllint.exe --xpath " //div[ contains( @data-content, '/ereignis' ) +] " 2 <div data-content="/ajax/ereignis/4" data-type="link"> 4</div><div data-content="/ajax/ereignis/185" data-type="link"> 185</div> $ xmllint.exe --xpath " //div[ @data-content ] " 2 <div data-content="/ajax/ereignis/4" data-type="link"> 4</div><div data-content="/ajax/ereignis/185" data-type="link"> 185</div><div data-content="wowee" data-type="link"> wowee</div>