web::scraper and regex

Trace On has asked for the wisdom of the Perl Monks concerning the following question:

Hi wiseguys,

can I (better: how can I) use regex with web::scraper?

my html looks like this:

#    <div class="ereignis " style="55;" data-type="link" data-content=
+"/ajax/ereignis/185" rel="tooltip">
#        <span class="point">
#            &nbsp;
#        </span>
#    </div>
<code>

My scraper looks (so far like this):
<code>
my $scraper = scraper {
"styles[]" => scraper {
    process 'div[data-content[contains("/ajax/ereignis")]]', "styles[]
+" => scraper {
        process 'span', "ereignis" => '@class';
        process 'div[style]', "zeit" => '@style';
    }
};
my $res = $scraper->scrape($html);
print Dumper $res;
[download]

The "contains"-part is wrong... What would work is:

div[data-content="/ajax/ereignis/185"]

But that would only give me a single "ereignis" and not all.

I am grateful for any ideas!

Comment on web::scraper and regex Select or Download Code

Replies are listed 'Best First'.
Re: web::scraper and regex by Corion (Patriarch) on Sep 07, 2015 at 08:23 UTC
Web::Scraper resp. the underlying HTML::Selector::XPath doesn't understand XPath regular expressions - they came in with XPath version 2.0 which it doesn't support. The correct syntax for a `contains()` query would be: `div[contains(@data-content, "/ajax/ereignis")]` [download] If that helps you already then that's it, otherwise you'll have to add some postprocessing in Perl.	[reply] [d/l] [select]
Re^2: web::scraper and regex by Anonymous Monk on Sep 07, 2015 at 08:34 UTC
Web::Scraper resp. the underlying HTML::Selector::XPath doesn't understand XPath regular expressions - they came in with XPath version 2.0 which it doesn't support. The correct syntax for a contains() query would be: :) FWIW/AFAIK http://www.w3.org/TR/xpath/#function-contains doesn't take a regex, it only takes strings, but I could be reading that wrong OTOH, perl regex are supported HTML::TreeBuilder::XPath and regular expressions	[reply]
Re^3: web::scraper and regex by Corion (Patriarch) on Sep 07, 2015 at 08:39 UTC
I'm sorry I was unclear - I didn't expect regular expressions but most of the time I'm content with matching substrings. I have to reinvestigate how to use Perl regular expressions in HTML::Selector::XPath and how these could be passed on downwards to HTML::TreeBuilder::XPath, thanks!	[reply]
Re: web::scraper and regex by Anonymous Monk on Sep 07, 2015 at 07:56 UTC
Hi wiseguys, can I (better: how can I) use regex with web::scraper? What do the docs say? The "contains"-part is wrong... What would work is: But that would only give me a single "ereignis" and not all. What? What is your input?	[reply]
Re^2: web::scraper and regex ( xpath ) by Anonymous Monk on Sep 07, 2015 at 08:19 UTC
Ok so given input of matching on an attribute that contains a substring, seems to work :) $ cat 2 <body> <div data-content="/ajax/ereignis/4" data-type="link"> 4</div> <div data-content="/ajax/ereignis/185" data-type="link"> 185</div> <div data-content="wowee" data-type="link"> wowee</div> </body> $ xmllint.exe --xpath " //div[ contains( @data-content, '/ereignis' ) +] " 2 <div data-content="/ajax/ereignis/4" data-type="link"> 4</div><div data-content="/ajax/ereignis/185" data-type="link"> 185</div> $ xmllint.exe --xpath " //div[ @data-content ] " 2 <div data-content="/ajax/ereignis/4" data-type="link"> 4</div><div data-content="/ajax/ereignis/185" data-type="link"> 185</div><div data-content="wowee" data-type="link"> wowee</div> [download]	[reply] [d/l]