vskatusa has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need some help in regular expression match. For example let us say I go to finance.yahoo.com and search for DODIX and view the source. I can search the following string and I ALWAYS get a match
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid=" +33">
However, when I use this is in code as follows, it does not work
if (string =~ m/<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib +)" data-reactid="33">/) { print "match\n"; } else print "no match"; }
Tried escaping following charcters:
( ) . -
and tried the following:
if (string =~ m/<span class="Trsdu\(0\.3s\) Fw\(b\) Fz\(36px\) Mb\(\- +4px\) D\(ib\)" data\-reactid="33">/) { print "match\n"; } else print "no match"; }
I am sure I am doing something wrong and I know RE sometimes gets complicated. Can monks educate me?

Replies are listed 'Best First'.
Re: Regular Expression Help
by marto (Cardinal) on Apr 24, 2020 at 17:26 UTC

    Doing such things with regex is no fun, and easily done wrong. Using Mojo::DOM:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; use feature 'say'; my $html = '<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" da +ta-reactid="33">'; my $dom = Mojo::DOM->new( $html ); say $dom->at('span.[class^=Trsdu]')->attr->{'data-reactid'};

    Output:

    33
Re: Regular Expression Help
by haukex (Archbishop) on Apr 24, 2020 at 17:24 UTC

    When I fix the syntax errors (and write $string instead of string - always Use strict and warnings!), your second piece of code works for me, as in it prints "match". But you should not be Parsing HTML/XML with Regular Expressions! For one, the order of items in an HTML class attribute can change. Use something like Mojo::DOM, which supports selectors, instead.

    use warnings; use strict; use Mojo::DOM; my $html = <<'END_HTML'; <span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid=" +33">a</span> <span class="D(ix) Mb(-4px) Fz(36px) Fw(b) Trsdu(0.3s)" data-reactid=" +34">b</span> <span class="Mb(-4px) Fz(36px) D(ib) Fw(b) Trsdu(0.3s)" data-reactid=" +35">c</span> END_HTML my $dom = Mojo::DOM->new($html); my $spans = $dom->find('span[class~="D(ib)"]')->each( sub { print "==> $_ <==\n" } ); __END__ ==> <span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-react +id="33">a</span> <== ==> <span class="Mb(-4px) Fz(36px) D(ib) Fw(b) Trsdu(0.3s)" data-react +id="35">c</span> <==
      Hi haukex, thank you for your insights. I used the DOM and I get the following results for https://finance.yahoo.com/quote/XOM?p=XOM&.tsrc=fin-srch
      ==> <span class="Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)" + data-reactid="14">43.88</span> <== ==> <span class="Trsdu(0.3s) Fw(500) Fz(14px) C($positiveColor)" data- +reactid="16">+0.15 (+0.34%)</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="68">43.73</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="73">43.59</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="78">43.79 x 900</span> <= += ==> <span class="Trsdu(0.3s) " data-reactid="83">43.80 x 900</span> <= += ==> <span class="Trsdu(0.3s) " data-reactid="96">19,803,615</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="101">38,560,761</span> <= += ==> <span class="Trsdu(0.3s) " data-reactid="109">185.631B</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="114">1.27</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="119">13.07</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="124">3.36</span> <== ==> <span class="Trsdu(0.3s) " data-reactid="143">47.13</span> <==
      My questions:
      sub { print "==> $_ <==\n" }
      How do I capture the $_ in a variable for example if I do this
      sub {push @myArray, $_; }
      It fails! I am trying to understand how I can capture the output in your sub into an array.
        sub {push @myArray, $_; } It fails! I am trying to understand how I can capture the output in your sub into an array.

        pushing the results onto an array works fine for me, so you'd have to show an SSCCE of how it's failing for you. However, note that you don't need to create a new array - the return value of the ->find method is a Mojo::Collection object, which is basically just a fancy array reference. In other words, you can do my $c = $dom->find(...) and then @$c is the array of results - an array of Mojo::DOM objects. Just a guess, but perhaps you need to look at those docs to see what you can do with such objects, such as for example calling their ->all_text method to get their contents. (You can use Perl's grep, map, and other array operations on @$c, or you can use Mojo::Collection's methods such as $c->grep(...) and $c->map(...), which have a different API and return Mojo::Collection objects. ->each is basically the object's version of Perl's foreach.)

Re: Regular Expression Help
by davido (Cardinal) on Apr 24, 2020 at 18:57 UTC

    I would absolutely use a DOM parser for this such as Mojo::DOM, as has already been recommended. In fact, it's exactly the one I would use, though there are many alternatives also on CPAN. If I were looking at this with only regular expressions in my tool belt I would alter your regex as follows:

    if ($string =~ m/<\s*span\s+class\s*=\s*"Trsdu\(0\.3s\)\s+Fw\(b\)\s+Fz +\(36px\)...............

    In other words, because HTML allows whitespace just about everywhere, you have to allow for whitespace to show up just about anywhere in your patten. But you can't use this either, because the order of elements in a span tag is not set in stone. 'class' and 'data-reactid' can come in any order, so you would also need to deal with that. By the time you've dealt with these realities, you've gotten a pretty good start at writing a really fragile and specialized tool that would be better served by a DOM parser.


    Dave

Re: Regular Expression Help
by hippo (Archbishop) on Apr 24, 2020 at 17:28 UTC

    If you are trying to match an exact substring then there is usually little point in using regex for that. I would use index instead.

    use strict; use warnings; use Test::More tests => 1; my $corpus = q{foo<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(i +b)" data-reactid="33">bar}; my $lookfor = q{<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib) +" data-reactid="33">}; my $res = index ($corpus, $lookfor) > -1 ? 1 : 0; ok $res, 'Substring found';

    Update: Replacing the ternary with a simple increment gives us more info in $res with the same testing ability. Adding a negative test too gives the better example:

    use strict; use warnings; use Test::More tests => 2; my $corpus = q{foo<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(i +b)" data-reactid="33">bar}; my $lookfor = q{<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib) +" data-reactid="33">}; my $res = index ($corpus, $lookfor) + 1; ok $res, 'Substring found'; $corpus = 'Something else'; $res = index ($corpus, $lookfor) + 1; ok !$res, 'Substring not found';