Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Parsing/regex help required

by Fletch (Bishop)
on Sep 27, 2021 at 13:21 UTC ( [id://11137037]=note: print w/replies, xml ) Need Help??


in reply to Parsing/regex help required

First be sure if you have HTML you need to be using an HTML parser, not regex, to extract your lines.

Presumably this is something where the numbering's not generated by say an <ol> and you've actually pulled the text of whatever nodes out (using say HTML::TreeBuilder or Mojo::DOM) then you could use something maybe like.

my( $num, $text1, $text2 ) = $line_from_html =~ m{^ (\d+) \. \s+ (.*?) + \s+-\s+ (.*?) $}x;

Edit: Tweaked.

The cake is a lie.
The cake is a lie.
The cake is a lie.

Replies are listed 'Best First'.
Re^2: Parsing/regex help required
by Anonymous Monk on Sep 27, 2021 at 13:49 UTC
    each paragraph text is captured using mojo->all_text so that's all good. Running that code:
    my $entry = "123. The Quick brown fox – jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.*?) \s+-\s+ ( +.*?) $}x; say "$num|$text1|$text2";
    gives
    Use of uninitialized value $num in concatenation (.) or string at ./te +st.pl line 10. Use of uninitialized value $text1 in concatenation (.) or string at ./ +test.pl line 10. Use of uninitialized value $text2 in concatenation (.) or string at ./ +test.pl line 10. ||

      Problem is your dash is a fancy unicode-y en dash, not just a simple "-" character so my naïve attempt's not matching. I had to do some monkeying with Encode cutting and pasting your sample (which I don't think you'd need for Mojo when you're actually fetching your real results) but then I was able to get this to match.

      ## I set $_ to your sample string cut-n-pasted, then ran it through +decode DB<33> $_ = Encode::decode( q{UTF-8}, $_ ) ## Afterwards this worked (U+2013 is EN DASH); if you're not interes +ted in what ## the separator was you can of course change that bit to non-captur +ing DB<38> x m{ ^ (\d+) \. \s+ (.*?) \s+(-|\N{EN DASH}|\N{EM DASH})\s+ ( +.*?) $}x 0 123 1 'The Quick brown fox' 2 '\x{2013}' 3 'jumped over'

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

      This is what I get:

      Win8 Strawberry 5.30.3.1 (64) Mon 09/27/2021 15:56:45 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -Mfeature=say my $entry = "123. The Quick brown fox - jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.*?) \s+-\s+ ( +.*?) $}x; say "$num|$text1|$text2"; ^Z 123|The Quick brown fox|jumped over
      Are you sure the code you posted is really the code you're running?


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137037]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-03-29 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found