*2 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks! How to use a line of regular code to get this matter?
use warnings; use strict; my $t = <<'EOF'; <!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> <script src="CssScriptLoader.js"></script> <script src="XZClass.js"></script> </head> <div id="224"> <p>aaa</p> <p>aaa</p> <p>axxxdsfosdaa</p> <p>aaa</p> </div> <div id="724"> <p>aaa22</p> <p>22</p> <p>22</p> <p>aaa22</p> <p>aaa22</p> <p>aafsdfsdfa22</p> </div> <div id="284"> <p>aaa33</p> <p>aaa33</p> <p>aaa33sdfsdfaom</p> <p>aaa33</p> <p>aaa33</p> <p>aaa33</p> </div> </html> EOF if ($t =~ /<div id="724">(.*?)<\/div>/sg) { for my $m ($1 =~ /<p>(.+?)<\/p>/g) { print $m, "\n" } }else { print 'match fail!', "\n" }

Replies are listed 'Best First'.
Re: A line of code matches the question
by SuicideJunkie (Vicar) on Aug 10, 2017 at 15:07 UTC

    Presumably you mean a "regular expression"?

    The most basic thing is you probably want multiline mode with a /m

    And odds are you're going to want an XML parser to make your life easier when someone changes minor details in the file. I recommend XML::Tiny which great for most use cases and has no dependencies.

      Yes, I would like to match all the contents of the following <div id = "724">. Also I have not used XML :: Tiny, match the content I am not sure, I can not use xml, I match is Can not change the content. But want trying to get a match with a regular match., / M I tried, it seems not work. What should I do?
        You should NOT USE REGULAR EXPRESSIONS to parse HTML. Just don't do it. If you have a good-enough solution that uses two regular expressions, you should NOT try to combine it into one regular expression. Regexes are bad at parsing structured data like HTML, and rapidly become incomprehensible and unmaintainable when you try.
Re: A line of code matches the question
by Athanasius (Archbishop) on Aug 10, 2017 at 16:02 UTC

    Hello *2, and welcome to the Monastery!

    First, please note that the /g modifier on the first regex (the one in the if statement) does nothing, because the regex is called only once, in scalar context. If there were two or more <div id="724"> elements, only the first would be printed. You can fix this easily by changing the if into a while loop:

    while ($t =~ /<div id="724">(.*?)<\/div>/sg) { print "$_\n" for $1 =~ /<p>(.+?)<\/p>/g; }

    However, as SuicideJunkie says, you’ll be much better off using a dedicated XML parser. But note that your XML is not well-formed, because the <meta charset="UTF-8"> tag has no corresponding closing tag. When this is fixed, parsing is straightforward:

    use strict; use warnings; use XML::LibXML; my $t = <<'EOF'; ... <meta charset="UTF-8" /> ... EOF my $dom = XML::LibXML->load_xml(string => $t); print $_->to_literal . "\n" for $dom->findnodes('//div[@id="724"]/p');

    Output:

    1:59 >perl 1798_SoPW.pl aaa22 22 22 aaa22 aaa22 aafsdfsdfa22 1:59 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I have just done some testing, I found that XML :: LibXML is too concerned about the HTML format is correct, it does not seem to allow me to make a mistake. I found it was not quite suitable for doing this thing, and maybe the regular expression was more suitable for my current job. :)
      Wow, XML :: LibXML too strong! It solved my problem, the other of your careful worthy of my learning!
Re: A line of code matches the question
by hippo (Archbishop) on Aug 10, 2017 at 16:01 UTC

    When I run your code as it stands I get this:

    $ perl 1197154.pl aaa22 22 22 aaa22 aaa22 aafsdfsdfa22

    How does this output differ from what you expect/want?

    If you want the same output but by some other means, then you would need to be a lot more specific about what those other means might be.

      This is no problem, I'm sure. If you have other methods, expect you to reply again. I like different ways to solve the same problem! Thanks again for the monks!
Re: A line of code matches the question
by *2 (Novice) on Aug 10, 2017 at 17:49 UTC
    Thank you first for the monks! Through the XML package we mentioned, I refer to the Athanasius given the code, made a more satisfactory changes. as follows:
    use strict; use warnings; use XML::LibXML; use Data::Dumper; my $t = <<'EOF'; <!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> <script src="CssScriptLoader.js"></script> <script src="XZClass.js"></script> </head> <div id="224"> <p>aaa</p> <p>aaa</p> <p>axxxdsfosdaa</p> <p>aaa</p> </div> <div id="724"> <p>aaa22</p> <p>22</p> <p>22</p> <p>aaa22</p> <p>aaa22</p> <p>aafsdfsdfa22</p> </div> <div id="284"> <p>aaa33</p> <p>aaa33</p> <p>aaa33sdfsdfaom</p> <p>aaa33</p> <p>aaa33</p> <p>aaa33</p> </div> </html> EOF my $dom = XML::LibXML->load_html( string => \$t, recover => 1, suppress_errors => 1, ); my $xpath = '//div[@id="724"]/p'; print "$_\n" foreach $dom->findnodes($xpath)->to_literal_list;
Re: A line of code matches the question
by Anonymous Monk on Aug 10, 2017 at 14:24 UTC
    How to use a line of regular code to get this matter?
    Why?
      Sorry, my mother tongue is not English, so it's hard to describe it. I would like to try to make the problem clear.