Saket has asked for the wisdom of the Perl Monks concerning the following question:

HI Guys, I am new to reg expressions.I want to extract text that is present between the HTML tags. For example

<title> This is a very good site for regular Expressions. Very helpful. Thanks. </title>
How to extract the 3 lines of text using reg exp? Can someone has an answer to this?

Replies are listed 'Best First'.
Re: How to extract text present in 3 lines within the HTML tags
by Nikhil Jain (Monk) on May 17, 2011 at 07:24 UTC

    It would be nice practice if you use html parsers for parsing a html like HTML::Parser,

    Same practice is also mentioned in perlfaq6 - How do I match XML, HTML, or other nasty, ugly things with a regex?

    if you really want to have regex for it then try

    my $str ="<title> This is a very good site for regular Expressions. Very helpful. Thanks. </title>"; #print"$str\n"; $str =~ m#<title>(.*?)</title>#gis; my $matched_output = $1; #print"\n$matched_output\n";
Re: How to extract text present in 3 lines within the HTML tags
by moritz (Cardinal) on May 17, 2011 at 07:28 UTC
        and since that snippet is XML valid, you can use XML::Twig

      moritz wrote: Let somebody else write the regexes for you:

      He kinda did, by asking perlmonks... :)

Re: How to extract text present in 3 lines within the HTML tags
by Anonymous Monk on May 17, 2011 at 07:13 UTC
Re: How to extract text present in 3 lines within the HTML tags
by ambrus (Abbot) on May 17, 2011 at 15:04 UTC

    Come on, just use a real full HTML parser.

    use warnings; use XML::Twig; our $doc = q( <title> This is a very good site for regular Expressions. Very helpful. Thanks. </title> <p> Some other text we don't want to extract. ); my $twig = XML::Twig->new; $twig->parse_html($doc); my($title_elt) = $twig->findnodes("//title"); my $title = $title_elt->trimmed_text; print "$title\n" __END__
Re: How to extract text present in 3 lines within the HTML tags
by wind (Priest) on May 17, 2011 at 15:58 UTC

    HTML::Parser comes with an example specifically for that: htitle

    However, I'd probably go with HTML::TreeBuilder::XPath:

    use HTML::TreeBuilder::XPath; use strict; use warnings; my $data = do {local $/; <DATA>}; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($data); print $tree->findvalue('//title'); __DATA__ <html> <head> <title> This is a very good site for regular Expressions. Very helpful. Thanks. </title> </head> <body> <p>Hello world</p> </body> </html>