Which one is the better Regex?

Rufnex has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

i'm looking for regular expr. for extract the title from a html-document. i've the following:

/<title>([^<]*)/i
/(?i)<title>(.*?)<\/title>?/i
[download]

which on is better .. or do you have another on? Next, i've a regular expr. to get rid from all html-tags:

s/<[^>]+>//g;
/<(?:[^>'"]*|(['"]).*?\1)*>//gs;
[download]

same above .. which one is better or do ya have the best for this topic ;o)?

thx a lot
Rufnex

Comment on Which one is the better Regex? Select or Download Code

Replies are listed 'Best First'.

Re: Which one is the better Regex?
by broquaint (Abbot) on Feb 27, 2003 at 13:42 UTC

/me suggests an alternative solution

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new( *DATA );
while(my $t = $p->get_token()) {
  print $p->get_token()->return_text() and last
    if $t->is_start_tag('title');
}

__DATA__
<html>
 <head>
  <title>Nothing to see here</title>
 </head>
</html>
[download]

HTML::TokeParser::Simple

_________ broquaint

Re: Re: Which one is the better Regex?

by IlyaM (Parson) on Feb 27, 2003 at 15:25 UTC

/me suggests another alternative solution


use XML::LibXML;
use FileHandle;

my $parser = XML::LibXML->new;
my $doc = $parser->parse_html_fh( \*DATA );
print $doc->findvalue('//title');

__DATA__
<html>
 <head>
  <title>Nothing to see here</title>
 </head>
</html>
[download]

--
Ilya Martynov, ilya@iponweb.net
CTO IPonWEB (UK) Ltd
Quality Perl Programming and Unix Support UK managed @ offshore prices - http://www.iponweb.net
Personal website - http://martynov.org

Re: Which one is the better Regex?
by Abigail-II (Bishop) on Feb 27, 2003 at 13:49 UTC

If we look at the first set, they are both wrong. Sure, in many cases, they will extract the title, but in some cases, they will not. For instance, in the first regex, you are assuming that any < starts a tag. This is not the case however. Furthermore, both regexes assume comments do not exist. Or CDATA marked sections. Note also that the latter one uses both (?i) and /i. One can be omitted.

As for the last set of regexes, both are so horribly wrong, that talking about which one is better carries no meaning at all. It's like asking "what's better to eat with fries? Yellow or wednesday?".

As the FAQ says, if you want to extract elements, or remove tags, PARSE the HTML, don't use trivial regexes.

Abigail

[reply]
[d/l]
[select]

Re: Re: Which one is the better Regex?

by Rufnex (Novice) on Feb 27, 2003 at 14:11 UTC

thx

Re: Which one is the better Regex?

by Abigail-II (Bishop) on Feb 27, 2003 at 14:30 UTC

Abigail

Re: Re: Re: Which one is the better Regex?

by PodMaster (Abbot) on Feb 28, 2003 at 07:45 UTC

#!/usr/bin/perl

use YAPE::HTML;
use Data::Dumper;
use warnings;
use strict;

my $content = "
    <html>
        <title>
            yes a title
        </title>
        <body>
            yes a body
        </body>
    </html>
";

my $parser = YAPE::HTML->new($content);
my $extor = $parser->extract( 'title' => []);

while (my $chunk = $extor->()) {
    print Dumper $chunk;
    print $/,'>>>>',$chunk->text()->[0]->string(),'<<<<',$/x5;
}
__END__

$VAR1 = bless( {
                 'TYPE' => 'tag',
                 'ATTR' => {},
                 'TAG' => 'title',
                 'TEXT' => [
                             bless( {
                                      'TYPE' => 'text',
                                      'TEXT' => '
            yes a title
        '
                                    }, 'YAPE::HTML::text' )
                           ],
                 'IMPLIED' => '',
                 'CLOSED' => 1
               }, 'YAPE::HTML::tag' );

>>>>
            yes a title
        <<<<
[download]

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Which one is the better Regex?
by dws (Chancellor) on Feb 27, 2003 at 19:40 UTC

The "pragmatic" answer may be different. 99.9+% of the time, extracting a title using a regex works just fine. It's not the bullet-proof way, but in my experience you'll see far fewer real problems than you'll see theoretical ones. If you're not sensitive to the risk, and want to avoid taking on the overhead of using a parser,

  if ( $html =~ m|<title>(.*?)</title>|is ) {
      $title = $1;
  } else {
      # note that you failed to find a title
  }
[download]

If you're extracting much more that the title, though, I'd go with a parser.

Back to Seekers of Perl Wisdom