Rufnex has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

i'm looking for regular expr. for extract the title from a html-document. i've the following:

/<title>([^<]*)/i /(?i)<title>(.*?)<\/title>?/i
which on is better .. or do you have another on? Next, i've a regular expr. to get rid from all html-tags:
s/<[^>]+>//g; /<(?:[^>'"]*|(['"]).*?\1)*>//gs;
same above .. which one is better or do ya have the best for this topic ;o)?

thx a lot
Rufnex

Replies are listed 'Best First'.
Re: Which one is the better Regex?
by broquaint (Abbot) on Feb 27, 2003 at 13:42 UTC
    /me suggests an alternative solution
    use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( *DATA ); while(my $t = $p->get_token()) { print $p->get_token()->return_text() and last if $t->is_start_tag('title'); } __DATA__ <html> <head> <title>Nothing to see here</title> </head> </html>
    See. Ovid's HTML::TokeParser::Simple for more info on this groovy little module.
    HTH

    _________
    broquaint

      /me suggests another alternative solution
      use XML::LibXML; use FileHandle; my $parser = XML::LibXML->new; my $doc = $parser->parse_html_fh( \*DATA ); print $doc->findvalue('//title'); __DATA__ <html> <head> <title>Nothing to see here</title> </head> </html>
      XML::LibXML and XPath rocks!

      --
      Ilya Martynov, ilya@iponweb.net
      CTO IPonWEB (UK) Ltd
      Quality Perl Programming and Unix Support UK managed @ offshore prices - http://www.iponweb.net
      Personal website - http://martynov.org

Re: Which one is the better Regex?
by Abigail-II (Bishop) on Feb 27, 2003 at 13:49 UTC
    It depends on what you define "better".

    If we look at the first set, they are both wrong. Sure, in many cases, they will extract the title, but in some cases, they will not. For instance, in the first regex, you are assuming that any < starts a tag. This is not the case however. Furthermore, both regexes assume comments do not exist. Or CDATA marked sections. Note also that the latter one uses both (?i) and /i. One can be omitted.

    As for the last set of regexes, both are so horribly wrong, that talking about which one is better carries no meaning at all. It's like asking "what's better to eat with fries? Yellow or wednesday?".

    As the FAQ says, if you want to extract elements, or remove tags, PARSE the HTML, don't use trivial regexes.

    Abigail

      Ok ... i've to write better regex ;o) btw i'm newbie to this topic. do you have e.g. for both things?

      thx

        As I said, you have to PARSE the HTML text - you shouldn't attempt to solve it with a single regex.

        Abigail

        Check out YAPE::HTML -- It is pure perl (ie regexes)
        #!/usr/bin/perl use YAPE::HTML; use Data::Dumper; use warnings; use strict; my $content = " <html> <title> yes a title </title> <body> yes a body </body> </html> "; my $parser = YAPE::HTML->new($content); my $extor = $parser->extract( 'title' => []); while (my $chunk = $extor->()) { print Dumper $chunk; print $/,'>>>>',$chunk->text()->[0]->string(),'<<<<',$/x5; } __END__ $VAR1 = bless( { 'TYPE' => 'tag', 'ATTR' => {}, 'TAG' => 'title', 'TEXT' => [ bless( { 'TYPE' => 'text', 'TEXT' => ' yes a title ' }, 'YAPE::HTML::text' ) ], 'IMPLIED' => '', 'CLOSED' => 1 }, 'YAPE::HTML::tag' ); >>>> yes a title <<<<


        MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
        I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
        ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Which one is the better Regex?
by dws (Chancellor) on Feb 27, 2003 at 19:40 UTC
    As others have said above, the "right" answer to this is to parse the HTML. Unless the HTML is malformed, you'll be able to reliably extract titles.

    The "pragmatic" answer may be different. 99.9+% of the time, extracting a title using a regex works just fine. It's not the bullet-proof way, but in my experience you'll see far fewer real problems than you'll see theoretical ones. If you're not sensitive to the risk, and want to avoid taking on the overhead of using a parser,

    if ( $html =~ m|<title>(.*?)</title>|is ) { $title = $1; } else { # note that you failed to find a title }
    will work just fine. You may still need to strip leading or trailing whitespace, and convert any newlines to spaces.

    If you're extracting much more that the title, though, I'd go with a parser.