Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a variable with this info:
<title>stuff</title> <br>Header</br> <tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr> <br>Trailing Info
How would I create a regular expression to delete everything before the first <tr> and after the last </tr>? Esentially I want to crop out all the HTML except the Table part. After processing the above variable should look like:
<tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr>


Here's my try at it but I'm off:
$var =~ s/^.*<tr>//mg; $var =~ s/<\/tr>.*$//mg;

Replies are listed 'Best First'.
Re: Removing leading and ending text?
by McDarren (Abbot) on Jan 29, 2006 at 04:49 UTC
    Try this:
    #!/usr/bin/perl -w use strict; my $string = <<"EOT"; <title>stuff</title> <br>Header</br> <tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr> <br>Trailing Info EOT my ($wanted) = $string =~ /(<tr>.*<\/tr>)/s; print "$wanted\n";
    Which gives:
    <tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr>

    The key here is the "s" modifier in the pattern match, which causes the whole string to be treated as a single line. See perlre for more info.

    Update: I guess I should point out that if you happened to be parsing some html which had multiple tables, then this would fail because the .* would eat up everything between the first <tr> in the first table, and the final </tr> in the last table (but that is what you asked for). For anything other than basic parsing of html, you're much better going with one of the many modules available.

    Cheers,
    Darren :)

Re: Removing leading and ending text?
by wfsp (Abbot) on Jan 29, 2006 at 09:25 UTC
    I would echo McDarren's sentiment above. If you can't be absolutly sure what your HTML will look like (and, in my experience, you can't) it is usually best to use an HTML parser.

    This uses HTML::TokeParser::Simple.

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do { local $/; <DATA>; }; my $p = HTML::TokeParser::Simple->new(\$html); my ($table, $start); while (my $t = $p->get_token){ $start++ if $t->is_start_tag('tr'); next unless $start; last if $t->is_start_tag('br'); $table .= $t->as_is; } print "*$table*\n"; __DATA__ <title>stuff</title> <br>Header</br> <tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr> <br>Trailing Info
    Output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl *<tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr> * > Terminated with exit code 0.
Re: Removing leading and ending text?
by talexb (Chancellor) on Jan 29, 2006 at 04:47 UTC
      Here's my try at it but I'm off:

    How far off are you? Are all of the readers of this post going to have to copy and paste this chunk of HTML and code into an editor and run it?

    Give us the complete script, and the result you got, and then we'll be in a better position to help you out.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Removing leading and ending text?
by smokemachine (Hermit) on Jan 29, 2006 at 07:40 UTC
    $a="<title>stuff</title> <br>Header</br> <tr> <td>1</td> <td>2</td> </tr> <tr> <td>3</td> <td>4</td> </tr> <br>Trailing Info"; print $a if $a=~s#\A.*?(<tr>.*<\/tr>).*?\Z#$1#smi;