Sharky_The_Dog has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to search a user-given string in an html file that is enclosed in TR tags on multiple lines. Every entry that is searchable has its own set of TR tags. So here's a simplified sample file:
<TR>
<p> FOO BAR </p>
<a href=www.yahoo.com>yahoo</a> </TR>
This format follows throughout the file, and only the line that says FOO BAR is searchable, but I want to return that line and the line after (and hypothetically, the line after that, and after that and so on...) when a search is confirmed. Here's my problem: I can find the line with the "hit" just fine, but for the life of me, I can't make perl notice the </TR> tag. here's a sample of the code I am trying to make work:
my $trString = "</TR>";
while ( <DATAFILE> )
{
	chomp;
	if ( /$content/ )
	{
		$searchString .= $_;
		until ( /$trString/ )
		{
			chomp;
			$searchString .= $_;
		}
		print "<TR>$searchString";
	}
}
I have tried numerous things, messing with flags, etc, and I couldn't get it to work.. can someone help? Worth noting: I'm a newbie at perl and have programmed in java/c/basic my entire life, so please be patient if this looks like garbage to you. Thanks!

Replies are listed 'Best First'.
Re: File/String search...
by maverick (Curate) on Sep 07, 2000 at 01:31 UTC
    Unless you can guarantee that your HTML is going to really consistent and simple, you're eventually going to run into headaches trying to parse it correctly. Consider that:
    <table> <tr><td>you're likely to see</td> <tr><td>html done like this</td> </table>
    I'd suggest checking out a module called HTML-Parser (http://www.cpan.org/modules/by-module/HTML/HTML-Parser-3.11.tar.gz )

    I'd also suggest building an index of key words and which files contain them if you're going to be doing a lot of searching. I had mentioned this once before for a similar type question. The node is Re: Search Algorithm

    Hope this helps...

    /\/\averick

Re: File/String search...
by jreades (Friar) on Sep 07, 2000 at 03:47 UTC

    Before I get into anything else, have you considered that your search string uses double quotes and your '/' may need to be escaped?

    If the table method is the only one available to you then I'll outline a possible approach below, but first may I suggest that you dynamically generate the table used on the page (assuming that this is possible).

    Consider a file that looked like this:

    FOO\twww.yahoo.com/yahoo\tsomething else\tand another BAR\twww.altavista.com\tsomething else\

    Obviously the tab character is a physical tab and not the literal '\t'.

    Then your code can be something like:

    while (<FILE>) { my @line = split /\t/; # do_something... }

    This would allow you format the pages as you liked (and pretty damned easily) and would minimize the opportunity for inconsistent coding...

    And now on to the approach...

    According to what you've said, and factoring in what others have pointed out by way of problems, the only thing we can 'reliably' count on is a /<tr/i starting our table row and, thus, a new search term. I'd suggest avoiding /<tr>/ as it's entirely possible that someone will add bgcolor or valign or some other piece of amusing code...

    So here's a shot based on what you already have. It does have the side effect of stripping out the <tr>/</tr> (assuming that these are on seperate lines and none of the following occur:

    1. <tr><p>Some content</p> # will screw up because it skips to the nex +t line 2. <p>Some content</p></tr> # will also screw up because it doesn't ta +ke the last line

    Anyway...

    open (FILE, "<$read") or die ("Couldn't open file to read: $!"); while ($line = <FILE>) { next unless ($line =~ /<tr/i); my $match; while (($additional_lines = <FILE>) !~ /<[\/]{0,1}tr/i) { $match .= $additional_lines; } print "MATCH: " . $match . "\n\n"; } close FILE; exit 0;

    The best way around the limitations of the previous code would be to undef $/ and slurp your file into a scalar as follows (this is just pseudo code as I couldn't get it working quite the way I wanted):

    undef $/; my $read = 'test.html'; open (FILE, "<$read") or die ("Couldn't open file to read: $!"); my $file = <FILE>; close FILE; my @possible_search_terms = split /<tr[^>]*?>/i, $file; foreach (@possible_search_terms) { next if $_ =~ /^\s*$/; print "WORKING: '" . $_ . "'\n"; }

      Before I get into anything else, have you considered that your search string uses double quotes and your '/' may need to be escaped?

      You don't need to escape / inside double quotes. There are really very few characters that require escaping inside of (Perl) double quotes. Namely, \, @, $, and the delimiter character (which is " for "this" and . for qq.this., etc.).

      Having heard something like this twice in as many days, I felt it was important to comment on this.

              - tye (but my friends call me "Tye")
Re: File/String search...
by Adam (Vicar) on Sep 07, 2000 at 01:25 UTC
    There is a module for parsing html, Parse::Html I think. Check CPAN. Also, you might want to look into the scalar .. (range)operator.

    Ah, jcwren has the right module name.

      HTML::Parser, actually.

      Be aware, though, that this module may be a little difficult to wrap your brain around, if you're new to Perl. If you're not comfortable with sub-classing and basic OO, if may be a little overwhelming. That said, there are some examples, and if you dig in the docs, you'll be able to find something that you should be able to carve up for your purposes. However, it's not something you'll be able to do in 20 minutes...

      I can't tell exactly what you're trying to do from what you've provided, but there is a module called HTML::TableParser. Since you're using <TR>/</TR> tags, this indicates table rows. HTML::TableParser is useful for yanking the data out of tables. The problem is that if you need the HREF tag info, HTML::TableParser won't give it to you. In the luke_repwalker.pl script, I had a similiar problem. At the bottom of the code is a package you may be able to extract, and with a little tinkering would allow you to extract the table text and the HREF links.

      Unless I'm making it more complicated than what you're trying to do, this may be of help. If you need some additional assistance with getting that working, drop me a /msg or an e-mail and we'll see what we can get going.

      Using regexps to extract HTML *can* work, but it's not the best idea. Certain tags aren't balanced pairs, which can really mess you up. Also, there are some places where people will render the starting tags, but not the ending tags. Most browsers, trying to be the acommodating beasts they are, don't care about the end tags. This is particularly true of table rows and data. As such, unless you can be assured that the HTML is DTD spec HTML, using regexps is risky business.

      This code was something I came up with, based on a /msg from Sharky_The_Dog. I realize that it could be collapsed into one statement, but that wasn't the point (and, dang it, tilly, I know $filename and $match could be 'use vars'!). It's also based on the fact that Sharky says his HTML is machine generated, and legal.
      #!/usr/local/bin/perl -w use strict; my $filename = 'filename with HTML'; my $match = 'your criteria here'; { my $data; # # Use braces to localize the $/ assignment, so we don't get bitten + later. # { local $/ = undef; open (FH, "<$filename") || die; $data = <FH>; close FH; } # # @list will contain all the <tr>/</tr> pairs # my @list = $data =~ m/<tr>(.*?)<\/tr>/igs; # # @newlist will contain all the <tr>/</tr> pairs that match our se +arch criteria # my @newlist = grep { /$match/i } @list; # # Display the number of <tr>/</tr> pairs total, and the number tha +t matched the search criteria # print "Items total : ", scalar @list, "\n"; print "Items found : ", scalar @newlist, "\n"; }
      --Chris

      e-mail jcwren
Re: File/String search...
by Fastolfe (Vicar) on Sep 07, 2000 at 01:50 UTC
    The problem is that your until() block will always fail, since you're never changing $_ like you are for each iteration through the while() loop. If you want to insist on doing it this way (the other posters already give you enough advice), you may have to do something like set a flag when you see the 'start' pattern, append each subsequent line while your flag is set, and clear the flag when your 'end' pattern is reached.