File/String search...

Sharky_The_Dog has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File/String search... by maverick (Curate) on Sep 07, 2000 at 01:31 UTC
Unless you can guarantee that your HTML is going to really consistent and simple, you're eventually going to run into headaches trying to parse it correctly. Consider that: `<table> <tr><td>you're likely to see</td> <tr><td>html done like this</td> </table>` [download] I'd suggest checking out a module called HTML-Parser (http://www.cpan.org/modules/by-module/HTML/HTML-Parser-3.11.tar.gz ) I'd also suggest building an index of key words and which files contain them if you're going to be doing a lot of searching. I had mentioned this once before for a similar type question. The node is Re: Search Algorithm Hope this helps... /\/\averick	[reply] [d/l]
Re: File/String search... by jreades (Friar) on Sep 07, 2000 at 03:47 UTC
Before I get into anything else, have you considered that your search string uses double quotes and your '/' may need to be escaped? If the table method is the only one available to you then I'll outline a possible approach below, but first may I suggest that you dynamically generate the table used on the page (assuming that this is possible). Consider a file that looked like this: `FOO\twww.yahoo.com/yahoo\tsomething else\tand another BAR\twww.altavista.com\tsomething else\` [download] Obviously the tab character is a physical tab and not the literal '\t'. Then your code can be something like: `while (<FILE>) { my @line = split /\t/; # do_something... }` [download] This would allow you format the pages as you liked (and pretty damned easily) and would minimize the opportunity for inconsistent coding... And now on to the approach... According to what you've said, and factoring in what others have pointed out by way of problems, the only thing we can 'reliably' count on is a /<tr/i starting our table row and, thus, a new search term. I'd suggest avoiding /<tr>/ as it's entirely possible that someone will add bgcolor or valign or some other piece of amusing code... So here's a shot based on what you already have. It does have the side effect of stripping out the <tr>/</tr> (assuming that these are on seperate lines and none of the following occur: `1. <tr><p>Some content</p> # will screw up because it skips to the nex +t line 2. <p>Some content</p></tr> # will also screw up because it doesn't ta +ke the last line` [download] Anyway... `open (FILE, "<$read") or die ("Couldn't open file to read: $!"); while ($line = <FILE>) { next unless ($line =~ /<tr/i); my $match; while (($additional_lines = <FILE>) !~ /<[\/]{0,1}tr/i) { $match .= $additional_lines; } print "MATCH: " . $match . "\n\n"; } close FILE; exit 0;` [download] The best way around the limitations of the previous code would be to undef $/ and slurp your file into a scalar as follows (this is just pseudo code as I couldn't get it working quite the way I wanted): `undef $/; my $read = 'test.html'; open (FILE, "<$read") or die ("Couldn't open file to read: $!"); my $file = <FILE>; close FILE; my @possible_search_terms = split /<tr[^>]?>/i, $file; foreach (@possible_search_terms) { next if $_ =~ /^\s$/; print "WORKING: '" . $_ . "'\n"; }` [download]	[reply] [d/l] [select]
RE: Re: File/String search... by tye (Sage) on Sep 07, 2000 at 19:22 UTC
Before I get into anything else, have you considered that your search string uses double quotes and your '/' may need to be escaped? You don't need to escape / inside double quotes. There are really very few characters that require escaping inside of (Perl) double quotes. Namely, \, @, $, and the delimiter character (which is " for "this" and . for qq.this., etc.). Having heard something like this twice in as many days, I felt it was important to comment on this. - tye (but my friends call me "Tye")	[reply]
Re: File/String search... by Adam (Vicar) on Sep 07, 2000 at 01:25 UTC
There is a module for parsing html, Parse::Html I think. Check CPAN. Also, you might want to look into the scalar .. (range)operator. Ah, jcwren has the right module name.	[reply]
(jcwren) RE: Re: File/String search... by jcwren (Prior) on Sep 07, 2000 at 01:29 UTC
HTML::Parser, actually. Be aware, though, that this module may be a little difficult to wrap your brain around, if you're new to Perl. If you're not comfortable with sub-classing and basic OO, if may be a little overwhelming. That said, there are some examples, and if you dig in the docs, you'll be able to find something that you should be able to carve up for your purposes. However, it's not something you'll be able to do in 20 minutes... I can't tell exactly what you're trying to do from what you've provided, but there is a module called HTML::TableParser. Since you're using <TR>/</TR> tags, this indicates table rows. HTML::TableParser is useful for yanking the data out of tables. The problem is that if you need the HREF tag info, HTML::TableParser won't give it to you. In the luke_repwalker.pl script, I had a similiar problem. At the bottom of the code is a package you may be able to extract, and with a little tinkering would allow you to extract the table text and the HREF links. Unless I'm making it more complicated than what you're trying to do, this may be of help. If you need some additional assistance with getting that working, drop me a /msg or an e-mail and we'll see what we can get going. Using regexps to extract HTML can work, but it's not the best idea. Certain tags aren't balanced pairs, which can really mess you up. Also, there are some places where people will render the starting tags, but not the ending tags. Most browsers, trying to be the acommodating beasts they are, don't care about the end tags. This is particularly true of table rows and data. As such, unless you can be assured that the HTML is DTD spec HTML, using regexps is risky business. This code was something I came up with, based on a /msg from Sharky_The_Dog. I realize that it could be collapsed into one statement, but that wasn't the point (and, dang it, tilly, I know $filename and $match could be 'use vars'!). It's also based on the fact that Sharky says his HTML is machine generated, and legal. #!/usr/local/bin/perl -w use strict; my $filename = 'filename with HTML'; my $match = 'your criteria here'; { my $data; # # Use braces to localize the $/ assignment, so we don't get bitten + later. # { local $/ = undef; open (FH, "<$filename") \|\| die; $data = <FH>; close FH; } # # @list will contain all the <tr>/</tr> pairs # my @list = $data =~ m/<tr>(.*?)<\/tr>/igs; # # @newlist will contain all the <tr>/</tr> pairs that match our se +arch criteria # my @newlist = grep { /$match/i } @list; # # Display the number of <tr>/</tr> pairs total, and the number tha +t matched the search criteria # print "Items total : ", scalar @list, "\n"; print "Items found : ", scalar @newlist, "\n"; } [download] --Chris e-mail jcwren	[reply] [d/l]
Re: File/String search... by Fastolfe (Vicar) on Sep 07, 2000 at 01:50 UTC
The problem is that your until() block will always fail, since you're never changing $_ like you are for each iteration through the while() loop. If you want to insist on doing it this way (the other posters already give you enough advice), you may have to do something like set a flag when you see the 'start' pattern, append each subsequent line while your flag is set, and clear the flag when your 'end' pattern is reached.	[reply]