Re: How do I extract text from an HTML page?

Assuming that the format for every grade is '<b>NAME</b>, grade: GRADE<br>' and that there are no lines that begin incidentally with the same format (Well actually you don't have if you use the regex I wrote. At least I think so...), you can use this little snippet that uses regex to create this HTML page (You can easily alter it as you wish to just print it and not create an HTML file):

open (FILE, "+<$path/finalgrades.html") or die "Can't open file: $!"; 
+# where $path is the full path to the directory where the file reside
+s.
while (<FILE>) {
  if (m/^<b>[^<>]</b>, grade: (\d+)<br>$/) {
    my $name = $1; my $grade = $2;
    open (HTML, "+>>$path/$name.html") or die "$!";
    print HTML "<html><head><title>CONGRATULATIONS TO THE PARTICIPANT 
+$name</title></head><body><h1>$name</h1> has just reviece "$grade" as
+ a grade for this seminar. Congratulations!!!<br></body></html>
  }
}
close (FILE) or die "$!";
[download]

Note that you should want to use CGI.pm to print the HTML taks instead of printing them directly. Go to http://search.cpan.org/author/LDS/CGI.pm-2.93/CGI.pm for more info on the CGI module.

--------------------------
Live fat, die young

Comment on Re: How do I extract text from an HTML page? Select or Download Code

Replies are listed 'Best First'.
Re: Re: How do I extract text from an HTML page? by ido50 (Scribe) on Aug 03, 2003 at 17:57 UTC
A few corrections: 1. Replace "(Well actually you don't have if you..." with "(Well actually you don't have to worry about it if you...". 2. After the print HTML "bla bla" statement I forgot to include a ";", and you should add after it a "close HTML" statement too. ---------------------- Live fat, die young	[reply]
Re: Re: Re: How do I extract text from an HTML page? by ido50 (Scribe) on Aug 03, 2003 at 18:00 UTC
Last two corrections (Not my day today): 1. I also forgot a terminating double quote in the print HTML "bla bla" statement. 2. Replace the "`[^<>]`" in the regex with "`([^<>])`". ---------------------- Live fat, die young	[reply] [d/l] [select]
4Re: How do I extract text from an HTML page? by jeffa (Bishop) on Aug 03, 2003 at 19:02 UTC
Two suggestions: test before you post (and then test some more) and don't use regexes to parse HTML. Granted, this is trivial HTML to parse, but the more you use parsers, the better you get at it. Also, the more you use templating modules, the better you get at them. I hate to just outright solve the problem, but i did. Here is link that you will have to click to see my solution - so any readers have been warned. `<blink>` WARNING SPOILERS CLICK AT OWN RISK! `</blink>` It's a lot more code than you posted, but it does a lot more as well. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]


go ahead... be a heretic
	PerlMonks