Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: How do I extract text from an HTML page?

by ido50 (Scribe)
on Aug 03, 2003 at 17:53 UTC ( #280471=note: print w/replies, xml ) Need Help??

in reply to How do I extract text from an HTML page?

Assuming that the format for every grade is '<b>NAME</b>, grade: GRADE<br>' and that there are no lines that begin incidentally with the same format (Well actually you don't have if you use the regex I wrote. At least I think so...), you can use this little snippet that uses regex to create this HTML page (You can easily alter it as you wish to just print it and not create an HTML file):

open (FILE, "+<$path/finalgrades.html") or die "Can't open file: $!"; +# where $path is the full path to the directory where the file reside +s. while (<FILE>) { if (m/^<b>[^<>]</b>, grade: (\d+)<br>$/) { my $name = $1; my $grade = $2; open (HTML, "+>>$path/$name.html") or die "$!"; print HTML "<html><head><title>CONGRATULATIONS TO THE PARTICIPANT +$name</title></head><body><h1>$name</h1> has just reviece "$grade" as + a grade for this seminar. Congratulations!!!<br></body></html> } } close (FILE) or die "$!";

Note that you should want to use to print the HTML taks instead of printing them directly. Go to for more info on the CGI module.

Live fat, die young

Replies are listed 'Best First'.
Re: Re: How do I extract text from an HTML page?
by ido50 (Scribe) on Aug 03, 2003 at 17:57 UTC
    A few corrections:
    1. Replace "(Well actually you don't have if you..." with "(Well actually you don't have to worry about it if you...".
    2. After the print HTML "bla bla" statement I forgot to include a ";", and you should add after it a "close HTML" statement too.

    Live fat, die young
      Last two corrections (Not my day today):
      1. I also forgot a terminating double quote in the print HTML "bla bla" statement. 2. Replace the "[^<>]" in the regex with "([^<>])".

      Live fat, die young
        Two suggestions: test before you post (and then test some more) and don't use regexes to parse HTML. Granted, this is trivial HTML to parse, but the more you use parsers, the better you get at it. Also, the more you use templating modules, the better you get at them. I hate to just outright solve the problem, but i did. Here is link that you will have to click to see my solution - so any readers have been warned.


        It's a lot more code than you posted, but it does a lot more as well. ;)


        (the triplet paradiddle with high-hat)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://280471]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2023-03-25 13:30 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (63 votes). Check out past polls.