zzspectrez has asked for the wisdom of the Perl Monks concerning the following question:

Happy Turkey Day fellow monks!

I am currently trying to learn spanish. I found a website that has a page that displays a new spanish word each day.

I would like to write a script that I can have cron run that will download the web page, parse the spanish word for the day and email mail it to me so that I can download it to my palm pilot.

Not a difficult task I know. The only part Im not to sure about is if Im parsing the data properly. Im am inexperienced with regular expressions, so would like fellow monks suggestions.

An example of the html src:

.... SNIP ..... if (day == 22) document.write("<p><font size='+2'><font color='#cc0000 +'><i><b>acongojar:</b></i></font> to sadden or grieve. <i>Mi hermano +estaba muy acongojado cuando muri&oacute\; su esposa</i>, my brother +was very sad when his wife died.</font>"); if (day == 23) document.write("<p><font size='+2'><font color='#cc0000 +'><i><b>contestar:</b></i></font> to answer. <i>Sus oraciones fueron +contestadas</i>, her prayers were answered.</font>"); .... SNIP ....

I have written a quick script to download and print out the data but am unsure if I should do things differently. It appears to work.

Test perl script

#!/usr/bin/perl -Tw use strict; use LWP::Simple; my $url = 'http://spanish.about.com/homework/spanish/blword.htm'; die "Unable to download Spanish word of the day." unless (defined(my $ +web = get ($url))); my $today = ((localtime)[3]); print "\nToday: $today\n"; my ($word, $def, $sent, $trans) = $web =~ m[if \(day == $today\).+?<b> +(\w+):</b></i></font>(.+?)<i>(.+?)</i>,(.+?).</font>]s; print "Word: $word\nDefinition: $def\nSentence: $sent\nTranslation: $t +rans\n";

Thanks for any suggestions!
zzSPECTREz

Replies are listed 'Best First'.
Re: Pattern matching html.
by davorg (Chancellor) on Nov 24, 2000 at 01:23 UTC

    Generally speaking, parsing HTML files using regular expressions is a dangerous business. There's always just one more complication to take into account.

    A far better approach is to use a proper HTML parser. The HTML::Parser module is available from CPAN, but it seems to me that one of it's sub-classes HTML::TokeParser or HTML::TreeBuilder might be more appropriate in this case. There's a particularly good article on HTML::Treebuilder in the current Perl Journal (issue 19).

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      I read the article in Perl Journal. Im not sure if its solution would work well with my situation. The data I'm looking for is not embedded within html tokens but is within javascript functions within the html source. Unless I am mistaken, HTML::Treebuilder will not be able to help me with this problem. The data I am looking for is within some Javascript calls like if (day = 10) document.write("The data I want")

      I have not used these modules so maybee I am missing something. About all I can see using them for is to grab all the functions within the <script LANGUAGE="JavaScript"> .... blah blah ... </script> tags. But even this doesnt seem as straight forward since there are three diferent script sections and once I get the data I still have to parse it with a regular expression, right??

      Thanks!
      zzSPECTREz

Re: Patern matching html.
by rpc (Monk) on Nov 24, 2000 at 01:35 UTC
    Hi there, I agree with davorg. If they change the format of their page, the script will break horribly. Regardless, I wrote a test script that does the same thing as yours, pretty much. I just broke it up into smaller steps.
    #!/usr/bin/perl -w use strict; use LWP::Simple; use constant URL => 'http://spanish.about.com/homework/spanish/blword. +htm'; my $today = (localtime)[3]; my $page = get(URL) or die "can't download page.\n"; # grab today's entry. my($entry) = $page =~ m/if \(day == $today\) [^\(]+\(\"([^\"\);]+)/; # remove markup. $entry =~ s/<[^>]+>//g; my($word,$def, $sentence, $trans) = $entry =~ m/([^:]+):([^\.]+\.)([^, +]+),(.*)/; print "word: $word\n"; print "definition: $def\n"; print "sentence: $sentence\n"; print "translation: $trans\n";

      I downloaded this code, and it did not work properly. When ran it printed the following:

      word: definition: sentence: translation:

      zzSPECTREz

        Like we discussed earlier, at least today (the 25th) they changed the format of the definition (from what I downloaded, there was no space after the colon after the word)

        I changed the code slightly, removed the hardcoded spaces in the regex.

        thanks for pointing that out :)

Re: Patern matching html.
by Anonymous Monk on Nov 24, 2000 at 00:29 UTC
    Happy Monk Day, fellow turkeys.