I have two problems actually.
The first I described in my first post: The output isn't the IPA-encoded stuff I'm expecting. The suggestion that I find UTF-8 codes for each IPA symbol is good -- I can do that easily, and 'write' each symbol's UTF-8 hexidecimal code to Excel as needed. But as my html source doesn't give the UTF-8 hexidecimal, I'd need some method of encoding the html source's IPA symbols as UTF-8 hexidecimal before doing my Excel write.
For this task, I might try to make sense of the post at http://www.dev411.com/blog/2006/09/29/perl-getting-a-unicode-characters-hex-codepoint
The second problem is that any given web page's html source contains at least two pairings of what I'm calling 'surrounding strings' in the code below. Because regex is greedy, I now assume it's matching the 'htmlchunk' sitting between the final pairing found. I'd like to get the chunk between the first pairing.
Have a look at the html source for http://dictionary.reference.com/browse/hello and you'll see what I mean. Scan for instances of "prondelim" and you'll see four (ie, two pairings).
Luckily, dictionary.com's pronunciation delimiters for showing IPA (/ and /) differs from those used for showing spelled pronunciation ([ and ]). So I know my regex can distinguish between these similar-looking surrounding strings and that I'm matching the desired IPA-encoded source.
However, 'hello' has just one dictionary entry.
The word 'fly' for instance has four entries on the same webpage. Each entry offers it's own IPA-between-pronunciation-delimiters html. In this simple case, each pronunciation is exactly the same, so I don't really care whether I'm matching on the first, second, third, or fourth bit of IPA.
But other words may have different pronunciations for different meanings. Again, I'd like to match on and extract merely the first -- not last -- pronunciation. So what do I do? Would a 'lookahead' regex be the proper way to proceed here? Reversing the entire html source and then reversing my regular expression (so as to greedily match the original first expression instance) seems an awful hack.
Here's my failing code. Don't laugh if it looks newbie-ish -- it is. I really appreciate all of your help. You guys are much better at this stuff than I am. --Cypress
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use Spreadsheet::WriteExcel;
use FileHandle;
use strict;
# create useragent, open an excel workbook and sheet
my $ua = LWP::UserAgent -> new;
my $workbook = Spreadsheet::WriteExcel -> new ( "IPA.xls" );
my $sheet = $workbook -> add_worksheet ( );
$sheet -> set_column ( 0, 0, 100 );
# get html source and parse
my $address = "http://dictionary.reference.com/browse/hello";
my $request = HTTP::Request -> new ( GET => $address );
my $response = $ua -> request ( $request );
my $htmlsource;
my $writestring;
if ( $response -> is_success ) {
$htmlsource = $response -> content;
$writestring = parse( $htmlsource );
}
# write to spreadsheet, close excel
$sheet -> write ( 0, 0, $writestring );
$workbook -> close ( );
sub parse {
my $source = shift;
my $htmlchunk;
my $ipa;
# select from html source the chunk of html which contains IPA-
# encoded symbols
# this chunk will still contain html tags that need to be removed
# i'll find it between the *first* (but perhaps not last) pairing
# of these two surrounding strings:
# "prondelim">/</span><span class="pron"
# /span><span class="prondelim"
if ( $source =~ /"prondelim">\/<\/span><span class="pron"(.*?)\/sp
+an><span class="prondelim"/ ) {
$htmlchunk = $1;
}
# get rid of leading html tags, save the IPA or English bits
# between '>' and '<',
# and continue doing the same over the remaining chunk
while ( $htmlchunk =~ /(.*?)>(.*?)<(.*)/ ) {
$ipa = $ipa . $2;
$htmlchunk = $3;
}
return $ipa;
}
Holy cow! Just noticed to whom I'm replying! How do I prostrate myself before you over the internet? |