Re: Writing International Phonetic Alphabet symbols to Excel?

Replies are listed 'Best First'.
Re^2: Writing International Phonetic Alphabet symbols to Excel? by cypress (Beadle) on Sep 25, 2009 at 01:52 UTC
I have two problems actually. The first I described in my first post: The output isn't the IPA-encoded stuff I'm expecting. The suggestion that I find UTF-8 codes for each IPA symbol is good -- I can do that easily, and 'write' each symbol's UTF-8 hexidecimal code to Excel as needed. But as my html source doesn't give the UTF-8 hexidecimal, I'd need some method of encoding the html source's IPA symbols as UTF-8 hexidecimal before doing my Excel write. For this task, I might try to make sense of the post at http://www.dev411.com/blog/2006/09/29/perl-getting-a-unicode-characters-hex-codepoint The second problem is that any given web page's html source contains at least two pairings of what I'm calling 'surrounding strings' in the code below. Because regex is greedy, I now assume it's matching the 'htmlchunk' sitting between the final pairing found. I'd like to get the chunk between the first pairing. Have a look at the html source for http://dictionary.reference.com/browse/hello and you'll see what I mean. Scan for instances of "prondelim" and you'll see four (ie, two pairings). Luckily, dictionary.com's pronunciation delimiters for showing IPA (/ and /) differs from those used for showing spelled pronunciation ([ and ]). So I know my regex can distinguish between these similar-looking surrounding strings and that I'm matching the desired IPA-encoded source. However, 'hello' has just one dictionary entry. The word 'fly' for instance has four entries on the same webpage. Each entry offers it's own IPA-between-pronunciation-delimiters html. In this simple case, each pronunciation is exactly the same, so I don't really care whether I'm matching on the first, second, third, or fourth bit of IPA. But other words may have different pronunciations for different meanings. Again, I'd like to match on and extract merely the first -- not last -- pronunciation. So what do I do? Would a 'lookahead' regex be the proper way to proceed here? Reversing the entire html source and then reversing my regular expression (so as to greedily match the original first expression instance) seems an awful hack. Here's my failing code. Don't laugh if it looks newbie-ish -- it is. I really appreciate all of your help. You guys are much better at this stuff than I am. --Cypress use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use Spreadsheet::WriteExcel; use FileHandle; use strict; # create useragent, open an excel workbook and sheet my $ua = LWP::UserAgent -> new; my $workbook = Spreadsheet::WriteExcel -> new ( "IPA.xls" ); my $sheet = $workbook -> add_worksheet ( ); $sheet -> set_column ( 0, 0, 100 ); # get html source and parse my $address = "http://dictionary.reference.com/browse/hello"; my $request = HTTP::Request -> new ( GET => $address ); my $response = $ua -> request ( $request ); my $htmlsource; my $writestring; if ( $response -> is_success ) { $htmlsource = $response -> content; $writestring = parse( $htmlsource ); } # write to spreadsheet, close excel $sheet -> write ( 0, 0, $writestring ); $workbook -> close ( ); sub parse { my $source = shift; my $htmlchunk; my $ipa; # select from html source the chunk of html which contains IPA- # encoded symbols # this chunk will still contain html tags that need to be removed # i'll find it between the first (but perhaps not last) pairing # of these two surrounding strings: # "prondelim">/</span><span class="pron" # /span><span class="prondelim" if ( $source =~ /"prondelim">\/<\/span><span class="pron"(.?)\/sp +an><span class="prondelim"/ ) { $htmlchunk = $1; } # get rid of leading html tags, save the IPA or English bits # between '>' and '<', # and continue doing the same over the remaining chunk while ( $htmlchunk =~ /(.?)>(.?)<(.)/ ) { $ipa = $ipa . $2; $htmlchunk = $3; } return $ipa; } [download] Holy cow! Just noticed to whom I'm replying! How do I prostrate myself before you over the internet?	[reply] [d/l]
Re^3: Writing International Phonetic Alphabet symbols to Excel? by graff (Chancellor) on Sep 25, 2009 at 02:47 UTC
Having tried that version of your script myself, the problem that shows up in the excel file appears to be the result of "double encoding" into utf8. In other words, data that is already utf8 encoded gets treated as if it were plain-old single-byte Latin1, and gets encoded into utf8 again. There might be better ways to fix this besides the following, but the following will work (at least, it did for me): `use Encode; # add this near the top, with the other "use" statements ... # write to spreadsheet, close excel $sheet -> write ( 0, 0, decode( 'utf8', $writestring )); # add the "de +code()" call` [download]	[reply] [d/l]
Re^3: Writing International Phonetic Alphabet symbols to Excel? by jmcnamara (Monsignor) on Sep 25, 2009 at 08:44 UTC
You are very close, and well done for showing a detailed example. The main (Unicode) problem is that perl doesn't know that the strings that you are extracting from the Html source are UTF-8. You can either explicitly convert them, as graff shows, or better still use `decoded_content()` instead of `content()` in your LWP code: `... if ( $response -> is_success ) { $htmlsource = $response -> decoded_content(); $writestring = parse( $htmlsource ); } ...` [download] This will get you most of the way there if you view the output file. However, you will notice that the backquote-like (inflection?) character doesn't display in the default Arial font (the other Unicode characters do). The solution in this case is to switch to a full Unicode font in Excel such as 'Arial Unicode MS' `... my $arial_unicode = $workbook -> add_format(font => 'Arial Unicode + MS'); $sheet -> write ( 0, 0, $writestring, $arial_unicode ); ...` [download] -- John.	[reply] [d/l] [select]
Re^3: Writing International Phonetic Alphabet symbols to Excel? by cypress (Beadle) on Sep 25, 2009 at 02:08 UTC
Hmm. My paragraph above beginning 'The second problem...' could have been worded a bit better. What I meant is that a simple word like 'hello' has two pairings with the "prondelim" etc string. The first pairing (the one I'm actually matching) sets off the Show IPA html code. The second pairing sets off the Show Spelled Pronunciation html code. A word with multiple entries at dictionary.com will have multiple pairings, both for IPA and for spelled pronunciation. I don't want to match the greedy-final pairing -- I want to extract whatever IPA comes up first. As for the meaning of 'English bits' in my code comments, try the word 'of' at dictionary.com. I'll be matching on stuff like 'unstressed' and 'especially before consonants,' too.	[reply]