I have two problems actually.

The first I described in my first post: The output isn't the IPA-encoded stuff I'm expecting. The suggestion that I find UTF-8 codes for each IPA symbol is good -- I can do that easily, and 'write' each symbol's UTF-8 hexidecimal code to Excel as needed. But as my html source doesn't give the UTF-8 hexidecimal, I'd need some method of encoding the html source's IPA symbols as UTF-8 hexidecimal before doing my Excel write.

For this task, I might try to make sense of the post at http://www.dev411.com/blog/2006/09/29/perl-getting-a-unicode-characters-hex-codepoint

The second problem is that any given web page's html source contains at least two pairings of what I'm calling 'surrounding strings' in the code below. Because regex is greedy, I now assume it's matching the 'htmlchunk' sitting between the final pairing found. I'd like to get the chunk between the first pairing.

Have a look at the html source for http://dictionary.reference.com/browse/hello and you'll see what I mean. Scan for instances of "prondelim" and you'll see four (ie, two pairings).

Luckily, dictionary.com's pronunciation delimiters for showing IPA (/ and /) differs from those used for showing spelled pronunciation ([ and ]). So I know my regex can distinguish between these similar-looking surrounding strings and that I'm matching the desired IPA-encoded source.

However, 'hello' has just one dictionary entry.

The word 'fly' for instance has four entries on the same webpage. Each entry offers it's own IPA-between-pronunciation-delimiters html. In this simple case, each pronunciation is exactly the same, so I don't really care whether I'm matching on the first, second, third, or fourth bit of IPA.

But other words may have different pronunciations for different meanings. Again, I'd like to match on and extract merely the first -- not last -- pronunciation. So what do I do? Would a 'lookahead' regex be the proper way to proceed here? Reversing the entire html source and then reversing my regular expression (so as to greedily match the original first expression instance) seems an awful hack.

Here's my failing code. Don't laugh if it looks newbie-ish -- it is. I really appreciate all of your help. You guys are much better at this stuff than I am. --Cypress

use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use Spreadsheet::WriteExcel; use FileHandle; use strict; # create useragent, open an excel workbook and sheet my $ua = LWP::UserAgent -> new; my $workbook = Spreadsheet::WriteExcel -> new ( "IPA.xls" ); my $sheet = $workbook -> add_worksheet ( ); $sheet -> set_column ( 0, 0, 100 ); # get html source and parse my $address = "http://dictionary.reference.com/browse/hello"; my $request = HTTP::Request -> new ( GET => $address ); my $response = $ua -> request ( $request ); my $htmlsource; my $writestring; if ( $response -> is_success ) { $htmlsource = $response -> content; $writestring = parse( $htmlsource ); } # write to spreadsheet, close excel $sheet -> write ( 0, 0, $writestring ); $workbook -> close ( ); sub parse { my $source = shift; my $htmlchunk; my $ipa; # select from html source the chunk of html which contains IPA- # encoded symbols # this chunk will still contain html tags that need to be removed # i'll find it between the *first* (but perhaps not last) pairing # of these two surrounding strings: # "prondelim">/</span><span class="pron" # /span><span class="prondelim" if ( $source =~ /"prondelim">\/<\/span><span class="pron"(.*?)\/sp +an><span class="prondelim"/ ) { $htmlchunk = $1; } # get rid of leading html tags, save the IPA or English bits # between '>' and '<', # and continue doing the same over the remaining chunk while ( $htmlchunk =~ /(.*?)>(.*?)<(.*)/ ) { $ipa = $ipa . $2; $htmlchunk = $3; } return $ipa; }

Holy cow! Just noticed to whom I'm replying! How do I prostrate myself before you over the internet?


In reply to Re^2: Writing International Phonetic Alphabet symbols to Excel? by cypress
in thread Writing International Phonetic Alphabet symbols to Excel? by cypress

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.