in reply to Re: Copy html tag and replace umlauts with alternate spellings
in thread Copy html tag and replace umlauts with alternate spellings

The HTML::Parse solution worked but an unexpected side effect had to do with inflection data (identified by "infl=" )that I did not see prior to posting. Below I've posted the result from an entry created with the Parser solution. I now realized I need to create the main headword with an alternate spelling but exclude the creation of inflectional data for the non-sensical word. I guess I need to exclude the creation of new inflection data for it to work correctly to avoid creating non-sensical inflection data.

<idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern" infl="abaendere,abaenderen,abaenderest,abaenderet,abaendern,abaenderst,abaendert,abaenderte,abaenderten,abaendertest,abaendertet,abgeaendert,abzuaendern"/><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>&#139;sw. V.; hat&#155;: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div>

I know it is a lot to ask, but is there anyone that can suggeset a change to the html:: parse script above to prevent the inflectional data from being produced? My desired result is below.

<idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern"><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>&#139;sw. V.; hat&#155;: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div>

Replies are listed 'Best First'.
Re^3: Copy html tag and replace umlauts with alternate spellings
by graff (Chancellor) on Mar 30, 2011 at 21:11 UTC
    PLEASE DO NOT USE <pre>...</pre> (or <tt>...</tt>) when posting at perlmonks -- always use "<c>...</c>" for code and data.

    Now, if you really are so severely unfamiliar with Perl that you don't see the easy solution, you really should consider looking things up... find a copy of "Learning Perl", look through online tutorials (here at perlmonks and elsewhere), etc.

    The easy solution involves adding one line to the "if(...)" block in the "fix_umlaut" subroutine:

    if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) { $text =~ s/\s+infl="[^"]+"//; #<-- add this line $text =~ s/([$um])/$replace{$1}/g; $output .= $text; # repeat the tag with modified umlauts }
    If the tag does not contain an "infl" attribute, the added line does nothing; if the "infl" is present, it will be deleted (along with its full value) before appending the tag to the output.

    If you run into more problems, try working them out yourself first -- then if you still need help, show us what you tried. (And sign up for a user account.)

Re^3: Copy html tag and replace umlauts with alternate spellings
by Anonymous Monk on Mar 27, 2011 at 20:36 UTC
    Reposting the new desired head word as I left off the closing slash mark <idx:orth value="abaendern"/>