Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Forgive my ignorance as I am a complete newbie to Perl. And my desired task I'm afraid takes theskills of a master. So I humbly present my problem here in hopes that one of the many masters will help a new initiate. Thanks in advance.

I need to loop through a large html file, find usage of umlauted characters in a specific tag, copy and alter the tag by substituting a unicode character with two new characters, and append that tag next to the original tag. I feel like I can use Perlre to find the pattern, and then the tag will have to be copied, altered, and placed correctly next to the new tag.

1. The script must copy a tag and replace the umlauted letter with a oe,ue, or ae and append this to the same location below is two examples.

<idx:orth value="aufhören"/> <idx:orth value="Förderverein"/>

What we need is two tags like this:

<idx:orth value="aufhören"/><idx:orth value="aufhoeren"/>

or

<idx:orth value="Förderverein"/><idx:orth value="Foerderverein"/>

2. The case of the headword should be retained. If it is capitalized then the new tag should also be capitalized. An example is presented above and below.

<idx:orth value="Öl"/><idx:orth value="Oel">

The script will have to identify every use of an umlaut in this tag set, copy the tag, and write it next to the original tag with the spelling change.

Below is an example of the full entry:

<idx:entry name="dic" scriptable="yes" wild="yes" spell="yes"><idx:sho +rt><div height="4"><a name="51707"/><div><idx:orth value="Förderverein"/><b>F</b><font color +="#000000"><betonung/><b><b>ö</b></b></font><b>r</b>·<b>der</b>·<b>ve +r</b>·<b>ein, </b>der: </div><blockquote>zur <a href="#51698"><font s +ize="+1"><b><img hspace="0" align="middle" hisrc="bbm/t2i-ie/U8593.1NI48A-h.gif"/></b></font> Förderung (1)</a> e +iner bestimmten Sache gegründeter Verein.</blockquote></div></idx:sho +rt></idx:entry><div height="10" align="center"><img hspace="0" vspace +="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" h +isrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php +/200-1-U35555555-m.gif"/><br/></div>

A corrected version of the html would look like this:

<idx:entry name="dic" scriptable="yes" wild="yes" spell="yes"><idx:sho +rt><div height="4"><a name="51707"/><div><idx:orth value="Förderverei +n"/><idx:orth value="Foerderverein"/><b>F</b><font color="#000000"><b +etonung/><b><b>ö</b></b></font><b>r</b>·<b>der</b>·<b>ver</b>·<b>ein, + </b>der: </div><blockquote>zur <a href="#51698"><font size="+1"><b>< +img hspace="0" align="middle" hisrc="bbm/t2i-ie/U8593.1NI48A-h.gif"/></b></font> Förderung (1)</a> e +iner bestimmten Sache gegründeter Verein.</blockquote></div></idx:sho +rt></idx:entry><div height="10" align="center"><img hspace="0" vspace +="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" h +isrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php +/200-1-U35555555-m.gif"/><br/></div>

Replies are listed 'Best First'.
Re: Copy html tag and replace umlauts with alternate spellings
by graff (Chancellor) on Mar 26, 2011 at 16:27 UTC
    This seems to be a good example for showing how a "simple regex solution" by itself just won't work -- you have to parse the data before doing anything with regexes to fix the spellings.

    Here's a minimal solution using HTML::Parser. It would be worthwhile and instructive to use Unicode::Normalize as well, but if we're just twiddling umlauts, this is good enough. (Still, you'll want to check the output carefully...):

    #!/usr/bin/perl use strict; use HTML::Parser; # set up a hash containing the umlauted characters and their replaceme +nts: my %replace = ( "\xC4" => 'Ae', "\xCF" => 'Ie', "\xD6" => 'Oe', "\xDC" => 'Ue', "\xE4" => 'ae', "\xEF" => 'ie', "\xF6" => 'oe', "\xFC" => 'ue', ); my $um = join '', keys %replace; binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; $/ = undef; my $input = <>; my $output = ''; my $p = HTML::Parser->new( api_version => 3, start_h => [ \&fix_umlaut, 'tagname, attr, +text' ], default_h => [ \&copy, 'text' ], ); $p->empty_element_tags( 1 ); $p->parse( $input ); print $output; sub fix_umlaut { my ( $tagname, $attr, $text ) = @_; $output .= $text; if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) { $text =~ s/([$um])/$replace{$1}/g; $output .= $text; # repeat the tag with modified umlauts } } sub copy { $output .= $_[0]; }
    That's set up to work as a "stdin - stdout filter" -- in other words, it's strictly a command line process, and the usage is supposed to be:  script_name < input.html > output.html

    The HTML::Parser man page is well worth studying.

      The HTML::Parse solution worked but an unexpected side effect had to do with inflection data (identified by "infl=" )that I did not see prior to posting. Below I've posted the result from an entry created with the Parser solution. I now realized I need to create the main headword with an alternate spelling but exclude the creation of inflectional data for the non-sensical word. I guess I need to exclude the creation of new inflection data for it to work correctly to avoid creating non-sensical inflection data.

      <idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern" infl="abaendere,abaenderen,abaenderest,abaenderet,abaendern,abaenderst,abaendert,abaenderte,abaenderten,abaendertest,abaendertet,abgeaendert,abzuaendern"/><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>&#139;sw. V.; hat&#155;: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div>

      I know it is a lot to ask, but is there anyone that can suggeset a change to the html:: parse script above to prevent the inflectional data from being produced? My desired result is below.

      <idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern"><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>&#139;sw. V.; hat&#155;: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div>
        PLEASE DO NOT USE <pre>...</pre> (or <tt>...</tt>) when posting at perlmonks -- always use "<c>...</c>" for code and data.

        Now, if you really are so severely unfamiliar with Perl that you don't see the easy solution, you really should consider looking things up... find a copy of "Learning Perl", look through online tutorials (here at perlmonks and elsewhere), etc.

        The easy solution involves adding one line to the "if(...)" block in the "fix_umlaut" subroutine:

        if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) { $text =~ s/\s+infl="[^"]+"//; #<-- add this line $text =~ s/([$um])/$replace{$1}/g; $output .= $text; # repeat the tag with modified umlauts }
        If the tag does not contain an "infl" attribute, the added line does nothing; if the "infl" is present, it will be deleted (along with its full value) before appending the tag to the output.

        If you run into more problems, try working them out yourself first -- then if you still need help, show us what you tried. (And sign up for a user account.)

        Reposting the new desired head word as I left off the closing slash mark <idx:orth value="abaendern"/>
Re: Copy html tag and replace umlauts with alternate spellings
by moritz (Cardinal) on Mar 26, 2011 at 14:46 UTC