Re: Copy html tag and replace umlauts with alternate spellings

This seems to be a good example for showing how a "simple regex solution" by itself just won't work -- you have to parse the data before doing anything with regexes to fix the spellings.

Here's a minimal solution using HTML::Parser. It would be worthwhile and instructive to use Unicode::Normalize as well, but if we're just twiddling umlauts, this is good enough. (Still, you'll want to check the output carefully...):

#!/usr/bin/perl

use strict;
use HTML::Parser;

# set up a hash containing the umlauted characters and their replaceme
+nts:
my %replace = (
    "\xC4" => 'Ae', "\xCF" => 'Ie', "\xD6" => 'Oe', "\xDC" => 'Ue',
    "\xE4" => 'ae', "\xEF" => 'ie', "\xF6" => 'oe', "\xFC" => 'ue',
);
my $um = join '', keys %replace;

binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
$/ = undef;
my $input = <>;
my $output = '';

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [ \&fix_umlaut, 'tagname, attr, 
+text' ],
                           default_h => [ \&copy, 'text' ],
    );
$p->empty_element_tags( 1 );
$p->parse( $input );

print $output;

sub fix_umlaut
{
    my ( $tagname, $attr, $text ) = @_;
    $output .= $text;
    if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) {
        $text =~ s/([$um])/$replace{$1}/g;
        $output .= $text;  # repeat the tag with modified umlauts
    }
}

sub copy
{
    $output .= $_[0];
}
[download]

That's set up to work as a "stdin - stdout filter" -- in other words, it's strictly a command line process, and the usage is supposed to be: script_name < input.html > output.html

The HTML::Parser man page is well worth studying.

Comment on Re: Copy html tag and replace umlauts with alternate spellings Select or Download Code

Replies are listed 'Best First'.
Re^2: Copy html tag and replace umlauts with alternate spellings by Anonymous Monk on Mar 27, 2011 at 20:30 UTC
The HTML::Parse solution worked but an unexpected side effect had to do with inflection data (identified by "infl=" )that I did not see prior to posting. Below I've posted the result from an entry created with the Parser solution. I now realized I need to create the main headword with an alternate spelling but exclude the creation of inflectional data for the non-sensical word. I guess I need to exclude the creation of new inflection data for it to work correctly to avoid creating non-sensical inflection data. <idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern" infl="abaendere,abaenderen,abaenderest,abaenderet,abaendern,abaenderst,abaendert,abaenderte,abaenderten,abaendertest,abaendertet,abgeaendert,abzuaendern"/><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>sw. V.; hat: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div> I know it is a lot to ask, but is there anyone that can suggeset a change to the html:: parse script above to prevent the inflectional data from being produced? My desired result is below. <idx:short><div height="4"><a name="83"/><div><idx:orth value="abändern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuändern"/><idx:orth value="abaendern"><betonung/><b><b>a</b></b><b>b</b>·<b>än</b>·<b>dern </b>sw. V.; hat: </div><blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>1.</B> ein wenig, in Teilen ändern: <i>das Testament, den Antrag, Beschluss, das Programm a. </i> </div></blockquote><blockquote><div width="-70"><img hspace="0" vspace="0" align="middle" hisrc="bbm/rectangle-php/40-1-h.gif" src="bbm/rectangle-php/40-1-m.gif"/><B>2.</B> (BIOL.) (durch Mutation od. Umwelt) in den Artmerkmalen variieren, sich wandeln: <i>die Farben der Blüten ändern stark ab.</i> </div></blockquote></blockquote></div></idx:short></idx:entry><div height="10" align="center"><img hspace="0" vspace="0" align="middle" losrc="bbm/rectangle-php/150-1-U35555555-l.gif" hisrc="bbm/rectangle-php/520-4-U35555555-h.gif" src="bbm/rectangle-php/200-1-U35555555-m.gif"/><br/></div>	[reply] [d/l] [select]
Re^3: Copy html tag and replace umlauts with alternate spellings by graff (Chancellor) on Mar 30, 2011 at 21:11 UTC
PLEASE DO NOT USE `<pre>...</pre> (or <tt>...</tt>)` when posting at perlmonks -- always use "<c>...</c>" for code and data. Now, if you really are so severely unfamiliar with Perl that you don't see the easy solution, you really should consider looking things up... find a copy of "Learning Perl", look through online tutorials (here at perlmonks and elsewhere), etc. The easy solution involves adding one line to the "if(...)" block in the "fix_umlaut" subroutine: `if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) { $text =~ s/\s+infl="[^"]+"//; #<-- add this line $text =~ s/([$um])/$replace{$1}/g; $output .= $text; # repeat the tag with modified umlauts }` [download] If the tag does not contain an "infl" attribute, the added line does nothing; if the "infl" is present, it will be deleted (along with its full value) before appending the tag to the output. If you run into more problems, try working them out yourself first -- then if you still need help, show us what you tried. (And sign up for a user account.)	[reply] [d/l] [select]
Re^3: Copy html tag and replace umlauts with alternate spellings by Anonymous Monk on Mar 27, 2011 at 20:36 UTC
Reposting the new desired head word as I left off the closing slash mark `<idx:orth value="abaendern"/>`	[reply] [d/l]