Limit substitution in html parsing

pcerda has asked for the wisdom of the Perl Monks concerning the following question:

With the help of some community members I have successfully used a script to parse a huge html file and add alternate spellings. However the substitution was too greedy and gave me unuseable inflection data "infl=". I need to limit the copy to only the value attribute (head word of a dictionary entry). Below is the the original version, the version created by the script, the desired output, and the script. Is there anyone who can suggest a change to this script that will prevent it from creating incorrect inflection data?

Original version

     
<idx:short><div height="4"><a name="83"/><div><idx:orth value="abänder
+n" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,a
+bändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuän
+dern"/>
[download]

Result given from the script

    
  <idx:short><div height="4"><a name="83"/><div><idx:orth value="abänd
+ern" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst
+,abändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzu
+ändern"/>
<idx:orth value="abaendern" infl="abaendere,abaenderen,abaenderest,aba
+enderet,abaendern,abaenderst,abaendert,abaenderte,abaenderten,abaende
+rtest,abaendertet,abgeaendert,abzuaendern"
[download]

Desired output

   
<idx:short><div height="4"><a name="83"/><div><idx:orth value="abänder
+n" infl="abändere,abänderen,abänderest,abänderet,abändern,abänderst,a
+bändert,abänderte,abänderten,abändertest,abändertet,abgeändert,abzuän
+dern"/><idx:orth value="abaendern"/>
[download]

Script as it currently is

#!/usr/bin/perl

use strict;
use HTML::Parser;

# set up a hash containing the umlauted characters and their replaceme
+nts:
my %replace = (
    "\xC4" => 'Ae', "\xCF" => 'Ie', "\xD6" => 'Oe', "\xDC" => 'Ue',
    "\xE4" => 'ae', "\xEF" => 'ie', "\xF6" => 'oe', "\xFC" => 'ue',
);
my $um = join '', keys %replace;

binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
$/ = undef;
my $input = <>;
my $output = '';

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [ \&fix_umlaut, 'tagname, attr, 
+text' ],
                           default_h => [ \&copy, 'text' ],
    );
$p->empty_element_tags( 1 );
$p->parse( $input );

print $output;

sub fix_umlaut
{
    my ( $tagname, $attr, $text ) = @_;
    $output .= $text;
    if ( $tagname eq 'idx:orth' and $$attr{value} =~ /[$um]/ ) {
        $text =~ s/([$um])/$replace{$1}/g;
        $output .= $text;  # repeat the tag with modified umlauts
    }
}

sub copy
{
    $output .= $_[0];
}
[download]

Comment on Limit substitution in html parsing Select or Download Code

Replies are listed 'Best First'.
Re: Limit substitution in html parsing by jethro (Monsignor) on Mar 28, 2011 at 00:38 UTC
I'm not sure you'll find many people kind enough to wade through those unreadable text blobs to find the differences between them. Naturally if anyone really wants to do it he would copy the html texts into two files and run 'diff' on them. But I think you would get much more cooperation if formatted the html and pruned down the example to the important parts or alternatively marked them. UPDATE: Instead of using $text you might construct the tag yourself: `sub fix_umlaut { my ( $tagname, $attr, $text ) = @_; $output .= $text; if ( $tagname eq 'idx:orth' and $$attr{value} =~ s/[$um]/$replace{ +$1}/g ) { delete $$attr{'infl'}; $output .= '<idx:orth ' . map( " $_=\"".$$attr{$_}.'"',keys(%$ +attr)) . '/>'; # repeat the tag with modified umlauts } }` [download]	[reply] [d/l]
Re: Limit substitution in html parsing by wind (Priest) on Mar 28, 2011 at 09:03 UTC
It looks like you just want second idx:orth tag to have the non-special character value and no additional attributes. Therefore just output the tag explicitly instead of taking the previous text and modifying it since that will contain the potential additional attributes. `sub fix_umlaut { my ( $tagname, $attr, $text ) = @_; $output .= $text; if ( $tagname eq 'idx:orth' && $attr->{value} =~ s/([$um])/$replac +e{$1}/g ) { $output .= qq{<idx:orth value="$attr->{value}"/>}; # repeat t +he tag with modified umlauts } }` [download]	[reply] [d/l]