kcinmd has asked for the wisdom of the Perl Monks concerning the following question:

I am about to implement a preprocessing one liner to remove Invalid ascii character representations which fall out of the range of our decoding ability. Our data feed is a drop directory of XML that ultimately is loaded into data warehouse. Prior to load I will have the ETL issue the following pre-command:

perl -pi -e 's/^&#x.+;//g' ./*.xml

Testing has proven desired result. I figured I would throw this out there for opinion just in case I am missing something.

Replies are listed 'Best First'.
Re: Sanity Check
by choroba (Cardinal) on Jan 28, 2015 at 12:45 UTC
    Do you really want to get nothing for the following line?
      Some content  
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      As far as I can tell these out of range ascii characters are introduced by the user copying from MS Word into a long description. For example

      60 TON AIR-COOLED ROTARY SCROLL CHILLER Carrier Model no.30RAP060 Chiller Manufactured in Charlotte, NC ò AHRI Performance: 56.0 Tons, Full Load û 10.2 EER, IPLV &# +xFB; 14.5 EER ò Electrical Requirements for 208/3: MCA û 279.7, MOCP &#xFB +; 300, Rec $37,855.00 $37,855.00 Quote ref: TII/CA/1214/2895 Page 1 of 5 Fuse Size û 300 ò Electrical Requirements for 460/3: MCA û 134.6, MOCP &#xFB +; 150, Rec Fuse Size û 150 ò AHRI STANDARD 550/590 CERTIFIED ò ASHRAE 90.1 COMPLIANT ò 7.5 HP Constant speed, single-pump package (61.3Æ ext. hea +d @ 134 GPM)

      I do not care about the content of the long description. I would much rather have a fully automated solution to keep a human out of the loop. W/O the pre-command... my ETL (Informatica) will not digest the unknown character and aborts then I get a phone call!.

        I see. What about the entities in non-starting positions, then?
        Fuse Size û 150
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Perhaps it would be better to just skip the lines with encoded chars
        perl -ne 'print unless /&#x/'
        Stuff like "134 GPM)" is on a separate line; I don't know if that's a problem.