AnishaM has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks, Is there a way to remove BOM ÿþ from a string? Last time I saw such characters while reading my file for which I got help from you all who pointed at encoding problem. But is there a way to remove ÿþ from a string? I tried this: s/^\N{ZERO WIDTH NO-BREAK SPACE}//; and s/\xFE\xFF/\x{FFFD}/g But I am not able to resolve this issue? Could you please give me some hints on how do I solve this issue? I have been stuck with this issue from 2 days. Thanks in advance. Regards, AnishaM

Replies are listed 'Best First'.
Re: Remove ÿþ from a string
by Athanasius (Archbishop) on Sep 17, 2016 at 12:39 UTC

    Hello AnishaM,

    A quick search of turns up the module String::BOM which has a strip_bom_from_string function:

    use strict; use warnings; use utf8; use String::BOM qw( strip_bom_from_string ); my $string = 'ÿþThe quick brown fox jumped over the unfortunate dog.'; print ">$string<\n"; $string = strip_bom_from_string($string); print ">$string<\n";

    Output:

    22:35 >perl 1697_SoPW.pl >ÿþThe quick brown fox jumped over the unfortunate dog.< >The quick brown fox jumped over the unfortunate dog.< 22:35 >

    Note that this will remove the BOM from the beginning of the string only — which is the only place a BOM is supposed to occur within a text stream (see Byte order mark).

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Remove ÿþ from a string
by shmem (Chancellor) on Sep 17, 2016 at 13:13 UTC

    Since the BOM is unicode, you have to use unicode to remove the BOM:

    s/\x{feff}//;

    But if you use vim, removing the BOM is as simple as loading the file, typing

    :set nobomb

    and writing the file back to disk.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Remove ÿþ from a string
by kroach (Pilgrim) on Sep 17, 2016 at 11:35 UTC
    If you want to remove particularly the character sequence "ÿþ", it's quite simple:
    use utf8; my $string = 'AÿþBÿþÿþC'; $string =~ s/ÿþ//g; print $string;
    Result:
    ABC
Re: Remove ÿþ from a string
by hippo (Archbishop) on Sep 17, 2016 at 14:10 UTC
    I tried this: s/^\N{ZERO WIDTH NO-BREAK SPACE}//;

    Works for me (unanchored):

    #!/usr/bin/env perl use strict; use warnings; use Test::More tests => 2; my $src = "foo\x{feff}bar"; my $res = $src; $res =~ s/\N{ZERO WIDTH NO-BREAK SPACE}//; isnt ($res, $src, 'Output differs from src'); is ($res, 'foobar', 'BOM stripped');

    You can always add the anchor in if required (as the BOM should only occur at the start as Athanasius explained).

Re: Remove ÿþ from a string
by haukex (Archbishop) on Sep 17, 2016 at 14:31 UTC

    Hi AnishaM,

    Personally I would try to diagnose the issue a little further, since even a simple print of the variable can give misleading results when it comes to diagnosing encoding problems.

    If the data is coming from a file, I suggest you try the commands hexdump -C FILENAME or od -Ax -tx1z FILENAME and show us the first few lines.

    If the data is coming from some other source and you just have a Perl string, try the code use Devel::Peek; Dump( $string ); and show us that output.

    Regards,
    -- Hauke D

Re: Remove ÿþ from a string
by kcott (Archbishop) on Sep 17, 2016 at 16:01 UTC
      This is an example from some code I wrote to read in a MS file that had a superfluous BOM.
      my @file; { my $lh; open($lh, "<:utf8:crlf", $LogFn) || do { Pe "\nlogfile <%s> not found", $LogFn; help; }; @file = grep { s/([^\r]+)$/$1/; m{^\s*$} ? undef : $_; } <$lh>; close $lh; } $file[0] =~ s/^\N{U+FEFF}//; # UTF-8 BOM
      The stuff immediately after the open was to remove the carriage returns, so I could have normal unix line endings.

      "Pe", BTW is part of "P". From what I understand, in newer perls, a new keyword, "err" is weaker version of the same (doesn't allow a format statement).

      Hope this is of use.

Re: Remove ÿþ from a string
by AnishaM (Acolyte) on Sep 18, 2016 at 06:36 UTC
    Thank you so much everyone for the help. I was able to resolve the issue by all your help and hints. Thanks a ton :) For a newbie programmer like me it feels great to get such different ways to solve an issue. Great learning for me :)