RT::Client turns occasional binary characters in to wide characters

wardmw has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a Perl script to retrieve attachments from an RT instance and write them out to individual files. I did this three years ago, with a previous incarnation of the Client::RT libraries, and with assistance from your clergy got it working. I have since upgraded to a newer revision of Perl (5.16.3) and a newer version of the RT Perl libraries (0.52). I am having to rewrite this old code and have come up with a new, different problem instead.

All I want to do is take the attachments from the tickets and write them out to a file, these files need to be written in whatever format they were stored in (png, xlsx, txt, whatever). You'd think this was a simple task but I can't make it happen.

If I save a file using the RT web interface and hex dump the first few chunks it looks like this:

0000000 50 4b 03 04 14 00 09 00 08 00 67 8d 25 46 00 00
0000010 00 00 00 00 00 00 00 00 00 00 1c 00 00 00 73 63
0000020 72 65 65 6e 73 68 6f 74 2d 31 37 32 20 32 31 20
0000030 32 34 32 20 36 34 2e 7a 69 70 32 9e 8a fc b5 15
[download]

Yet if I read the attachment in to a string using the RT API and hex dump that string see:

00000000  50 4B 03 04 14 00 09 00 - 08 00 67 FFFD 25 46 00 00  PK.....
+...g.%F..
00000010  00 00 00 00 00 00 00 00 - 00 00 1C 00 00 00 73 63  .........
+.....sc
00000020  72 65 65 6E 73 68 6F 74 - 2D 31 37 32 20 32 31 20  reenshot-
+172 21
00000030  32 34 32 20 36 34 2E 7A - 69 70 32 FFFD FFFD FFFD FFFD 15  2
+42 64.zip2.....
[download]

Note the four-digit hex chars on lines 1 and 4. The code I used to generate the above is as follows:

#!/usr/bin/env perl
#
# wctest.pl - A test to see where the wide characters come from.

use strict;
use warnings;
use RT::Client::REST;
use RT::Client::REST::Ticket;
use Data::HexDump;

my $user='user';
my $pass='pass';

my $rt = RT::Client::REST->new(
    server  => ('https://rt.local'),
    basic_auth_cb => ( sub { return ($user, $pass); } )

);

$rt->login( username=> $user, password=> $pass,);
my $ticket_ptr = RT::Client::REST::Ticket->new(rt => $rt);

my $results = $ticket_ptr->search( limits => [ { attribute => 'id', op
+erator => '=', value => '51447' }, ],);

my $iterator = $results->get_iterator;
my ($ticket, $attachments);
while ($ticket = &$iterator) {
    $attachments = $ticket->attachments;

    # Store attachments
    my $atch_ater = $attachments->get_iterator;
    while (my $att = &$atch_ater) {
        next if ($att->file_name eq '');
        print HexDump substr($att->content, 0, 64);
    }
}
[download]

I've tried adding

use Encode;
[download]

and

use utf8;
[download]

both together and alone but no difference was made, the hex dump still shows multibyte characters.

I have even tried just writing the data out to files, in case the problem was with the HexDump module, but it still failed to create the files in their native format, usually almost doubling the number of characters in the file than are in the stringed attachment.

I would appreciate any help or guidance you might be able to provide, I', banging my head against the wall here.

|\/|artin

Comment on RT::Client turns occasional binary characters in to wide characters Select or Download Code

Replies are listed 'Best First'.
Re: RT::Client turns occasional binary characters in to wide characters by BillKSmith (Monsignor) on Oct 02, 2018 at 22:14 UTC
I do not think that there is anything wrong with your string, only with the utility you use to display it. I have recreated your file (named it 1223420.png), displayed it with xxd to show I got it right (ignore windows clrl at end), read that file into a string and converted the first 16 characters of that string to hex with unpack. The content of the string is correct. `$xxd 1223420.png 00000000: 504b 0304 1400 0900 0800 678d 2546 0000 PK........g.%F.. 00000010: 0000 0000 0000 0000 0000 1c00 0000 7363 ..............sc 00000020: 7265 656e 7368 6f74 2d31 3732 2032 3120 reenshot-172 21 00000030: 3234 3220 3634 2e7a 6970 329e 8afc b515 242 64.zip2..... 00000040: 0d0a .. $type 1223420.pl use strict; use warnings; my $string = <>; print unpack( 'H32', $string ); $perl 1223420.pl 1223420.png 504b0304140009000800678d25460000` [download] Also note in you output, that although the offending bytes are printed as four hex characters, the address of the following bytes is correct. Bill	[reply] [d/l]
Re^2: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 03, 2018 at 10:50 UTC
Thanks for responding Bill. I haven't tried unpack on the string yet, I will give that a go and see what it returns. You are right that the address of the following bytes is correct, in the Perl hexdump output it is showing four hex characters whereas the Linux hexdump output only shows two, so something is translating the code internally, as @Veltro suggests below.	[reply]
Re: RT::Client turns occasional binary characters in to wide characters by haukex (Archbishop) on Oct 04, 2018 at 08:01 UTC
I also suspect that this is some encoding problem, but my suspicion is the problem happens before RT::Client::REST even hands you the data, and the display of the data with Data::HexDump is just a symptom - use Devel::Peek to really see what Perl is storing internally, maybe you could show that here. I haven't had a chance to look at the issue in detail myself, partly because I haven't yet found an RT ticket with a binary attachment to play with. I see you reported #127288, but did you have a look at #90112, which seems to discuss a similar issue? Just an idea, perhaps you could attach an example file that reproduces the issue to #127288? (Or, maybe someone knows of an RT issue with a binary attachment off the top of their head.)	[reply]
Re^2: RT::Client turns occasional binary characters in to wide characters by Veltro (Hermit) on Oct 04, 2018 at 09:28 UTC
... problem happens before RT::Client::REST even hands you the data... Because of this comment I looked a little bit at the source of RT::Client::REST and I noticed that there are two debug/logger lines that could be of interest that allows you to see both the request and the response. These lines are located in the _submit method: `# Then we send the request and parse the response. $self->logger->debug('request: ', $req->as_string); my $res = $self->_ua->request($req); $self->logger->debug('response: ', $res->as_string);` [download] To use it you need to create a custom logger object that needs the methods debug, warn, info, error and pass it in the logger function and write the debug function to print the output somewhere. I think this may help to pinpoint where the problem is really located (server side, or client side). If the problem is indeed on the server side then I'm afraid there is not much you can do.	[reply] [d/l]
Re^3: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 09, 2018 at 11:03 UTC
OK, so I started using the logger object, which provided more information but nothing that I found useful. For the future people reading this the Log::Log4perl module works perfectly for this. I went back over my code and switched from using the `ticket->attachments` pointer (which doesn't reference the undecoded option as far as I can tell) to the `get_attachment_ids / get_attachment loop` way. I also created a one line text file, zipped it up and added it to a new RT ticket on my system so that I had a 75 byte gzip file to test with. This worked, I was able to download, unzip and read the file, so I went back and tried the original test ticket, which also worked. I will post the working code below. Having got this to work I went back over my old code to see what happened before and the answer is, as always, a simple one. Instead of using the `undecoded => 1` parameter I had `uudecoded => 1`. This seemed reasonable to my eye as I scoured the code for errors because I Know Stuff (tm), I knew that uuen/decoding was a valid way of coding characters so didn't question it, it looked right. So the fault was all mine, a simple typo, although I will add an RFE to the RT::Client::REST developers to add warnings if unknown options are specified as this would have saved me and you a few weeks worth of debugging. The final, working code looks like this: #!/usr/bin/env perl # # wctest.pl - A test to retrieve and save an attachment. use strict; use warnings; use RT::Client::REST; use RT::Client::REST::Ticket; use Log::Log4perl; my $user='xxxx'; my $pass='yyyy'; my $rt = RT::Client::REST->new( server => ('https://rt.local'), basic_auth_cb => ( sub { return ($user, $pass); } ), ); $rt->login( username=> $user, password=> $pass,); # # Get attachments using get_attachment # my @results = $rt->search( type => 'ticket', query => "id=51447" ); my ($id, @atch_ids, $atch_id, $atch); for $id (@results) { @atch_ids = $rt->get_attachment_ids( id => $id); for $atch_id (@atch_ids) { $atch = $rt->get_attachment (parent_id => $id, id => $atch_id, + undecoded => 1); next if (! defined($atch->{'Filename'}) ); next if ($atch->{'Filename'} eq ''); open(FH, ">", $atch->{'Filename'}) \|\| die "Can't open file: $! +"; syswrite(FH, $atch->{'Content'}); close FH; } } [download] Many thanks to you for your patience, help and assistance.	[reply] [d/l] [select]
Re: RT::Client turns occasional binary characters in to wide characters by Veltro (Hermit) on Oct 03, 2018 at 09:11 UTC
It seems to me that something is trying to convert your attachment to readable text and replaces extended ascii characters to a replacement character. I don't know much about RT, however a quick look at RT::Client::REST::Attachment shows some attributes that could be of interest for you which are: content_type and content_encoding. Maybe you can try and see what these attributes return.	[reply]
Re: RT::Client turns occasional binary characters in to wide characters by johngg (Canon) on Oct 02, 2018 at 22:02 UTC
I can't offer any solution but it might be helpful if you could provide a link to the earlier thread in which other Monks helped you. It might also be useful to know what versions of the perl interpreter and modules were used in your working system, and perhaps the platform as well. Cheers, JohnGG	[reply]
Re^2: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 03, 2018 at 08:53 UTC
Thanks for responding John, the old post is Accessing attachments using RT::Client::REST although I cannot now set the uudecoded flag since I am not retrieving the attachments in that manner any more.	[reply]
Re: RT::Client turns occasional binary characters in to wide characters by cavac (Prior) on Oct 03, 2018 at 11:24 UTC
Tried this? `use Encode qw[encode_utf8 is_utf8]; ... if(is_utf8($mystring)) { $mystring = encode_utf8($mystring); }` [download] This should give you a string with only bytes 0x00 to 0xFF. In all likelyhood, the module you are using is treating incoming data as UTF8 and decodes that into characters. "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."	[reply] [d/l]
Re^2: RT::Client turns occasional binary characters in to wide characters by Anonymous Monk on Oct 03, 2018 at 22:10 UTC
Please do not propagate the trap of using is_utf8 for Perl code. It does not indicate if the string you have is UTF-8 encoded bytes. It is only an internal flag for Perl's own use and XS code. It is possible, especially after people try hacks like this, or write incomplete XS code, to have byte-strings where is_utf8 is true, and character strings where is_utf8 is false. I would link to some RT bugs for more reading about the issue, but the website doesn't allow me to post them.	[reply]
Re^3: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 08, 2018 at 15:53 UTC
Thanks for the response. Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters. Working from that I removed every reference to UTF8 subroutines from my code but I still get this wide character complaint when I try and write the string contents out to a binary (or any) file. So I have removed one potential issue (UTF8) but it's still got a problem. While I take you at your word that this is not a UTF8 problem (as I understand it) It's odd that running `encode('UTF-8'...` against the string and writing the results out does not generate this wide character warning.	[reply] [d/l]
Re^4: RT::Client turns occasional binary characters in to wide characters by haukex (Archbishop) on Oct 08, 2018 at 16:13 UTC
Re^2: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 03, 2018 at 15:56 UTC
Thanks for that. According to is_utf8() the string is in UTF8, however running encode_utf8() doesn't resolve the problem. it does remove then 4 character hex, but doesn't put the code back to what it was originally: `encode_utf8() version: 00000000 50 4B 03 04 14 00 09 00 08 00 67 EF B +F BD 25 46 PK........g...%F Original version: 0000000 50 4b 03 04 14 00 09 00 08 00 67 8d 2 +5 46 00 00` [download] I took a look at the attributes of the file, as @Veltro suggested and got the following: `content_type is: application/octet-stream content_encoding is: none file_name is: screenshot-172 21 242 64.zip headers is: Content-Type: application/octet-stream; name="screenshot-1 +72 21 242 64.zip" Content-Disposition: attachment; filename="screenshot-172 21 242 64.zi +p" Content-Transfer-Encoding: base64 Content-Length: 460749` [download] That "base64" string in the headers section looked interesting although the string does not seem to be encoded insofar as is has characters in it that do not match the Base64 character set (A-Za-z0-9+/=). I tried encoding and decoding using the MIME functions but to no avail. The content length stated is the exact size of the actual binary file (460749 bytes) but the string provided by the RT libraries is different (442958 bytes). I would be willing to believe that the missing 17791 characters are included in the wide characters in the RT string, that is to say that I expect there to be 17791 wide characters in the octet stream.	[reply] [d/l] [select]
Re^3: RT::Client turns occasional binary characters in to wide characters by Anonymous Monk on Oct 03, 2018 at 22:14 UTC
This is another reason why is_utf8 is a trap. It does not indicate the string is "in UTF-8". It is an internal flag that describes how Perl is internally storing the string. utf8::upgrade and utf8::downgrade enable and disable this flag respectively without any change to the string (as used in Perl code) (as long as the string can be represented in your native encoding, otherwise utf8::downgrade will croak). So in fact, the only sure thing you can determine from is_utf8 is that every Perl string with codepoints above U+FF must have it enabled (but not the other way around).	[reply]