Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How can I seralize a file for use with XML?

by Lamont85 (Novice)
on Jan 31, 2012 at 22:11 UTC ( [id://951084]=perlquestion: print w/replies, xml ) Need Help??

Lamont85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am attempting to seralize several files (created by a Win32 program called Farm Works Site Mate), add the seralized files to a hash, convert the hash to XML, print the XML to a socket, then, on the other end of the socket, convert the XML back to a perl hash and deseralize the files. Here's an example of what I have so far.

#!/usr/bin/perl -w use strict; use File::Slurp qw(read_file); use XML::Dumper qw(pl2xml); my @DIRECTORIES = ( ['01-09-2012','/home/user/test/01-09-2012'], ['01-11-2012','/home/user/test/01-11-2012'], ['12-13-2011','/home/user/test/12-13-2011'], ); my %DATES; for my $dir (@DIRECTORIES) { $DATES{$dir->[0]} = _slurp_directory($dir->[1]); } my $request = { TYPE => 'UPLOAD', DATA => \%DATES, }; my $xml = pl2xml($request); print $xml; exit 0; sub _slurp_directory { my $directory = shift; my %DIR; if (opendir my $dh, $directory) { my @files = readdir $dh; closedir $dh; for my $file (@files) { next if $file eq "." || $file eq ".."; $DIR{$file} = read_file "$directory/$file"; } } return \%DIR; }

This would work fine but not all of the files I am working with are ASCII. I get the following error when trying this with non-ASCII encoded files:

not well-formed (invalid token) at line 10, column 39, byte 813 at /usr/lib/perl5/XML/Parser.pm line 187

Here are some of the files:

79342 SH_Grid.smb
SMB1 ; CMD_SCOUT_OPENLOG "Z:\home\user\SiteMate\01-11-2012\79342 SH.fgp" ; CMD_SCOUT_OPENBK "Z:\home\user\SiteMate\01-11-2012\79342 SH_GridPt.shp +" WGS84 LATLON 101 CMD_SCOUT_OPENBK "Z:\home\user\SiteMate\01-11-2012\79342 SH_GridLn.shp +" WGS84 LATLON 101 ;
79342 SH.mid
SMB1 ; CMD_SCOUT_OPENLOG "Z:\home\user\SiteMate\01-11-2012\79342 SH.fgp" ; CMD_SCOUT_OPENBK "Z:\home\user\SiteMate\01-11-2012\79342 SH_GridPt.shp +" WGS84 LATLON 101 CMD_SCOUT_OPENBK "Z:\home\user\SiteMate\01-11-2012\79342 SH_GridLn.shp +" WGS84 LATLON 101 ;
79342 SH PNTS.mif
VERSION 300 CHARSET "WindowsLatin1" DELIMITER "," COORDSYS Earth Projection 1,104 COLUMNS 2 __SAMPLEID char(40) _ELEVATION float DATA POINT -89.38364924 41.11561035 POINT -89.38364294 41.11651634 POINT -89.38364006 41.11741473 POINT -89.38364654 41.11832219 POINT -89.38484210 41.11832490 POINT -89.38484192 41.11741636 POINT -89.38484605 41.11650997 POINT -89.38483886 41.11560669 POINT -89.38605241 41.11560940 POINT -89.38604611 41.11651268 POINT -89.38604899 41.11741690 POINT -89.38604683 41.11832382 POINT -89.38724527 41.11831962 POINT -89.38724041 41.11741744 POINT -89.38725120 41.11651485 POINT -89.38724706 41.11560452 POINT -89.38844640 41.11561306 POINT -89.38844964 41.11651431 POINT -89.38845665 41.11741798 POINT -89.38844712 41.11832056

There are several other file types (.fdt, .fpg, .gpe, .shp, .shx, and many more) but my text editor isn't able to recognize their encoding types. I could archive the files into a ZIP file, but I end up with the same error in that case. So, with that in mind...How can I seralize any file in a way that is compatible with XML?

Thanks,
-cory-

Replies are listed 'Best First'.
Re: How can I seralize a file for use with XML?
by repellent (Priest) on Feb 01, 2012 at 06:18 UTC
    The prevalent XML 1.0 spec forbids the use of most "control" characters (in the range #x00 - #x19), and that makes it a terrible format for shipping binary data. Even if your API requirement starts off as being printable text-only, you'd soon slam hard into this limitation once you demand more out of your XML use.

    The workaround is to encode your binary data (e.g. Base64) into an XML-compatible string that you can embed in the XML. This introduces an extra decoding step when you want to get back at your binary data.

      ... add the seralized files to a hash, convert the hash to XML, print the XML to a socket, then, on the other end of the socket, convert the XML back to a perl hash and deseralize the files.

    Consider that what you want is to get that Perl hash safely across. Why not just ship binary data across using Storable?
    use Data::Dumper; use Storable qw(nfreeze thaw); my $request = { TYPE => 'UPLOAD', DATA => { file1 => 'content1', file2 + => 'content2' } }; my $serialized = nfreeze $request; # send this binary string my $deserialized = thaw $serialized; # round trip! print Dumper $deserialized;

    Or use a format like JSON? See JSON::XS.

    If all you care about is to ship the files across, then consider using Archive::Tar with compression as a way of packaging the files. The tar object can read/write from filehandles and would work with sockets.
Re: How can I seralize a file for use with XML?
by InfiniteSilence (Curate) on Jan 31, 2012 at 23:22 UTC
    You could also try:
    use MIME::Base64; ... for my $file (@files) { next if $file eq "." || $file eq ".."; $DIR{$file} = encode_base64(read_file "$directory/$file"); } ...

    Celebrate Intellectual Diversity

      This approach works perfectly! Thanks for the help!
Re: How can I seralize a file for use with XML?
by tobyink (Canon) on Jan 31, 2012 at 23:08 UTC

    Figure out what encoding your files are in... ISO-8859-1, UTF8, UTF16, whatever. Then use File::Slurp::Unicode to read them instead of File::Slurp.

    When writing to the network, you may have to convert Perl's internal unicode coding to UTF-8 bytes (see Encode). And when reading from the socket, do the inverse.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://951084]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 07:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found