Byte-level file inspection?

princepawn has asked for the wisdom of the Perl Monks concerning the following question:

I was using XML::Simple to parse a document and was told "junk after document element at line 77, column 0, byte 2910. So I wrote the following program to help me take a look at the data in the file around that byte position.

Does anyone have any input into whether I am using the right method of doing byte-level analysis of a file for bad data? Program output *precedes* the source code.

Program Output Starts on Next Line

< 47    /
< 99    c
< 100   d
< 62    >
* 10

> 60    <
> 99    c
> 100   d
> 62    >
[download]

Program

#!/usr/bin/perl

use strict;

my $f = shift || die "must supply filename";

open F, $f    or die "couldnt open $f: $!";

my $bytepos = shift || die "must supply bytepos";

my $offset  = 4;

my @range   = ($bytepos-$offset .. $bytepos+$offset);
my $range   =  @range;
my $text    =   join '', <F>;

my @substr  = ($range[0]-1, $range);
warn "substr @substr";
#my $chunk   = substr $text, @substr;
my $chunk   = substr $text, $range[0]-1, $range;
warn substr $chunk, 0, 20;

my $format  = "C" . $range;
my @unpack  = unpack $format, $chunk;

my $normal  = $range[0];
my @chunk   = split //, $chunk;
for (@range) {

  my $arydex = $_-$normal;

  if ($_ < $bytepos) {
    print '< ';
  }
  if ($_ > $bytepos) {
    print '> ';
  }
  if ($_ == $bytepos) {
    print '* ';
  }

  print unpack 'C', $chunk[$arydex];

  print "\t";

  print             $chunk[$arydex];

  print "\n";

}
[download]

Comment on Byte-level file inspection? Select or Download Code

Replies are listed 'Best First'.
Re: Byte-level file inspection? by chipmunk (Parson) on Feb 16, 2001 at 01:34 UTC
Instead of reading in the whole file and using substr(), you might want to use seek and read to read in just the part of the file you're interested in: `my $bytepos = shift \|\| die "must supply bytepos"; my $offset = 4; my $startpos = $bytepos - $offset; $startpos = 0 if $startpos < 0; seek(F, $startpos, 0) or die "Can't seek: $!"; read(F, $chunk, $offset * 2) or die "Can't read: $!";` [download] This will be a big win on especially large files.	[reply] [d/l]
(tye)Re: Byte-level file inspection? by tye (Sage) on Feb 16, 2001 at 01:30 UTC
I'll skip all of the minor changes I would make and just mention that you probably want to do `binmode(F)` before you read. - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Byte-level file inspection? by MeowChow (Vicar) on Feb 16, 2001 at 01:49 UTC
For the sake of cuteness, this: `if ($_ < $bytepos) { print '< '; } if ($_ > $bytepos) { print '> '; } if ($_ == $bytepos) { print '* '; }` [download] can become: `print qw(* > <)[$_ <=> $bytepos];` [download] Whoosh, that was fun :-) You should also be careful about feeding a negative index into substr, which could happen with a low enough `bytepos`. update: deleted a bit of misguided foolishness. MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l] [select]
Re: Byte-level file inspection? by dws (Chancellor) on Feb 16, 2001 at 01:39 UTC
Looks to me like you're wacking a gnat with a baseball bat. Unless XML::Simple is counting lines wrong and you need a script that miscounts in the same way, you should be able to use your favorite text editor to get you to a specific line and column. `vi +77 foo.xml` gets you most of the way there. And if you see an obvious problem, you can fix it on the spot! Use the tools within reach before you burn up time building new ones.	[reply] [d/l]
Re: Byte-level file inspection? by mirod (Canon) on Feb 16, 2001 at 11:15 UTC
I am afraid that you have not understood what `"junk after document element"` means. It does not mean at all that there are weird characters in the document. An XML document should have one, and only one root element, so anything after the end of the first element should not be there and is reported as junk. It looks like your document is missing a wrapping tag, the "one to tie them all": This is not an XML document: `<cd id="cd1">...</cd> <cd id="cd2">...</cd> <cd id="cd3">...</cd>` [download] You need this: `<cd_list> <cd id="cd1">...</cd> <cd id="cd2">...</cd> <cd id="cd3">...</cd> </cd_list>` [download] And you don't need to do byte-level file inspection at all. By the way, you asked basically the same question on c.l.p.m a little while ago and got 2 answers which explained the exact same thing, so I expect you already fixed your problem, and you don't need to link your way to examin a file to the "junk" message, which might mislead other people having the same problem.	[reply] [d/l] [select]