parsing XMLish data

gav^ has asked for the wisdom of the Perl Monks concerning the following question:

However much I like to encourage people to use the HTML and XML parsing modules I think I have found a problem that they can't solve. I have these XMLish data files which are not XML compliant in the slightest and seem to make things choke. They are 5+ meg in the format:

<data>
  <r>
     <rec1>data</rec1>
     <something>else</something>
  </r>
</data>
[download]

Where thare are many <r> tags in the file. I want to be able to parse it record by record with each record ending up in a hash. The main problem is that the data contains unescaped HTML and a variety of random characters. Luckily I know what columns I'm expecting so I can just skip over the HTML. At the moment my code looks like:

    while ($html =~ m/<(\/?r)>/g) { 
        $r_pos = pos($html) if $1 eq 'r'; 
        if ($1 eq '/r') {
            $r = substr($html, $r_pos, pos($html) - $r_pos - length('<
+/r>'));
            _extract_tags($r);
            $rec_count++;
        }
    }

sub _extract_tags {

    my $html = $_[0];
    my $tag_pos;
    my $curr_tag;
    my %data;
    
    while ($html =~ m/<(\/?[\w-]+)>/g) {

        my $tag = $1;
        my $n_tag = substr($tag, 1); # a 'naked' slashless tag

        if ($tag =~ /^\// && $n_tag eq $curr_tag) {
                
            # get the text inside the tag
            my $text = substr($html, $tag_pos, pos($html) - $tag_pos -
+ length($tag) - 2); 

            $text =~ s/\r\n/\n/g;        
            $data{$n_tag} = $text;
        } else {

            # only get tags specified
            if (defined $columns{$tag}) {
                $tag_pos = pos($html);  
                $curr_tag = $tag;
            }
            
        }
    
    }

    # pass off to subroutine
    $handle_data->(\%data);

}
[download]

I was wondering if anyone has any suggestions to improve this, it looks a bit messy to me and I can't seem to think of anything cleaner.

gav^

Comment on parsing XMLish data Select or Download Code

Replies are listed 'Best First'.
Re: parsing XMLish data by mirod (Canon) on Feb 12, 2002 at 22:20 UTC
In such a case I would use CDATA sections. First wrap the content of the dodgy elements in `<[CDATA[` ... `]]>`, then you can parse them without problem: #!/bin/perl -w use strict; use XML::Twig; use Data::Dumper; # generate a file where the content of rec1 and something is # stuck in CDATA sections my $tmp="tmp"; open( TMP, ">$tmp") or die "$0 cannot open $tmp: $!"; while( <DATA>) { s{<(rec1\|something)>}{<$1><![CDATA[}g; s{</(rec1\|something)>}{]]></$1>}g; print TMP $_; } close TMP; # sorry, I could not help but use XML::Twig for this my %data; my $t= XML::Twig->new( twig_handlers => { r => sub { $data{$_->field( 'key')}= { rec1 => $_->field( 'rec +1'), something => $_->field( 'som +ething') }; $_[0]->purge; # I like to save +memory } }, ); $t->parsefile( $tmp); print Dumper( %data); __DATA__ <data> <r> <key>k1</key> <rec1>data</rec1> <something>else</something> </r> <r> <key>k2</key> <rec1>includes <br> and non UTF-8 chars like é, or nasties like < +</rec1> <something>else, <p>ugly <i>too<b>isn't</i> it</b> & all</p></some +thing> </r> </data> [download] The only restrictions are: if the "embedded" data contains `]]>` then you must break the CDATA section, and the data of course it must not contain `</rec1>` or `</something>`.	[reply] [d/l] [select]
Re: parsing XMLish data by asiufy (Monk) on Feb 12, 2002 at 22:06 UTC
That looks like something XML::Parser would be able to grok... I've just finished a quick project with it, and while it's intimidating at first, it's quite powerful. It has a few different parsing "styles". I'd suggest you use Subs, where you can have a sub for each tag in your XML, and since you know exactly which tags to expect, it makes it easy to write the subs to assign the data values. Or, you can use the Stream style, that will provide more flexibility, as it calls pre-defined subs on specific events (start tag, end tag, text, start document, etc), and from there you can verify which is the tag that is being parsed. There's some more information here and here.	[reply]
Re: parsing XMLish data by mirod (Canon) on Feb 13, 2002 at 08:41 UTC
I am afraid gav^ is right: you can't use XML::Parser, or any XML module for what matters, if your data is not well-formed XML. Which gives you 3 choices: use a custom parser, that deals with the data you actually have, just do not call it XML, believe me it will save you tons of problems in the long run, when you want to use the parser on real XML, use a 2-step process: turn your data into valid XML, either using CDATA sections or by replacing < and & in the content of "elements" that contain HTML. BTW you also need to convert the charaters to UTF-8, or maybe to add an encoding declaration, my previous code works by pure accident, if you add a comma after the é the XML parser will complain (loudly!), then, and then only, you can use XML tools, a variant would be to use a custom parser that generates SAX events, then you can use XML SAX tools to process the data, and if you need to use real XML (or CSV, or any other format for which a SAX parser exists) you can just use the appropriate parser. Note that in order to generate SAX events you still need to escape & and < and probably to pass properly encoded strings to the SAX processor. Kip Hampton wrote an excellent column about this on xml.com: Writing SAX Drivers for Non-XML Data .	[reply]
Re: parsing XMLish data by steves (Curate) on Feb 13, 2002 at 09:47 UTC
Another technique I've used (only for outputting XML with XML::Writer though) is to filter the data. In my case I tied the Handle XML::Writer was using. That tie filtered the data to make it XML compliant. So how would that work here you ask? Loosely thinking, I think you'd write a tie or come up with your own IO class to filter the input. You'd then have XML::Parser read from this handle. Using XML::Parser handlers, you'd recognize when you were in and out of your data tags and make a call to the tie or custom handle to tell it to filter the data. That filter becomes pretty trivial I think: You convert the angle brackets and ampersands to XML character entities and you do whatever you need to with 8-bit data that doesn't fit. The power here (IMO) is that you're separating the filtering of HTML from the parsing of XML. You can blow up that HTML parsing independently as needed, again using existing tools like HTML::Parser. The problem with rolling your own is that it seems simple until you hit all the exceptions. Read all the Perl docs on why not to parse your own HTML as an example. Anything beyond the character-by-character filtering I describe above will fail miserably as the data changes to have tags spanning lines, nested tags, etc. Contrary to something you said earlier, you don't necessarily need to know all your tag names to do this. But there has to be some predictability to your documents for you to write any parser. So don't confuse what you think XML::Parser needs with this general requirement -- I think you'll go about as far there rolling your own as you can with an existing, robust parser. Filtering and using standard interfaces is an approach I prefer. It fits that UNIX-like philosophy of not reinventing the wheel and using existing tools as filters to coerce things into models that are predictable. In short, leverage the work of others into the problem at hand by focusing only on the exceptions unique to your case.	[reply]
Re: parsing XMLish data by gav^ (Curate) on Feb 13, 2002 at 03:42 UTC
I like asiufy's idea and I'll definatly look into XML::Parser's subs style in the future. I don't think this is really the best way to go as I don't know the names of the tags that I'm looking for until runtime. On my test data file there are around 350,000 making me wonder if firing a start tag, end tag, etc handler a good idea. For the same reasons, I was a bit scared off mirod's idea, there is anywhere between 10 and 30 columns I'm looking for so going through the data file and adding the CDATA sections seems a bit inefficient. I will bear this idea in mind, I'm tempted to try and change the data files to use these. The best I can come up with is a single loop: `while (/<([^>]+)>/g) { if ($1 eq '/r') { print Dumper(\%temp); undef %temp; } else { next unless $wanted{$1}; if (substr($1, 0, 1) eq '/') { my $text = substr($_,$pos,pos()-$pos-length($1)-2); $temp{substr($1, 1)} = $text; } else { $pos = pos(); } } }` [download] gav^	[reply] [d/l]
Re: parsing XMLish data by Anonymous Monk on Feb 13, 2002 at 04:43 UTC
There is no excuse, go to Tutorials and read the XML::Parser Tutorial. Otherwise, go to Why I like functional programming by tilly, and use his parser (the module in the thread) to build a custom one. tillys code is a perfect example of why people shouldn't roll their own parser. It is a complete, and fairly complex one, and most people fail to cover all the bases. Please do not try to roll your own if you wish to have a working solution very soon (unless you like doing things the hard way) This has been a test of the emergency perl system, you must read pod now!	[reply]
Re: Re: parsing XMLish data by gav^ (Curate) on Feb 13, 2002 at 05:19 UTC
You can't use XML::Parser to parse everything, especially stuff that isn't really XML like my data. For example running this code -- Re: Is a file XML? proves it. gav^	[reply]