Regular Expression XML Searching Help

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have XML files in a directory and I am trying to make things very simple by just opening these files and check the existence of the first and last element - <order> and </order> , to make sure I don't get any bad XML in my other Perl code, and eventually delete the bad formatted XML file(s), but having problems trying to match it in my regular expression, can anyone help me with that?

my $xml_dir = "c://apache//htdocs//xml";

opendir(DIR, $xml_dir);

my @files = grep { /\.xml$/ } readdir(DIR);
closedir(DIR);

foreach my $file (@files) {

if($file =~/<order>(.*)<\/order>/g){
                        print "FILE::: $file\n";
} 

}
[download]

XML Sample File 1:

<order>
 <customer>
  <name>Coyote, Ltd.</name>
  <shipping_info>
    <address>1313 Desert Road</address>
    <city>Nowheresville</city>
    <state>AZ</state>
    <zip>90210</zip>
  </shipping_info>
 </customer>
 <item>
  <product id="1111">Acme Rocket Jet Pack</product>
  <quantity type="each">1</quantity>
 </item>
 <item>
  <product id="2222">Roadrunner Chow</product>
  <quantity type="bag">10</quantity>
 </item>
 </order>
[download]

XML Sample File 2 - This file is not valid:

<order>
 <customer>
  <name>Coyote, Ltd.</name>
  <shipping_info>
    <address>1313 Desert Road</address>
    <city>Nowheresville</city>
    <state>AZ</state>
    <zip>90210</zip>
  </shipping_info>
 </customer>
 <item>
  <product id="1111">Acme Rocket Jet Pack</product>
  <quantity type="each">1</quantity>
 </item>
 <item>
  <product id="2222">Roadrunner Chow</product>
  <quantity type="bag">10</quantity>
 </item>
[download]

Thanks!!!

Comment on Regular Expression XML Searching Help Select or Download Code

Replies are listed 'Best First'.

Re: Regular Expression XML Searching Help
by Your Mother (Archbishop) on Jun 16, 2008 at 17:03 UTC

I'm guessing your real intention is to check for well-formedness. The question also implies you're doing XML parsing in a fragile way elsewhere. If I'm right, the following might be a good start to doing things in a way that's more bomb-proof and easier to extend. See XML::LibXML for more.

use XML::LibXML;

local $/ = "::FILE::";

my $parser = XML::LibXML->new();
# $parser->recover(1); <-- turn on to "save" many bad docs.

while ( my $xml = <DATA> )
{
    chomp($xml);
    my $doc = eval { $parser->parse_string($xml) };
    if ( $doc )
    {
        print "File $. is valid.\n";
        # Do whatever you want with your valid $doc here.
    }
    else
    {
        print "File $. is NOT valid.\n";
        # Deal with bad docs here...
    }
}

__DATA__
<order>
 <customer>
  <name>Coyote, Ltd.</name>
  <shipping_info>
    <address>1313 Desert Road</address>
    <city>Nowheresville</city>
    <state>AZ</state>
    <zip>90210</zip>
  </shipping_info>
 </customer>
 <item>
  <product id="1111">Acme Rocket Jet Pack</product>
  <quantity type="each">1</quantity>
 </item>
 <item>
  <product id="2222">Roadrunner Chow</product>
  <quantity type="bag">10</quantity>
 </item>
 </order>

::FILE::

<order>
 <customer>
  <name>Coyote, Ltd.</name>
  <shipping_info>
    <address>1313 Desert Road</address>
    <city>Nowheresville</city>
    <state>AZ</state>
    <zip>90210</zip>
  </shipping_info>
 </customer>
 <item>
  <product id="1111">Acme Rocket Jet Pack</product>
  <quantity type="each">1</quantity>
 </item>
 <item>
  <product id="2222">Roadrunner Chow</product>
  <quantity type="bag">10</quantity>
 </item>
[download]

[reply]
[d/l]

Re^2: Regular Expression XML Searching Help

by Anonymous Monk on Jun 16, 2008 at 17:07 UTC

[reply]

Re^2: Regular Expression XML Searching Help

by Anonymous Monk on Jun 16, 2008 at 18:57 UTC

I can't install the module XML::LibXML on this windows box, is there any other Perl module that would work with this code example?

[reply]

Re^3: Regular Expression XML Searching Help

by Your Mother (Archbishop) on Jun 16, 2008 at 19:12 UTC

Not directly, no. But you could try to install XML::Twig (or one of the other good ones) and try to adapt the recipe. Even an eval around an XML::Simple::XMLin() might work. I don't recommend the module but if you've got it already...

I'm no expert on Win installs but you could try to install the C lib for libxml before trying to install the Perl modules. Might be the only problem. There is some really good work lately with Strawberry Perl to make Perl behave more like it does on other OSes. If you don't have it, try it maybe(?).

[reply]
[d/l]

Re: Regular Expression XML Searching Help
by ikegami (Patriarch) on Jun 16, 2008 at 16:48 UTC

That's not valid XML. It's not even well-formed XML. An XML validator would take care of that. Searching on CPAN for "xml validator" finds some Perl solutions, and Google finds non-Perl solutions. No need to reinvent the wheel.

[reply]

Re: Regular Expression XML Searching Help
by mirod (Canon) on Jun 16, 2008 at 17:31 UTC

In case you ever have other types of searches to perform on the XML, you can try xml_grep2, which you will find in my tool box. Or of course xml_grep, which comes with XML::Twig (and I am sure Jenda will have a similar tool based on XML::Rules in the next 5 minutes ;--)

[reply]

Re: Regular Expression XML Searching Help
by pc88mxer (Vicar) on Jun 16, 2008 at 17:10 UTC

/s

use File::Slurp;
...
for my $file (@files) {
  my $content = read_file($file);

  unless ($content =~ m{<order>(.*)</order>}s) {
    # bad file
  } else {
    # found <order>...</order>
  }
}
[download]

/s

[reply]
[d/l]
[select]

Re: Regular Expression XML Searching Help
by ikegami (Patriarch) on Jun 16, 2008 at 16:52 UTC

g

if (/.../g) is almost guaranteed to be a bug.
while (/.../g) would make sense.

[reply]
[d/l]
[select]

Re^2: Regular Expression XML Searching Help

by Anonymous Monk on Jun 16, 2008 at 17:04 UTC

I tried with and without he "/g", and it still didn't work, also tried using a "while" other than "if", no luck!

[reply]

Re^3: Regular Expression XML Searching Help

by ikegami (Patriarch) on Jun 16, 2008 at 17:33 UTC

I didn't say it would work without the 'g' or with a while. I said it was wrong as is.

[reply]

Re: Regular Expression XML Searching Help
by Jenda (Abbot) on Jun 16, 2008 at 17:25 UTC

You do not need /g, but /s.

if($file =~/<order>(.*)<\/order>/s){
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re: Regular Expression XML Searching Help
by Jenda (Abbot) on Jul 07, 2008 at 23:50 UTC

Thinking about this one more time ... in case the question actually is not "is the XML valid", but rather "is the XML already complete, assuming it's being uploaded from a valid source" then the XML parser modules based solutions are an overkill. For a quick check with minimal memory footprint something like this may be better:

sub XMLisComplete {
    my $file = shift();
    open my $IN, '<', $file or return; # if I can't open it, it's prob
+ably locked. Therefore it's not complete.
    my $main_tag;
    read $IN, $main_tag, 1024;
    if ($main_tag =~ m{<(\w+)}) {
        $main_tag = $1;
    } else {
        return; # there's not even the opening tag!
    }
    seek $IN, 2, -100;
    my $end = do {local $/; <$IN>};
    close $IN;
    if ($end =~ m{</$main_tag>\s*$}s) {
        return 1;
    } else {
        return;
    }
}
[download]

It's most likely not 100% standard proof (e.g. I bet \w+ doesn't match all allowed tag names), but it works for me to test whether I can already start parsing the uploaded file or whether to wait a bit more. The actual parsing would of course be best left to a module.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re: Regular Expression XML Searching Help
by richb (Scribe) on Jun 17, 2008 at 18:05 UTC

name

my $xml_dir = "c://temp";

opendir(DIR, $xml_dir);

my @files = grep { /\.xml$/ } readdir(DIR);
closedir(DIR);

foreach my $file (@files) {
    
    open my $fh, "$xml_dir//$file" or die "can't open $file: $!";
    local $/;
    my $contents = <$fh>;
    close $fh;
    
    print "$file is ";
    if($contents !~ /<order>(.*)<\/order>/s){
        print "NOT ";
    }
    print "valid\n";
}
[download]

invalid.xml is NOT valid #this is your sample file 2

valid.xml is valid #this is your sample file 1

[reply]
[d/l]