Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I have to check the start of the line is having this following data,

<?xml version="1.0" encoding="UTF-8"?><sample>
I have written the regular expression like this, if ($content =~/\?XML\s+version=[\'\"]1.0[\'\"](\s+encoding=[\'\"][^\' +\"]*[\'\"])?/){ print"Input file is valid"; }else{ print"File is invalid"; }
Is there any better and easy method to match this.
Thanks,
Raj.

Replies are listed 'Best First'.
Re: Regular expression to check XML type
by ikegami (Patriarch) on Jun 27, 2009 at 17:42 UTC
    The following are all be equivalent and valid:
    • \xEF\xBB\xBF<?xml version="1.0" encoding="UTF-8"?>
    • <?xml version="1.0" encoding="utf-8"?>
    • <?xml version='1.0' encoding='UTF-8'?>
    • <?xml version="1.0"?>
    • <?xml encoding="UTF-8" version="1.0"?>
    • <?xml version="&#x31;&#x2E;&#x30;" encoding="&#x55;&#x54;&#x46;&#x2D;&#x38;"?>

    And that's assuming you're just supporting UTF-8!

    This isn't the job of a regex. But if you insist on using a regex match, here you go:

    use XML::LibXML qw( ); local our $doc; $content =~ /(?(?{ !eval { $doc = XML::LibXML->new()->parse_string($_) } })(?!))/;

    The match will succeed if the parsing succeeds (and set $doc to a XML::LibXML::Document object) and fail if the parsing fails.

Re: Regular expression to check XML type
by Your Mother (Archbishop) on Jun 27, 2009 at 17:45 UTC

    Even for something as seemingly simple as that, you should still reach for a parser. You can certainly make do with regular expressions but a parser buys you much more, will fall down less, is easier to write and read, and won't give you false positives. Your sample XML, for example, is broken but your regex wouldn't tell you that.

    use strict; use warnings; use XML::LibXML; my $doc = XML::LibXML->new->parse_fh(\*DATA); print "All about me-\n", " Version: ", $doc->version, "\n", " Encoding: ", $doc->encoding, "\n", " Standalone: ", $doc->standalone, "\n"; __DATA__ <?xml version="1.0" encoding="UTF-8"?><sample></sample>
    All about me- Version: 1.0 Encoding: UTF-8 Standalone: -2
Re: Regular expression to check XML type
by Corion (Patriarch) on Jun 27, 2009 at 17:22 UTC
    $content = qq(?XML version="1.0'); if ($content =~/\?XML\s+version=[\'\"]1.0[\'\"](\s+encoding=[\'\"][^\' + +\"]*[\'\"])?/){ print"Input file is valid"; }else{ print"File is invalid"; } __END__ Input file is valid

    Maybe you want to look at perlop for string equality, or maybe you want to use a real XML parser?

Re: Regular expression to check XML type
by Bloodnok (Vicar) on Jun 27, 2009 at 17:25 UTC
    One thing jumps out (at me anyway:-)...

    Your RE contains XML, the string you want to match contains xml and you haven't used the i (ignore case) modifier (see perlre).

    A user level that continues to overstate my experience :-))