TheBigAmbulance has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that loops through a file a couple of times looking for values stored in @search

#!/usr/bin/perl -w use warnings; use strict; my @search = ("/<ID>(.*?)<\/ID>/", # ID "/<TimeStamp>(.*?)<\/TimeStamp>/", # Time Stamp "/<IP_Address>(.*?)<\/IP_Address>/", # IP Address "/<Title>(.*?)<\/Title>/", # Title "/<Complainant><Entity>(.*?)<\/Entity>/", # Reporting Ent +ity "/<Contact>(.*?)<\/Contact>/", # Reporting Entity Con +tact "/<Address>(.*?)<\/Address>/", # Reporting Entity Add +ress "/<\/Phone><Email>(.*?)<\/Email>/"); # Reporting Entity E +mail Address my $search_count = @search; my @xmlfiles2 = <*xml>; foreach my $file (@xmlfiles2) { for (my $i = 0; $i <= $search_count; $i++) { open FILE, $file or die "Could not read from $file, program haltin +g."; while (<FILE>) { my $line = "$_"; if ( $line =~ $search[$i]) { my $Var = $1; print "$Var\n"; } } close FILE; } }

When I run this, when it matches data, it is generating errors.

dpich@m6400-vb:~/Documents$ perl 2time.pl Use of uninitialized value within @search in regexp compilation at 2ti +me.pl line 21, <FILE> line 1. Use of uninitialized value $Var in concatenation (.) or string at 2tim +e.pl line 23, <FILE> line 1. Use of uninitialized value within @search in regexp compilation at 2ti +me.pl line 21, <FILE> line 2. Use of uninitialized value $Var in concatenation (.) or string at 2tim +e.pl line 23, <FILE> line 2. Use of uninitialized value within @search in regexp compilation at 2ti +me.pl line 21, <FILE> line 3. Use of uninitialized value $Var in concatenation (.) or string at 2tim +e.pl line 23, <FILE> line 3. Use of uninitialized value within @search in regexp compilation at 2ti +me.pl line 21, <FILE> line 4. Use of uninitialized value $Var in concatenation (.) or string at 2tim +e.pl line 23, <FILE> line 4. dpich@m6400-vb:~/Documents$

With my variables declared, I am at a loss to see why this is generating errors. Can anyone point me in the proper direction?

Replies are listed 'Best First'.
Re: I don't understand why I'm getting an "Use of uninitialized value" error
by jwkrahn (Abbot) on Nov 03, 2011 at 18:32 UTC

    Print out the contents of @search and you will see that the back-slashes have been removed by interpolation.

    You also have an off-by-one error in the line:

    for (my $i = 0; $i <= $search_count; $i++) {

      So how does one maintain the integrity of the backslashes? I am not sure what you mean. Abbreviated Script:

      #!/usr/bin/perl -w use warnings; use strict; my @search = ("/<ID>(.*?)<\/ID>/", # ID "/<TimeStamp>(.*?)<\/TimeStamp>/", # Time Stamp "/<IP_Address>(.*?)<\/IP_Address>/", # IP Address "/<Title>(.*?)<\/Title>/", # Title "/<Complainant><Entity>(.*?)<\/Entity>/", # Reporting Ent +ity "/<Contact>(.*?)<\/Contact>/", # Reporting Entity Con +tact "/<Address>(.*?)<\/Address>/", # Reporting Entity Add +ress "/<\/Phone><Email>(.*?)<\/Email>/"); # Reporting Entity E +mail Address print @search;
      dpich@m6400-vb:~/Documents$ perl 2time.pl /<ID>(.*?)</ID>//<TimeStamp>(.*?)</TimeStamp>//<IP_Address>(.*?)</IP_A +ddress>//<Title>(.*?)</Title>//<Complainant><Entity>(.*?)</Entity>//< +Contact>(.*?)</Contact>//<Address>(.*?)</Address>//</Phone><Email>(.* +?)</Email>/dpich@m6400-vb:~/Documents$

        This should work:

        my @search = ( qr[<ID>(.*?)</ID>], # ID qr[<TimeStamp>(.*?)</TimeStamp>], # Time Stamp qr[<IP_Address>(.*?)</IP_Address>], # IP Address qr[<Title>(.*?)</Title>], # Title qr[<Complainant><Entity>(.*?)</Entity>], # Reporting Enti +ty qr[<Contact>(.*?)</Contact>], # Reporting Entity Cont +act qr[<Address>(.*?)</Address>], # Reporting Entity Addr +ess qr[</Phone><Email>(.*?)</Email>], # Reporting Entity Emai +l Address );
Re: I don't understand why I'm getting an "Use of uninitialized value" error
by GrandFather (Saint) on Nov 03, 2011 at 20:03 UTC

    This looks like you are parsing XML. You should really consider using a module like XML::Twig to do the heavy lifting for you:

    #!/usr/bin/perl -w use warnings; use strict; use XML::Twig; my $xml = <<XML; <head><ID>This is an id</ID> <Title> Title stuff </Title> <Title>Another title</Title> </head> XML my $twig = XML::Twig->new( twig_roots => { ID => \&dump, TimeStamp => \&dump, IP_Address => \&dump, Title => \&dump, Complainant => \&dump, } ); $twig->parse($xml); sub dump { my ($t, $elt) = @_; (my $text = $elt->text()) =~ s/^\s+|\s+$//g; print "$text\n"; }

    Prints:

    This is an id Title stuff Another title
    True laziness is hard work

      So how would you specify a certain tag if there is more than one? For example, I have two <Contact> tags in my xml. The only way I might be able to sort through it is to pick out the higher branch, say <Source> for one contact and <SourceB> for the other. So what I'm saying is return <Source><Countact>.

        Exactly that is demonstrated in the example for the section Processing just parts of an XML document. One way of achieving that is:

        #!/usr/bin/perl -w use warnings; use strict; use XML::Twig; my $xml = <<XML; <head><ID>This is an id</ID> <Book1> <Title> Title stuff </Title> </Book1> <Book2><Title>Another title</Title></Book2> </head> XML my $twig = XML::Twig->new( twig_roots => { ID => \&dump, TimeStamp => \&dump, IP_Address => \&dump, 'Book1/Title' => sub {title (1, @_);}, 'Book2/Title' => sub {title (2, @_);}, Complainant => \&dump, } ); $twig->parse($xml); sub dump { my ($t, $elt) = @_; (my $text = $elt->text()) =~ s/^\s+|\s+$//g; print "$text\n"; } sub title { my ($type, $t, $elt) = @_; (my $title = $elt->text()) =~ s/^\s+|\s+$//g; print "Title $type: $title\n"; }

        Prints:

        This is an id Title 1: Title stuff Title 2: Another title

        Take this code. Make sure you have XML::Twig installed. Play with it until you have some understanding of how it works.

        True laziness is hard work

      It involves a XML. I have a source XML. What I'm trying to accomplish is open the XML once, and have some form of perl 'scripting thingy' run through the xml multiple times to pick out the tags.

      I.E. If there is a tag in the xml like '<ID>test</ID>', set the variable $ID to 'test'. If there is a timestamp '<TimeStamp>2011-09-24T21:38:11Z</TimeStamp>', pull out '2011-09-24T21:38:11Z'. I'm a novice when it comes to perl/xml interactions. Just trying to get the information out.

        Yes, but did you take note of what I said. Parsing XML is hard. Especially if you are new to this stuff: don't reinvent the wheel (see Re: Reinventing the wheel and Re: Reinventing the wheel). Try running the code I supplied. Try running it against a sample of your real data. Skim read the XML::Twig documentation. I know it looks really intimidating, but it contains really good examples and will save you much more time debugging your own hand written XML parsing code than you spend figuring out how to use the module.

        On a more general programming style note: "run through the ... multiple times" is almost always a red flag in programming (and most other places too). The more times you have to perform a slow operation the slower the overall job will be. Turn your inner two loops inside out so you read the file once and perform multiple matches per line. Compared with accessing information in memory accessing data from disk is very slow.

        True laziness is hard work
Re: I don't understand why I'm getting an "Use of uninitialized value" error
by ikegami (Patriarch) on Nov 03, 2011 at 18:51 UTC

    To answer your direct question, you backslash them. The string literal «"foo \\ bar"» produces the string «foo \ bar».

    But that's the least of your problems, /.../ is the match operator, so it's Perl code. If you want to compile and run Perl code, you'd have to use eval EXPR.

    But if @search contained regular expressions instead of Perl code, you could wouldn't need to go to such drastic measures.

    my @patterns = ( "<ID>(.*?)</ID>", # ID "<TimeStamp>(.*?)</TimeStamp>", # Time Stamp "<IP_Address>(.*?)</IP_Address>", # IP Address "<Title>(.*?)</Title>", # Title "<Complainant><Entity>(.*?)</Entity>", # Reporting Entity "<Contact>(.*?)</Contact>", # Reporting Entity Contact "<Address>(.*?)</Address>", # Reporting Entity Address "</Phone><Email>(.*?)</Email>", # Reporting Entity Email ); my @xml_files = <*xml>; for my $file (@xml_files) { for my $pattern (@patterns) { ... if ($line =~ $pat) { ... } }

    Furthermore, it would be beneficial to use qr//.

    my @regexps = ( qr{<ID>(.*?)</ID>}, # ID qr{<TimeStamp>(.*?)</TimeStamp>}, # Time Stamp qr{<IP_Address>(.*?)</IP_Address>}, # IP Address qr{<Title>(.*?)</Title>}, # Title qr{<Complainant><Entity>(.*?)</Entity>}, # Reporting Entity qr{<Contact>(.*?)</Contact>}, # Reporting Entity Contact qr{<Address>(.*?)</Address>}, # Reporting Entity Address qr{</Phone><Email>(.*?)</Email>}, # Reporting Entity Email ); my @xml_files = <*xml>; for my $file (@xml_files) { for my $regexp (@regexps) { ... if ($line =~ $regexp) { ... } }
Re: I don't understand why I'm getting an "Use of uninitialized value" error
by graff (Chancellor) on Nov 04, 2011 at 04:36 UTC
    The problem you asked about has been solved by jwkrahn and ikegami above, and GrandFather has recommended his favorite XML parsing method. That leaves it to me to point out two other things:
    1. Your algorithm is inefficient to the point of being somewhat embarrassing: given 8 tags of interest, you read every line of every file 8 times, looking for each tag in turn. That does not scale well when dealing with more tags and/or more (and/or larger) files.
    2. If you think XML::Twig's mile-long manual is overkill, I agree, but you should still be using some sort of XML parsing approach, because That's The Right Way To Do It (and There's More Than One Way To Do It with XML parsing).

    Here's a simple way, which I've tested on a directory containing a couple of XML files that probably have enough in common with the ones you have. Reading the first 250 lines of the XML::Parser manual was sufficient to know how to write this:

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; my ( $tagname, $tagtext ); # "globals" used in callback subs my @target_tags = qw(ID TimeStamp IP_Address Title Complainant Contact Address Email); my $target_regex = join( '|', @target_tags ); my $parser = XML::Parser->new( Handlers => { Start => \&get_tagname, Char => \&get_tagtext, End => \&print_tagdata, } ); for my $xmlfile ( <*.xml> ) { $parser->parsefile( $xmlfile ); } sub get_tagname { $tagname = $_[1]; $tagtext = ''; } sub get_tagtext { $tagtext .= $_[1]; } sub print_tagdata { if ( $_[1] =~ /$target_regex/ ) { print "$_[1] = $tagtext\n"; } }
    (updated to fix typo in target_tags list)

    One notable difference between this and the OP code is that this will print tag labels and their contents in the order in which they occur in the XML files. If that's okay, then there's nothing more to worry about.

    (But if you need to control tag order and it varies from one xml file to the next, you just need to add a global hash for storing tag values, then print the hash contents in the desired order after parsing each file. -- update: and don't forget to assign "()" to the hash, i.e. empty it, before parsing each file.)

    XML::Parser is the surprisingly simple foundation on which many "higher-level" parsing modules are built. I'm actually surprised at how many CPAN modules have been created that are layers around XML::Parser, considering how easy and efficient this module is.

    For relatively simple tasks like yours, the logic involved in using XML::Parser is pretty trivial, and when you use it, you really save a lot of effort, and end up with code that is simpler, more coherent, more robust, and easier to maintain.