rsriram has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am writing a script to search through a ASCII file for a pattern <BOXIND NUM="x" ID="BX1.xx.xxx"/> and push the contents of the ID to a array.

while ($file =~ /<BOXIND NUM=\"(.+)\" ID=\"(.*?)\"\/>/g) {push(@ind,$2);}

@ind is the array to which I need to add the ID values. I there are two BOXIND elements in a same line, this regex returns only the last found element in that line. Can someone let me know what went wrong in this pattern?

Sriram

Replies are listed 'Best First'.
Re: Using a regex search pattern
by Corion (Patriarch) on Jun 21, 2006 at 09:34 UTC

    In your regex, /<BOXIND NUM=\"(.+)\" ID=\"(.*?)\"\/>/g, the .+ part is greedy and gobbles up too much. You want .+?, or better, [^"]+ in there to stop the regex from matching outside of the attribute field.

    Maybe you want to look at XPath and/or an HTML parser instead, at least if your task is anything beyond simple extraction ;)

Re: Using a regex search pattern
by jwkrahn (Abbot) on Jun 21, 2006 at 10:11 UTC
    You probably want something like this:
    push @ind, $file =~ /<BOXIND NUM=".+?" ID="(.*?)"\/>/g;
Re: Using a regex search pattern
by leocharre (Priest) on Jun 21, 2006 at 13:57 UTC
    while ( $file =~/<BOXIND\s+NUM\s*=\s*"(\d+)"\s+ID\s*=\s*"([\w\.]+)"\s* +\/>/ig) { push(@ind,$2); }

    The dot+ matches as much as possible.
    \s+ means there must be at least one whitespace char (tab, space, carriage return).
    \s* means there might be whitespace.
    the part [\w\.]+ means match at least one character comprising of any "word character" (a-z caps and lower, 0-9 and _) and the dot. If the ID contained a parenthesis, an ampersand, the match would fail.

    Look into lookahead and lookbehind too..

    ( and .* means "how hard can i make life for myself") :)