ant has asked for the wisdom of the Perl Monks concerning the following question:


A practical Pattern matching query seems to be catching me out.

I have a large text file (over a gig in size), each line has <CS_REFCLT>12526489</CS_REFCLT> in it some where. I would like to get at the number in between the tags but as the line is not fixed position I can't use substr to get at it. I have got around this by using split like below.
while (my $line = <SESAME>){ my ($tempa, $tempb) = split (/<CS_REFCLT>/,$line) my ($value, $tempc) = split (/<\/CS_REFCLT>/,$tempb); }
However, I'd like this also as a pattern match so that I can compare speeds and speed up the program, as I think a regular expression will be quicker.

Therefore a pattern match snippet of code for this would be much appreciated.

Thanks in Advance


Replies are listed 'Best First'.
Re: Accessing data between two tags
by Skeeve (Parson) on Nov 09, 2006 at 13:23 UTC
    The others already gave regexes. But how about this split approach:
    while (my $line = <SESAME>){ my ($tempa, $value, $tempb)= split m#</?CS_REFCLT>#, $line, 3; }

    Update: Thanks to jdporter for telling me about my mistake, using 2 instead of 3 above. Fixed...

    Of course this will also find other constructs like </CS_REFCLT>dfasdf</CS_REFCLT>.

    But maybe, if the data is XML, a real XML parser like XML::Twig is something that would help you best here...
    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig( twig_handlers => { CS_REFCLT => \&cs_refclt, }, ); my @numbers; $twig->parsefile( 'filename' ); # here you will have all numbers in @numbers. sub cs_refclt { my ($t, $elt)= @_; push @numbers, $elt->text(); }

Re: Accessing data between two tags
by prasadbabu (Prior) on Nov 09, 2006 at 11:21 UTC

    Hi ant,

    Are you looking something like this?

    my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; or my (@value) = $line =~ m|<CS_REFCLT>((?:(?!</CS_REFCLT>).)*)</CS_REFCL +T>|g;

    Also take a look at perlre.


      ITYM my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; as you are trying to pull out a scalar not an array.




        I am getting array as output. As per your solution, we can get only one value even if you use 'g' modifier.

        use strict; use warnings; my $line = 'some text <CS_REFCLT>12121</CS_REFCLT> then some text <CS_ +REFCLT>4654</CS_REFCLT> here'; my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; $" ="\t"; print "Array: @value\n"; print "Scalar: $value\n"; prints: ------- Array: 12121 4654 Scalar: 12121


Re: Accessing data between two tags
by planetscape (Chancellor) on Nov 10, 2006 at 00:55 UTC

    In general, parsing tag-delimited data with a regex is fraught with peril, and can cause all manner of interesting failures, like segfaults. While I do not know with certainty the format you are parsing, I would strongly recommend you use a parser built and tested to work with the kind of data you are processing, one such as XML::Twig, XML::TreeBuilder, or HTML::TreeBuilder, for example.


Re: Accessing data between two tags
by Jenda (Abbot) on Dec 29, 2006 at 17:57 UTC
    use XML::Rules; my @numbers; my $parser = XML::Rules->new( rules => [ '_default' => '', # not interested in most tags 'CS_REFCLT' => sub {push @numbers, $_[1]->{_content}; return}, ], ); $parser->parse($filename);

    This way you don't have to worry whether there's just one <CS_REFCLT> on a line or whether there are more, etc.