Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Accessing data between two tags

by ant (Scribe)
on Nov 09, 2006 at 11:13 UTC ( #583082=perlquestion: print w/replies, xml ) Need Help??

ant has asked for the wisdom of the Perl Monks concerning the following question:


A practical Pattern matching query seems to be catching me out.

I have a large text file (over a gig in size), each line has <CS_REFCLT>12526489</CS_REFCLT> in it some where. I would like to get at the number in between the tags but as the line is not fixed position I can't use substr to get at it. I have got around this by using split like below.
while (my $line = <SESAME>){ my ($tempa, $tempb) = split (/<CS_REFCLT>/,$line) my ($value, $tempc) = split (/<\/CS_REFCLT>/,$tempb); }
However, I'd like this also as a pattern match so that I can compare speeds and speed up the program, as I think a regular expression will be quicker.

Therefore a pattern match snippet of code for this would be much appreciated.

Thanks in Advance


Replies are listed 'Best First'.
Re: Accessing data between two tags
by Skeeve (Parson) on Nov 09, 2006 at 13:23 UTC
    The others already gave regexes. But how about this split approach:
    while (my $line = <SESAME>){ my ($tempa, $value, $tempb)= split m#</?CS_REFCLT>#, $line, 3; }

    Update: Thanks to jdporter for telling me about my mistake, using 2 instead of 3 above. Fixed...

    Of course this will also find other constructs like </CS_REFCLT>dfasdf</CS_REFCLT>.

    But maybe, if the data is XML, a real XML parser like XML::Twig is something that would help you best here...
    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig( twig_handlers => { CS_REFCLT => \&cs_refclt, }, ); my @numbers; $twig->parsefile( 'filename' ); # here you will have all numbers in @numbers. sub cs_refclt { my ($t, $elt)= @_; push @numbers, $elt->text(); }

Re: Accessing data between two tags
by prasadbabu (Prior) on Nov 09, 2006 at 11:21 UTC

    Hi ant,

    Are you looking something like this?

    my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; or my (@value) = $line =~ m|<CS_REFCLT>((?:(?!</CS_REFCLT>).)*)</CS_REFCL +T>|g;

    Also take a look at perlre.


      ITYM my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; as you are trying to pull out a scalar not an array.




        I am getting array as output. As per your solution, we can get only one value even if you use 'g' modifier.

        use strict; use warnings; my $line = 'some text <CS_REFCLT>12121</CS_REFCLT> then some text <CS_ +REFCLT>4654</CS_REFCLT> here'; my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; $" ="\t"; print "Array: @value\n"; print "Scalar: $value\n"; prints: ------- Array: 12121 4654 Scalar: 12121


Re: Accessing data between two tags
by planetscape (Chancellor) on Nov 10, 2006 at 00:55 UTC

    In general, parsing tag-delimited data with a regex is fraught with peril, and can cause all manner of interesting failures, like segfaults. While I do not know with certainty the format you are parsing, I would strongly recommend you use a parser built and tested to work with the kind of data you are processing, one such as XML::Twig, XML::TreeBuilder, or HTML::TreeBuilder, for example.


Re: Accessing data between two tags
by Jenda (Abbot) on Dec 29, 2006 at 17:57 UTC
    use XML::Rules; my @numbers; my $parser = XML::Rules->new( rules => [ '_default' => '', # not interested in most tags 'CS_REFCLT' => sub {push @numbers, $_[1]->{_content}; return}, ], ); $parser->parse($filename);

    This way you don't have to worry whether there's just one <CS_REFCLT> on a line or whether there are more, etc.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://583082]
Approved by prasadbabu
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2023-09-22 03:49 GMT
Find Nodes?
    Voting Booth?

    No recent polls found