in reply to Regex keep matching the last possible match (but should get all)

You could try making all your gobbling matches less greedy: Change .+ to .+?. Even better would be [^>]+ for stuff within HTML tags and [^<] for stuff that is supposed to capture (text) content.

Even better would be to use a real HTML parser like for example HTML::TreeBuilder::XPath.

Replies are listed 'Best First'.
Re^2: Regex keep matching the last possible match (but should get all)
by Anonymous Monk on May 18, 2015 at 09:48 UTC
Re^2: Regex keep matching the last possible match (but should get all)
by Anonymous Monk on May 18, 2015 at 10:57 UTC

    Dear Perl-Monks

    I will have a look at the links provided at the page in time, but in the meanwhile I created a counter-example to verify that my codings indeed will yield results in a way I expect them to do.

    consider the following file

    blabla:(123):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0815))ding knickknack boing 44 nothing here blabla:(123):falleriefallera dingdong moep blubb 471 dingdong blob))hop((gob))sob((0815))ding knickknack boing 45 nothing here too blabla:(1344):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0815))ding knickknack boing 46 nothing again blabla:(123):falleriefallera dingdong moep blubb 4711 dingdong blob))hop((gob))sob((0825))ding knickknack boing 47

    access it using the following perl-script:

    use strict; use warnings; # 1. get file and stuff it into an array # that what it will be in target code open FILE, 'target.txt' or die "nope dude: $!"; my @stuff; while(<FILE>){ chomp $_; push @stuff, $_; } print "reading done "; # 2. make a long line out of it # because I still have problems using an array for this :( my $longline; foreach my $x (@stuff){ $longline .= $x; } # 3. get all matches and place them in an array array x) my @super; while ($longline =~ /\D+(\d+)\D+(\d+)\D+(\d+)\D+(\d+)/g){ my @sub = ($1, $2, $3, $4); push @super, \@sub; } # 4. we should have four entries in that @super print scalar @super, "\n";

    will yield this (at least the debugger think so):

    0 ARRAY(0x1f08820) 0 123 1 4711 2 0815 3 44 1 ARRAY(0x2199678) 0 123 1 471 2 0815 3 45 2 ARRAY(0x21994e0) 0 1344 1 4711 2 0815 3 46 3 ARRAY(0x219f128) 0 123 1 4711 2 0825 3 47

    so it will work in the way I hoped for. IF I ever can create a valid regex for this. But now I'm busy looking into these walktroughs.

    By the way; using .+? didn't made the RegEx work, but I don't understand how [^>] should be utilized to help me in my case :( Because... I do find the correct piece of plain text in my file, so how should I include "no >" and "no <" inside?

    Greetings, a random visitor

      Your example is far more restricted because a character in \D (a non-digit) can never be matched by a character in \d (a digit) and vice-versa.

      This is why I suggested that you could use [^<]+ for characters within tags or [^>]+ for characters outside of tags. Both will only match normal characters and not closing (or opening) a tag.