Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regex - problem with the loop I believe or maybe the regex itself ?

by trummelbummel (Initiate)
on Feb 12, 2014 at 15:53 UTC ( [id://1074639]=perlquestion: print w/replies, xml ) Need Help??

trummelbummel has asked for the wisdom of the Perl Monks concerning the following question:

Hi!
I am new to perl and I have to get a script ready for my research. I have to analyse a huge string and extract a substring between two keywords dIonly and workset with two parenthesis attached. Right now I am stuck at the foreach loop. Everything else seems to be running, the string gets slurped I have to reverse the words but not the characters in the string, because there are a lot of unwanted worksets, and I just want to go from dIonly to the first workset(( including some text and then two parenthesis at the end )), hence starting from workset(( it is like a nested regex, first I want from dIonly to workset(( including what is coming right after.
I would really appreciate help!
Thank you very much!
This is an example of my string :
originally it would have:
workset((ab;joiret;garg)) c wasdobao; erhgahufdgah; c workset((adsghlia) c aghaoeriarg;oi c aasdfgohaerg c workset(empty) c ah;sorguiaerg c aoi;hgruio;ghaer c playA c dIonly
but I am reversing it for the above reasons to:
dIonly c ....................................workset((.......)) and I need the items in the workset as well.
<code> #!/usr/bin/perl # perl -d ./perl_debugger.pl use strict; use Data::Dumper qw(Dumper); use File::Slurp; my @a_linesorig; my @a_out; my @a_str; my $line; my $reversedline; my @a_linesrev; my @reversedarray; my $reversedline; my $str; open(my $fh, "<", "data.txt") or die "cannot open < data.txt: $!"; my $line = read_file('data.txt'); @a_linesorig = split(' ', $line); @a_linesrev = reverse(@a_linesorig); $reversedline = join(' ', @a_linesrev); # joins the reversed list t +o a single string again @reversedarray = split( /solution/, $reversedline ); # should split + huge string into a list from one solution to next foreach $str (@reversedarray) { if ($str =~ /\bdIonly:\b(.*?)\bworkset\b\\(\\(/g); print (@a_out, "$str"); } close $fh or die "can't close file: $!"; open(my $fh, ">", "output.txt") or die "cannot open > output.txt: $!"; foreach $str (@a_out) { print ($fh "$str\n"); } close $fh or die "can't close file: $!";
</code>
  • Comment on regex - problem with the loop I believe or maybe the regex itself ?
  • Download Code

Replies are listed 'Best First'.
Re: regex - problem with the loop I believe or maybe the regex itself ?
by AnomalousMonk (Archbishop) on Feb 12, 2014 at 16:36 UTC
    I don't entirely understand the OP, but will something like this serve (sorry for all the wraparound)?

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'workset((ab;joiret;garg)) c wasdobao; erhgahufdgah; c workse +t((adsghlia) c aghaoeriarg;oi c aasdfgohaerg c workset(empty) c ah;sorguiaerg c aoi;hgruio;ghaer c p +layA c dIonly'; print qq{'$s'}; ;; my $rev_dIonly = qr{ ylnoId }xms; my $rev_workset = qr{ \(\( teskrow }xms; ;; my $rs = reverse $s; my ($capture) = $rs =~ m{ ($rev_dIonly .*? $rev_workset) }xms; print qq{'$capture'}; $capture = reverse $capture; print qq{'$capture'}; " 'workset((ab;joiret;garg)) c wasdobao; erhgahufdgah; c workset((adsghl +ia) c aghaoeriarg;oi c aasdfgo haerg c workset(empty) c ah;sorguiaerg c aoi;hgruio;ghaer c playA c dI +only' 'ylnoId c Ayalp c reahg;oiurgh;ioa c greaiugros;ha c )ytpme(teskrow c +greahogfdsaa c io;graireoahga c )ailhgsda((teskrow' 'workset((adsghlia) c aghaoeriarg;oi c aasdfgohaerg c workset(empty) c + ah;sorguiaerg c aoi;hgruio;ghaer c playA c dIonly'

    See perlre, perlretut, perlrequick.

    (And the proper closing tag for a code block opened with  <c> is a  </c> tag: note the  / forward-slash.)

      Thank you but that does unfortunately not help.
      So firstly, one question, if I put the string into an array separated by a delimiter space, will reversing it reverse the characters itself as well? i thought this is only happening if it is a string, but not with a list? Hence if I print the array it is fine, we do not have to reverse to search for e.g. ylnoId.
      Secondly, this was just an example of the string maybe a bad one. But the point is. I do not want to change much of the script as it seems to work fine. Except for that I am not able to parse the part in which the if statement is used. The right string is fed into the loop, but then the if statement is not executing. If you use
      print Dumper \@reversedarray;
      it works fine until i add the if statement. If you could help me on that i would be more than grateful.
      If you have any further questions please do not hesitate to ask!
        So firstly, one question, if I put the string into an array separated by a delimiter space, will reversing it reverse the characters itself as well? i thought this is only happening if it is a string, but not with a list?

        See reverse. In scalar context, reverse will reverse a string, in list context, reverse reverses a list. If you want both, use

        @yra = reverse map { scalar reverse $_ } @ary;

        update: if you want to match parens, don't escape the backslash.

        - if ($str =~ /\bdIonly:\b(.*?)\bworkset\b\\(\\(/g); + if ($str =~ /\bdIonly:\b(.*?)\bworkset\b\(\(/g);
        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Can you share an actual line or two that you want to parse, and show what you want extracted? I'm not sure that I understand the need to reverse the components of each line.

Re: regex - problem with the loop I believe or maybe the regex itself ?
by kcott (Archbishop) on Feb 13, 2014 at 06:48 UTC

    G'day trummelbummel,

    Welcome to the monastery.

    It's much easier for us to help if you provide a short, representative example of your initial data and a clear indication of your expected output. A written description of the data is rarely, if ever, useful. The data should be shown in <code>...</code> tags. I realise this is your first post: please just keep this in mind for future reference.

    I'm wondering if your question falls into the "XY Problem" category: you've focussed more on a specific solution rather than on the actual problem. Your problem seems to be that you have a text file and you need to extract some data from it. Your current solution involves manipulating the data with split(), reverse() and join(): you've said this is necessary — I'm not convinced that it is.

    If you'd care to consider another solution, take a look at the following (then the Notes at the end).

    #!/usr/bin/env perl -l use strict; use warnings; my $sep = '-' x 60 . "\n"; my ($start, $end) = ('workset((', 'dIonly'); my $all_to_end_incl; { local $/ = $end; $all_to_end_incl = <DATA>; } print $sep, "Up to and including first '$end':\n", $all_to_end_incl; my $start_end_incl = substr $all_to_end_incl, index $all_to_end_incl, +$start; print $sep, "From '$start' to '$end' inclusive:\n", $start_end_incl; my $start_end_excl = substr $start_end_incl, length($start), -length($ +end); print $sep, "From '$start' to '$end' exclusive:\n", $start_end_excl; my $paren_group_re; $paren_group_re = qr{ \( (?: (?> [^()]+ ) | (??{ $paren_group_re }) )* \) }x; my $workset_re = qr{ ( workset\(\( (?: [^(]+ (??{ $paren_group_re }) [, ]* )* \)\) ) }x; $start_end_excl =~ $workset_re; print $sep, "Wanted extract:\n", $1; __DATA__ ... other data before 'workset' found ... workset(( RiskCA(cA, 3) RiskCB(cB, 2)) c workset((RiskCA(cA, 3), RiskCB(cB, 2), totPaycA(cA, 7), totPaycB(c +B, 6))) *********** trial #682 ceq pAAr(rA, cA, P1) c pAAc(rA, cA, P2) c ineqAA(rA, cA, P3) = (pAAc(r +A, cA, ... rl dec c cognum(X2) c watch(X1) c worklist(L) c workset((S, maxtotIneq +C(cB, X))) => watch(X1 + 1) c cognum(X2 + 1) c worklist(nil) c workset(e +mpty) c playA c dIonly [label avoidMaxI] . X2 --> 3 X1 --> 3

    That code has output at each stage: partly for demo purposes and partly because I'm not entirely sure which parts you want. Only the last part, which I'm certain you wanted, is displayed; the full output is in the spoiler.

    Wanted extract: workset((RiskCA(cA, 3), RiskCB(cB, 2), totPaycA(cA, 7), totPaycB(cB, 6 +)))

    Notes:

    • The $paren_group_re regex may look a little daunting; however, it's almost a verbatim copy of the code example in "perlre: Extended Patterns", so you can find a discussion there.
    • The data is read in blocks up to, and including, 'dIonly' with 'local $/ = $end;'. See "perlvar: Variables related to filehandles" for more details. I've only read the filehandle (DATA) once; you can use a while loop and handle each instance of "workset((...dIonly" separately. You haven't given enough information about the input file for me to know, but maybe dealing with multiple shorter strings would be useful.
    • If your "workset((RiskCA(cA, 3), ..., totPaycB(cB, 6)))" data is split across multiple lines, the code I've posted will handle this without modification. If you want to get that back into one line and lose the additional whitespace, consider using y/// something like this: 'y/\n / /s' ("perlop: Quote-Like Operators" has more complete details).
    • Everything else should be fairly straightforward, but do ask if there's something you don't understand.

    -- Ken

Re: regex - problem with the loop I believe or maybe the regex itself ?
by tangent (Parson) on Feb 12, 2014 at 18:52 UTC
    I just want to go from dIonly to the first workset(( including some text and then two parenthesis at the end ))
    Using your original code this works for me - note that I took out the colon (:) from dIonly.
    foreach $str (@reversedarray) { if ($str =~ /\bdIonly\b.*?\bworkset\b\(\(([^)]*)/ ) { print "content of workset before dIonly: $1\n"; } }
      thanks that is a good start. So could you please tell me how I can extend the content that it returns, right now it firstly does not return the whole content of workset, but only a subpart secondly even if I apply global, it does only do it for one part, but there are several dIonly in the string, hence I would want to extract a number of these worksets.
        If I understand correctly, and ignoring the reverse line stuff, what you want to extract is everything between the last occurrence of 'workset((..' up to 'dIonly', and that there are a number of these in each line. If that is the case I would forget about reversing and just split each line on 'dIonly', then find the substring for each segment:
        my @a_out; my $line = read_file('data.txt'); if ($line =~ /\bdIonly\b/) { # remove everything after last 'dIonly' $line =~ s/(.*)\bdIonly\b.*?$/$1/; my @segments = split(/\bdIonly\b/,$line); for my $str (@segments) { if ($str =~ /.*(\bworkset\b\(\(.*)/ ) { my $workset = $1; print "workset content: $workset\n\n"; push(@a_out,$workset); } else { print "no workset\n"; } } }
Re: regex - problem with the loop I believe or maybe the regex itself ?
by 2teez (Vicar) on Feb 12, 2014 at 18:43 UTC

    hi trummelbummel,

    ..but I am reversing it for the above reasons to: dIonly c ....................................workset((.......)) and I need the items in the workset as well...

    Maybe something like this could help:

    use warnings; use strict; my $node = 'workset((ab;joiret;garg)) c wasdobao; erhgahufdgah; c workset((adsghl +ia) c aghaoeriarg;oi c aasdfgohaerg c workset(empty) c ah;sorguiaerg +c aoi;hgruio;ghaer c playA c dIonly '; my $modified = join q{ } => reverse split /\s+/ => $node; # you could print modified to see the reversed string if ( my @dat = $modified =~ /\(\(?(.+?)\)/g ) { print "@dat\n"; # prints empty adsghlia ab;joiret;garg }
    You might have to look into the documentations mentioned by AnomalousMonk.
    Hopes that helps.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1074639]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-24 05:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found