Perolizador has asked for the wisdom of the Perl Monks concerning the following question:

#!/bin/perl $FILEIN = shift (); $FILEOUT = shift (); open(FILE, "<$FILEIN"); open(SAIDA, ">$FILEOUT"); @arquivo = <FILE>; # VETOR #@arquivo = sort(@arquivo); foreach $linha (@arquivo) { if($linha=~/(^.+<HR>Campo_Identificacao\()[^0-9]*<b>([0-9]{5,7}\-[ +^\,]*)</b>(\,.+)/) { <b>$NumProc = $2;</b> #$linhaTemp = $1."\#\$".$3; } #push @novo_array, <b>$NumProc</b> if not <b>$NumProc_anterior{ +$linha}++;</b> } print SAIDA @novo_array; close (SAIDA); close (FILE);

Hi, i would like to know how i verify multiple lines, and if the code finds that $NumProc String exists at the exact same position in the next line it would wipe out the next line, keeping only the first line found. The commented #push @novo_array line verified if the lines were exactly the same, but i want to go further and verify only if the line has the same number at the same position, hence repeated useless information.<\p>

<DI:"def.def"><RD:Item><PS:Item>Campo_Ementa ###HERE GOES THE TEXT THA +T I DONT WANT TO VERIFY[a-zA-Zetcetc]*### <HR>Campo_Identificacao(Rec +urso Inominado nº <b>0702995-42.2017.8.01.0002</b>, 1ª Turma Recursal + dos Juizados Especiais/AC, Rel. Maria Rosinete dos Reis Silva. j. 11 +.07.2018 , Publ. 20.07.<FD:AnoJulg>2018</FD:AnoJulg>).<HR><PS:Identif +icacao>

---->0702995-42.2017.8.01.0002 being the number i want to use to batch clean the document, keeping only the first time it found, completely wiping out the rest. It will be mounted into a .NFO afterwards, so i would need to keep the line intact.

Replies are listed 'Best First'.
Re: Array Exact Position Filter
by hippo (Archbishop) on Aug 02, 2018 at 16:06 UTC
    if the code finds that $NumProc String exists at the exact same position in the next line it would wipe out the next line, keeping only the first line found.

    Taking you at your word, here's an SSCCE:

    use strict; use warnings; use Test::More tests => 1; my @in = split (/\n/, <<EOT); aaa Foo: 1234 xyz bbb Foo: 1234 yzx ccc Foo: 1234 yxz ddd Foo: 2345 zyx EOT my @want = split (/\n/, <<EOT); aaa Foo: 1234 xyz ddd Foo: 2345 zyx EOT my $number; my $index; my %seen; my @out; for my $line (@in) { if ($line =~ /Foo: ([\d]{4})/) { $number = $1; $index = index ($number, $line); } else { $number = ''; $index = -1; } push @out, $line unless $number ne '' && $seen{$number}++ && index + ($number, $line) == $index; } is_deeply (\@out, \@want);

    There's a good few assumptions in there and I've entirely ignored that your data looks like XML (which, if it is, is probably a better route for your specific needs) but it should give you an idea of one way you might proceed.

Re: Array Exact Position Filter
by AnomalousMonk (Archbishop) on Aug 02, 2018 at 18:33 UTC
    ---->0702995-42.2017.8.01.0002 being the number i want to use to batch clean the document, keeping only the first time it found, completely wiping out the rest. [underlined emphases added]

    I'm also confused by your OPed statement of requirements. I will concentrate on the above quoted section therefrom, so maybe something like:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my @lines = ( 'line 1 this line has no Foo number: keep', 'line 2 Foo something 1234-12.3 other 123 first valid keep', 'TOSS 3 Foo toss this 1234-12.3 line 123 DELETE', 'line 4 Foo not same 1234-12.3 offset 123 line keep', 'line 5 Foox not tag 1234-12.3 same offset 123 keep', 'line 6 has no Foo number: keep', 'line 7 Foo yyyy xxx 1234-12.3 this line keep', 'line 8 Foo xxxx zzzz 1234-12.4 different n keep', 'TOSS 9 Foo kill this 1234-12.3 one DELETE', 'line 10 no Foo number: keep', ); dd \@lines; print qq{\n}; ;; my $strange_n = qr{ (?<! \d) \d+ - \d+ (?: [.] \d+)* (?! \d) }xms; my $toss_tag = qr{ \b Foo \b }xms; ;; my $rx_can_toss = qr{ \A .+ $toss_tag \D* ($strange_n) }xms; ;; my ($valid_was_seen, $sn, $sn_offset); ;; my @keep_lines; LINE: for my $i (0 .. $#lines) { my ($sn_seen, $this_sn, $this_sn_offset) = map { $_ ? (1, $1, $-[1]) : () } $lines[$i] =~ $rx_can_toss ; if ($sn_seen) { if ($valid_was_seen) { next LINE if $sn eq $this_sn and $sn_offset eq $this_sn_offset; } else { ($valid_was_seen, $sn, $sn_offset) = (1, $this_sn, $this_sn_off +set); } } push @keep_lines, $lines[$i]; } dd \@keep_lines; " [ "line 1 this line has no Foo number: keep", "line 2 Foo something 1234-12.3 other 123 first valid keep", "TOSS 3 Foo toss this 1234-12.3 line 123 DELETE", "line 4 Foo not same 1234-12.3 offset 123 line keep", "line 5 Foox not tag 1234-12.3 same offset 123 keep", "line 6 has no Foo number: keep", "line 7 Foo yyyy xxx 1234-12.3 this line keep", "line 8 Foo xxxx zzzz 1234-12.4 different n keep", "TOSS 9 Foo kill this 1234-12.3 one DELETE", "line 10 no Foo number: keep", ] [ "line 1 this line has no Foo number: keep", "line 2 Foo something 1234-12.3 other 123 first valid keep", "line 4 Foo not same 1234-12.3 offset 123 line keep", "line 5 Foox not tag 1234-12.3 same offset 123 keep", "line 6 has no Foo number: keep", "line 7 Foo yyyy xxx 1234-12.3 this line keep", "line 8 Foo xxxx zzzz 1234-12.4 different n keep", "line 10 no Foo number: keep", ]
    (And in addition to an SSCCE, in future please also consider How to ask better questions using Test::More and sample data.)


    Give a man a fish:  <%-{-{-{-<

Re: Array Exact Position Filter
by Perolizador (Initiate) on Aug 02, 2018 at 22:12 UTC

    I had a hard time trying to explain this, i will look into your examples and try to implement it into my code. Everything after Campo_Identificacao is fixed, only changing the values in ###VALUE### - and the description of the process comes before Campo_Identificacao

    <HR>Campo_Identificacao(###NAME OF THE PROCESS### nº ###NUMBER OF THE +PROCESS###, ###THE PLACE THAT JUDGED IT###/AC, Rel. ###NAME OF THE JU +DGE###. j. ###JUDGEMENT DAY### , Publ. ###PUBLICATION DAY### .<FD:AnoJulg>2018</FD:AnoJulg>).<HR><PS:Id +entificacao>

    The problem i had was that i wanted to use the NUMBER OF THE PROCESS to filter and look for duplicates, because there could be repeated process descriptions, but the number of the process itself could never repeat, biggest problem was that the number could be inserted into the description, so i would need to pinpoint its exact position BETWEEN NAME OF THE PROCESS nº and , THE PLACE THAT JUDGED so icould verify it and delete the entire line if it was found afterwards. Many thanks guys, have a great day!!!

      I still don't really understand your requirement, but here's a different approach. In my previous solution, the absolute offset of the start of the number field in the line had always to be the same if subsequent lines were to be rejected once a valid number line was detected. In this solution, the offset between the end of the  $toss_tag substring and the start of the number field must remain the same, wherever the toss tag happens to be in the string. So in the following string:

      +----------------------------------- any number of any characters | before toss-tag field | | +---------------------------- toss-tag field (defined | | string, constant width) | | | | +--------------------- any number of any non-digits | | | between end of toss-tag | | | and start of number field. | | | characters can vary after | | | first field seen, but width | | | cannot vary once set. | | | | | | +----------- number field. number cannot | | | | vary once set. | | | | | | | | +---- any number of any characters | | | | | after number field | | | | | /---+----\/+\/----+----\/---+---\/-+-----------------------\ 'line 2 xx Foo NineChars 1234-12.3 other 123 first valid keep'
      I hope that makes sense. (I also slightly simplified some of the field capture logic, but that shouldn't matter.) Tested under Perl version 5.8.9, so it should work under any later version. I hope this will be of some help.


      Give a man a fish:  <%-{-{-{-<

      Now that we know that the exact position of the key field does in fact change, that restriction is removed and all you need to do is find a suitably unique regex to capture it in the first place. Looks to me from

      <DI:"def.def"><RD:Item><PS:Item>Campo_Ementa ###HERE GOES THE TEXT THA +T I DONT WANT TO VERIFY[a-zA-Zetcetc]*###<HR>Campo_Identificacao(Recu +rso Inominado nº <b>0702995-42.2017.8.01.0002</b>, 1ª Turma Recursal +dos Juizados Especiais/AC, Rel. Maria Rosinete dos Reis Silva. j. 11. +07.2018 , Publ. 20.07.<FD:AnoJulg>2018</FD:AnoJulg>).<HR><PS:Identifi +cacao>

      that it exists between "nº " and ", " and if that's the same for all of them then it is a simple switch:

      use strict; use warnings; use utf8; use Test::More tests => 1; my @in = split (/\n/, <<EOT); aaa nº 1234, xyz bbb nº 1234, yzx ccc nº 1234, yxz ddd nº 2345, zyx EOT my @want = split (/\n/, <<EOT); aaa nº 1234, xyz ddd nº 2345, zyx EOT my $number; my %seen = ( '' => 1 ); my @out; for my $line (@in) { if ($line =~ /nº ([\d]*),/) { $number = $1; } else { $number = ''; } push @out, $line unless $seen{$number}++; } is_deeply (\@out, \@want);

      If that logic doesn't match your requirements, simply change the regex until it does. Obviously here, I'm just matching integers for the numbers and yours is more complex but the principle should still fit.