Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Perl - Remove duplicate based on substring and check on delimiters

by Marshall (Canon)
on May 18, 2016 at 23:39 UTC ( [id://1163389]=note: print w/replies, xml ) Need Help??


in reply to Perl - Remove duplicate based on substring and check on delimiters

Another way without using substr (which is actually seldom used in Perl) is to use split, like a simple CSV file would be parsed, except with 'x' instead of ','.

#!usr/bin/perl use warnings; use strict; use Data::Dumper; while (my $line =<DATA>) { chomp $line; print "line = $line\n"; my $tokens =(my $first, my @rest)= split 'x',$line,-1; print "num tokens is: $tokens\n"; print Dumper $first, \@rest; print "\n"; } =prints line = 1212123x534534534534xx4545454x232322xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232322', '', '' ]; line = 0901001x876879878787xx0909918x212245xx num tokens is: 7 $VAR1 = '0901001'; $VAR2 = [ '876879878787', '', '0909918', '212245', '', '' ]; line = 1212123x534534534534xx4545454x232323xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232323', '', '' ]; line = 1212133x534534534534xx4549454x232322xx num tokens is: 7 $VAR1 = '1212133'; $VAR2 = [ '534534534534', '', '4549454', '232322', '', '' ]; line = 4352342xx23232xxx345545x45454x23232xxx num tokens is: 11 $VAR1 = '4352342'; $VAR2 = [ '', '23232', '', '', '345545', '45454', '23232', '', '', '' ]; =cut __DATA__ 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx

Replies are listed 'Best First'.
Re^2: Perl - Remove duplicate based on substring and check on delimiters
by AnomalousMonk (Archbishop) on May 19, 2016 at 01:54 UTC

    That gives an off-by-one  $tokens value (it's actually counting the stuff "around" the tokens (update: and it requires creation of an otherwise unused array to hold most of that stuff)), but that's easy to fix:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $t = 'x'; ;; for my $line (qw( 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx )) { my $tokens = my ($first, @rest) = split $t, $line, -1; $tokens -= 1; print qq{'$line': num '$t' tokens is: $tokens}; dd ($first, \@rest); } " '1212123x534534534534xx4545454x232322xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232322, "", ""]) '0901001x876879878787xx0909918x212245xx': num 'x' tokens is: 6 ("0901001", [876879878787, "", "0909918", 212245, "", ""]) '1212123x534534534534xx4545454x232323xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232323, "", ""]) '1212133x534534534534xx4549454x232322xx': num 'x' tokens is: 6 (1212133, [534534534534, "", 4549454, 232322, "", ""]) '4352342xx23232xxx345545x45454x23232xxx': num 'x' tokens is: 10 ( 4352342, ["", 23232, "", "", 345545, 45454, 23232, "", "", ""], )
    (But I don't really see anything wrong with using good old  tr/// for counting and poor old substr for fixed-field extraction.)

    Update: This gets rid of  @rest and the  $tokens -= 1; statement for all you one-liner addicts out there:
        my $tokens = (my ($first) = split $t, $line, -1) - 1;


    Give a man a fish:  <%-{-{-{-<

      I think we are splitting hairs here. I count $first as the first token, you don't. Or you figure that the final empty token shouldn't be counted? Either way not a significant problem in my mind.

      Yes, tr is the fastest and best way to do a simple count of the x's. And yes, substr is the fastest way to get a fixed length thing at the beginning. The reason that I demo'd split was to show: a)how to get a non-fixed length thing at the beginning, b)how to access some of these other length "between the x's" fields. I'm sure that they have some meaning.

      Update: I almost never use the -1 limit on split. I saw an opportunity to play with this and remind myself of how it worked. Once I had done that, I impulsively posted my "play". Wasn't meant to be "earth shattering" stuff, just an example of a not so common usage that is often forgotten.

        ... do a simple count of the x's. ... get a fixed length thing at the beginning.

        But I understood that to be what the OP was asking for, at least as a starting point for a larger application. (bopibopi actually seemed to have the counting and extracting part under control, and was asking for help with the subsequent pieces.) Using split may be a good example of doing something slightly different. We may not be so much splitting hairs here as comparing apples and oranges — or perhaps tangerines and oranges since we're not really all that far apart.

        Update:

        ... the -1 limit on split ... an example of a not so common usage ...
        As someone addicted to "not so common usage" myself, I can sympathize. (But I have it under control; I haven't used uncommonly in ages!)


        Give a man a fish:  <%-{-{-{-<

Re^2: Perl - Remove duplicate based on substring and check on delimiters
by johngg (Canon) on May 19, 2016 at 11:17 UTC
    without using substr (which is actually seldom used in Perl)

    Surely, you jest?!?!

    Cheers,

    JohnGG

      Sorry for the controversy - not my intent. I should've said something different or omitted that entirely.

      I use Perl often to process all kinds of text reports. By far and away, the most common tools that I use are: a)split and b)match global combined with c) regex. In my typical application, speed doesn't matter, but flexibility does. It is very seldom that I encounter a fixed column report where substr would be appropriate.

      That doesn't mean that I don't use substr, just that in my personal experience, with the types of text reports that I process, it doesn't come up. Mileage Varies! Processing a binary header, say like that found in a .WAV file is a whole different critter, substr is definately the right tool for that job. I am talking about text reports.

      Just yesterday, a file that I've been processing since 2011 changed its format. Oops. The same info is there, but it got moved around. The 2016 format is different and I have no control over that change. But this change was easy for me to adapt to and was something like this: (split ' ',$line)[1,7,3] to (split ' ',$line)[1,4,-2]. If I had used substr(), then this would have been a bigger deal. Changing something that has been working for 5 years comes up all the time. Such is the nature of using ad hoc methods to parse reports that you have no control over.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1163389]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-03-28 17:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found