Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Parse file and creating hashes

by PerlMonger79 (Sexton)
on Jan 08, 2022 at 13:56 UTC ( [id://11140267]=perlquestion: print w/replies, xml ) Need Help??

PerlMonger79 has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have a text file which I've parsed into three elements per line. The following sample snippet of the data, the actual output is thousands of lines.

702005010554291,5016554291,7020000023F22
702005010524898,5016524898,70200000441E0
702005010660208,5016660208,7020000033FD0
702005010509777,5016509777,7020000033FF0
702005010633781,5016633781,7020000024092
702005010616472,5016616472,7020000043FE2
310005010601516,5016601516,7020000044201
702005010526097,5016526097,7020000013EB1
702005010681238,5016681238,7020000044052
702005010551103,5016551103,7020000023F12
702005010625010,5016625010,7020000023F51

I would like to be able to parse the data and create an array/hash for each element in the third row. Eg: an array/hash for 7020000023F22 then 70200000441E0 etc. once it's unique and not repeated. Additionally, I would like to sort and store the element in the second row in the corresponding array/hash if the element in the third row has already been created then the element in the second row is stored in that array/hash. In the end I'd like to have all the elements in the second row stored in the array/hash that corresponds to third element. Not sure if I explain myself properly, I feel like I'm just confusing my own self. LOL I would really appreciate any help I can get. So far I've have this piece of code but now I need help to sort them out and store them into their respective arrays/hash though.
# Open Data File and parse each line open my $DF, '<', $DFile or die "Can't open $DFile $!"; foreach $_ (<$DF>){ chomp( $_ ); if((length $_ > 0)&&($_ =~ /^\d{15}/)){ $_ =~ s/\s+/;/g; my($imsi,$mdn,$sec) = (quotewords('[\t;]+|,\s', 0, $_))[0, +1,3]; if((($imsi =~ /^\d{15}/)&&($imsi =~ /^(702|310)/))&&(($mdn + =~ /^\d{10}/)&&($mdn =~ /^(501)/))){ print $imsi.",".$mdn.",".$sec."\n"; } } } close $DF;

Replies are listed 'Best First'.
Re: Parse file and creating hashes
by talexb (Chancellor) on Jan 08, 2022 at 15:31 UTC

    I'm also not certain what you need, but here's what I pulled together from your description.

    #!/usr/bin/perl use strict; use warnings; # 2022-0108: Read a CSV line w/ three elements, building a hash using # the third element and populated with the second element. use Data::Dumper; { my %hash; while (<DATA>) { s/\s+$//; my @w = split(/,/); push ( @{ $hash{ $w[2] } }, $w[1] ); } print Dumper ( \%hash ); } __DATA__ 702005010554291,5016554291,7020000023F22 702005010524898,5016524898,70200000441E0 702005010660208,5016660208,7020000033FD0 702005010509777,5016509777,7020000033FF0 702005010633781,5016633781,7020000024092 702005010616472,5016616472,7020000043FE2 310005010601516,5016601516,7020000044201 702005010526097,5016526097,7020000013EB1 702005010681238,5016681238,7020000044052 702005010551103,5016551103,7020000023F12 702005010625010,5016625010,7020000023F51
    This produces the following results:
    $VAR1 = { '7020000023F51' => [ '5016625010' ], '7020000044052' => [ '5016681238' ], '7020000033FD0' => [ '5016660208' ], '7020000044201' => [ '5016601516' ], '7020000023F22' => [ '5016554291' ], '7020000043FE2' => [ '5016616472' ], '7020000024092' => [ '5016633781' ], '7020000013EB1' => [ '5016526097' ], '7020000023F12' => [ '5016551103' ], '7020000033FF0' => [ '5016509777' ], '70200000441E0' => [ '5016524898' ] };

    Let us know if this is what you wanted.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Hello and thank you very much for your time dedicated to my plea for help. I would like a script that would use the following data and output the following.
      702005010683593,5016683593,7020000024140 702005010640383,5016640383,7020000024150 310005010532143,5016532143,7020000034001 702005010637702,5016637702,7020000034001 702005010608274,5016608274,7020000034013 702005010608274,5016608274,7020000034013 310005010609604,5016609604,7020000034013 702005010510869,5016510869,7020000034013 702005010551513,5016551513,7020000034130 702005010551513,5016551513,7020000034130 702005010679719,5016679719,7020000034222 702005010527052,5016527052,7020000034222 702005010645458,5016645458,7020000034222
      Logical processing of he info ...
      $VAR1 = { '7020000024140' => [ '5016683593' ], '7020000024150' => [ '5016640383' ], '7020000034001' => [ '5016532143','5016637702' ], '7020000034013' => [ '5016608274','5016608274','5016609604','5016510869' ], '7020000034130' => [ '5016551513',5016551513' ], '7020000034222' => [ '5016679719','5016527052','5016645458' ] };
      As I mentioned earlier the data is thousands of lines above is the logical processing of the info as I would treat it. In the end ONLY the array with numbers repeating more than 30 times would be printed as an output. For this sample purpose, I'm using arrays with numbers repeating more than once.
      $VAR1 = { '7020000034013' => [ '5016608274' ], '7020000034130' => [ '5016551513' ] };
      I apologize if I'm am not clear, I'm really trying to be as clear as I can with my explanation since I really need the help.
        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11140267 use warnings; my %answer; while( <DATA> ) { my (undef, $two, $three) = split /,|\n/; $answer{$three}{$two}++; } my $thirty = 1; # FIXME 1 for testing, should be 30 for my $href ( values %answer) { delete @{$href}{ grep $href->{$_} <= $thirty, keys %$href } } delete @answer{ grep 0 == %{ $answer{$_} }, keys %answer }; $_ = [ keys %$_ ] for values %answer; use Data::Dump 'dd'; dd \%answer; __DATA__ 702005010683593,5016683593,7020000024140 702005010640383,5016640383,7020000024150 310005010532143,5016532143,7020000034001 702005010637702,5016637702,7020000034001 702005010608274,5016608274,7020000034013 702005010608274,5016608274,7020000034013 310005010609604,5016609604,7020000034013 702005010510869,5016510869,7020000034013 702005010551513,5016551513,7020000034130 702005010551513,5016551513,7020000034130 702005010679719,5016679719,7020000034222 702005010527052,5016527052,7020000034222 702005010645458,5016645458,7020000034222

        Outputs:

        { "7020000034013" => [5016608274], "7020000034130" => [5016551513] }

        Here's another way...

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11140267 use warnings; my $file = <<END; 702005010683593,5016683593,7020000024140 702005010640383,5016640383,7020000024150 310005010532143,5016532143,7020000034001 702005010637702,5016637702,7020000034001 702005010608274,5016608274,7020000034013 702005010608274,5016608274,7020000034013 310005010609604,5016609604,7020000034013 702005010510869,5016510869,7020000034013 702005010551513,5016551513,7020000034130 702005010551513,5016551513,7020000034130 702005010679719,5016679719,7020000034222 702005010527052,5016527052,7020000034222 702005010645458,5016645458,7020000034222 END open my $fh, '<', \$file or die; # FIXME change for your file my $thirty = 1; # FIXME 1 for testing, should be 30 my (%answer, %lines); $lines{ /,(.+)/ ? $1 : die "bad data <$_>" }++ while <$fh>; for ( keys %lines ) { my ($value, $key) = split ','; $lines{$_} > $thirty and push @{ $answer{$key} }, $value; } use Data::Dump 'dd'; dd \%answer;
        This is the script I have so far.
        #!/usr/bin/perl -w use strict; use warnings; use Text::ParseWords; use POSIX qw/strftime/; use Excel::Writer::XLSX; my $dir = "/home/vlr/archive/"; my $date = "20211201"; # Check directory for relevant files opendir(DIR, $dir) or die "Could not open ".$dir."\n"; my @dir = grep(/^USRINF.*.$date.*.1700.txt$/, readdir DIR); closedir DIR; my %btssec; foreach $_(sort @dir){ my $DFile = $dir.$_; print "Processing: ".$DFile."\n"; # Open Data File and parse each line open my $DF, '<', $DFile or die "Can't open $DFile $!"; foreach $_ (<$DF>){ chomp( $_ ); if((length $_ > 0)&&($_ =~ /^\d{15}/)){ $_ =~ s/\s+/;/g; my($imsi,$mdn,$sec) = (quotewords('[\t;]+|,\s', 0, $_))[0, +1,3]; if((($imsi =~ /^\d{15}/)&&($imsi =~ /^(702|310)/))&&(($mdn + =~ /^\d{10}/)&&($mdn =~ /^(501)/))){ #print $imsi.",".$mdn.",".$sec."\n"; =pod # The part below is the part of the script I need help with to p +rocess the parsed data whose output I'm posted as the snippet to proc +ess. if(!exists($btssec{$sec})){ print "New key:" . $sec ."\n"; $btssec{$sec} = $mdn; }elsif(exists($btssec{$sec})){ #push( @{$btssec{$sec}}, $mdn ); } =cut } } } close $DF; print "$_\n" for keys %btssec; }
        For now, the script I have in POD sucks it's not doing what I thought it would do. (sigh).
Re: Parse file and creating hashes
by kcott (Archbishop) on Jan 09, 2022 at 03:42 UTC

    G'day PerlMonger79,

    As others have already indicated, there are various aspects of your post that make it difficult to know exactly what you want; I won't repeat those here. The following is, I believe, the guts of what you're after.

    I've used the second set of example data that you posted (in "Re^2: Parse file and creating hashes"):

    $ cat pm_11140267_input.txt 702005010683593,5016683593,7020000024140 702005010640383,5016640383,7020000024150 310005010532143,5016532143,7020000034001 702005010637702,5016637702,7020000034001 702005010608274,5016608274,7020000034013 702005010608274,5016608274,7020000034013 310005010609604,5016609604,7020000034013 702005010510869,5016510869,7020000034013 702005010551513,5016551513,7020000034130 702005010551513,5016551513,7020000034130 702005010679719,5016679719,7020000034222 702005010527052,5016527052,7020000034222 702005010645458,5016645458,7020000034222

    Here's a working demo script:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $infile = 'pm_11140267_input.txt'; my $min_to_keep = 2; # 30 for production my %data; { open my $fh, '<', $infile; while (<$fh>) { chomp; my (undef, $col2, $col3) = unpack 'A15xA10xA13'; push @{$data{$col3}}, $col2; } } # For demo only: print "Interim results:\n"; use Data::Dump; dd \%data; print "\nWanted output data:\n"; for (sort keys %data) { next if @{$data{$_}} < $min_to_keep; print "$_: ", join(', ', @{$data{$_}}), "\n"; }

    This outputs:

    Interim results: { "7020000024140" => [5016683593], "7020000024150" => [5016640383], "7020000034001" => [5016532143, 5016637702], "7020000034013" => [5016608274, 5016608274, 5016609604, 5016510869], "7020000034130" => [5016551513, 5016551513], "7020000034222" => [5016679719, 5016527052, 5016645458], } Wanted output data: 7020000034001: 5016532143, 5016637702 7020000034013: 5016608274, 5016608274, 5016609604, 5016510869 7020000034130: 5016551513, 5016551513 7020000034222: 5016679719, 5016527052, 5016645458

    — Ken

Re: Parse file and creating hashes
by LanX (Saint) on Jan 08, 2022 at 14:40 UTC
    > Not sure if I explain myself properly,

    Unfortunately, not really.

    It would be easier if you showed us a dump of the expected outcome.

    see also SSCCE

    Two more recommendations:

    • The code you've shown is using unknown subs like quotewords without explanation
    • you should have put the sample input into <code> tags too

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I apologize, the quotewords and other unknows are part of the bigger script. The script is not complete, as I intend to print output into an excel file as well for a report.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11140267]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-26 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found