Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How to walk through convoluted data?

by perl-diddler (Chaplain)
on Jul 19, 2021 at 19:20 UTC ( [id://11135180]=perlquestion: print w/replies, xml ) Need Help??

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

Posting a question here, because even if I get the code to work, it looks ugly/feels icky.

The code is walking a data structure, which has some indirections in it due to multiple versions of the source data...I'll try to explain. I'm D/L rpm files associated with a 'Repo'. The Repo (a rolling release called tumbleweed) is released as often as daily. Each release has a primary.xml file that contains all the files in the release. Some of those rpms may be new, others may be the same as in previous releases.

As you download releases, you only need to d/l the new rpms. I am trying to keep the past 4 d/l's which may not be from consecutive releases. As I d/l new file lists (from primary.xml), I eventually end up with unreferenced rpm's on disk. I want a way to be able to expire the old releases, either as I download new ones (idea) or by running an expiration script after I download a new version.

My data structures have changed more than a few times as I have been evolving a program that downloads 1 release to one that downloads multiple releases, but that will expire old files.

The part I'm questioning right now is when I got to:

$p->{Vcpeid}{$cpeid} = RepoVData->new() unless blessed $p->{Vcpeid +}{$cpeid}; $p->{Vcpeid}{$cpeid}{RepoDB} = RepoDB->new() unless; $p->{Vcpeid}{$cpeid}{RepoDB}->{$nam} = VR->new()
and realized I was hitting some repetition. So I extracted some similar code to continue:
sub mkBless($;$) { my $targ = shift; if (@_) { ${$targ} = shift unless blessed $targ; } $targ } sub RepoDB($;$) { my $p = shift; my $nam = shift; my $cpeid; die P "RepoDB: can't access Vdata w/o cpeid" unless $cpeid=$p->cpeid and $cpeid =~ /^\d{8}$/; mkBless(\$p->{Vcpeid}{$cpeid}, RepoVData->new); mkBless(\$p->{Vcpeid}{$cpeid}{RepoDB}, RepoDB->new()); mkBless(\$p->{Vcpeid}{$cpeid}{RepoDB}{$nam}, VR->new());
Realized I need to handle the storing of the optional param and not liking all that growing repetition in each line, my latest code looks more like:
sub mkBless($;$) { my $targ = shift; if (@_) { ${$targ} = shift unless blessed $targ; } \$targ } sub RepoDB($;$) { my $p = shift; my $nam = shift; my $cpeid; die P "RepoDB: can't access Vdata w/o cpeid" unless $cpeid=$p->cpeid and $cpeid =~ /^\d{8}$/; my $obj = mkBless(\$p->{Vcpeid}{$cpeid}, RepoVData->new); $obj = mkBless(\$obj->{RepoDB}, RepoDB->new()); $obj = mkBless(\$obj->{$nam}, VR->new()); if (@_) { $$obj = shift; } $obj }
Since I'm going deeper and deeper in the data structure I figure I might as well store some of that walk into '$obj' so I can re-use it and not spell it out each time. But this feels like a kludge/hack. Maybe that's what it will take/be, but I thought I put it here to see if anyone had done similar and how they did it (preferably without throwing out the baby w/they bath-water, i.e. w/o a complete rewrite of 1600-1700 lines of code.

Sigh.

FWIW -- the "cpeid" is my daily version (20210719), so I'm trying to create a single "name" database (RepoDB) for all the versions, and off of that a "version-release" for each rpm-VR that would exist for each from each name, and have the VR point to the rest of the info for a given rpm (including what cpeid(s) it was in). Theoretically, on the last time through the loop, if I find a "VR" that isn't tagged, it can be removed.

At 1700+ lines, there's quite a bit of code not shown. Sadly, since this script has morphed from a simple D/L script, to whatever I will have due to needs to "expire" disconnected RPM versions, There's been a great deal of code-refactoring on the fly.

mkBless is a routine that "ensures" there is a Blessed obj, like my mkARRAY/mkHASH ensure there is an ARRAY or HASH where it is pointing to and if not, then do something about it. They are not usually needed due to autovivification, but there are a few places where auto-vivification doesn't work (that need helper routines).

I'm not specifically trying to solve some problem (that I know of, yet), but wondering if others have had to follow a bunch of nested data and how that worked out (or if you have any, *cough*, pointers....?)

Replies are listed 'Best First'.
Re: How to walk through convoluted data?
by bliako (Monsignor) on Jul 20, 2021 at 07:27 UTC

    the following part of your code has "class-factory" written all over it in big flashing red letters:

    # called with : mkBless(\$p->{Vcpeid}{$cpeid}, RepoVData->new); sub mkBless($;$) { my $targ = shift; if (@_) { ${$targ} = shift unless blessed $targ; } \$targ }

    (btw i think you have errors in above code: shouldn't condition be: blessed $$targ and return $$targ ?)

    Nothing fancy though:

    # notice @ params: contails optional class name and any optional par +ams to new() sub mkBless($;@) { my $targ = shift; if (@_) { ${$targ} = shift->new(@_) unless blessed $$targ; } # you don't need to return anything, but if you must: $$targ; #\$targ } # example call: my $obj = mkBless(\$someotherobj, 'RepoVData', 1,2,3);

    another observation: your code calls mkBless() with a newly created object which you discard if the 1st param is already blessed. That's a waste.

    bw, bliako

        ... majorly helpful, open electronic form.

        perlfan's link is a free (but not public domain) download direct from Dominus.


        Give a man a fish:  <%-{-{-{-<

        My bible!, or one of them anyway, Great inspiration for my P module or Types::Core, maybe even Xporter in CPAN. Some invaluable modules, though needing a bit of update. someday, at some point. But HoP...any particular chapter or page? I only have a paper copy, no majorly helpful, open electronic form.

        Don't even find 'Class Factory' in its Index, so maybe under a different conceptual idea or term?

        Maybe it's not a good book for this, as I might never get the program working -- but might spend another few months re-factor/re-writing it again :^)

        Yeah, since it's a rolling, release, I could express it as some type of infinite series, though am not sure what formula I could apply to the input to get useful output...

        Given the discrete inputs by various authors for package releases, I'd have to allow for a callback at each stepping of the release to allow for new output. Certainly no lossy algorithm would work -- it would just get everyone p'd off at me in exponential time. Must be in the nature of linux distros...*sigh* Maybe I'll try a different part of book for inspiration, though I'm not sure how quickly such an endeavor will result in any convergence to the actual output...

        (only 1/2**(# times attempted) for actual applicability).

        Maybe I need a better paradigm as I sense only divergence down this route....

        I hope someone reading this has a sense of humor or alot more insight!

        -l

      May be a class factory, remember this is a rewrite of a previous version that only read the xml files as a guide of that rpms to download. That there are errors in it, is unsurprising, since, while I started with working code, I'm no where close to the new code that needs to go into its place. I'll have to reread some source(s) about class facs.

      As for throwing away the output of mkBless...that's because it is equivalent to:

      sub new() {my $p = shift; my $c=ref $p||$p; my $arghash = @_ ? shift : {}; blessed $p? $p->SUPER::new($arg) : ($p = $c->new($arghash); $p }
      except that there was a chain of included blessed object before the final output -- that's what I tried to short with mkBless (which I just cooked up for this example, actually, so I can't really defend its usage much, it being my first experience with it.

      So bugs? ya ya!, plenty, it was meant to be example code to give an idea of what I was having to do to generate / access data, its definitely not working code! ;-( :-) :-)

      -l

      Not to dispute anything you are suggesting, but more to explain how I got there. Before mkBless (something I came up with on the fly while writing there -- not always my best work -- BREAK--- Just as an FYI why I seem to go away for a while in middle of conversations. My typing speed has decayed something fierce over the years...like down to 20-30% of faster times. No way can my typing keep up with my thoughts these days so sometimes blocks of explanatory text are skipped -- confusing readers and myself. But sigh.... In case someone doesn't know, use mem allows me to mostly easily include a bunch of packages in 1 files.
      somwhere further above: { package RPM_data; #{{{ use constant D => q(-); #D for Dash use constant o => q(.); #o for dot use Data::Vars [qw(N V R A reldir reponame cpeids size)], {cpeids=>sub{ {} } }; # N-keys of this is ref-cnt sub cpeids() { my $p = shift; scalar keys %{$p->{cpeids}} } sub cpeid(;$) { return q(cpeid) unless @_; my $p = shift; my $cpeid = shift; mkARRAY $p->cpeids; push @{$p->cpeids}, $cpeid; my $cpeids = ErV $p, cpeids; push @{$p->cpeids}, $cpeid unless ErV $cpeids, $cpeid; return $p->cpeids($cpeid); } sub new($) { my $p = shift; my $c = ref $p || $p;...} sub new_from_file($) { my $p = shift; my $fname = shift; ... sub new_from_path($) { my $p = shift; my $pthnam = shift;... sub VR () { my $p = shift; $p->{V} .D. $p->{R} } sub NVR () { my $p=shift; $p->{N} .D. $p->VR } sub NVRA () { my $p=shift; $p->NVR .o. $p->{A} } sub relpth() {my $p = shift; pathcat($p->reldir, $p->NVRA .o."rpm")} }##### { package VRs; #{{{ use Types::Core qw(blessed LongSub); use Data::Vars [qw(VRs)], {VRs=>sub{ {} }}; use P; use Dbg(1,1,1); sub vr($;$) { my $p = shift; my $c = ref $p || $p; my $argp = shift; my $vr = ErV $argp, VR; my $vrp; if (@_) { $p->{VR}{$vr} = RPM_data->new(shift); } $p->{VR}{$vr}; } 1;} #}}} ###### (somewhere above: { package RepoVData; #{{{ use strict; use warnings; use mem; use Data::Vars [qw(RepoDB RepoXMLs )], #RDFile_inf { RepoXMLs => sub() { {} }, # HASH{type => RDFil +e_inf} RepoDB => sub() { {} }, }; # nam=>rpmdata

      And then a bunch of these progresssive assignments checking to see if each section is blessed. This started getting very repetitive and ugly looking.

      $p->{Vcpeid}{$cpeid} = RepoVData->new() unless blessed $p->{Vcpeid}{$c +peid}; $p->{Vcpeid}{$cpeid}{RepoDB} = RepoDB->new() unless blessed $p->{Vcpei +d}{$cpeid}{RepoDB} $p->{Vcpeid}{$cpeid}{RepoDB}{$nam} = VR->new() unless blessed $p->{Vcpeid}{$cpe +id}{RepoDB}{$nam}; $p->{Vcpeid}{$cpeid}{RepoDB}{$nam}{vr($V, $R) = RPM_data->new({...}) u +nless blessed $p->{Vcpeid}{$cpeid}{RepoDB}{$na +m}{vr($v, $R) ....

      This is where I saw I had:

      lvalue = new addon unless blessed lvalue;

      That's where mkBless came in -- to check if the levalue was blessed, and if so, assign more of the data-to-follow onto it.

      I had it return a value so I could skip some repetition at the beginning of each line.

      So, call that a class factory if you want, I'm not sure it is, but at least wanted you to see how it "fell out of" progressively adding on more data to walk through this DB.

      (is there anyway to insert a picture or diagram in this chaos? as it might help me clarify where I'm going....the old pic is worth a thousand words thing.

      Thanks for the feedback so far, I won't be annoyed if anyone dropps out of this mess, I sorta wish I could, but it stuff I need to get done to manage my system, UG.

      cheers!...oh and the bit about doing that "blessed test or return less work" bit, I probably would have gotten there eventually, in some later cleanup -- I put focus first on getting something to work -- and yes, I am one of those who keeps working at things to clean them up and make them better. -- because if I know I need to do that for my "future self" (if no one else) to able to reuse the code I write today.

Re: How to walk through convoluted data?
by perlfan (Vicar) on Jul 20, 2021 at 22:31 UTC
    You might get more responses if you provide some realistic XML files.
      How/where should I post the files? I need to heavily trim them, but the two of interest would be the repomd.cml which has the cpeid in it and the names of the other xml files of the group. and the 'primary.xml' which has a list of all the rpms in the release.

      out of the 4 repos released / day, (oss/non-oss/src-oss/src-non-oss) I've been using src-non-oss for recent test runs since it's the shortest. with repomd.xml at 8869 and primary.xml at 41033 bytes.


      Vs. for 'oss', ( repomd's are about the same), but primary.xml varying alot depending on an individual update, but say, with the same date as src-non-oss, 162MB.primary.xml has 3.2M lines and 67370 different rpm descriptions.

      From beginning of repomd.xml through its cpeid entry, and including the listing for the primary.xml file. I'll list here:

      <?xml version="1.0" encoding="UTF-8"?> <repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http:// +linux.duke.edu/metadata/rpm"> <revision>1625990264</revision> <tags> <content>pool</content> <content>gpg-pubkey-3dbdc284-53674dd4.asc?fpr=22C07BA534178CD02EFE +22AAB88B2FD43DBDC284</content> <content>gpg-pubkey-39db7c82-5f68629b.asc?fpr=FEAB502539D846DB2C09 +61CA70AF9E8139DB7C82</content> <content>gpg-pubkey-307e3d54-5aaa90a5.asc?fpr=4E98E67519D98DC7362A +5990E3A5C360307E3D54</content> <repo>obsproduct://build.opensuse.org/openSUSE:Factory/openSUSE/20 +210710/i586</repo> <repo>obsproduct://build.opensuse.org/openSUSE:Factory/openSUSE/20 +210710/x86_64</repo> <distro cpeid="cpe:/o:opensuse:opensuse:20210710">openSUSE Tumblew +eed</distro> </tags> <data type="primary"> <checksum type="sha256">60ac248489df31c61277a6872279561730d27d51b3 +bb7d15368d75b69d1ac80c</checksum> <open-checksum type="sha256">d101bad38f3a987c9a790f927031cfcc68c15 +98b4d6f329447c6fe338cfb7128</open-checksum> <location href="repodata/60ac248489df31c61277a6872279561730d27d51b +3bb7d15368d75b69d1ac80c-primary.xml.gz"/> <timestamp>1625990264</timestamp> <size>18659084</size> <open-size>171435824</open-size> </data>

      That gives me my distro version or date (the cpeid number) and the location of the first primary.xml file of rpms that have changed since "yesterday" (previous release).

      The header and 1st package of a primary for an oss release are below:

      <?xml version="1.0" encoding="UTF-8"?> <metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="htt +p://linux.duke.edu/metadata/rpm" packages="66746"> <package type="rpm"> <name>2048-cli</name> <arch>i586</arch> <version epoch="0" ver="0.9.1+git.20181118" rel="1.11"/> <checksum type="sha256" pkgid="YES">310f3c8e912923da08eab8debafd6fc0 +3afe9e1ae97304bcd029658959e099d0</checksum> <summary>A CLI version of the "2048" game</summary> <description>2048 is a mathematics-based puzzle game where the playe +r has to slide tiles on a grid to combine them and create a tile with the number 2048 +. The player has to merge the similar number tiles (2n) by moving the ar +row keys in four different directions. When two tiles with the same number touch, they will merge into one.</description> <packager>https://bugs.opensuse.org</packager> <url>https://github.com/tiehuis/2048-cli</url> <time file="1616702669" build="1616702650"/> <size package="20045" installed="26081" archive="27080"/> <location href="i586/2048-cli-0.9.1+git.20181118-1.11.i586.rpm"/> <format> <rpm:license>MIT</rpm:license> <rpm:vendor>openSUSE</rpm:vendor> <rpm:group>Amusements/Games/Strategy/Other</rpm:group> <rpm:buildhost>lamb25</rpm:buildhost> <rpm:sourcerpm>2048-cli-0.9.1+git.20181118-1.11.src.rpm</rpm:sourc +erpm> <rpm:header-range start="5096" end="9153"/> <rpm:provides> <rpm:entry name="2048-cli" flags="EQ" epoch="0" ver="0.9.1+git.2 +0181118" rel="1.11"/> <rpm:entry name="2048-cli(x86-32)" flags="EQ" epoch="0" ver="0.9 +.1+git.20181118" rel="1.11"/> </rpm:provides> <rpm:requires> <rpm:entry name="libncurses.so.6"/> <rpm:entry name="libncurses.so.6(NCURSEST6_5.7.20081102)"/> <rpm:entry name="libtinfo.so.6"/> <rpm:entry name="libtinfo.so.6(NCURSES6_TINFO_5.0.19991023)"/> <rpm:entry name="libtinfo.so.6(NCURSES6_TINFO_5.7.20081102)"/> <rpm:entry name="libc.so.6(GLIBC_2.7)"/> </rpm:requires> <file>/usr/bin/2048-cli</file> </format> </package>

      I'm NOT include most fields -- only ones I need for downloading and sorting.

      I'm also only downloading archs useful to me. as determined by my constants section:

      use constant RepoNames => qw(oss non-oss src-oss src-non-oss); use constant ArchNames => qw(noarch nosrc src x86_64); use constant RepoMDFile => 'repomd.xml'; use constant Wanted_Names => {qw(susedata 1 appdata 1 other 1 filelists 1 primary 1 appdata-ic +ons 1)}; use constant RType => { map { $_ => $_ } @{[RepoNames]} }; use constant Archt => { map { $_ => $_ } @{[ArchNames]} }; sub Repo_valid($) { my $p = shift if HASH $_[0]; ErV RType, shift } sub Arch_valid($) { my $p = shift if HASH $_[0]; ErV Archt, shift; } our @EXPORT; use mem(@EXPORT = (qw( RType Archt Repo_valid Arch_valid RepoMDFile Wanted_Names ) ) ); use Xporter;
      Hopefully that gives at least a bit more context. Can add more later if wanted, but already feel like I'm overwhelming....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11135180]
Approved by johngg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found