oko1 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Perlmonks! I come bearing a conceptual problem. Frankly, it's not something that requires Perl - besides the fact that I'll be coding the solution in Perl - but there's a bunch of smart people here, a number of whom are smart in exactly the ways that this problem needs. I'd be grateful for any help.

Here's the problem: I have a rather large (2MB+) cross-reference stored in a text file that's TSV-formatted and arranged in a grid - labels across the top, cross-referenced numbers across each line. The first column is more or less the reference column: that's what was originally used for looking up the data in the other columns - and the number in that first column can be repeated on several lines if there's more than one cross-referenced product for any given vendor.

The way things used to be done was pretty simple: find the product in the first column, look up the x-refs for all the other vendors. What I'd like to do is to expand that: I want to be able to either 1) select any vendor, enter that vendor's part number, and get all the x-refs from all the vendors for that part, and 2) enter just the part number and be presented with a list of all matches, with vendor names, at which point I'd go back to #1.

I can do a single instance of any of these fairly easily - loop over, build a hash, etc. What I'm having trouble doing is coming up with 1) an efficient way of storing the data (should I reformat/reorganize it for easier lookups, and if so, how? Should I stuff it into a database? Why?) and 2) a usable interface (this latter particularly makes me feel stupid; I'm usually very good at interfaces.) I guess it's only partially a programming problem; mostly, I'm just frustrated at being stuck, and need a push to get past the point of stuckness.

Ideas to get my brain rolling again would be highly and gratefully welcomed. Thanks in advance to all who try.


--
"Language shapes the way we think, and determines what we can think about."
-- B. L. Whorf
  • Comment on Organizing and presenting a cross-reference

Replies are listed 'Best First'.
Re: Organizing and presenting a cross-reference
by BrowserUk (Patriarch) on Sep 24, 2008 at 01:44 UTC

    A picture paints a thousand words. Or, a clearly labelled sample of the data, 2 or 3 rows, will be a far clearer explanation than:

    The first column is more or less the reference column: that's what was originally used for looking up the data in the other columns - and the number in that first column can be repeated on several lines if there's more than one cross-referenced product for any given vendor.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      OK. I hesitated to show the actual data because it's so broad, but I'll take a reasonable-width snip of it (whitespace reduced for better fit.)

      NGK_STK. NGK_P/N NGK ALT ACCEL AC_DELCO AUT +OLITE BECK/ARNLEY 1010 A6 N/A N/A C86 375 + N/A 1010 A6 N/A N/A C86 269 +7 N/A 1010 A6 N/A N/A C86 375 + N/A 1010 A6 N/A N/A C86 377 + N/A 1010 A6 N/A N/A C86 375 + N/A 1010 A6 N/A N/A C86 386 + N/A 1010 A6 N/A N/A C86S* 386 + N/A 1010 A6 N/A N/A S85F 386 + N/A 1010 A6 N/A N/A C87 277 +5 N/A 1010 A6 N/A N/A C86 277 +5 N/A 1010 A6 N/A N/A C86 303 +5 N/A 1010 A6 N/A N/A C86 303 +5 N/A 1010 A6 N/A N/A C86 303 +5 N/A 1010 A6 N/A N/A C86 303 +5 N/A 1010 A6 N/A N/A C86S* 313 +6 N/A 1010 A6 N/A N/A C86 376 + N/A 1010 A6 N/A N/A C85S 379 + N/A 1010 A6 N/A N/A C86 375 + N/A 1010 A6 N/A N/A C86 386 + N/A 1010 A6 N/A N/A 18A* 379 + N/A 1010 A6 N/A N/A C86,M8 311 +6 N/A 1010 A6 N/A N/A C87 303 +5 N/A 1010 A6 N/A N/A C87 376 + N/A 1010 A6 N/A N/A C86,M8 313 +6 N/A 1011 B7EB* 5122 142 R42XL* 403 + N/A 1024 AR6FS-11* N/A N/A R83T 124 + N/A 1024 AR6FS-11* N/A N/A R83T 804 + N/A 1024 AR6FS-11* N/A N/A R83T 124 + N/A 1024 AR6FS-11* N/A N/A R83T 584 + N/A 1024 AR6FS-11* N/A N/A R83T 124 + N/A 1027 AP9FS N/A N/A 84TS 32 + 176-5178 1027 AP9FS N/A N/A 84TS 32 + N/A 1029 BPMR6A-10 N/A N/A CS42S 297 +4 N/A 1030 DPR8EV-9* 2872 N/A N/A 416 +3 N/A 1030 DPR8EV-9* 2872 N/A N/A 416 +3 N/A 1030 DPR8EV-9* 2872 N/A N/A 416 +3 N/A 1034 BP7ES N/A 113 R41XLS 53 + 176-5075 1034 BP7ES N/A 113 R41XLS 425 +2 N/A 1034 BP7ES N/A 113 R41XLS 425 +2 176-5075 1034 BP7ES N/A 113 R41XLS 62 + 176-5075 1034 BP7ES N/A 113 R41XLS 52 + 176-5075 1034 BP7ES N/A 113 41XLS* 425 +2 N/A 1034 BP7ES N/A 113 41XLS* 425 +2 N/A 1041 ZFR6A-11 N/A N/A N/A 522 +4 176-5204 1043 BR8EVX SOLID* 6747 N/A N/A 406 +3 N/A 1049 B8EFS N/A N/A N/A AR4 +74 N/A 1052 B6HS-10 N/A 156 42F 409 +3 176-5006 1059 R217-10 N/A N/A N/A N/A + N/A

      --
      "Language shapes the way we think, and determines what we can think about."
      -- B. L. Whorf

        Despite my "a picture paints a thousand words", I'm still having trouble understanding your data.

        For example, in the following two lines, only the presence of a part no for the last manufacturer distinguishes them. Doesn't that make the second a duplicate of the first with missing information? Ie. redundant?

        NGK_STK. NGK_P/N NGK ALT ACCEL AC_DELCO AUTOLITE BECK/ARNLEY 1027 AP9FS N/A N/A 84TS 32 176-5178 1027 AP9FS N/A N/A 84TS 32 N/A

        There appears to be a significant column of information missing from the above table?

        I could imagine that the above data represents the recommendations by the different manufacturers for the plugs in their range that would be applicable to two difference vehicles. Say the first is the normally aspirated version of some mark, and the second is the turbo-charged version. And whilst most of the manufacturers recommend the same plug for both, the BECK/ARNLEY plug is unsuitable for the latter variant. And, they have no suitable alternative in their range.

        My point is, that whilst plugs from different manufacturers may be interchangable for a given vehicle, each manufacturers plugs have different ranges of operating parameters, which means that plugs from two different manufacturers are not interchangable for all applications.

        The upshot is, the reason you are having so much trouble coming up with a normalisation schema, is because the key field--the vehicle--is missing from your table. Any attempt at normalisation based upon grouping of part numbers without taking the vehicle into consideration is at best doing your customers a dis-service. It could pursuade them to purchase plugs that are unsuitable for their particular vehicle, that might have limited life due to (say) overheating. Or worse, that could damage their engines by (say) holing their pistons by burning too hot.

        At worst, it could be dangerous.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Organizing and presenting a cross-reference
by dragonchild (Archbishop) on Sep 24, 2008 at 01:48 UTC
    You would have a products table and a vendors table. You would then have a product_x_vendor table which would be the xref. If each vendor had their own unique part number for a given product, the part number would go in the xref table.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Unfortunately, they're not unique - in fact, many vendors use the same number for a given part. I've seen the scheme you describe before (believe it or not, I was a database admin for a couple of years...), but I couldn't figure out how to make it work here.


      --
      "Language shapes the way we think, and determines what we can think about."
      -- B. L. Whorf
        Keep the vendor name and part num separate (cols), but define a compound key on the table (vendor,part_num).

        I think what you are searching for is database design or more specific database normalization.

        Really you should probably put the product serial, price, anything else specific to that vendors offer on the product into a seperate table. Lets name it vendoroffer. Then have another table with the vendor, product, and vendoroffer keys having all your "cross-references" but none of the data. This is an example of normalization.

        By keys I mean internally created numbers (i.e. autoincrement integers) that only have meaning in the database and not something like serial numbers.

        IANADBA, but perhaps you should store all the values with their vendor-label prefix. Then at least you will eliminate duplicate part numbers for different parts.
Re: Organizing and presenting a cross-reference
by apl (Monsignor) on Sep 24, 2008 at 01:12 UTC
    Stick it in a database. Let SQL and the DB Server do the heavy lifting...

      ...which unfortunately puts me right back where I started. It's not that I don't have tools; I know how to write an SQL statement, and I can crank out reams and oodles of Perl code. The problem is that I can't figure out how to shape this problem so that I can apply the tools. :\

      If I stuff this info into a database (how do I organize the data? What's the guiding principle for storing it in one structure vs. another?), how do I retrieve it to serve the goals that I stated? That's my problem - it seems like the interface would determine the structure, and the structure would determine the interface. Meanwhile, here I am starving to death between two piles of hay.


      --
      "Language shapes the way we think, and determines what we can think about."
      -- B. L. Whorf