in reply to Re: Re: Re: Re: many to many join on text files
in thread many to many join on text files

He specified many to many. Using a hash like that will cut the many down to just 1, and lose data.

(I would assume that in the full application he will do something more interesting than just print a message.)

  • Comment on Re: Re: Re: Re: Re: many to many join on text files

Replies are listed 'Best First'.
Re^6: many to many join on text files
by hv (Prior) on Apr 15, 2004 at 01:57 UTC

    Well, it doesn't have to lose any data you need to keep - I originally wrote it as $hash->{$key} = 1, and changed it to an increment to preserve the count; it could just as easily store an arrayref of line numbers or file positions for each key. The core concept though is that the primary super-slow operation being performed is searching through a file repeatedly for matches on a key, and if there is any way to turn that into a hash lookup it's likely to improve things as long as there is memory left.

    I think we'd need to see some information about the actual task at hand to know whether there's some reason an in-core hash cannot be used, but unless memory is unusually restricted for a modern-day computer it seems unlikely.

    Hugo

      2 files, with a million rows each, and dozens of fields each. Suggests that we are likely to be throwing around a few million objects, many of which are going to take up a few hundred bytes.

      Even with modern-day computers, this is starting to push what I feel comfortable with assuming is available.

      As for not having to lose data you intend to keep, I agree that it is possible to do so. I was just pointing out that your code wouldn't in the circumstance described by the original question.

        I'm sorry if I was unclear: I was not at any point suggesting putting the whole records (or whole objects derived from those records) in the hash.

        Hugo

Re: Re: Re: Re: Re: Re: many to many join on text files
by aquarium (Curate) on Apr 15, 2004 at 01:56 UTC
    the nitty gritty of the processing after the full outer join is trivial compared to the join itself. both tables also may contain duplicate rows....which need to match up in full outer join fashion -- this breaks the logic of my program to do a proper full outer join. I've just tried M$Access...can't find full outer join there...it does have a nice text import wizard though, so columns automatically get field names. will try mysql quickly next.