Re^2: How can I keep the first occurrence from duplicated strings?

Replies are listed 'Best First'.
Re^3: How can I keep the first occurrence from duplicated strings? by haukex (Archbishop) on Aug 30, 2023 at 10:54 UTC
See also File::ReadBackwards (however, its documentation doesn't mention file encodings, so it might blow up on UTF-8).	[reply]
Re^3: How can I keep the first occurrence from duplicated strings? by Bod (Parson) on Aug 30, 2023 at 11:10 UTC
This will read the entire file into RAM. No problem for 100 kBytes, big trouble for big files Yes of course that's true. But given the apparent nature of the data, I feel it's safe to assume that the file size will be small relative to available RAM and paging files. If it were more than a few lines then processing it using Perl is almost certainly the wrong approach. For a file large enough to be a problem, Perl should be reading in one line at a time and loading it into a database when the desired result of getting the first occurrence becomes trivial. So for a ~~big~~ huge file, the question needs asking on SQLMonks (wishful thinking...)	[reply]
Re^4: How can I keep the first occurrence from duplicated strings? by hippo (Archbishop) on Aug 30, 2023 at 13:15 UTC
For a file large enough to be a problem, Perl should be reading in one line at a time and loading it into a database IME you are seriously underestimating the time it would take to perform those insertions. I would not take this approach but rather would use the all-in-Perl approach as proposed by other respondents. It will be more robust, quicker to develop and faster to run to completion. There are of course different tasks where the time penalty of loading into a database will be outweighed by other advantages but a single pass through the data while discarding a majority of rows like this isn't one of them. 🦛	[reply]
Re^4: How can I keep the first occurrence from duplicated strings? by eyepopslikeamosquito (Archbishop) on Sep 01, 2023 at 02:06 UTC
For a file large enough to be a problem, Perl should be reading in one line at a time and loading it into a database As usual, you'd need to benchmark the specific application to know which is faster. In this thread, where some suggested using an external database instead of a gigantic Perl hash, I remember this quote from BrowserUk: I've run Perl's hashes up to 30 billion keys/2 terabytes (ram) and they are 1 to 2 orders of magnitude faster, and ~1/3rd the size of storing the same data (64-bit integers) in an sqlite memory-based DB. And the performance difference increases as the size grows. Part of the difference is that however fast the C/C++ DB code is, calling into it from Perl, adds a layer of unavoidable overhead that Perl's built-in hashes do not have. In Re: Fastest way to lookup a point in a set, when asked if he tried a database, erix replied: "I did. It was so spectacularly much slower that I didn't bother posting it". In Re: Memory efficient way to deal with really large arrays? by Tux, Perl benchmarked way faster than every database tried (SQLite, Pg, mysql, MariaDB). With memory relentlessly getting bigger and cheaper (a DDR4 DIMM can hold up to 64 GB while DDR5 octuples that to 512 GB) doing everything in memory with huge arrays and hashes is becoming more practical over time.	[reply]
Re^4: How can I keep the first occurrence from duplicated strings? by marto (Cardinal) on Aug 30, 2023 at 11:16 UTC
"If it were more than a few lines then processing it using Perl is almost certainly the wrong approach." Why? "For a file large enough to be a problem, Perl should be reading in one line at a time and loading it into a database when the desired result of getting the first occurrence becomes trivial." False, given this sort of task perl is capable of reading a line at a time and generating the required output without first inserting each line into a database then querying it.	[reply]
Re^5: How can I keep the first occurrence from duplicated strings? by eyepopslikeamosquito (Archbishop) on Aug 30, 2023 at 13:48 UTC
False, given this sort of task perl is capable of reading a line at a time and generating the required output without first inserting each line into a database then querying it Well said. Curiously, the format of the data in this long thread is spookily similar to Long list is long by long lost Chuma ... who we are longing to return. :)	[reply]