I think this approach is ok for small set of data, and gets inefficient when the input data is in the millions of lines, your algorithm requires the entire file to be read in memory (what if 200million lines?)
To be more efficient, you should be looking for an algorithm that has a smaller/predictable memory footprint, reading the entire data file into memory is not an ideal option.