comment on

Thank you for taking the time Marshall. Let me try to add some clarity and a little feedback based on a couple of quick tests

I think that you want separate lines based upon the node and the audit information?
As much as it can be defined, the issue is that I need to increase throughput from the script as started by only being able to handle a fraction of what it needed to. The results need the multiline event folded into a single line when reading STDIN where the lines are intermixed between nodes.

From what I understand, it could be that a node will spew out interleaved audit info representing 2 different events although your example data does not show that?
Your understanding is correct. 100's of machines sending syslog messages to a syslog server but it's output stream is serial which feeds to STDIN of this script.

I was looking at your code and my first thought was, "hey, run the regex engine once!" instead of multiple times.
Both you and BrowserUk pointed this out in one way or another and you opened my eyes to methods of thinking of the data in ways that it is needed minimally or not at all (index was new to me)

...This is like a "garbage collection" process that happens occasionally
Yeah, it really is much like that. In the collection I was trying to avoid using time() as it had a slight impact when processing 1.5M records and while light, why use it of the data's already available. Was using a line count as a trigger to perform the collection though I could have just as easily used the extracted time (if it's extracted) and added N time to it to determine when to collect again.

I am very skeptical that multi-threads or multi-processes can help in this situation. At the end of the day, you are writing to a single disk file.
The reason I thought about threading is because the need to "dedupe" the data (remove the multiples of the same field from the flattened record) was performed by a single regex and seemed to be as good as it gets but it also accounted for an 100-200% increase in processing time. If I am able to offload that to a thread (or two) while the main loop goes on reading data then there would be a benefit. I think I was able to achieve. The other benefit is that writing to a single file is not a requirement and each thread could write independently to the separate files. For most of the testing I performed I left off the disk writes for now.

The code you provided performed very well:

$ wc -l audit-day.log
1622199 audit-day.log
$ ls -lh audit-day.log
-rw-r-----. 1 xxxxxxx xxxx 264M Jan 29 11:02 audit-day.log

Original code you provided (thank you!):

$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m3.07s
user    0m3.02s
sys     0m0.17s

After removing the call to time(), the related assignment and delete:

$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m2.67s
user    0m2.62s
sys     0m0.14s

And even further enhancement by appending the data to the end of the hash instead of using push/join:

$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m2.46s
user    0m2.41s
sys     0m0.14s

Without the garbage collection this still is not realistic but helps in being able to judge the impact of certain choices which is what I have been doing a lot of over the last couple of days.

In reply to Re^2: Multi-CPU when reading STDIN and small tasks by bspencer
in thread Multi-CPU when reading STDIN and small tasks by bspencer

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.