comment on

Please feel free to respond away with solutions...

Short version: J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.

(The caveat: "words" are supposed to be not too different in length, they are temporarily padded to longest during program run (i.e. not padded at all with current test sample), expect deterioration if, say, their lengths differ 2 or more orders of magnitude or so, especially if just a few are very long. I didn't check.)

###################### (Prose and previous results, can be skipped)

I was shocked to notice this thread is almost a month old already. While I'm in no hurry and have been pursuing what follows at leisure and only rarely (kind of "for dessert"), it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed.

As a reference frame, here are assortment of results of previous solutions, with my hardware:

llil2grt.pl (see 11148713) (fastest "pure core-only Perl"), Windows:

llil2grt start
get_properties : 16 secs
sort + output  : 31 secs
total          : 47 secs
Peak working set (memory): 2,744,200K
[download]

Same code & data & PC, Linux: (I didn't investigate why such difference.)

llil2grt start
get_properties : 11 secs
sort + output  : 20 secs
total          : 31 secs
2,152,848 Kbytes of RAM were used
[download]

I assume that Judy (see 11148585) is best in both speed and memory for non-parallel Perl solutions, with same caveat as at the very top: "words" are temporarily padded to fixed length e.g. here to 10 bytes.:

my_Judy_test start
get_properties : 13 secs
sort + output  : 7 secs
total          : 20 secs
349,124 Kbytes of RAM were used
[download]

Being lazy bum, I didn't compile C++ solutions (nor do I code in C++), here is a copy-paste from 11148969, I assume it is the best result so far, among C++ and all others: (For my PC, I expect time to be worse.)

llil2grt start
get_properties      CPU time : 4.252 secs
emplace set sort    CPU time : 1.282 secs
write stdout        CPU time : 1.716 secs
total               CPU time : 7.254 secs
total        wall clock time : 7 secs
Memory use (Windows Private Bytes): 1,626,728K
[download]

###################### (End of prose and previous results)

Code below generates next message with RAM usage taken from Windows Task Manager (to be on par with how it was measured for Perl), while script pauses for a key (Ctrl-D or Ctrl-Z + Enter combo as usual for Linux or Windows, respectively) after finish:

Read and parse input:    1.636
Classify, sum, sort:     2.206
Format and write output: 1.895
Total time:              5.737

Finished. Waiting for a key...

Peak working set (memory): 657,792K
[download]

The injection of CR into output lines is only required on Windows (actually, not required at all) to later ensure no difference with output from Perl. The "magic constant" 3 for number width can be any, and is only used for intermediate step.

I had to make this code slightly less readable than it was during development, by somewhat aggressively re-using over and over again same variable names for words and nums, as data are processed and modified as script progresses. They were different "self-explanatory" names at each stage, but because arrays are huge, it's better to immediately over-write variable on successive assignments to conserve memory. "Erasing" throw-away helper arrays (similar to undef in Perl) serves same purpose.

Actually, during development I was playing with this toy dataset, here's original data and result:

   text =: noun define
tango	1
charlie	2
bravo	1
foxtrot	4
alpha	3
foxtrot	1
bravo	1
foxtrot	7
)

    NB. Do work here...
    
    ] text
foxtrot	12
alpha	3
bravo	2
charlie	2
tango	1

The script:

NB. -----------------------------------------------------------
NB. --- This file is "llil.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------

args   =: 2 }. ARGV
fn_out =: {: args
fn_in  =: }: args

NUM_LENGTH =: 3
PAD_CHAR   =: ' '

make_sel  =: [: (1 2 0& |:) @ ,: ([ ,. ] - [: , [)
sort_some =: ([: /:~ {)`[`] }

text    =: , freads " 0 fn_in
lf_pos  =: I. text = LF
tab_pos =: I. text = TAB

words =: ((0 , >: }: lf_pos) make_sel tab_pos) ];.0 text
nums  =: 0&". (tab_pos make_sel lf_pos) ; @: (<;.0) text
erase 'text' ; 'lf_pos' ; 'tab_pos'

t1 =: (6!:1) ''         NB. time since engine start

nums  =: words +//. nums
words =: ~. words
'words nums' =: (\: nums)& { &.:>"_1 words ; nums

starts =: I. ~: nums
ranges =: starts ,. (}. starts , # nums) - starts
count  =: # starts

sort_words =: monad define
   'ranges words' =. y
   range  =. ({. + i. @ {:) {. ranges
   (}. ranges) ; range sort_some words
)

words =: > {: sort_words ^: count ranges ; words
erase 'starts' ; 'ranges'

t2 =: (6!:1) ''         NB. time since engine start

nums =: (- NUM_LENGTH) ]\ NUM_LENGTH ": nums
text =: , words ,. TAB ,. (nums ,. CR) ,. LF
erase 'words' ; 'nums'

text =: (#~ ~: & PAD_CHAR) text
text fwrite fn_out
erase < 'text'

t3 =: (6!:1) ''         NB. time since engine start

echo 'Read and parse input:    ' , ": t1
echo 'Classify, sum, sort:     ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time:              ' , ": t3
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
[download]

In reply to Re: Rosetta Code: Long List is Long (faster) by Anonymous Monk
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks