The good news is... The bad news is...
The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.
NB. -----------------------------------------------------------
NB. --- This file is "llil4.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
pattern =: 0 1
NB. ========> This line has a star in its right margin =======> NB. *
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
NB. PAD_CHAR =: ' '
filter_CR =: #~ ~: & CR
make_more_space =: ' ' I. @ ((LF = ]) +. (TAB = ])) } ]
find_spaces =: I. @: = & ' '
read_file =: {{
'fname pattern' =. y
text =. make_more_space filter_CR fread fname
selectors =. (|.!.0 , {:) >: find_spaces text
width =. # pattern
height =. width <. @ %~ # selectors
append_diffs =. }: , 2& (-~/\)
shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $)
selectors =. append_diffs selectors
selectors =. shuffle_dims selectors
literal =. < @: (}:"1) @: (];. 0) & text "_1
numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1
extract =. pattern & {
using =. 1 & \
or_maybe =. `
,(extract literal or_maybe numeric) using selectors
}}
read_many_files =: {{
'fnames pattern' =. y
,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames NB. *
}}
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
'words nums' =: (~. words) ; words +//. nums NB. *
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
words =: ; nums < @ /:~/. words
t2 =: (6!:1) '' NB. time since engine start
text =: , words ,. TAB ,. (": ,. nums) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & ' ') text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):
> jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
Read and parse input: 1.501
Classify, sum, sort: 2.09
Format and write output: 1.318
Total time: 4.909
Finished. Waiting for a key...
Peak working set (memory): 376,456K
There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:
{{ for. i. 3 do. 0 T. 0 end. }} ''
,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames
'words nums' =: (~.t.'' words) , words +//. t.'' nums
As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:
Read and parse input: 0.992
Classify, sum, sort: 1.849
Format and write output: 1.319
Total time: 4.16
I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:
> jconsole.exe llil4.ijs long1.txt long2.txt long3.txt out.txt
Read and parse input: 1.329
Classify, sum, sort: 0.149
Format and write output: 0.009
Total time: 1.487
########################################################
These are my results for latest C++ solution (compiled using g++), to compare my efforts to:
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec start
get_properties CPU time : 3.41497 secs
emplace set sort CPU time : 1.04229 secs
write stdout CPU time : 1.31578 secs
total CPU time : 5.77311 secs
total wall clock time : 5 secs
$ ./llil2vec_11149482 long1.txt long2.txt long3.txt >vec.tmp
llil2vec start
get_properties CPU time : 1.14889 secs
emplace set sort CPU time : 0.057158 secs
write stdout CPU time : 0.003307 secs
total CPU time : 1.20943 secs
total wall clock time : 2 secs
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 2.43187 secs
emplace set sort CPU time : 0.853877 secs
write stdout CPU time : 1.33636 secs
total CPU time : 4.62217 secs
total wall clock time : 5 secs
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.