Re^3: Rosetta Code: Long List is Long (faster - vec

Thanks to the excellent feedback from marioroy I've been able to improve the code as follows:

Compile MAX_STR_LEN_L version with clang++, otherwise g++
For MAX_STR_LEN_L version, reduced size of short strings by one char by not null-terminating them
Use good old printf to write to stdout (seems a bit faster than streaming to std::cout)
Minor code cleanups (especially around creating the inverted std::set)

New Timings for Long String Version

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties      CPU time : 3.06708 secs
emplace set sort    CPU time : 0.903617 secs
write stdout        CPU time : 1.23459 secs
total               CPU time : 5.20544 secs
total        wall clock time : 5 secs

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties      CPU time : 3.4832 secs
emplace set sort    CPU time : 1.10209 secs
write stdout        CPU time : 0.939805 secs
total               CPU time : 5.52519 secs
total        wall clock time : 6 secs

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties      CPU time : 3.42605 secs
emplace set sort    CPU time : 0.916222 secs
write stdout        CPU time : 1.00049 secs
total               CPU time : 5.343 secs
total        wall clock time : 5 secs

> diff f.tmp vec.tmp
[download]

This is an average time of 5.35 secs (compared to 5.5 secs in the original post).

New Timings for Short String (MAX_STR_LEN_L=6) Version

> For some reason, short string version is faster when compiled with c
+lang++
> (while long string version is not)
> clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties      CPU time : 1.87217 secs
emplace set sort    CPU time : 0.589238 secs
write stdout        CPU time : 0.842179 secs
total               CPU time : 3.30369 secs
total        wall clock time : 3 secs

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties      CPU time : 1.95909 secs
emplace set sort    CPU time : 0.610479 secs
write stdout        CPU time : 0.959859 secs
total               CPU time : 3.52965 secs
total        wall clock time : 4 secs

> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties      CPU time : 1.86549 secs
emplace set sort    CPU time : 0.608097 secs
write stdout        CPU time : 0.857176 secs
total               CPU time : 3.33087 secs
total        wall clock time : 3 secs

> diff f.tmp vec.tmp
[download]

This is an average time of 3.4 secs (compared to 4.3 secs in the original post).

New version of llil2vec.cpp

// llil2vec.cpp.
// Vector version of llil2grt.pl.
// g++ compile on Linux:
//    g++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil2vec big1.txt big2.txt big3.txt >vec.tmp

// Experiment with fast_io.h (see [id://11149504] by [marioroy])
#if 0
#include <fast_io.h>
#else
#include <cstdio>
#endif

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>

#include <string>
#include <array>
#include <vector>
#include <set>

#include <algorithm>
#include <utility>
#include <iterator>

#include <iostream>
#include <fstream>
#include <sstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// -------------------------------------------------------------------
+---------

// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
//    #include <chrono>
//    #include <thread>

// -------------------------------------------------------------------
+---------

typedef long long llil_int_type;

// Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l
+ength
// To use (limited length) fixed length strings uncomment the next lin
+e
// big.txt  max word length is 6
// long.txt max word length is 208
// Based on rough benchmarking, this short fixed string hack
// is only worth trying for MAX_STR_LEN_L up to about 64
// #define MAX_STR_LEN_L 6

#ifdef MAX_STR_LEN_L
using str_type         = std::array<char, MAX_STR_LEN_L+1>;
#else
using str_type         = std::string;
#endif

using str_int_type     = std::pair<str_type, llil_int_type>;
using int_str_type     = std::pair<llil_int_type, str_type>;

using vec_str_int_type = std::vector<str_int_type>;
using vec_int_str_type = std::vector<int_str_type>;
using set_int_str_type = std::set<int_str_type>;

// Mimic the Perl get_properties subroutine --------------------------
+--

// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255

static void get_properties(
   int                nfiles,      //  in: the number of input files
   char*              fname[],     //  in: the input file names
   vec_int_str_type&  vec_ret)     // out: a vector of properties
{
   FILE* fh;
   char line[MAX_LINE_LEN_L+1];
   char* word;
   llil_int_type count;

   for (int i = 0; i < nfiles; ++i) {
      fh = ::fopen(fname[i], "r");
      if (fh == NULL) {
         std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
         continue;
      }
      while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
         word  = ::strtok(line, "\t");
         count = ::atoll( ::strtok(NULL, "\n") );
#ifdef MAX_STR_LEN_L
         // str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0', '
+\0' } };
         str_type fixword;
         ::memset( fixword.data(), '\0', MAX_STR_LEN_L+1 );
         ::memcpy( fixword.data(), word, strlen(word) );
         vec_ret.emplace_back( -count, fixword );
#else
         vec_ret.emplace_back( -count, word );
#endif
      }
      ::fclose(fh);
   }

   // Needs to be sorted by word for later sum of adjacent count field
+s to work
   std::sort( vec_ret.begin(), vec_ret.end(),
      [](const int_str_type& left, const int_str_type& right) { return
+ left.second < right.second; }
   );
}

// -------------------------------------------------------------------
+--

int main(int argc, char* argv[])
{
   if (argc < 2) {
      std::cerr << "usage: llil2vec file1 file2 ... >out.txt\n";
      return 1;
   }

#ifdef MAX_STR_LEN_L
   std::cerr << "llil2vec (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
   std::cerr << "llil2vec start\n";
#endif
   time_t  tstart1 = ::time(NULL);
   clock_t cstart1 = ::clock();

   // Create the vector of properties
   vec_int_str_type vec_ret;
   get_properties(argc - 1, &argv[1], vec_ret);

   clock_t cend1  = ::clock();
   double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
   std::cerr << "get_properties      CPU time : " << ctaken1 << " secs
+\n";

   clock_t cstart2 = ::clock();

   // To avoid calling sort(), create an inverted std::set container
   // Note: negative count gives desired ordering
   set_int_str_type myset;
   auto it = vec_ret.cbegin();
   str_type name_last  = it->second;
   llil_int_type count = it->first;
   for (++it; it != vec_ret.cend(); ++it) {
      if ( it->second == name_last ) {
         count += it->first;
      }
      else {
         myset.emplace_hint( myset.end(), count, name_last );
         name_last = it->second;
         count     = it->first;
      }
   }
   myset.emplace_hint( myset.end(), count, name_last );

   clock_t cend2s = ::clock();

   // Output the (already sorted) std::set - no sort() function requir
+ed
   // Note: fix up negative count via -n.first
#ifdef MAX_STR_LEN_L
   // If name is not NULL-terminated:
   //   for ( auto const& n : myset ) ::printf( "%.*s\t%lld\n", MAX_ST
+R_LEN_L, n.second.data(), -n.first );
   // If name is NULL-terminated:
   //   for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second
+.data(), -n.first );
   for ( auto const& n : myset ) std::cout << n.second.data() << '\t' 
+<< -n.first << '\n';
#else
   //   for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second
+.c_str(), -n.first );
   for ( auto const& n : myset ) std::cout << n.second << '\t' << -n.f
+irst << '\n';
#endif

   clock_t cend2 = ::clock();
   time_t  tend2 = ::time(NULL);

   long    ttaken   = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
   double  ctaken   = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;

   double  ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER
+_SEC;
   double  ctaken2o = (double) (cend2  - cend2s)  / (double)CLOCKS_PER
+_SEC;

   std::cerr << "emplace set sort    CPU time : " << ctaken2s << " sec
+s\n";
   std::cerr << "write stdout        CPU time : " << ctaken2o << " sec
+s\n";

   std::cerr << "total               CPU time : " << ctaken   << " sec
+s\n";
   std::cerr << "total        wall clock time : " << ttaken   << " sec
+s\n";

   // Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
   //   std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));

   return 0;
}
[download]

Interim OpenMP Versions

Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by expunging the dreaded strtok function.

The one with the most potential for multi-threading seems to be llil3vec.cpp because it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by marioroy in llil4vec.cpp.

While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below) because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map. Ideas welcome.

Interim version of llil3vec.cpp

// llil3vec.cpp.
// Vector version of llil2grt.pl.
// g++ compile on Linux:
//    g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp
//    g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil3vec big1.txt big2.txt big3.txt >vec.tmp

#include <cstdio>

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>

#include <string>
#include <array>
#include <vector>
#include <set>

#include <utility>
#include <iterator>

#if defined(_OPENMP)
#include <omp.h>
#include <parallel/algorithm>
#else
#include <algorithm>
#include <execution>
#endif

#include <iostream>
#include <fstream>
#include <sstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// -------------------------------------------------------------------
+---------

// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
//    #include <chrono>
//    #include <thread>

// -------------------------------------------------------------------
+---------

typedef long long llil_int_type;

// Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l
+ength
// To use (limited length) fixed length strings uncomment the next lin
+e
// big.txt  max word length is 6
// long.txt max word length is 208
// Based on rough benchmarking, this short fixed string hack
// is only worth trying for MAX_STR_LEN_L up to about 64
#define MAX_STR_LEN_L 6

#ifdef MAX_STR_LEN_L
using str_type         = std::array<char, MAX_STR_LEN_L>;
#else
using str_type         = std::string;
#endif

using str_int_type     = std::pair<str_type, llil_int_type>;
using int_str_type     = std::pair<llil_int_type, str_type>;

using vec_str_int_type = std::vector<str_int_type>;
using vec_int_str_type = std::vector<int_str_type>;
using set_int_str_type = std::set<int_str_type>;

// Mimic the Perl get_properties subroutine --------------------------
+--

// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255

static void get_properties(
   int                nfiles,      //  in: the number of input files
   char*              fname[],     //  in: the input file names
   vec_int_str_type&  vec_ret)     // out: a vector of properties
{
   FILE* fh;
   char line[MAX_LINE_LEN_L+1];
   char* found;
   llil_int_type count;

   for (int i = 0; i < nfiles; ++i) {
      fh = ::fopen(fname[i], "r");
      if (fh == NULL) {
         std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
         continue;
      }
      while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
         found = ::strchr(line, '\t');
         count = ::atoll( &line[found - line + 1] );
         line[found - line] = '\0';  // word
#ifdef MAX_STR_LEN_L
         str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } };
         ::memcpy( fixword.data(), line, found - line );
         vec_ret.emplace_back( -count, fixword );
#else
         vec_ret.emplace_back( -count, line );
#endif
      }
      ::fclose(fh);
   }

   // Needs to be sorted by word for later sum of adjacent count field
+s to work
#if defined(_OPENMP)
   __gnu_parallel::sort(
#else
   std::sort(
   // std::sort(std::execution::par_unseq,
#endif
      vec_ret.begin(), vec_ret.end(),
      [](const int_str_type& left, const int_str_type& right) { return
+ left.second < right.second; }
   );
}

// -------------------------------------------------------------------
+--

int main(int argc, char* argv[])
{
   if (argc < 2) {
      std::cerr << "usage: llil3vec file1 file2 ... >out.txt\n";
      return 1;
   }

#ifdef MAX_STR_LEN_L
   std::cerr << "llil3vec (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
   std::cerr << "llil3vec start\n";
#endif
   time_t  tstart1 = ::time(NULL);
   clock_t cstart1 = ::clock();

   // Create the vector of properties
   vec_int_str_type vec_ret;
   get_properties(argc - 1, &argv[1], vec_ret);

   clock_t cend1  = ::clock();
   double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
   std::cerr << "get_properties      CPU time : " << ctaken1 << " secs
+\n";

   clock_t cstart2 = ::clock();

   // To avoid calling sort(), create an inverted std::set container
   // Note: negative count gives desired ordering
   set_int_str_type myset;
   auto it = vec_ret.cbegin();
   str_type name_last  = it->second;
   llil_int_type count = it->first;
   for (++it; it != vec_ret.cend(); ++it) {
      if ( it->second == name_last ) {
         count += it->first;
      }
      else {
         myset.emplace_hint( myset.end(), count, name_last );
         name_last = it->second;
         count     = it->first;
      }
   }
   myset.emplace_hint( myset.end(), count, name_last );

   clock_t cend2s = ::clock();

   // Output the (already sorted) std::set - no sort() function requir
+ed
   // Note: fix up negative count via -n.first
#ifdef MAX_STR_LEN_L
   // If name is not NULL-terminated:
   for ( auto const& n : myset ) ::printf( "%.*s\t%lld\n", MAX_STR_LEN
+_L, n.second.data(), -n.first );
   // If name is NULL-terminated:
   // for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second.d
+ata(), -n.first );
   // for ( auto const& n : myset ) std::cout << n.second.data() << '\
+t' << -n.first << '\n';
#else
   // Can try printf vs std::cout to see which is faster
   for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second.c_st
+r(), -n.first );
   // for ( auto const& n : myset ) std::cout << n.second << '\t' << -
+n.first << '\n';
#endif

   clock_t cend2 = ::clock();
   time_t  tend2 = ::time(NULL);

   long    ttaken   = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
   double  ctaken   = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;

   double  ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER
+_SEC;
   double  ctaken2o = (double) (cend2  - cend2s)  / (double)CLOCKS_PER
+_SEC;

   std::cerr << "emplace set sort    CPU time : " << ctaken2s << " sec
+s\n";
   std::cerr << "write stdout        CPU time : " << ctaken2o << " sec
+s\n";

   std::cerr << "total               CPU time : " << ctaken   << " sec
+s\n";
   std::cerr << "total        wall clock time : " << ttaken   << " sec
+s\n";

   // Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
   //   std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));

   return 0;
}
[download]

Timings of llil3vec

> g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp
> time ./llil3vec big1.txt big2.txt big3.txt  >f.tmp
llil3vec (fixed string length=6) start
get_properties      CPU time : 1.47895 secs
emplace set sort    CPU time : 0.592783 secs
write stdout        CPU time : 0.832071 secs
total               CPU time : 2.90392 secs
total        wall clock time : 3 secs

real    0m3.217s
user    0m2.806s
sys     0m0.412s

> diff f.tmp vec.tmp

> g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp

> time ./llil3vec big1.txt big2.txt big3.txt  >f.tmp
llil3vec (fixed string length=6) start
get_properties      CPU time : 2.5809 secs
emplace set sort    CPU time : 0.793355 secs
write stdout        CPU time : 0.860855 secs
total               CPU time : 4.23521 secs
total        wall clock time : 3 secs

real    0m2.673s
user    0m4.073s
sys     0m0.465s

> diff f.tmp vec.tmp
[download]

For the -fopenmp version note that the Unix real time is lower, but the user time is higher.

Interim version of llil3grt.cpp

// llil3grt.cpp.
// Inspired by llilgrt.pl: improve sort performance via a negative cou
+nt.
// g++ compile on Linux:
//    g++ -o llil3grt -std=c++20 -Wall -O3 llil3grt.cpp
//    g++ -o llil3grt -std=c++20 -fopenmp -Wall -O3 llil3grt.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil3grt tt1.txt tt2.txt tt3.txt >out.txt

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cstdio>

#include <string>
#include <array>
#include <vector>
#include <set>
#include <map>

#include <algorithm>
#include <utility>

#include <iostream>
#include <fstream>
#include <sstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// -------------------------------------------------------------------
+---------

// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
//    #include <chrono>
//    #include <thread>

// -------------------------------------------------------------------
+---------

typedef long long llil_int_type;

// Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l
+ength
// To use (limited length) fixed length strings uncomment the next lin
+e
// big.txt  max word length is 6
// long.txt max word length is 208
// Based on rough benchmarking, this short fixed string hack
// is only worth trying for MAX_STR_LEN_L up to about 64
#define MAX_STR_LEN_L 6

#ifdef MAX_STR_LEN_L
using str_type         = std::array<char, MAX_STR_LEN_L>;
#else
using str_type         = std::string;
#endif

using str_int_type     = std::pair<str_type, llil_int_type>;
using int_str_type     = std::pair<llil_int_type, str_type>;

using vec_str_int_type = std::vector<str_int_type>;
using vec_int_str_type = std::vector<int_str_type>;
using set_int_str_type = std::set<int_str_type>;
using map_str_int_type = std::map<str_type, llil_int_type>;

// Mimic the Perl get_properties subroutine --------------------------
+--

// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255

static void get_properties(
   int                nfiles,      //  in: the number of input files
   char*              fname[],     //  in: the input file names
   map_str_int_type&  hash_ret)    // out: a hash of properties
{
   FILE* fh;
   char line[MAX_LINE_LEN_L+1];
   char* found;
   llil_int_type count;

   for (int i = 0; i < nfiles; ++i) {
      fh = ::fopen(fname[i], "r");
      if (fh == NULL) {
         std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
         continue;
      }
      while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
         found = ::strchr(line, '\t');
         count = ::atoll( &line[found - line + 1] );
         line[found - line] = '\0';  // word
#ifdef MAX_STR_LEN_L
         str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } };
         ::memcpy( fixword.data(), line, found - line );
         hash_ret[fixword] -= count;
#else
         hash_ret[line] -= count;
#endif
      }
      ::fclose(fh);
   }
}

// -------------------------------------------------------------------
+--

int main(int argc, char* argv[])
{
   if (argc < 2) {
      std::cerr << "usage: llil3grt file1 file2 ... >out.txt\n";
      return 1;
   }

#ifdef MAX_STR_LEN_L
   std::cerr << "llil3grt (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
   std::cerr << "llil3grt start\n";
#endif
#if defined(_OPENMP)
   std::cerr << "   - openmp version\n";
#endif
   time_t  tstart1 = ::time(NULL);
   clock_t cstart1 = ::clock();

   // Create the hash of properties
   map_str_int_type hash_ret;
   get_properties(argc - 1, &argv[1], hash_ret);

   clock_t cend1  = ::clock();
   double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
   std::cerr << "get_properties      CPU time : " << ctaken1 << " secs
+\n";

   clock_t cstart2 = ::clock();

   // To avoid calling sort(), try creating an inverted std::set conta
+iner
   // Note: negative count gives desired ordering (just like Perl GRT 
+sort :)
   set_int_str_type myset;
   for ( auto const& kv : hash_ret ) {
      // Note: emplace_hint() was a lot faster than emplace()
      myset.emplace_hint( myset.end(), kv.second, kv.first );
   }

   clock_t cend2s = ::clock();

   // Output the (already sorted) std::set - no sort() function requir
+ed!
   // Note: fix up negative count via -n.first
#ifdef MAX_STR_LEN_L
   // If name is not NULL-terminated:
   for ( auto const& n : myset ) ::printf( "%.*s\t%lld\n", MAX_STR_LEN
+_L, n.second.data(), -n.first );
   // If name is NULL-terminated:
   // for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second.d
+ata(), -n.first );
   // for ( auto const& n : myset ) std::cout << n.second.data() << '\
+t' << -n.first << '\n';
#else
   // Can try printf vs std::cout to see which is faster
   // for ( auto const& n : myset ) ::printf( "%s\t%lld\n", n.second.c
+_str(), -n.first );
   for ( auto const& n : myset ) std::cout << n.second << '\t' << -n.f
+irst << '\n';
#endif

   clock_t cend2 = ::clock();
   time_t  tend2 = ::time(NULL);

   long    ttaken   = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
   double  ctaken   = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;

   double  ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER
+_SEC;
   double  ctaken2o = (double) (cend2  - cend2s)  / (double)CLOCKS_PER
+_SEC;

   std::cerr << "emplace set sort    CPU time : " << ctaken2s << " sec
+s\n";
   std::cerr << "write stdout        CPU time : " << ctaken2o << " sec
+s\n";

   std::cerr << "total               CPU time : " << ctaken   << " sec
+s\n";
   std::cerr << "total        wall clock time : " << ttaken   << " sec
+s\n";

   // Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
   //   std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));

   return 0;
}
[download]

Interim version of llil3m.cpp

// llil3m.cpp. C++ 11 version of Perl llil.pl.
// C++ version inspired by Mario's famous dualvar two-sort Perl soluti
+on.
// g++ compile on Linux:
//    g++ -o llil3m -std=c++20 -Wall -O3 llil3m.cpp
//    g++ -o llil3m -std=c++20 -fopenmp -Wall -O3 llil3m.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil3m tt1.txt tt2.txt tt3.txt >out.txt

// Uncomment next line to mimic marioroy two sort trick (needs stable 
+sort)
#define MARIOROY_TWO_SORT_TRICK_L 1

#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cstdio>

#include <string>
#include <array>
#include <vector>
#include <set>
#include <map>

#include <utility>
#include <iterator>

#if defined(_OPENMP)
#include <omp.h>
#include <parallel/algorithm>
#else
#include <algorithm>
#include <execution>
#endif

#include <iostream>
#include <fstream>
#include <sstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// -------------------------------------------------------------------
+---------

// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
//    #include <chrono>
//    #include <thread>

// For some performance hacks to speed up C++ I/O see:
//   https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fast
+_stdinstdout_io_suitable_for/
// The only one we use here is to prefer "\n" to std::endl to reduce s
+tdout flushing

// -------------------------------------------------------------------
+---------

typedef long long llil_int_type;

// Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l
+ength
// To use (limited length) fixed length strings uncomment the next lin
+e
// big.txt  max word length is 6
// long.txt max word length is 208
// Based on rough benchmarking, this short fixed string hack
// is only worth trying for MAX_STR_LEN_L up to about 64
#define MAX_STR_LEN_L 6

#ifdef MAX_STR_LEN_L
using str_type         = std::array<char, MAX_STR_LEN_L>;
#else
using str_type         = std::string;
#endif

using str_int_type     = std::pair<str_type, llil_int_type>;
using int_str_type     = std::pair<llil_int_type, str_type>;

using vec_str_int_type = std::vector<str_int_type>;
using vec_int_str_type = std::vector<int_str_type>;
using set_int_str_type = std::set<int_str_type>;
using map_str_int_type = std::map<str_type, llil_int_type>;

// Mimic the Perl get_properties subroutine --------------------------
+--

// Limit line length and use lower level ANSI C functions to try to bo
+ost I/O performance
#define MAX_LINE_LEN_L 255

static void get_properties(
   int                nfiles,      //  in: the number of input files
   char*              fname[],     //  in: the input file names
   map_str_int_type&  hash_ret)    // out: a hash of properties
{
   FILE* fh;
   char line[MAX_LINE_LEN_L+1];
   char* found;
   llil_int_type count;

   for (int i = 0; i < nfiles; ++i) {
      fh = ::fopen(fname[i], "r");
      if (fh == NULL) {
         std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
         continue;
      }
      while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
         found = ::strchr(line, '\t');
         count = ::atoll( &line[found - line + 1] );
         line[found - line] = '\0';  // word
#ifdef MAX_STR_LEN_L
         str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } };
         ::memcpy( fixword.data(), line, found - line );
         hash_ret[fixword] -= count;
#else
         hash_ret[line] -= count;
#endif
      }
      ::fclose(fh);
   }
}

// -------------------------------------------------------------------
+--

int main(int argc, char* argv[])
{
   if (argc < 2) {
      std::cerr << "usage: llil3m file1 file2 ... >out.txt\n";
      return 1;
   }

#ifdef MAX_STR_LEN_L
   std::cerr << "llil3m (fixed string length=" << MAX_STR_LEN_L << ") 
+start\n";
#else
   std::cerr << "llil3m start\n";
#endif
#if defined(_OPENMP)
   std::cerr << "   - openmp version\n";
#endif
   time_t  tstart1 = ::time(NULL);
   clock_t cstart1 = ::clock();

   // Create the hash of properties
   map_str_int_type hash_ret;
   get_properties(argc - 1, &argv[1], hash_ret);

   clock_t cend1  = ::clock();
   double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
   std::cerr << "get_properties     CPU  time : " << ctaken1 << " secs
+\n";

   clock_t cstart2 = ::clock();
   // Create a vector to sort (can't sort a std::map directly)
   vec_str_int_type v( hash_ret.begin(), hash_ret.end() );
   clock_t cend2v = ::clock();

#ifdef MARIOROY_TWO_SORT_TRICK_L

   // std::map is already sorted by string key so just stable_sort by 
+value
#if defined(_OPENMP)
   __gnu_parallel::stable_sort(
#else
   std::stable_sort(
#endif
      v.begin(), v.end(),
      [](const str_int_type& left, const str_int_type& right) { return
+ left.second < right.second; }
   );

#else

   // Sort descending by value, i.e. mimic this Perl code in C++:
   //   sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href
+}
   // No surprise that string key comparison is way slower than numeri
+c key comparison in std::sort()
#if defined(_OPENMP)
   __gnu_parallel::sort(
#else
   std::sort(
#endif
      v.begin(), v.end(),
      [](const str_int_type& left, const str_int_type& right) { return
+ left.second != right.second ? left.second < right.second : left.firs
+t < right.first; }
   );

#endif

   clock_t cend2s = ::clock();

#ifdef MAX_STR_LEN_L
   // If name is not NULL-terminated:
   for ( auto const& n : v ) ::printf( "%.*s\t%lld\n", MAX_STR_LEN_L, 
+n.first.data(), -n.second );
   // If name is NULL-terminated:
   // for ( auto const& n : v ) ::printf( "%s\t%lld\n", n.first.data()
+, -n.second );
   // for ( auto const& n : v ) std::cout << n.first.data() << '\t' <<
+ -n.second << '\n';
#else
   // Can try printf vs std::cout to see which is faster
   for ( auto const& n : v ) ::printf( "%s\t%lld\n", n.first.c_str(), 
+-n.second );
   // for ( auto const& n : v ) std::cout << n.first << '\t' << -n.sec
+ond << '\n';
#endif

   clock_t cend2    = ::clock();
   time_t  tend2    = ::time(NULL);

   long    ttaken   = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
   double  ctaken   = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;

   double  ctaken2v = (double) (cend2v - cstart2) / (double)CLOCKS_PER
+_SEC;
   double  ctaken2s = (double) (cend2s - cend2v)  / (double)CLOCKS_PER
+_SEC;
   double  ctaken2o = (double) (cend2  - cend2s)  / (double)CLOCKS_PER
+_SEC;

   std::cerr << "vector copy        CPU  time : " << ctaken2v << " sec
+s\n";
#ifdef MARIOROY_TWO_SORT_TRICK_L
   std::cerr << "vector stable sort CPU  time : " << ctaken2s << " sec
+s\n";
#else
   std::cerr << "vector sort        CPU  time : " << ctaken2s << " sec
+s\n";
#endif
   std::cerr << "write  stdout      CPU  time : " << ctaken2o << " sec
+s\n";

   std::cerr << "total              CPU  time : " << ctaken  << " secs
+\n";
   std::cerr << "total             wall clock : " << ttaken  << " secs
+\n";

   // Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
   //   std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));

   return 0;
}
[download]

Timings

llil3grt (fixed string length=6) start
   - openmp version
get_properties      CPU time : 2.34487 secs
emplace set sort    CPU time : 0.942098 secs
write stdout        CPU time : 0.855157 secs
total               CPU time : 4.14223 secs
total        wall clock time : 4 secs

real    0m4.752s
user    0m4.241s
sys     0m0.511s
[download]

$ time ./llil3m big1.txt big2.txt big3.txt  >f.tmp
llil3m (fixed string length=6) start
   - openmp version
get_properties     CPU  time : 2.40089 secs
vector copy        CPU  time : 0.544294 secs
vector stable sort CPU  time : 5.84981 secs
write  stdout      CPU  time : 0.805509 secs
total              CPU  time : 9.6006 secs
total             wall clock : 4 secs

real    0m4.713s
user    0m9.238s
sys     0m0.679s
[download]

llil3m (fixed string length=6) start
   - openmp version
get_properties     CPU  time : 2.3031 secs
vector copy        CPU  time : 0.544613 secs
vector sort        CPU  time : 2.64047 secs
write  stdout      CPU  time : 0.791186 secs
total              CPU  time : 6.27946 secs
total             wall clock : 4 secs

real    0m4.182s
user    0m6.128s
sys     0m0.473s
[download]

$ time ./llil3m big1.txt big2.txt big3.txt  >f.tmp
llil3m (fixed string length=6) start
get_properties     CPU  time : 2.28072 secs
vector copy        CPU  time : 0.471632 secs
vector stable sort CPU  time : 0.735042 secs
write  stdout      CPU  time : 0.67514 secs
total              CPU  time : 4.16263 secs
total             wall clock : 4 secs

real    0m4.470s
user    0m4.159s
sys     0m0.311s

$ time ./llil3m big1.txt big2.txt big3.txt  >f.tmp
llil3m (fixed string length=6) start
get_properties     CPU  time : 2.30618 secs
vector copy        CPU  time : 0.473185 secs
vector sort        CPU  time : 1.13081 secs
write  stdout      CPU  time : 0.668702 secs
total              CPU  time : 4.57897 secs
total             wall clock : 4 secs

real    0m4.889s
user    0m4.558s
sys     0m0.331s

$ time ./llil3vec big1.txt big2.txt big3.txt  >f.tmp
llil3vec (fixed string length=6) start
get_properties      CPU time : 1.46864 secs
emplace set sort    CPU time : 0.630914 secs
write stdout        CPU time : 0.847912 secs
total               CPU time : 2.9476 secs
total        wall clock time : 3 secs

real    0m3.233s
user    0m2.852s
sys     0m0.381s

$ time ./llil3grt big1.txt big2.txt big3.txt  >f.tmp
llil3grt (fixed string length=6) start
get_properties      CPU time : 2.34418 secs
emplace set sort    CPU time : 0.901415 secs
write stdout        CPU time : 0.90025 secs
total               CPU time : 4.14595 secs
total        wall clock time : 5 secs

real    0m4.784s
user    0m4.232s
sys     0m0.551s
[download]

References Added Later

OpenMP Little Book

Updated: The source code for llil2vec.cpp was updated after the node was first created based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added). Plus interim versions of llil3grt.cpp and llil3m.cpp. Added a few missing #include <array> and reference for OpenMP Little Book (thanks marioroy).

Comment on Re^3: Rosetta Code: Long List is Long (faster - vec - openmp) Select or Download Code

Replies are listed 'Best First'.
Re^4: Rosetta Code: Long List is Long (faster - vec - fast_io) by marioroy (Prior) on Jan 10, 2023 at 17:25 UTC
I'm not seeing a noticeable difference between the initial llil2vec demonstration and this one w.r.t. total time. So, I searched the web and tried fast_io with success. It requires a C++20 capable compiler. `git clone --depth 1 https://github.com/cppfastio/fast_io` [download] Add the header file above other includes and comment out the cstdio include line. `#include <fast_io.h> //#include <cstdio>` [download] Specify the path to the include folder. Also C++20. `clang++ -o llil2vec -std=c++20 -Wall -O3 llil2vec.cpp -I./fast_io/incl +ude` [download] Before C++11 results: `llil2vec (fixed string length=6) start get_properties CPU time : 1.62356 secs emplace set sort CPU time : 0.451718 secs write stdout CPU time : 0.923496 secs total CPU time : 2.99881 secs total wall clock time : 3 secs` [download] C++20 + fast_io results: `llil2vec (fixed string length=6) start get_properties CPU time : 1.50394 secs emplace set sort CPU time : 0.430272 secs write stdout CPU time : 0.889239 secs total CPU time : 2.82348 secs total wall clock time : 3 secs` [download]	[reply] [d/l] [select]
Re^5: Rosetta Code: Long List is Long (faster - vec - fast_io) by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 11:12 UTC
Thanks so much for finding this github `fast_io` library and for posting such clear instructions on how to use it! This made it ridiculously easy for me to play around with it. So far I haven't found any statistically significant speed-ups from employing this library (hardly surprising given the mega man years of effort poured into `g++` and `clang++`), but I will keep my eye on it.	[reply] [d/l] [select]
Re^6: Rosetta Code: Long List is Long (faster - vec - OpenMP) by marioroy (Prior) on Jan 11, 2023 at 16:51 UTC
So far I haven't found any statistically significant speed-ups from employing this library ... After closer look, it was the -std=c++20 language mode that enables faster vectors by 0.2 ~ 0.4 seconds versus -std=c++11. Update 1: Support variable length words. Update 2: Enable parallel sort. See Using Parallel Mode. I tried OpenMP. Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok. The OpenMP result improved by 0.1 seconds. That's because the actual reading is already fast. It takes 2 threads minimally to run faster than non-OpenMP results due to populating vec_rec from local copies. Building: `clang++ -o llil4vec -std=c++11 -Wall -O3 llil4vec.cpp clang++ -o llil4vec -std=c++20 -Wall -O3 llil4vec.cpp # faster # enable parallel via -fopenmp clang++ -o llil4vec-omp -std=c++11 -fopenmp -Wall -O3 llil4vec.cpp clang++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp # +faster` [download] Running - Real time results: `$ time ./llil4vec big1.txt big2.txt big3.txt >out.txt std:c++11: 2.901 secs std:c++20: 2.850 secs $ time OMP_NUM_THREADS=3 ./llil4vec-omp big1.txt big2.txt big3.txt >ou +t.txt std:c++11: 2.308 secs std:c++20: 2.267 secs` [download] llil4vec.cpp modification, OpenMP-aware: #if defined(_OPENMP) #include <omp.h> #include <parallel/algorithm> #else #include <algorithm> #endif ... static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names vec_int_str_type& vec_ret) // out: a vector of properties { #if defined(_OPENMP) omp_set_dynamic(0); omp_set_max_active_levels(1); #pragma omp parallel { vec_int_str_type vec_loc; // thread local copy #pragma omp for nowait schedule(static,1) #endif for (int i = 0; i < nfiles; ++i) { char line[MAX_LINE_LEN_L+1]; char* found; llil_int_type count; FILE* fh = ::fopen(fname[i], "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname[i] << "' : errno=" + << errno << "\n"; continue; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); count = ::atoll( &line[found - line + 1] ); line[found - line] = '\0'; // word #ifdef MAX_STR_LEN_L str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } +}; ::memcpy( fixword.data(), line, found - line ); #if defined(_OPENMP) vec_loc.emplace_back( -count, fixword ); #else vec_ret.emplace_back( -count, fixword ); #endif #else #if defined(_OPENMP) vec_loc.emplace_back( -count, line ); #else vec_ret.emplace_back( -count, line ); #endif #endif } ::fclose(fh); } #if defined(_OPENMP) #pragma omp critical vec_ret.insert(vec_ret.end(), vec_loc.begin(), vec_loc.end()); } #endif // Needs to be sorted by word for later sum of adjacent count field +s to work #if defined(_OPENMP) __gnu_parallel::sort( vec_ret.begin(), vec_ret.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #else std::sort( vec_ret.begin(), vec_ret.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #endif } [download]	[reply] [d/l] [select]
Re^7: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) by marioroy (Prior) on Jan 12, 2023 at 09:55 UTC
Re^8: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) by marioroy (Prior) on Jan 12, 2023 at 11:59 UTC
Re^8: Rosetta Code: Long List is Long (faster - vec - OpenMP code - tbb) by eyepopslikeamosquito (Archbishop) on Jan 17, 2023 at 03:23 UTC
Some notes below your chosen depth have not been shown here
Re^7: Rosetta Code: Long List is Long (faster - vec - OpenMP) by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 20:59 UTC
Re^8: Rosetta Code: Long List is Long (faster - vec - OpenMP) by marioroy (Prior) on Jan 11, 2023 at 22:07 UTC
Re^4: Rosetta Code: Long List is Long (faster - llil4map/2 - OpenMP code) by marioroy (Prior) on Jan 17, 2023 at 14:28 UTC
I tried the following for the interim version of llil3m.cpp. The overhead for running parallel is greater than I hoped for using std::map. Then, I found parallel hashmap. From the link, "By using parallel SSE2 instructions, these hashmaps are able to look up items by checking 16 slots in parallel." Running: $ time ./llil2grt big1.txt big2.txt big3.txt >out.txt llil2grt start get_properties CPU time : 3.55473 secs emplace set sort CPU time : 0.998423 secs write stdout CPU time : 0.854453 secs total CPU time : 5.40764 secs total wall clock time : 5 secs real 0m5.973s user 0m5.697s sys 0m0.292s $ NUM_THREADS=1 ./llil4map-no6-uno big1.txt big2.txt big3.txt >out.txt llil4map start use std::unordered_map don't use OpenMP use boost sort get properties 3.368 secs finish merging 1.202 secs vector stable sort 0.898 secs write stdout 0.261 secs total time 5.730 secs $ NUM_THREADS=1 ./llil4map-no6-std big1.txt big2.txt big3.txt >out.txt llil4map start use std::map don't use OpenMP use boost sort get properties 3.328 secs finish merging 0.657 secs vector stable sort 0.816 secs write stdout 0.268 secs total time 5.069 secs $ NUM_THREADS=1 ./llil4map-no6-flat big1.txt big2.txt big3.txt >out.tx +t llil4map start use phmap::parallel_flat_hash_map don't use OpenMP use boost sort get properties 1.565 secs map to vector 0.200 secs vector stable sort 0.665 secs write stdout 0.220 secs total time 2.652 secs [download] llil4map.cc Read more... (20 kB)	[reply] [d/l] [select]
Re^5: Rosetta Code: Long List is Long (faster - llil4map - OpenMP code) by eyepopslikeamosquito (Archbishop) on Feb 22, 2023 at 10:10 UTC
Nice work Mario! As you might expect, I couldn't restrain myself from playing around with a few minor changes: I gave boost::hash_range a try because there's less syntactic claptrap around the specialization of `std::hash` -- I've `#if 0`'ed it out though because it seems to be a touch slower than good old `std::hash`. Generally, I pull a face whenever I see a function updating a non-const global variable, so I changed the `get_properties` interface to pass `map_upd` as a function parameter. My usual simplification of parsing in `get_properties` by assuming the input file/s match the llil spec `^[a-z]+\t\d+$` (and without any long names when using `MAX_STR_LEN` :). // llil4map.cpp // https://perlmonks.com/?node_id=11149643 // OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book // // C++ version inspired by Mario's famous dualvar two-sort Perl soluti +on. // based on llil3m.cpp https://perlmonks.com/?node_id=11149482 // 1. Run get_properties in parallel. // 2. Capture time and diff via chrono. // 3. Threads: Flush local vector periodically; speed-up for std::m +ap. // 4. Key words null-terminated for MAX_STR_LEN_L. // 5. Concat strings using fast_io::concatln during output. // 6. Support NUM_THREADS environment variable. // 7. Added define for using boost's parallel sort. // 8. Removed parallel get_properties, not helpful due to overhead. // 9. Removed MAX_STR_LEN_L for simply std::map. // A. Replaced atoll with fast_atoll64. // B. Fast vector sorting - from llil5vec-tbb. // C. Re-enabled parallel get_properties. // D. Exit early if no work; fast_io tweak writing to output; // limit to 8 threads max for sorting. // E. Switch to parallel hashmap; using SSE2 instructions. // F. Support std::map, std::unordered_map, or parallel-hashmap via + define. // G. Refactored OpenMP code to merge immediately. // H. Improved get_properties; set precision for timings. // I. Support fixed-length key words using basic_static_string. // J. Remove support for std::map and std::unordered_map. // K. Opt-in to use (limited length) fixed length strings. // Uncomment line #define MAX_STR_LEN_L and update value. // L. Improved the fixed length path with custom default_hash/eq fu +nctions. // M. Define str_type struct as std::array, overriding == and < ope +rators. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Obtain the parallel hashmap library (required dependency): // git clone --depth=1 https://github.com/greg7mdp/parallel-hashmap // // g++ compile on Linux: (boost header may need the -Wno-stringop-over +flow gcc option) // g++ -o llil4map -std=c++20 -Wall -O3 llil4map.cpp -I ./fast_io/i +nclude -I ./parallel-hashmap // g++ -o llil4map-omp -std=c++20 -fopenmp -Wall -O3 llil4map.cpp - +I ./fast_io/include -I ./parallel-hashmap // // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // clang++ compile: same args, without the -Wno-stringop-overflow opti +on // Seems to run slightly faster when compiled with clang++ instead of +g++ // // Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/ +?node_id=11148681 // perl gen-llil.pl big1.txt 200 3 1 // perl gen-llil.pl big2.txt 200 3 1 // perl gen-llil.pl big3.txt 200 3 1 // perl gen-long-llil.pl long1.txt 600 // perl gen-long-llil.pl long2.txt 600 // perl gen-long-llil.pl long3.txt 600 // // To make random input, obtain shuffle.pl from https://perlmonks.com/ +?node_id=11149800 // perl shuffle.pl big1.txt >tmp && mv tmp big1.txt // perl shuffle.pl big2.txt >tmp && mv tmp big2.txt // perl shuffle.pl big3.txt >tmp && mv tmp big3.txt // // Example run: llil4map big1.txt big2.txt big3.txt >out.txt // NUM_THREADS=3 llil4map-omp ... // Note: Lesser NUM_THREADS is preferred, else merge time incr +eases // Start with MIN( CPU cores, 1 + log2( number of input +files ) ) // ------------------------------------------------------------------- +--------- // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: Another option is downloading and unpacking Boost locally. // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <parallel_hashmap/phmap.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #include <execution> #include <algorithm> #include <boost/container_hash/hash.hpp> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #endif #ifdef _OPENMP #include <omp.h> #endif #include <iostream> #include <iomanip> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length. // big.txt max word length is 6 // long.txt max word length is 208 // // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study // // To use (limited length) fixed length strings uncomment the next lin +e. #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L struct str_type : std::array<char, MAX_STR_LEN_L> { bool operator==( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0; } bool operator<( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0; } }; #else struct str_type : std::basic_string<char> { bool operator==( const str_type& o ) const { return ::strcmp(this->data(), o.data()) == 0; } bool operator<( const str_type& o ) const { return ::strcmp(this->data(), o.data()) < 0; } }; #endif using str_int_type = std::pair<str_type, llil_int_type>; using vec_str_int_type = std::vector<str_int_type>; // inject specialization of std::hash for str_type into namespace std namespace std { template<> struct hash<str_type> { std::size_t operator()( str_type const& v ) const noexcept { #if 0 return boost::hash_range( v.cbegin(), v.cend() ); #else std::basic_string_view<char> bv { reinterpret_cast<const char>(v.data()), v.size() sizeof +(char) }; return std::hash<std::basic_string_view<char>>()(bv); #endif } }; } // create the parallel_flat_hash_map without internal mutexes using map_str_int_type = phmap::parallel_flat_hash_map< str_type, llil_int_type, phmap::priv::hash_default_hash<str_type>, phmap::priv::hash_default_eq<str_type>, phmap::priv::Allocator<phmap::priv::Pair<const str_type, llil_int_t +ype>>, 8, phmap::NullMutex >; // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; // int sign = 0; // if ( str == '-' ) { // sign = 1, ++str; // } uint8_t digit; while ((digit = uint8_t(str++ - '0')) <= 9) val = val * 10 + digit +; // return sign ? -val : val; return val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 #define MAX_THREADS 32 static void get_properties( const char* fname, // in : the input file name map_str_int_type& map_upd // inout: the map to be updated ) { FILE* fh; std::array<char, MAX_LINE_LEN_L + 1> line; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return; } while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh +) != NULL ) { found = std::find( line.begin(), line.end(), '\t' ); count = fast_atoll64( found + 1 ); // Note: using emplace() is faster than map_upd[fixword] += coun +t; #ifdef MAX_STR_LEN_L str_type fixword {}; // {} initializes all elements of fixword +to '\0' std::copy( line.begin(), found, fixword.begin() ); const auto [it, success] = map_upd.emplace(fixword, count); #else found = '\0'; const auto [it, success] = map_upd.emplace(str_type{ line.data() + }, count); #endif if ( !success ) it->second += count; } ::fclose(fh); } typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) 1e-3; } // ------------------------------------------------------------------- +-- static map_str_int_type Map[MAX_THREADS]; int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil4map file1 file2 ... >out.txt\n"; return 1; } std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); #ifdef MAX_STR_LEN_L std::cerr << "llil4map (fixed string length=" << MAX_STR_LEN_L << " +) start\n"; #else std::cerr << "llil4map start\n"; #endif #ifdef _OPENMP std::cerr << "use OpenMP\n"; #else std::cerr << "don't use OpenMP\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif time_point cstart1, cend1, cstart2, cend2, cstart3, cend3s, cend3; cstart1 = high_resolution_clock::now(); #ifdef _OPENMP // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = ( env_nthrs && strlen(env_nthrs) ) ? ::atoi(env_nthrs) : std::thread::hardware_concurrency(); if ( nthrs > MAX_THREADS ) nthrs = MAX_THREADS; omp_set_dynamic(false); omp_set_num_threads(nthrs); #else int nthrs = 1; #endif int nthrs_sort = ( std::thread::hardware_concurrency() < 12 ) ? std::thread::hardware_concurrency() : 12; // Get the list of input files from the command line int nfiles = argc - 1; char fname = &argv[1]; int merge_id = 0; // Run parallel, depending on the number of threads if ( nthrs == 1 \|\| nfiles == 1 ) { for ( int i = 0; i < nfiles; ++i ) { get_properties(fname[i], Map[0]); } cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties " << std::setw(8) << ctaken1 < +< " secs\n"; cstart2 = high_resolution_clock::now(); } #ifdef _OPENMP else { merge_id = -1; int nfiles_processed = 0; int gettime_rendered = 0; #pragma omp parallel { int tid = omp_get_thread_num(); #pragma omp for nowait schedule(static, 1) for ( int i = 0; i < nfiles; ++i ) { get_properties(fname[i], Map[tid]); #pragma omp atomic nfiles_processed += 1; } #pragma omp critical { // report time for get properties if ( nfiles_processed == nfiles && !gettime_rendered ) { cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties " << std::setw(8) << +ctaken1 << " secs\n"; cstart2 = high_resolution_clock::now(); gettime_rendered = 1; } // merge array if ( Map[tid].size() > 0 ) { if ( merge_id < 0 ) { // container id other threads merge to merge_id = tid; } else { for ( auto const& x : Map[tid] ) { // Map[merge_id][x.first] += x.second; const auto [it, success] = Map[merge_id].emplace( +x.first, x.second); if ( !success ) it->second += x.second; } Map[tid].clear(); } } } } } #endif if ( merge_id < 0 \|\| Map[merge_id].size() == 0 ) { std::cerr << "No work, exiting...\n"; return 1; } // Store the properties into a vector vec_str_int_type propvec( Map[merge_id].begin(), Map[merge_id].end( +) ); Map[merge_id].clear(); cend2 = high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "finish merging " << std::setw(8) << ctaken2 << " + secs\n"; cstart3 = high_resolution_clock::now(); // Sort the vector by (count) in reverse order, (name) in lexical o +rder #if USE_BOOST_PARALLEL_SORT == 0 std::sort( // Standard sort propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.second != right.second ? left.second > right.second : left.first < right.first; } ); #else boost::sort::block_indirect_sort( // Parallel sort propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.second != right.second ? left.second > right.second : left.first < right.first; }, nthrs_sort ); #endif cend3s = high_resolution_clock::now(); // Output the sorted vector #ifdef MAX_STR_LEN_L for ( auto const& n : propvec ) ::print(fast_io::concatln(fast_io::mnp::os_c_str(n.first.data(), + MAX_STR_LEN_L), "\t", n.second)); #else for ( auto const& n : propvec ) ::print(fast_io::concatln(n.first, "\t", n.second)); #endif cend3 = high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3s = elaspe_time(cend3s, cstart3); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector stable sort " << std::setw(8) << ctaken3s << +" secs\n"; std::cerr << "write stdout " << std::setw(8) << ctaken3o << +" secs\n"; std::cerr << "total time " << std::setw(8) << ctaken << +" secs\n"; // Hack to see Private Bytes in Windows Task Manager // (uncomment next line so process doesn't exit too quickly) // std::this_thread::sleep_for(milliseconds(9000)); return 0; } [download] Parallel HashMap References Added Later** Solving the Long List is Long challenge, finally? by marioroy (July 2023) Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (May 2023) Re^2: Rosetta Code: Long List is Long - JudySL summary by marioroy (Jan 2023) Comprehensive C++ Hashmap Benchmarks 2022 by Martin Leitner (ankerl) emhash7::HashMap by ktprime (hyb) parallel-hashmap question from mario with replies from Gregory Popovitch (June 2023) 2024 update: marioroy noted that `std::mutex` is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web.	[reply] [d/l] [select]
Re^6: Rosetta Code: Long List is Long (faster - llil4map - OpenMP code) by marioroy (Prior) on Mar 09, 2023 at 08:40 UTC
I took a break since you had posted. Recently, I revised the map implementation to handle hundreds of files with lesser memory consumption. The llil4map variant may handle Chuma's use-case; see my reply to Chuma. There is also the mcesort implementation, a GNU parallel (parsort variant). It includes a mini-MCE (less than 1,150 lines) integrated into the script. The application itself is around 500 lines. It runs on Unix only.	[reply]


Perl Monk, Perl Meditation
	PerlMonks

Re^3: Rosetta Code: Long List is Long (faster - vec - openmp)