Java 8 times faster with arrays than std::vector in C++. What did I do wrong? Java 8 times faster with arrays than std::vector in C++. What did I do wrong? arrays arrays

Java 8 times faster with arrays than std::vector in C++. What did I do wrong?


Yep, the cache in the c++ version takes a hammering. It seems the JIT is better equipped to handle this.

If you change the outer for in isUpdateNeeded() to shorter snippets. The difference goes away.

The sample below produces a 4x speedup.

void isUpdateNeeded() {    for (int i = 0; i < numberOfCells; ++i) {        h[i] =  h[i] + 1;        floodedCells[i] =  !floodedCells[i];        floodedCellsTimeInterval[i] =  !floodedCellsTimeInterval[i];        qInflow[i] =  qInflow[i] + 1;        qStartTime[i] =  qStartTime[i] + 1;        qEndTime[i] =  qEndTime[i] + 1;    }    for (int i = 0; i < numberOfCells; ++i) {        lowerFloorCells[i] =  lowerFloorCells[i] + 1;        cellLocationX[i] =  cellLocationX[i] + 1;        cellLocationY[i] =  cellLocationY[i] + 1;        cellLocationZ[i] =  cellLocationZ[i] + 1;        levelOfCell[i] =  levelOfCell[i] + 1;        valueOfCellIds[i] =  valueOfCellIds[i] + 1;        h0[i] =  h0[i] + 1;        vU[i] =  vU[i] + 1;        vV[i] =  vV[i] + 1;        vUh[i] =  vUh[i] + 1;        vVh[i] =  vVh[i] + 1;    }    for (int i = 0; i < numberOfCells; ++i) {        vUh0[i] =  vUh0[i] + 1;        vVh0[i] =  vVh0[i] + 1;        ghh[i] =  ghh[i] + 1;        sfx[i] =  sfx[i] + 1;        sfy[i] =  sfy[i] + 1;        qIn[i] =  qIn[i] + 1;        for(int j = 0; j < nEdges; ++j) {            neighborIds[i * nEdges + j] = neighborIds[i * nEdges + j] + 1;        }        for(int j = 0; j < nEdges; ++j) {            typeInterface[i * nEdges + j] = typeInterface[i * nEdges + j] + 1;        }    }}

This shows to a reasonable degree that cache misses are the reason for the slowdown. It is also important to note that the variables are not dependent so a threaded solution is easily created.

Order restored

As per stefans comment I tried grouping them in a struct using the original sizes. This removes the immediate cache pressure in a similar fashion. The result is that the c++ (CCFLAG -O3) version is about 15% faster than the java version.

Varning neither short nor pretty.

#include <vector>#include <cmath>#include <iostream>   class FloodIsolation {    struct item{      char floodedCells;      char floodedCellsTimeInterval;      double valueOfCellIds;      double h;      double h0;      double vU;      double vV;      double vUh;      double vVh;      double vUh0;      double vVh0;      double sfx;      double sfy;      double qInflow;      double qStartTime;      double qEndTime;      double qIn;      double nx;      double ny;      double ghh;      double floorLevels;      int lowerFloorCells;      char flagInterface;      char floorCompletelyFilled;      double cellLocationX;      double cellLocationY;      double cellLocationZ;      int levelOfCell;    };    struct inner_item{      int typeInterface;      int neighborIds;    };    std::vector<inner_item> inner_data;    std::vector<item> data;public:    FloodIsolation() :            numberOfCells(20000), inner_data(numberOfCells * nEdges), data(numberOfCells)   {    }    ~FloodIsolation(){    }     void isUpdateNeeded() {        for (int i = 0; i < numberOfCells; ++i) {            data[i].h = data[i].h + 1;            data[i].floodedCells = !data[i].floodedCells;            data[i].floodedCellsTimeInterval = !data[i].floodedCellsTimeInterval;            data[i].qInflow = data[i].qInflow + 1;            data[i].qStartTime = data[i].qStartTime + 1;            data[i].qEndTime = data[i].qEndTime + 1;            data[i].lowerFloorCells = data[i].lowerFloorCells + 1;            data[i].cellLocationX = data[i].cellLocationX + 1;            data[i].cellLocationY = data[i].cellLocationY + 1;            data[i].cellLocationZ = data[i].cellLocationZ + 1;            data[i].levelOfCell = data[i].levelOfCell + 1;            data[i].valueOfCellIds = data[i].valueOfCellIds + 1;            data[i].h0 = data[i].h0 + 1;            data[i].vU = data[i].vU + 1;            data[i].vV = data[i].vV + 1;            data[i].vUh = data[i].vUh + 1;            data[i].vVh = data[i].vVh + 1;            data[i].vUh0 = data[i].vUh0 + 1;            data[i].vVh0 = data[i].vVh0 + 1;            data[i].ghh = data[i].ghh + 1;            data[i].sfx = data[i].sfx + 1;            data[i].sfy = data[i].sfy + 1;            data[i].qIn = data[i].qIn + 1;            for(int j = 0; j < nEdges; ++j) {                inner_data[i * nEdges + j].neighborIds = inner_data[i * nEdges + j].neighborIds + 1;                inner_data[i * nEdges + j].typeInterface = inner_data[i * nEdges + j].typeInterface + 1;            }        }     }     static const int nEdges;private:     const int numberOfCells;}; const int FloodIsolation::nEdges = 6;int main() {    FloodIsolation isolation;    clock_t start = clock();    for (int i = 0; i < 4400; ++i) {        if(i % 100 == 0) {            std::cout << i << "\n";        }        isolation.isUpdateNeeded();    }    clock_t stop = clock();    std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";}                                                                              

My result differs slightly from Jerry Coffins for the original sizes. For me the differences remains. It might well be my java version, 1.7.0_75.


Here is the C++ version with the per-node data gathered into a structure, and a single vector of that structure used:

#include <vector>#include <cmath>#include <iostream>class FloodIsolation {public:  FloodIsolation() :      numberOfCells(20000),      data(numberOfCells)  {  }  ~FloodIsolation(){  }  void isUpdateNeeded() {    for (int i = 0; i < numberOfCells; ++i) {       data[i].h = data[i].h + 1;       data[i].floodedCells = !data[i].floodedCells;       data[i].floodedCellsTimeInterval = !data[i].floodedCellsTimeInterval;       data[i].qInflow = data[i].qInflow + 1;       data[i].qStartTime = data[i].qStartTime + 1;       data[i].qEndTime = data[i].qEndTime + 1;       data[i].lowerFloorCells = data[i].lowerFloorCells + 1;       data[i].cellLocationX = data[i].cellLocationX + 1;       data[i].cellLocationY = data[i].cellLocationY + 1;       data[i].cellLocationZ = data[i].cellLocationZ + 1;       data[i].levelOfCell = data[i].levelOfCell + 1;       data[i].valueOfCellIds = data[i].valueOfCellIds + 1;       data[i].h0 = data[i].h0 + 1;       data[i].vU = data[i].vU + 1;       data[i].vV = data[i].vV + 1;       data[i].vUh = data[i].vUh + 1;       data[i].vVh = data[i].vVh + 1;       data[i].vUh0 = data[i].vUh0 + 1;       data[i].vVh0 = data[i].vVh0 + 1;       data[i].ghh = data[i].ghh + 1;       data[i].sfx = data[i].sfx + 1;       data[i].sfy = data[i].sfy + 1;       data[i].qIn = data[i].qIn + 1;      for(int j = 0; j < nEdges; ++j) {        data[i].flagInterface[j] = !data[i].flagInterface[j];        data[i].typeInterface[j] = data[i].typeInterface[j] + 1;        data[i].neighborIds[j] = data[i].neighborIds[j] + 1;      }    }  }private:  const int numberOfCells;  static const int nEdges = 6;  struct data_t {    bool floodedCells = 0;    bool floodedCellsTimeInterval = 0;    double valueOfCellIds = 0;    double h = 0;    double h0 = 0;    double vU = 0;    double vV = 0;    double vUh = 0;    double vVh = 0;    double vUh0 = 0;    double vVh0 = 0;    double ghh = 0;    double sfx = 0;    double sfy = 0;    double qInflow = 0;    double qStartTime = 0;    double qEndTime = 0;    double qIn = 0;    double nx = 0;    double ny = 0;    double floorLevels = 0;    int lowerFloorCells = 0;    bool floorCompleteleyFilled = 0;    double cellLocationX = 0;    double cellLocationY = 0;    double cellLocationZ = 0;    int levelOfCell = 0;    bool flagInterface[nEdges] = {};    int typeInterface[nEdges] = {};    int neighborIds[nEdges] = {};  };  std::vector<data_t> data;};int main() {  std::ios_base::sync_with_stdio(false);  FloodIsolation isolation;  clock_t start = clock();  for (int i = 0; i < 400; ++i) {    if(i % 100 == 0) {      std::cout << i << "\n";    }    isolation.isUpdateNeeded();  }  clock_t stop = clock();  std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";}

live example

The time is now 2x the speed of the Java version. (846 vs 1631).

Odds are the JIT noticed the cache burning of accessing data all over the place, and transformed your code into a logically similar but more efficient order.

I also turned off stdio synchronization, as that is only needed if you mix printf/scanf with C++ std::cout and std::cin. As it happens, you only print out a few values, but C++'s default behavior for printing is overly paranoid and inefficient.

If nEdges is not an actual constant value, then the 3 "array" values will have to be stripped out of the struct. That shouldn't cause a huge performance hit.

You might be able to get another performance boost by sorting the values in that struct by decreasing size, thus reducing memory footprint (and sorting access as well when it doesn't matter). But I am unsure.

A rule of thumb is that a single cache miss is 100x more expensive than an instruction. Arranging your data to have cache coherency has lots of value.

If rearranging the data into a struct is infeasible, you can change your iteration to be over each container in turn.

As an aside, note that the Java and C++ versions had some subtle differences in them. The one I spotted was that the Java version has 3 variables in the "for each edge" loop, while the C++ one only had 2. I made mine match the Java. I don't know if there are others.


As @Stefan guessed in a comment on @CaptainGiraffe's answer, you gain quite a bit by using a vector of structs instead of a struct of vectors. Corrected code looks like this:

#include <vector>#include <cmath>#include <iostream>#include <time.h>class FloodIsolation {public:    FloodIsolation() :            h(0),            floodedCells(0),            floodedCellsTimeInterval(0),            qInflow(0),            qStartTime(0),            qEndTime(0),            lowerFloorCells(0),            cellLocationX(0),            cellLocationY(0),            cellLocationZ(0),            levelOfCell(0),            valueOfCellIds(0),            h0(0),            vU(0),            vV(0),            vUh(0),            vVh(0),            vUh0(0),            vVh0(0),            ghh(0),            sfx(0),            sfy(0),            qIn(0),            typeInterface(nEdges, 0),            neighborIds(nEdges, 0)    {    }    ~FloodIsolation(){    }    void Update() {        h =  h + 1;        floodedCells =  !floodedCells;        floodedCellsTimeInterval =  !floodedCellsTimeInterval;        qInflow =  qInflow + 1;        qStartTime =  qStartTime + 1;        qEndTime =  qEndTime + 1;        lowerFloorCells =  lowerFloorCells + 1;        cellLocationX =  cellLocationX + 1;        cellLocationY =  cellLocationY + 1;        cellLocationZ =  cellLocationZ + 1;        levelOfCell =  levelOfCell + 1;        valueOfCellIds =  valueOfCellIds + 1;        h0 =  h0 + 1;        vU =  vU + 1;        vV =  vV + 1;        vUh =  vUh + 1;        vVh =  vVh + 1;        vUh0 =  vUh0 + 1;        vVh0 =  vVh0 + 1;        ghh =  ghh + 1;        sfx =  sfx + 1;        sfy =  sfy + 1;        qIn =  qIn + 1;        for(int j = 0; j < nEdges; ++j) {            ++typeInterface[j];            ++neighborIds[j];        }           }private:    static const int nEdges = 6;    bool floodedCells;    bool floodedCellsTimeInterval;    std::vector<int> neighborIds;    double valueOfCellIds;    double h;    double h0;    double vU;    double vV;    double vUh;    double vVh;    double vUh0;    double vVh0;    double ghh;    double sfx;    double sfy;    double qInflow;    double qStartTime;    double qEndTime;    double qIn;    double nx;    double ny;    double floorLevels;    int lowerFloorCells;    bool flagInterface;    std::vector<int> typeInterface;    bool floorCompleteleyFilled;    double cellLocationX;    double cellLocationY;    double cellLocationZ;    int levelOfCell;};int main() {    std::vector<FloodIsolation> isolation(20000);    clock_t start = clock();    for (int i = 0; i < 400; ++i) {        if(i % 100 == 0) {            std::cout << i << "\n";        }        for (auto &f : isolation)            f.Update();    }    clock_t stop = clock();    std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";}

Compiled with the compiler from VC++ 2015 CTP, using -EHsc -O2b2 -GL -Qpar, I get results like:

0100200300Time: 0.135

Compiling with g++ produces a result that's slightly slower:

0100200300Time: 0.156

On the same hardware, using the compiler/JVM from Java 8u45, I get results like:

0100200300Time: 181

This is around 35% slower than the version from VC++, and about 16% slower than the version from g++.

If we increase the number of iterations to the desired 2000, the difference drops to only 3%, suggesting that part of the advantage of C++ in this case is simply faster loading (a perennial problem with Java), not really in execution itself. This doesn't strike me as surprising in this case--the computation being measured (in the posted code) is so trivial that I doubt most compilers can do a whole lot to optimize it.