Performance problems cannot be solved only with "technology". It is often a comprehensive problem of architecture, testing, assumptions, etc. However, for an engineer, you must start small and solve some "obvious" small problems. Otherwise, small things will add up to big things, and a dam thousands of miles away will collapse in an ant nest.
Why does the performance of C++ always rank behind C (see the latest test results from websites such as http://benchmarksgame.alioth.debian.org/u32/performance.php?test=binarytrees)? I think this is due to 3 reasons:
1) The C++ compiler used for testing does not use the latest optimization technology
2) The added value of C++ is not taken into account in testing
3) The "subtlety" of C++ at the application level (please refer to my other blogs about C++) often discourages ordinary programmers from choosing "textbook use cases", resulting in some side effects not being eliminated at the application level.
I remember that more than 10 years ago, when I was developing at Microsoft, I consulted Stan Lippman, the author of the earliest C++ compiler (then Microsoft VC++ architect), about a series of C++ performance problems for our group. In his With help, we used technologies such as inline and RVO in key places, which completely solved the performance problem and found several major errors in VC++. I realize that most of the performance problems of C++ lie in our shallow understanding of C++, and most of them are not difficult to solve.
Let’s use an example to compare and see how some subtle details affect program performance.
struct intPair
{
int ip1;
int ip2;
intPair(int i1, int i2) : ip1(i1), ip2(i2) {}
intPair(int i1) : ip1(i1), ip2(i1) {}
};
// Calc sum (usinh value semantic)
Int
Sum1(intPair p)
{return p.ip1 + p.ip2;
}
// Calc sum (usinh ref semantic)
int
Sum2(intPair &p)
{return p.ip1 + p.ip2;
}
// Calc sum (usinh const ref semantic)
Int
Sum3(const intPair& p)
{return p.ip1 + p.ip2;
}
The above simple
struct has three Sum functions, which do exactly the same thing, but is the performance the same? We use the following program to test:
double Sum(int t, int loop){
using namespace std;
if (t ==
1)
{clock_t begin = clock();
int x =0;
for(int i = 0; i < loop; ++i)
clock_t end = clock();
return double(end - begin) / CLOCKS_PER_SEC;
}
else if (t ==
2)
{ clock_t begin = clock(); int x =0; intPair p(1,2); for(int i = 0; i < loop; ++i) { x += Sum1(p); } clock_t end = clock(); return double(end - begin) / CLOCKS_PER_SEC; } else if (t == 3) { clock_t begin = clock(); int x =0; intPair p(1,2); for(int i = 0; i < loop; ++i) { x += Sum2(p); } clock_t end = clock(); return double(end - begin) / CLOCKS_PER_SEC; } else if (t == 4) { clock_t begin = clock(); int x =0; intPair p(1,2); for(int i = 0; i < loop; ++i) { x += Sum3(p); } clock_t end = clock(); return double(end - begin) / CLOCKS_PER_SEC; } else if (t == 5) { clock_t begin = clock(); int x =0; for(int i = 0; i < loop; ++i) { x += Sum3(10); } clock_t end = clock(); return double(end - begin) / CLOCKS_PER_SEC; } return 0; } 我们用了5个案列,对Sum1和Sum3 风别用了两种调用方式,对Sum2用了一种调用方式。我们测试了10万次调用: double sec = Sum(1, 100000); printf("Sum1 (use ctor) time: %f n", sec); sec = Sum(2, 100000); printf("Sum1 (use no c'tor) time: %f n", sec); sec = Sum(3, 100000); printf("Sum2 time: %f n", sec); sec = Sum(4, 100000); printf("Sum3 without conversion time: %f n", sec); sec = Sum(5, 100000); printf("Sum3 with conversion time: %f n", sec); 我们在VisualStidio 2010 中测试,结果是: 用例1 18ms 用例2 9ms 用例3 6ms 用例4 7ms 用例5 12ms 也就是说:用例1和5最慢,其他基本没有差别。 细心的读者不难看出, 1)用例5的性能问题,是因为Sum3用了C++的implicit conversion ,将整数自动转化成intPair 的临时变量。这是一个应用层面的问题,如果我们不得不将整数作这个转换,也就不得不付出这个性能上的代价。 2)用例1的问题和5类似,都是因为不得不每次创建临时变量。当然,可以强迫constructor inline 来使得临时变量的生成成本降低。 3)用例2用了在函数调用前了编译自生的copy constructor,不过因为 intPair object 很小,影响可以忽略不计了。 4)用例3性能是稳定的,但是它用了“间接”方式(详情请看我关于reference的博克),所以产生的指令比用例2多两条。但对性能的影响不大,估计和Intel的L1,L2 缓存有关。 *注意到OOP函数如果仅仅对 this 的成员存取数据,一般可以充分利用缓存,除非 object 过大。 5)用例4 和用例3生成代码完全一样,应该没有差别。const 只是编译时有用,生成的代码与const 与否无关。 性能问题的话题太多,本文只是蜻蜓点水,但是已经触及了C++的两个最大的性能隐患: a) 临时变量 b) Implicit conversion (silent conversion) 2014-6-20 Seattle I don’t have time to write a program for you, but I need to give you an idea and it will flash to my mind. First, copy the examples from the book on the three kinds of sorting and write them into three functions. In each function, take the system time once at the beginning and end and finally subtract it. The overhead time is printed out I modified the online program and integrated it. Take a look
#include
#include
#include
#define M 50
#define MAX 100000;
typedef struct
{
int weight;//Node weight
int parent,lchild,rchild;
}HTNODE,*HUFFMANTREE;
typedef char** HUFFMANCODE;//Dynamically allocate array to store Huffman coding table
typedef struct
{
int key; /* Keywords*/
}RecordNode; /*Type of sorting node*/
typedef struct
{
RecordNode *record;
int n; /*Size of sorting object* /
}SortObject; //Sequence to be sorted
HUFFMANTREE huffmantree(int n,int weight[])//Construct Huffman tree
{
int m1,m2,k;
int i,j,x1,x2;
HUFFMANTREE ht;
ht=(HUFFMANTREE)malloc((2*n)*sizeof(HTNODE));
for(i=1;i< ;(2*n);i++)//Initialize the data of each node in the Huffman tree. The assignment without initial value is 0
{
ht[i].parent=ht[i].lchild =ht[i].rchild=0;
if(i<=n)
ht[i].weight=weight[i];
else
ht[i].weight=0 ;
}
for(i=1;i
m1=m2= MAX;
x1=x2=0;
for(j=1;j<(n+i);j++)
{
if((ht[j].weight
m2=m1;
x2=x1;
m1=ht[j].weight;
x1=j;
}
else if((ht[j].weight
m2=ht[j].weight;
x2=j;
}
}
k=n+i;
ht[x1].parent=ht[x2].parent=k;
ht[k].weight =m1+m2;
ht[k].lchild=x1;
ht[k].rchild=x2;
}
return ht;
}
void huffmancoding(int n,HUFFMANCODE hc,HUFFMANTREE ht,char str[])
{
int i,start,child,father;
char *cd;
hc=(HUFFMANCODE)malloc(( n+1)*sizeof(char*));//Allocate the head pointer of n character encoding
cd=(char*)malloc(n*sizeof(char));//Allocate the working space for encoding
cd[n-1]='\0';//Encoding terminator
f...The rest of the text>>