Difference between revisions of "Performance Test"

From Yade

Line 36: Line 36:
 
=== Comparison AMD/Intel ===
 
=== Comparison AMD/Intel ===
   
[[File:performance_arch_scaling_2014-01-25.git-22c2441.jpg|600px|thumb|center|Fig. 3: Comparison of Intel vs. AMD for version1]]
+
[[File:performance_arch_scaling.jpg|1200px|thumb|center|Fig. 4: Comparison of Intel vs. AMD for both versions]]
 
Fig. 3 shows the difference between running version1 on an Intel or AMD machine. The AMD is generally slower (Intel/AMD>1).
 
 
[[File:performance_arch_scaling_2014-02-24.git-b60d388.jpg|600px|thumb|center|Fig. 4: Comparison of Intel vs. AMD for version1]]
 
 
Fig. 3 shows the difference between running version2 on an Intel or AMD machine. Again, the AMD is generally slower (Intel/AMD>1).
 
   
 
Fig. 4 shows the difference between running version1 and version2 on an Intel or AMD machine. The AMD is generally slower (Intel/AMD>1).
   
 
=== Conclusions ===
 
=== Conclusions ===

Revision as of 14:33, 28 February 2014

Introduction

This page summarises some results for the performance test of YADE (yade --performance) on a multicore machines. It should give an idea on how good YADE scales.

Test 1

Two versions of YADE are compared to each other and two different machines are used. The test was conducted on the computing grid of the University of Newcastle by Klaus.

YADE versions:

Machines:

  • AMD: AMD Opteron(tm) Processor 6282 SE (64 cores)
  • Intel: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (16 cores)

System:

  • Red Hat Enterprise Linux Server release 6.4 (Santiago)

Other:

  • Number of tests per point in plots: 10

Performance of Parallel Collider

Fig. 1: Scaling on Intel machine (left) and AMD machine (right.)

Fig. 1 shows how version2 (parallel collider) scales in relation to version1 (NOTE: version1/version2 indicates how much faster the parallel collider is). It is interesting to note that for simulations with less than 100000 particles the scaling is almost not depending on the number of threads and scaling is slightly bigger than one only. For simulations with more than 100000 particles things are looking differently.

Using the total number of cores on the machine is not recommended, e.g. -j12 (and probably -j14) scales better than -j16 on the Intel and -j24 scales better than -j32 on the AMD.

Fig. 2: Number of particle vs. ierations per seconds for Intel
Fig. 3:Number of particle vs. ierations per seconds for AMD

Fig 2-3 shows the absolute speed in iterations per seconds for both versions on both machines. It can clearly be seen that the lines for the parallel collider on the right are almost shifted parallel whereas the lines for version1 converge at the 500000 particle point. This means that the parallel collider does not just allow for better scaling (i.e. faster calculation) but as well for more particle to be used in a simulation.

Comparison AMD/Intel

Fig. 4: Comparison of Intel vs. AMD for both versions

Fig. 4 shows the difference between running version1 and version2 on an Intel or AMD machine. The AMD is generally slower (Intel/AMD>1).

Conclusions

The new parallel collider scales good for the --performance test with more than 100000 particles. The scaling for 500000 particles is really good, i.e. -j12 scales by a factor of 6 for both machines. Intel machines perform better (similar observations have been made here [1]). Finally, I would say that there is an optimum number of threads you should use per simulation. Many cores doesn't always mean much faster. So use your resources wisely.

Test 2

TODO some more real example, feel free to add something...