Commit 0d42924a by Francois Gygi

cleaned up notes


git-svn-id: http://qboxcode.org/svn/qb/trunk@662 cba15fb0-1239-40c8-b417-11db7ca47a34
parent 693f1d67
......@@ -662,127 +662,3 @@ SlaterDet can exist on any context.
Developing a Wavefunction that can hold any number of SlaterDets for
arbitrary nspin and nkpoints.
--------------------------------------------------------------------------------
Examples of sizes for H2O calculations
--------------------------------------
512 molecules (1536 atoms), cell: 46.912^3 Bohr, ecut=85Ry, FTGrid 280x280x280
nst = 2048
BlacsContext (nprow = 128, npcol = 128), total of 16384 nodes
16 states/column
wfbasis.size() = 690000
vbasis.size() = 5480000
global matrix of coefficients: 690000 x 2048 complex<double>,
(or 1380000 x 2048 double)
wfbasis.localsize() = 5400
vbasis.localsize() = 42800
local size of coefficients: 5400 x 16 x sizeof(complex<double>) = 1.4 MB
local size of FTGrid: 280x280x280/128 x sizeof(complex<double>) = 2.7 MB
local number of grid points: 280x280x280/128 = 172000
keeping the real-space copies of the wavefunctions takes 16 real grid
functions per column, i.e. 2.7 * 8 = 22 MB, and is therefore possible.
The number of FFT's needed to compute the density is 8 (complex)
on each column. Considering the timing of a 280^3 transform with 128
tasks on Frost (0.24 s.), the 8 transforms take 2.0 seconds.
Accumulation of the charge density is an MPI_Allreduce on the
rows of the BlacsContext.
NonLocalPotential:
aux[ia][ig] for s-projector: distributed across rows and columns:
512/128 = 4 atoms/column, 5400 plane-waves/row:
local size: 4 * 5400 * sizeof(complex<double>) = 0.3 MB
Fnl[n][ia]: 2048 x 512 double: distributed over rows and columns
localsize: 16 x 4 double
Replicating aux[ia][ig] on each column:
localsize = 512 x 5400 x sizeof(complex) = 44 MB
LocalPotential
Real space grid, replicated over columns: 1.4 MB/node
Summary of local sizes:
3 copies of wf coefficients: 33 MB
1 real-space copy of wf's 22 MB
2 complex FTGrid's 5.4 MB
1 real rho grid 1.4 MB
1 real vloc grid 1.4 MB
3 real grad_rho grids 4.2 MB
1 aux array 2.7 MB
--------
70.1 MB
Using replicated aux[ia][ig] arrays in NonLocalPotential adds 44 MB.
Note that having p-projectors would multiply the size of aux by 4, thus
making it difficult to replicate projectors on nodes.
Distributing projectors across rows and columns is better suited for
simulations involving d-projectors, or semi-local potentials (factors
10-20 in the number of projectors)
(sizes are OK for BGL, assuming 256 MB/node)
--------------------------------------------------------------------------------
2048 molecules (6144 atoms), cell 46x92x92 Bohr, ecut=85Ry, grid 280x560x560
nst=8192
Use a BlacsContext (nprow=256, npcol=256), total of 65536 nodes
32 states / column
Basis.size() = 22000000
global matrix of coefficients: 22000000 x 8192 complex<double>,
(or 44000000 x 8192 double)
Basis.localsize() ~= 85600
local size of coefficients: 85600 x 32 x sizeof(complex<double>) = 44 MB
local size of FTGrid: 280x560x560/256 x sizeof(complex<double>) = 5.4 MB
(i.e. sizes are OK for BGL)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
FTGrid timings
--------------------------------------------------------------------------------
(using blue, esslsmp)
file size tasks nodes fwd/bwd time
---- ---- ----- ----- ------------
ftest.o45511 264x264x264 2 2 11.0
ftest.o45507 264x264x264 4 4 5.8
ftest.o45506 264x264x264 8 8 3.0
ftest.o45505 264x264x264 16 16 1.6
ftest.o45504 264x264x264 32 32 0.9-1.1
ftest.o43442 264x264x264 64 64 0.5-0.6
ftest.o43879 264x264x264 128 64 1.0-2.9
ftest.o43431 264x264x264 256 64 1.9-10.5
ftest.o43423 264x264x264 128 32 1.5-4.5
ftest.o39910 264x264x264 64 16 1.6-4.4
ftest.o41371 264x264x264 32 8 2.2-4.6
ftest.o10483 264x264x264 64 8 1.6-7.8 (snow)
ftest.o10481 264x264x264 32 8 1.8-3.7 (snow)
ftest.o3426 256x256x256 128 32 0.7-0.8
ftest.o1065 256x256x256 64 16 1.0
ftest.o1067 512x512x512 128 32 4.0-4.5
--------------------------------------------------------------------------------
Note: these results were obtained with esslsmp. Better timings
can be obtained with essl and using 1 task per cpu.
The timings with essl and multiple tasks per node also appear
to be very homogeneous from node to node. It is likely that
the use of esslsmp causes delays in the MPI calls, and is
the source of the very inhomogenous timings observed
--------------------------------------------------------------------------------
Timings with essl
file size tasks nodes fwd/bwd time
---- ---- ----- ----- ------------
264x264x264 16 1 1.0/1.0 (Frost)
264x264x264 16 4 3.6/3.7 (Blue)
140x140x140 4 1 1.2/1.3 (Blue)
--------------------------------------------------------------------------------
Using 4 nodes on Blue for the (264)^3 case, it is better to use 16 tasks
with 4 tasks/node with essl (3.6 s.) than to use 4 tasks with 1 task/node
and esslsmp (5.8 s.).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment