Database Layout and Repacking¶
Instaseis supports a number of different database layouts, each with its own advantages and disadvantages as well as repacking functionality to convert from some layouts to others.
Available Layouts¶
From a high-level view Instaseis supports two database layouts which we will call the multi file layout and the merged layout in the following. Both support two components (horizontal + vertical) as well as single component databases. Forward database are also supported by both layouts.
Multi File Layout¶
This consists of two NetCDF files - one can optionally be neglected - Instaseis will then only be able to extract horizontal or vertical seismograms. Its stores the snapshots in time for two (vertical) or three (horizontal) components per GLL point. GLL points shared between two elements are only stored once which quite significantly reduces the total files size (only about 16 GLL points needs to be stored per element instead of 25).
The big downside is that, to extract three components seismograms, it needs to read all 25 GLL points for a single element from all 3 + 2 displacement snapshots. Instaseis is being smart about it and batches adjacent reads but worst case this means that 125 single read accesses across two different files have to be performed to get data from a single element. For forward databases up to 250 read operations across four files are reduced to a single read operation.
Expected NetCDF file locations:
For the horizontal data file:
ROOT/.../PX/../ordered_output.nc4
orROOT/../PX/.../axisem_output.nc4
For the vertical data file:
ROOT/.../PZ/../ordered_output.nc4
orROOT/.../PZ/.../axisem_output.nc4
For the forward databases file:
ROOT/.../[MZZ,MXX_P_MYY,MXZ_MYZ,MXY_MXX_M_MZZ]/../ordered_output.nc4
orROOT/.../[MZZ,MXX_P_MYY,MXZ_MYZ,MXY_MXX_M_MZZ]/../axisem_output.nc4
File layout (in a commented representation based on the output from
ncdump
):
dimensions:
gllpoints_all = 160692 ;
snapshots = 370 ;
// required global attributes (e.g. these are used by Instaseis):
* "dump type (displ_only, displ_velo, fullfields)" (string)
* "excitation_type" (string)
* "source type" (string)
* "background model" (string)
* "external model name" (string)
* "git commit hash" (string)
* "datetime" (string)
* "compiler brand" (string)
* "compiler version" (string)
* "user name" (string)
* "host name" (string)
* "time scheme" (string)
* "source time function" (string)
* "npol" (int)
* "file version" (int)
* "number of strain dumps" (int)
* "scalar source magnitude" (float/double/real)
* "strain dump sampling rate in sec" (float/double/real)
* "source shift factor in sec" (float/double/real)
* "source shift factor for deltat_coarse" (int)
* "npoints" (int)
* "attenuation" (int - 1 is true/0 is false)
* "planet radius" (float/souble/real)
* "dominant source period" (float/souble/real)
* "kernel wavefield rmin" (float/souble/real)
* "kernel wavefield rmax" (float/souble/real)
* "kernel wavefield colatmin" (float/souble/real)
* "kernel wavefield colatmax" (float/souble/real)
* "source depth in km" (float/souble/real)
* "nelem_kwf_global" (int)
group: Snapshots {
variables:
# Note that these can exist in both version - this one and the
# transposed one. If you use this one, make sure the chunking is set
# up in a way that the snapshots for a single GLL point can be read
# at once - otherwise performance will be abysmal. The transposed
# version does not have this problem.
float disp_s(snapshots, gllpoints_all) ;
# Not needed for for vertical databases.
float disp_p(snapshots, gllpoints_all) ;
float disp_z(snapshots, gllpoints_all) ;
# These two can (for legacy reasons) also be part of a top level
# "Surface" group.
float stf_dump(snapshots) ;
float stf_d_dump(snapshots) ;
}
group: Mesh {
dimensions:
elements = 9856 ;
control_points = 4 ;
npol = 5 ;
variables:
int midpoint_mesh(elements) ;
int eltype(elements) ;
int axis(elements) ;
int fem_mesh(elements, control_points) ;
int sem_mesh(elements, npol, npol) ;
float mp_mesh_S(elements) ;
float mp_mesh_Z(elements) ;
double G0(npol) ;
double G1(npol, npol) ;
double G2(npol, npol) ;
double gll(npol) ;
double glj(npol) ;
float mesh_S(gllpoints_all) ;
float mesh_Z(gllpoints_all) ;
float mesh_vp(gllpoints_all) ;
float mesh_vs(gllpoints_all) ;
float mesh_rho(gllpoints_all) ;
float mesh_lambda(gllpoints_all) ;
float mesh_mu(gllpoints_all) ;
float mesh_xi(gllpoints_all) ;
float mesh_phi(gllpoints_all) ;
float mesh_eta(gllpoints_all) ;
float mesh_Qmu(gllpoints_all) ;
float mesh_Qka(gllpoints_all) ;
}
Merged File Layout¶
This, in contrast to the multi file layout stores everything in a single 2D to 10D array, meaning data from one element can be accessed with a single read command. The downside is that many GLL points are duplicated which thus increases the file size. On the other hand this layout can easily increase the performance by more than an order of magnitude so depending on the use case this is the way to go. Turning on compression can save quite a lot of space here but comes at the expense of some speed. Make sure to set the chunking in a way that each chunk corresponds to all the data from a single element.
Expected NetCDF file locations: ROOT/.../merged_output.nc4
File layout (in a commented representation based on the output from
ncdump
):
# Global attributes and mesh the same as above!
dimensions:
gllpoints_all = 160692 ;
snapshots = 370 ;
ipol = 5 ;
jpol = 5 ;
nvars = 5 ;
elements = 9856 ;
variables:
float stf_dump(snapshots) ;
float stf_d_dump(snapshots) ;
float MergedSnapshots(elements, nvars, jpol, ipol, snapshots) ;
The second dimension in the MergedSnapshots
variable corresponds to the
displacement in the various directions. In terms of the multi file layout,
Instaseis assumes the following order:
5D => horizontal and vertical database:
disp_s horizontal
disp_p horizontal
disp_z horizontal
disp_s vertical
disp_z vertical
3D => horizontal only database:
disp_s horizontal
disp_p horizontal
disp_z horizontal
2D => vertical only database:
disp_s vertical
disp_z vertical
10D => forward database:
disp_s MZZ
disp_z MZZ
disp_s MXX+MYY
disp_z MXX+MYY
disp_s MXZ/MYZ
disp_p MXZ/MYZ
disp_z MXZ/MYZ
disp_s MXY/MXX-MYY
disp_p MXY/MXX-MYY
disp_z MXY/MXX-MYY
Repacking Script¶
Instaseis can convert databases from the multi file layout (also in the form that AxiSEM produces directly) to:
The same layout - (the repack method) - this sometimes improves compatibility. Additionally compression settings can be changed.
A transposed version of the same layout - this might improve the performance. Running this more than one time will keep transposing the data arrays.
The merged layout. Conversion can take a very long time. Compression is also able to save quite a bit of space.
Where to execute this?¶
Make sure to execute this at the top-level of all the NetCDF files. E.g. the script (assuming the merge method is chosen), will transform this
input_database
├── PX
│ └── Data
│ └── ordered_output.nc4
└── PZ
└── Data
└── ordered_output.nc4
into this
merged_database
└── merged_output.nc4
Usage¶
$ python -m instaseis.scripts.repack_db --help
Usage: repack_db.py [OPTIONS] INPUT_FOLDER OUTPUT_FOLDER
Options:
--contiguous Write a contiguous array - will turn off
chunking and compression
--compression_level INTEGER RANGE
Compression level from 1 (fast) to 9 (slow).
--method [transpose|repack|merge]
`transpose` will transpose the data arrays
which oftentimes results in faster
extraction times. `repack` will just repack
the data and solve some compatibility
issues. `merge` will create a single much
larger file which is much quicker to read
but will take more space. [required]
--help Show this message and exit.
Comparing Databases¶
If you don’t trust the repacking script, don’t fret - there is another script that compares two or more databases to make sure they produce the same waveforms:
$ python -m instaseis.scripts.compare_dbs --help
Usage: compare_dbs.py [OPTIONS] [DATABASES]...
Pass a list of databases to assert that they produce the same seismograms.
The first one will be treated as the reference.
Options:
--seed INTEGER Optionally pass a seed number to make it reproducible.
--help Show this message and exit.