Data Structures¶
There are a number of data structure formats used within the
PeleAnalysis
suite of codes. Some are basic AMReX types, such as
MultiFab
or plotfiles
, while others are specific to the
diagnostics and cannot be represented naturally in AMReX’s containers
that were generally created for block-structured data. Here, we
provide an overview of the data structures used by the various tools,
both how they are (currently) written to disk and how they are used in
the analysis tools.
Plotfiles¶
Plotfiles are the standard format for reading data from a Pele
simulation. Their format is discussed in the AMReX documentation. For a
multi-level AMR calculation, a plotfile contains an ASCII Header
file and one subfolder for each refinement level. There may also be a
folder containing multilevel particle data, and other
application-specific files (build information, typical state values,
run input parameters, etc). Each refinement level contains one or more
MultiFab
file sets (including an ASCII header file and a number of
data files, numbered sequentially). The main Header
file includes
a list of variable names written to the structure, some details about
the domain and refinement data and specific run status, and a mapping
for which Multifab
files in the level subfolders contain each of the
plotfile variables.
The analysis codes read plotfiles from disk into memory using one of
two AMReX-provided C++ classes: PlotFileData
or AmrData
. The
former is the more recently developed of the two, and has a simple
interface for reading and interpolating data from the plotfiles into
local temporaries used to build diagnostics. The AmrData
object
is somewhat more limited in terms of interface to interpolations, etc,
but has the significant advantage of allowing for demand-driven data
reads - that is, only the plotfile metadata is read from disk when the
AmrData
object is instantiated; the data is read only as needed to
satisfy a FillVar
request - and then it is read at the granularity
of component and box. A FlushGrids
operation is available to
dynamically manage memory used by the AmrData
object. This
functionality can be critical when processing extremely large datasets
on massively parallel machines. In the set of analysis codes in this
repository, either may be used depending on the application for which
the tool was developed. They can also be rather trivially converted,
as needed.
Note that many of the processing tools read plotfiles as input, derive fields that live on the grid structure of the plotfile, and write results out as new plotfiles. There are a number of tools one can use to subset plotfiles in space and component indices and to join together multiple plotfiles (with some limitations on the compatibility of the level and grid structures).
FArrayBox
and MultiFab
¶
All of the usual AMReX data structures are, of course, available in
the analysis tools, and are used extensively. These structures are
documented in the AMReX documentation.
Normally, these structures are used as recommended in the AMReX
documentation and application codes. However, as discussed below,
there are operations that these structures support which we have
“hijacked” for our own purposes. In particular, FArrayBox
and
MultiFab
can be read and written to disk in a format that is
somewhat “self-describing”. Floating point data is managed in a
portable way and the commands to interact with the IO are particularly
convenient/simple. Thus, when we need to read or write data, and it
happens to be structured in a way that makes use of these containers,
we use them…even if we have to “cheat” to do it!
MEF - Marc’s Element Format¶
An MEF
file is an example of (ab)using the AMReX IO functions for
the FArrayBox
. An MEF
file is the on-disk representation of
an unstructured data set, which is inherently not supported in the
native AMReX data structures. An unstructured dataset contains a
number of named components at a set of implicitly numbered “nodes”,
and then a set of integer sets that identify an oriented list of nodes
to connect for each “element”. An MEF
file makes use of the IO
operations of FArrayBox
in order to collect together into a single
file all the data associated with the unstructured data. Examples
where an MEF
structure is used include triangulated surfaces (such
as those that result from the isovalue contouring operations in 3D)
and polylines (or segment sets that represent contours of 2D data). A
limitation of the MEF
format is that all elements must have the
same number of nodes, and all nodes must have the same number of
components (in the same ordering). Also, we typically assume the
first AMREX_SPACEDIM
components contain the spatial coordinates.
On disk, the MEF file is a concatenated set of information containing:
a label, the variable names, the number of nodes per element, the
number of elements, followed by a dump of the FArrayBox
used to
store the nodes, followed by an ASCII write of the elements. At the
moment, the IO of the unstructured datasets is done explicitly by each
tool that needs them (we should probably encapsulate this into a class
that manages this part…TODO). Two other notes are that the node
numbering in an MEF file starts at 1 (rather than 0) by convention,
and the FArrayBox
used to manage the nodes is built on an AMReX
Box
object that has its first component running from 0 to
Nnodes-1
, and its 1:AMREX_SPACEIM-1
extents running from 0 to
0. Thus, one does not want to treat this special Box
or
FArrayBox
in the normal AMReX way (intersections in index space
will NOT give the user what is expected!). Visualization of these data
structures using standard AMReX tools will also give perhaps
unexpected results (there are conversion tools to generate VTK VTP-format
files from MEF
files for reading into Paraview, e.g. See the
figure below). The special Box
and FArrayBox
objects
used with MEF
files are typically created on-the-fly for the sole purpsoe of IO
operations. Outside of IO, the data is typically moved into structures
that more clearly indicate usage.
Many of the analysis routines that interact with the unstructured
datasets need to work in parallel (with MPI). Typically, when that
is done, each rank has a local node numbering. However, the MEF
format does not have a parallel counterpart, so all IO is typically
done via the IO rank, and thus requires an explicit aggregation and
node number rationalization step (which, again, is done explicitly
in the codes, and probably ought to be encapsulated into a class
or something to ease use - TODO).
Several of the diagnostic tools interact with MEF
structures. They
are used to represent isosurfaces of 3D data, polylines that are
isolines of 2D plotfile data, and isolines of 3D surface data. Although
they are generally used for node-centered data, the structures can
be (ab)used to represent data that is element-centered. This is
done by duplicating the data at the nodes of each triangle, including
its position, but setting the values of the other components for all
nodes in the element to the elemental value. This is the strategy
used, for example, to represent an element-averaged value of a quantity
(or even values of integrated quantities over stream tubes - discussed
below). The (brute force) writing of an MEF
to a std::ofstream os
looks like this:
os << label << std::endl;
os << vars << std::endl;
os << nElts << " " << nodesPerElt << std::endl;
nodefab.writeOn(*os);
os.write((char*)eltVec.dataPtr(),sizeof(int)*eltVec.size());
StreamData
¶
StreamData
is a class whose design sorta follows the MEF
ideas. The format was generated to represent “streamline data” in 2-
and 3-space, and is fundamentally unstructured in nature. Imagine you
have a cloud of point locations that you would like to use to seed the
generation of integral paths along a vector field. For example,
imagine the points are the nodes of an isotherm that defines a “flame”,
and that from each node we construct a path along the
integral curves of the temperature gradient, into both the hot and
cold sides of the surface. The connectivity of the triangles on the
seed surface can be used to define a connectivity of prism-shaped
elements that tile a subregion of the domain between the hot and cold
ends of the integral curves.
The representation of this data in memory is quite complex. Each streamline consists of a set of points, and each point has a location and any number of quantities that have been interpolated from the source plotfile data. Additionally, we want to preserve the connectivity of the surface that implies the connectivity of the curved prism-shaped elements that tile the volume of space surrounding the seed surface.
Because stream data can be quite large, the structures are inherently
parallel, and make extensive use the MultiFab
and its parallel IO
capabilies. Each line contains the original seed point, which falls
into the valid ragion of a box from the finest level of the plotfile
that contains that point. The Box
associated with this region of
space and refinement level is deemed the “owner” of the stream
line. The data associated with the stream is stored in an
FArrayBox
associated with the MultiFab
at the level of the
owning Box
. The special Box
of the owning FArrayBox
is
created over the bounds (0:Nlocal,-Npts:Npts,0)
, where Nlocal
is the number of seed points owned by this box, and Npts
is the
number of points on the stream line towards each direction from the
surface (j=0
is on the seed surface). The FArrayBox
created
on this Box
object has Ncomp
components, including position
coordinates and any number of fields interpolated from the source
plotfile. The data is distributed with the same distribution map used
to distribute the field data when the plotfile is read (determined by
the analysis code, NOT the original simulation). Any Box
in the
BoxArray
at each level in the stream data that do not contain
stream lines are set to a default (invalid) size, marking to the
analysis code that there are no stream lines there to process.
Much like the temporaries used in IO of MEF
data, the MultiFab
structures associated with stream data should not be treated like normal
AMReX data structures - visualization and manipulation of the data
requires detailed knowledge of their layout.
On disk, the StreamData
object looks much like a plotfile. There
is an ASCII Header
file, and subfolders for each AMR level.
Within the subfolders, there are MultiFab
files associated with
the stream line data, possibly written in parallel across multiple
data files, etc. Additionally, there is a text file that specifies
the connectivity of the elements. Presently, these structures are
written, brute force, by the analysis codes (see the function
write_ml_streamline_data
in stream.cpp
, for example). The
functionality has been lifted in to a StreamData
class, but the
analysis tools haven’t yet been ported to use these class - TODO.
N-dimensional bins¶
Many of the analysis tools generate bins of data. These bins
typically are used to create joint probability density functions
(jPDFs) in 1, 2 or higher dimensions. They are also used to condition
statistics as an intermediate step to generating jPDFs. 2D jPDFs are
somewhat special in that we typically assume constant bin widths in
each coordinate so that an FArrayBox
is a natural container to use
to hold the result. Also because of the IO capabilities of this
class, it is a natural candidate for a format on disk. However,
an FArrayBox
is a simple container, and has no notion of axes
labels, variable names, bin sizes, etc. Thus, whenever we are
generating this type of data, there is an inherent complexity in how
to represent the final output data to enable plotting and
interpretation of the results. Note that the analysis tools here DO
NOT include plotting routines, so there has to be an understanding
about how to communicate all these details to the end user (such as
xmgrace
or matlab
, etc). To date, we have not come up with a
sufficiently flexible, self-describing way to convey all this
information, so the tools typically dump everything one needs and the
person orchestrating the plots must manually assemble the necessary
information.
A particularly noteworthy case is the binMEF
tool, which bins the
data in an arbitrary number of coordinates. For each coordinate, the
user determines the min, max and number of bins, and the input data
MEF
file that represents a surface to be chopped up. The code
proceeds through each coordinate, and each bit of area landing in a
particular bin for coordinate n
, is then chopped up into bins of
coordinate n+1
. This can be used to generate an area-weighted
jPDF in multiple coordinates, but can also be used as a conditioning
tool to exclude parts of the surface satisfying certain criteria
(falling outside the bins defined for that coordinate). Given the
array of bins, the result can be represented as a floating point
number (the area) and an array of integers, one for each of the
binning coordinates. Just like the simpler 2D jPDF example above, the
end user plotting or analyzing the results of this tool must assemble
all the bin info in their plotting package of choice. For the
N-dimensional case however, it is rarely useful to store the data as a
dense N-space container. The results are written to the screen in
their naturally sparse format. We haven’t yet developed a
standardized way to communicate all these details, so the process can
be tedious, but it is unavoidable.