Data Files¶
The core class in pygeostat is the DataFile
class which contains a Pandas DataFrame with the
data values and column names in addition to metadata, such as the name of the x, y and z coordinates
or grid definition.
DataFile Class¶
-
class
pygeostat.data.data.
DataFile
(flname=None, readfl=None, fltype=None, dftype=None, data=None, columns=None, null=None, title='data', griddef=None, dh=None, x=None, y=None, z=None, ifrom=None, ito=None, weights=None, cat=None, catdict=None, variables=None, notvariables=None, delimiter='\s+', headeronly=False, h5path=None, h5datasets=None, nreals=-1, tmin=None)¶ This class stores geostatistical data values and metadata.
DataFile classes may be created on initialization, or generated using pygeostat functions. This is the primary class for pygeostat and is used for reading and writing GSLIB, CSV, VTK, and HDF5 file formats.
- Parameters
flname (str) – Path (or name) of file to read
readfl (bool) – True if the data file should be read on class initialization
fltype (str) – Type of data file: either
csv
,gslib
orhdf5
orgsb
dftype (str) – Data file type as either ‘point’ or ‘grid’ used for writing out VTK files for visualization
data (pandas.DataFrame) – Pandas dataframe containing array of data values
dicts (List[dict] or dict) – List of dictionaries or dictionary for converting alphanumeric to numeric data
null (float) – Null value for missing values in the data file
title (str) – Title, or name, of the data file
griddef (pygeostat.GridDef) – Grid definition for a gridded data file
dh (str) – Name of drill hole variable
x (str) – Name of X coordinate column
y (str) – Name of Y coordinate column
z (str) – Name of Z coordinate column
ifrom (str) – Name of ‘from’ columns
ito (str) – Name of ‘to’ columns
weights (str or list) – Name of declustering weight column(s)
cat (str) – Name of categorical (e.g., rock type or facies) column
catdict (dict) – Set a dictionary for the categories, which should be formatted as:
catdict = {catcode:catname}
variables (str or list) – Name of continuous variable(s), which if unspecified, are the columns not assigned to the above attributes (via kwargs or inference)
notvariables (str or list) – Name of column(s) to exclude from variables
delimiter (str) – Delimiter used in data file (ie: comma or space)
headeronly (bool) – True to just read header + 1 line of data file This is useful for getting column numbers of large files OR if reading hdf5 files will only read in the hdf5 store information
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument
datasets
from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value ofNone
places the dataset into the root directory of the HDF5 file. A value ofFalse
loads a blank pd.DataFrame().h5datasets (str or list) – Name of the dataset(s) to read from the group specified by
h5path
. Does nothing ifh5path
points to a dataset.columns (list) – List of column labels to use for the resulting
data
pd.DataFramenreals (int) – number of realizations to read in. -1 will read all
tmin (float) – If a number is provided, values less than this number (e.g., trimmed or null values) are convernted to NaN. May be useful since NaN’s are more easily handled within python, matplotlib and pandas. Set to None to disable.
Examples
Quickly reading in a GeoEAS data file:
data_file = gs.DataFile(flname='../data/oilsands.dat')
To read in a GeoEAS datafile and assign attributes
# Point Data Example data_file = gs.DataFile(flname='../data/oilsands.dat',readfl=True,dh='Drillhole Number', x='East',y='North',z='Elevation')
# Gridded Data Example griddef = gs.GridDef('''10 0.5 1 10 0.5 1 10 0.5 1''') data_file = gs.DataFile(flname='../data/3DDecor.dat', griddef=griddef) # To view grid definition string print(data_file.griddef) # Access some Grid Deffinition attributes data_file.griddef.count() # returns number of blocks in grid data_file.griddef.extents() # returns an array of the extents for all directions data_file.griddef.nx # returns nunmber of blocks in x direction
HDF5
Using the HDF5 file format has its own positive features. For one it reads and writes much faster then using the ASCII format. Attributes (like the grid definition) can also be saved within the file. All files for a single project can also be saved in the same file. Please refer to the introduction on HDF5 files for more information
This class currently only searches for and loads a grid definition.
Examples
HDF5 file simple read example:
data_file = gs.DataFile(flname='../data/oilsands_out.hdf5')
To view the HDF5 header information (tables stored in the file):
data_file.store
If you have a HDF5 file with multiple tables and you just want to read in the file information to view what tables are in the file and any attributes saved to the file you can do a header style only read:
data_file = gs.DataFile(flname='../data/oilsands_out.hdf5', dftype='hdf5', headeronly=True)
Then to see what tables are written in the hdf5 file:
data_file.store
DataFile Attributes¶
Attributes of a datafile
object are accessed with datafile.<attribute>
.
Columns¶
Access the columns
of the datafile. Wrapper for datafile.data.columns
.
Num Variables¶
Access the nvar
of the datafile. e.g., the len(datafile.variables)
Locations¶
Access the locations
stored in the datafile. Wrapper for datafile[datafile.xyz]
.
Example:
>>> datafile = gs.DataFile("somefile.out") # this file has an x, y[, z] attribute that is found
>>> datafile.locations
... dataframe of x, y, z locations
Shape¶
Access the shape
of the data stored in the datafile. Wrapper for datafile.data.shape
Example:
>>> datafile = gs.DataFile("somefile.out")
>>> datafile.shape
... shape of datafile.data
Rename Columns¶
-
DataFile.
rename
(columns)¶ Applies a dictionary to alter self.DataFrame column names. This applies the DataFrame.rename function, but updates any special attributes (dh, x, y, etc.) with the new name, if previously set to the old name. Users should consider using the self.columns property if changing all column names.
- Parameters
columns (dict) – formatted as {oldname1: newname1, oldname2:newname2}, etc, where the old and new names are strings. The old names must be present in data.columns.
Drop Columns¶
-
DataFile.
drop
(columns)¶ This applies the DataFrame.drop function, where axis=1, inplace=True and columns is used in place of the labels. It also updates any special attributes (dh, x, y, etc.), setting them to None if dropped. Similarly, if any variables are dropped, they are removed from self.variables.
- Parameters
columns (str or list) – column names to drop
Check for Duplicate Columns¶
-
DataFile.
check_for_duplicate_cols
()¶ Run a quick check on the column names to see if any of them are duplicated. If they are duplicated then print a Warning and rename the columns
Set Columns¶
-
DataFile.
setcol
(colattr, colname=None)¶ Set a specialized column attribute (
dh, ifrom, ito, x, y, z, cat or weights
) for theDataFile
, whereDataFile.data
must be initialized. If colname is None, then the attribute will be set if a common name for it is detected inDataFile.data
.columns (e.g., ifcolattr='dh'
andcolname=None
, and'DHID'
is found inDataFile.data
, thenDataFile.dh='DHID'
. The attribute will be None if none of the common names are detected. If colname is not None, then the provided string will be assigned to the attribute, e.g. DataFile.colattr=colname. Note, however, that an error will be thrown if colname is not None and colname is not inDataFile.data
.columns. This is used on DataFile initialization, but may also be useful for calling after specialized columns are altered.- Parameters
colattr (str) – must match one of:
'dh'
,'ifrom'
,'ito'
,'x'
,'y'
,'z'
,'cat'
or'weights'
colname (str or list) – if not None, must be the name(s) of a column in
DataFile.data
. List is only valid ifcolattr=weights
Examples
Set the x attribute (dat.x) based on a specified value:
>>> dat.setcol('x', 'Easting')
Set the x attribute (dat.x), where the function checks common names for x:
>>> dat.setcol('x')
Set Variable Columns¶
-
DataFile.
setvarcols
(variables=None, notvariables=None)¶ Set the variables for the DataFile. If provided, the function checks that the variables are present in the DataFrame. If not provided, the function assigns columns that are not specified as the variables (
dh, x, y, z, rt, weights
), as well as a list of user specified notvariables.This is used on DataFile initialization, but may also be useful for calling after variables are added or removed.
- Parameters
variables (list or str) – list of strings
notvariables (list or str) – list of strings
Examples
Set the variables based on a specified list:
>>> dat.setvarcols(variables=['Au', 'Carbon'])
Set the variables based on the function excluding specialized columns (dh, x, y, etc.):
>>> dat.setvarcols()
Set the variables based on the function excluding specialized columns (dh, x, y, etc.), as well as a user specified list of what is not a variable:
>>> dat.setvarcols(notvariables=['Data Spacing', Keyout'])
Set Categorical Dictionary¶
-
DataFile.
setcatdict
(catdict)¶ Set a dictionary for the categories, which should be formatted as:
>>> catdict = {catcode:catname}
Example
>>> catdict = {0: "Mudstone", 1: "Sandstone"} >>> self.setcatdict(catdict)
Check DataFile¶
-
DataFile.
check_datafile
(flname, variables, sep, fltype)¶ Run some quick checks on the DataFile before writing and grab info if not provided
Add Coord¶
-
DataFile.
addcoord
()¶ Only use on DataFile classes containing GSLIB style gridded data.
If x, y, or z coordinate column(s) do not exist they are created. If the created or current columns only have null values, they are populated based on the GridDef class pass to the DataFile class.
Note
A griddef must be assigned to the DataFile class either at read in like here
>>> data_file = gs.DataFile(flname='test.out', griddef=grid)
Or later such it can be manually assigned such as here
>>> data_file.griddef = gs.GridDef(gridstr=my_grid_str)
Apply Dictionary¶
-
DataFile.
applydict
(origvar, outvar, mydict)¶ Applies a dictionary to the original variable to get a new variable.
This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.
- Parameters
origvar (str) – Name of original variable.
outvar (str) – Name of output variable.
mydict (dict) – Dictionary of values to apply.
Examples
>>> data_file.applydict('Drillhole', 'Drillhole-mod', mydict)
Describe DataFile¶
-
DataFile.
describe
(variables=None)¶ Describe a data set using pandas describe(), but exclude special variables.
- Keyword Arguments
variables (List(str)) – List of variables to describe.
- Returns
Pandas description of variables.
- Return type
self.data[variables]describe()
Examples
Describe all none special variables in the DataFrame (will exclued columns set as dh ID, coordinate columns, etc.)
>>> data_file.describe()
Or describe specific variables
>>> data_file.describe(['Bitumen', 'Fines'])
Infer Grid Definition¶
-
DataFile.
infergriddef
(blksize=None, databuffer=5, nblk=None)¶ Infer a grid definition with the specified dimensions to cover the set of data values. The function operates with two primary options:
Provide a block size (node spacing), the function infers the required number of blocks (grid nodes) to cover the data
Provide the number of blocks, the function infers the required block size
A data buffer may be used for expanding the grid beyond the data extents. Basic integer rounding is also used for attempting to provide a ‘nice’ grid in terms of the origin alignment.
- Parameters
blksize (float or 3-tuple) – provides (xsiz, ysiz, zsiz). If blksize is not None, nblk must be None. Set zsiz None if the grid is 2-D. A float may also be provided, where xsiz = ysiz = zsiz = float is assumed.
databuffer (float or 3-tuple) – buffer between the data and the edge of the model, optionally for each direction
nblk (int or 3-tuple) – provides (nx, ny, nz). If blksize is not None, nblk must be None. Set nz to None or 1 if the grid is 2-D. An int may also be provided, where nx = ny = nz = int is assumed.
- Returns
this function returns the grid definition object as well as assigns the griddef to the current gs.DataFile
- Return type
griddef (GridDef)
Note
this function assumes things are either 3D or 2D along the xy plane. If nx == 1 or ny == 1, nonsense will result!
Usage:
First, import a datafile using gs.DataFile(), make sure to assign the correct columns to x, y and z:
>>> datfl = gs.DataFile('test.dat',x='x',y='y',z='z')
Now create the griddef from the data contained within the dataframe:
>>> blksize = (100, 50, 1) >>> databuffer = (10, 25, 0) # buffer in the x, y and z directions >>> griddef = datfl.infergriddef(blksize, databuffer)
Check by printing out the resulting griddef:
>>> print(griddef)
Examples
For 3D data, infergriddef() returns a 3D grid definition even if zsiz is given as None or 0 or 1:
df3d = gs.ExampleData("point3d_ind_mv") a = df3d.infergriddef(blksize = [50,60,1]) b = df3d.infergriddef(blksize = [50,60,None]) c = df3d.infergriddef(blksize = [50,60,0]) #a,b,c are returned as Pygeostat GridDef: # 20 135.0 50.0 # 19 1230.0 60.0 # 82 310.5 1.0
For 3D data, nz given as None or 0 or 1 returns a 2D grid that covers the vertical extent of the 3D data:
d = df3d.infergriddef(nblk = [50,60,1]) e = df3d.infergriddef(nblk = [50,60,None]) f = df3d.infergriddef(nblk = [50,60,0]) #d,e,f are returned as Pygeostat GridDef: # 50 119.8 19.6 # 60 1209.1 18.2 # 1 350.85 81.7
Where xsiz = ysiz = zsiz, a float can also be provided, or where nx = ny = nz, an int can also be provided:
df3d.infergriddef(blksize = 75) df3d.infergriddef(blksize = [75,75,75])#returns the same as its above line df3d.infergriddef(nblk = 60) df3d.infergriddef(nblk = [60,60,60])#returns the same as its above line
If data is 2-D, zsiz or nz must be provided as None. Otherwise it raise exception:
df2d = gs.ExampleData("point2d_ind") df2d.infergriddef(nblk = [60, 60, None]) df2d.infergriddef(blksize = [50,60,None])
File Name String¶
-
DataFile.
__str__
()¶ Return the name of the data file if asked to ‘print’ the data file… or use the datafile in a string!
Generate Dictionary¶
-
DataFile.
gendict
(var, outvar=None)¶ Generates a dictionary with unique IDs from alphanumeric IDs. This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.
- Parameters
var (str) – Variable to generate a dictionary for
- Keyword Arguments
outvar (str) – Variable to generate using generated dictionary.
- Returns
Dictionary of alphanumerics to numeric ids.
- Return type
newdict (dict)
Examples
A simple call
>>> data_file.gendict('Drillhole')
OR
>>> dh_dict = data_file.gendict('Drillhole')
GSLIB Column¶
-
DataFile.
gscol
(variables, string=True)¶ Returns the GSLIB (1-ordered) column given a (list of) variable(s).
- Parameters
variables (str or List(str)) – Path, or name, of the data file.
- Keyword Arguments
string (bool) – If True returns the columns as a string.
- Returns
GSLIB 1-ordered column(s).
- Return type
cols (int or List(int) or string)
Note
None input returns a 0, which may be necessary, for example, with 2-D data: >>> data.xyz … [‘East’, ‘North’, None] >>> data.gscol(data.xyz) … ‘2 3 0’
Examples
Some simple calls
>>> data_file.gscol('Bitumen') ... 5
>>> data_file.gscol(['Bitumen', 'Fines']) ... [5, 6]
>>> data_file.gscol(['Bitumen', 'Fines'], string=True) ... '5 6'
Truncate NaN’s¶
-
DataFile.
truncatenans
(variable)¶ Returns a truncated list with nans removed for a variable.
- Parameters
variable (str) – Name of original variable.
- Returns
Truncated values.
- Return type
truncated (values)
Examples
A simple call that will return the list
>>> data_file.truncatenans('Bitumen')
Unique Categories¶
-
DataFile.
unique_cats
(variable, truncatenans=False)¶ Returns a sorted list of the unique categories given a variable.
- Parameters
variable (str) – Name of original variable.
- Keyword Arguments
truncatenans (bool) – Truncates missing values if True.
- Returns
Sorted, list of set(object).
- Return type
unique_cats (List(object))
Examples
A simple call that
>>> data_file.unique_cats('Drillhole')
Or to save the list
>>> unique_dh_list = data_file.unique_cats('Drillhole')
Write file¶
-
DataFile.
write_file
(flname, title=None, variables=None, fmt=None, sep=None, fltype=None, data=None, h5path=None, griddef=None, null=None, tvar=None, nreals=1)¶ Writes out a GSLIB-style, VTK, CSV, Excel (XLSX), HDF5 data file.
- Parameters:
flname (str): Path (or name) of file to write out.
- Keyword Args:
title (str): Title for output file. variables (List(str)): List of variables to write out if only a subset is desired. fmt (str): Format to use for floating point numbers. sep (str): Delimiter to use for file output, generally don’t need to change. fltype (str): Type of file to write either
gslib
,vtk
,csv
,xlsx
,or
hdf5
.- data (str): Subset of data to write out - cannot be used with variables
option!
h5path (str): The h5 group path to write data to (H5 filetype) griddef (obj): a gslib griddef object
tvar (str): Name of variable to use for compression when NaNs exist within it nreals (int): number of realizations you are writing out (needed for GSB)
- null (float): If a number is provided, NaN numbers are converted to this value
prior to writing. May be useful since NaN’s are more easily handled within python and pandas than null values, but are not valid in GSLIB. Set to None to disable (but NaN’s must be handled prior to this function call if so).
- Note:
pygeostat.write_file is saved for backwards compatibility or as an overloaded class method. Current write functions can be called seperately with the functions listed below:
>>> import pygeostat as gs >>> import pandas as pd >>> gs.write_gslib(gs.DataFile or pd.DataFrame) >>> gs.write_csv(gs.DataFile or pd.DataFrame) >>> gs.write_hdf5(gs.DataFile or pd.DataFrame) >>> gs.write_vtk(gs.DataFile or pd.DataFrame) >>> gs.write_gsb(gs.DataFile or pd.DataFrame)
Note: The GSB format is not specifically intended for general users of pygeostat. Some CCG programs use GSB that is a compressed GSLIB-like binary data format that greatly reduces the computational expense.
The following calls are equivalent:
>>> data_file.write_file('testgslib.out') >>> data_file.write_file('testgsb.gsb')
is equivalent to:
>>> gs.write_gslib(data_file, 'testgslib.out') >>> gs.write_gsb(data_file, 'testgsb.gsb')
and similar to:
>>> gs.write_gslib(data_file.data, 'testgslib.out') >>> gs.write_gsb(data_file.data, 'testgsb.gsb')
Data Spacing¶
-
DataFile.
spacing
(n_nearest, var=None, inplace=True, dh=None, x=None, y=None)¶ Calculates data spacing in the xy plane, based on the average distance to the nearest n_nearest neighbours. The x, y coordinates of 3-D data may be provided in combination with a dh (drill hole or well), in which case the mean x, y of each dh is calculated before performing the calculation. If a dh is not provided in combination with 3-D xy’s, then calculation is applied to all data and may create memory issues if greater than ~5000-10000 records are provided. A var specifier allows for the calculation to only applied where var is not NaN.
If
inplace==True
:The output is concatenated as a ‘Data Spacing ({Parameters[‘plotting.unit’]})’ column if
inplace=False
(or ‘Data Spacing’ if Parameters[‘plotting.unit’] is None). If var is used, then the calculation is only performed where DataFile[var] is not NaN, and the output is concatenated as ‘{var} Data Spacing ({Parameters[‘plotting.unit’]})’.If
inplace==False
:The funciton returns dspace as a numpy array if dspace.shape[0] is equal to DataFile.shape[0], meaning that dh and var functionality was not used, or did not lead to differences in the length of dspace and DataFile (so that the x and y in DataFile can be used for plotting dspace in map view). The function returns a tuple of the form (dspace, dh, x, y), if dh is not None and dspace.shape[0] is not equal to DataFile.shape[0]. The function returns a tuple of the form (dspace, x, y) if dh is None and and var is not None and dspace.shape[0] is not equal to DataFile.shape[0].
- Parameters
n_nearest (int) – number of nearest neighbours to consider in data spacing calculation
var (str) – variable for calculating data spacing, where the calculation is only applied to locations where var is not NaN. If None, the calculation is to all locations.
inplace (bool) – if True, the output data spacing is concatenated
dh (str) – dh name, which can override self.dh
x (str) – x coordinate name, which can override self.x
y (str) – y coordinate name, which can override self.y
Examples
Calculate data spacing without consideration of underlying variables, based on the nearest 8 neighbours.
>>> dat.spacing(8)
Output as a numpy array rather than concatenating a column:
>>> dspace = dat.spacing(8, inplace=False):
Only consider values where Au is non NaN for the calculation:
>>> (dspace, x, y) = dat.spacing(8, inplace=False, var=Au)
Example Data¶
-
pygeostat.data.data.
ExampleData
(testfile, griddef=None, **kwargs)¶ Get an example pygeostat DataFile
- Parameters
testfile (str) – one of the available pygeostat test files, listed below
Test files available in pygeostat include:
“point2d_ind”: 2d indicator dataset
“point2d_surf”: 2d point dataset sampling a surface
“grid2d_surf”: ‘Thickness’ from ‘point2d_surf’ interpolated on the grid
“point3d_ind_mv”: 3d multivariate and indicator dataset
“oilsands”: 3D Oil sands data set
“accuracy_plot”: Simulated realizations to test accuracy plot
“location_plot”: 2D data set to test location plot
“3d_grid”: 3D gridded data set
“point2d_mv” : 2D multivariate data set
“cluster”: GSLIB datafile (data with declustering weights)
“97data”: GSLIB datafile (the first 97 rows of cluster datafile)
“data”: GSLIB datafile (2D data set of primary and secondary variable)
“parta”: GSLIB datafile (small 2D dataset part A)
“partb”: GSLIB datafile (small 2D dataset part B)
“partc”: GSLIB datafile (small 2D dataset part C)
“true”: GSLIB datafile (Primary secondary data pairs)
“ydata”: GSLIB datafile (2D spatial seondary data with some primary data)
Input/Ouput Tools¶
iotools.py: Contains input/output utilities/functions for pygeostat. Many of which are based off of Pandas builtin functions.
Read File¶
-
pygeostat.data.iotools.
read_file
(flname, fltype=None, headeronly=False, delimiter='\\s*', h5path=None, h5datasets=None, columns=None, ireal=1, griddef=None, tmin=None)¶ Reads in a GSLIB-style Geo-EAS data file, CSV, GSB or HDF5 data files.
- Parameters
flname (str) – Path (or name) of file to read.
- Keyword Arguments
fltype (str) – Type of file to read: either
csv
,gslib
, orhdf5
.headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing. OR it allows you to open a hdf5 object with Pandas HDFStore functionality
delimiter (str) – Delimiter specified instead of sniffing
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument
datasets
from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value ofNone
places the dataset into the root directory of the HDF5 file. A value ofFalse
loads a blank pd.DataFrame().h5datasets (str or list) – Name of the dataset(s) to read from the group specified by
h5path
. Does nothing ifh5path
points to a dataset.column (list) – List of column labels to use for resulting frame
ireal (int) – Number of realizaitons in the file
griddef (GridDef) – griddef for the realization
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].
- Returns
Pandas DataFrame object with input data.
- Return type
data (pandas.DataFrame)
Note
Functions can also be called seperately with the following code
>>> data.data = pygeostat.read_gslib(flname) >>> data.data = pygeostat.read_csv(flname) >>> data.data = pygeostat.read_h5(flname, h5path='') >>> data.data = pygeostat.read_gsb(flname) >>> data.data = pygeostat.open_hdf5(flname)
Examples
>>> data.data = gs.read_gsb('testgsb.gsb') >>> data = gs.DataFile('testgsb.gsb')
Read CSV¶
-
pygeostat.data.iotools.
read_csv
(flname, headeronly=False, tmin=None)¶ Reads in a GSLIB-style CSV data file.
- Parameters
flname (str) – Path (or name) of file to read.
- Keyword Arguments
headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
delimiter (str) – Delimiter specified instead of sniffing
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].
- Returns
Pandas DataFrame object with input data.
- Return type
data (pandas.DataFrame)
Read GSLIB Python¶
-
pygeostat.data.iotools.
read_gslib
(flname, headeronly=False, delimiter='\\s*', tmin=None)¶ Reads in a GSLIB-style Geo-EAS data file
- Parameters
flname (str) – Path (or name) of file to read.
- Keyword Arguments
headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
delimiter (str) – Delimiter specified instead of sniffing
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].
- Returns
Pandas DataFrame object with input data.
- Return type
data (pandas.DataFrame)
Fortran Compile for GSB¶
-
pygeostat.data.iotools.
compile_pygsb
()¶ Compiles ‘pygeostat/fortran/src/pygsb.f90’ using ‘pygeostat/fortran/compile.py’ and tries to import pygsb.pyd
Note
How to install a gfortran compiler:
Install chocolatey from:
chocolatey.org/install
(chocolatey is a package manager that let you install software using command prompt and PowerShell)
After installing chocolatey, then install the ‘gnu Fortran compiler’ by writing the below in a PowerShell:
choco install mingw –version 8.1
choco install visualstudio2019community
choco install visualstudio2019-workload-vctools
When installing “mingw” through “chocolatey”, ensure that the path of the “mingw” ‘s “bin” folder is added to the environment variables path.
Read GSB¶
-
pygeostat.data.iotools.
read_gsb
(flname, ireal=-1, tmin=None, null=None)¶ Reads in a CCG GSB (GSLIB-Binary) file.
- Parameters
flname (str) – Path (or name) of file to read.
- Keyword Arguments
ireal (int) – 1-indexed realization number to read (reads 1 at a time), -1 to read all
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].
null (float) – when the gsb array has a keyout, on reconstruction this value fills the array in keyed out locations. If None taken from Parameters[‘data.null’]
- Returns
Pandas DataFrame object with input data.
- Return type
data (pandas.DataFrame)
Code author: Jared Deutsch 2016-02-19
Write GSLIB Python¶
-
pygeostat.data.iotools.
write_gslib
(data, flname, title=None, variables=None, fmt=None, sep=' ', null=None)¶ Writes out a GSLIB-style data file.
- Parameters
data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.
- Keyword Arguments
title (str) – Title for output file.
variables (List(str)) – List of variables to write out if only a subset is desired.
fmt (str) – Format to use for floating point numbers.
sep (str) – Delimiter to use for file output, generally don’t need to change.
null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].
Write CSV¶
-
pygeostat.data.iotools.
write_csv
(data, flname, variables=None, fmt='%.5f', sep=', ', fltype='csv', null=None)¶ Writes out a CSV or Excel (XLSX) data file.
- Parameters
data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.
- Keyword Arguments
variables (List(str)) – List of variables to write out if only a subset is desired.
fmt (str) – Format to use for floating point numbers.
sep (str) – Delimiter to use for file output, generally don’t need to change.
fltype (str) – Type of file to write either
csv
orxlsx
.null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].
Write GSB¶
-
pygeostat.data.iotools.
write_gsb
(data, flname, tvar=None, nreals=1, variables=None, griddef=None, fmt=0)¶ Writes out a GSB (GSLIB-Binary) style data file. NaN values of tvar are compressed in the output with no tmin now provided.
- Parameters
data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.
tvar (str) – Variable to trim by or None for no trimming. Note that all variables are trimmed in the data file (for compression) when this variable is trimmed.
nreals (int) – number of realizations in data
- Keyword Arguments
griddef (pygeostat.griddef.GridDef) – This is required if the data is gridded and you want other gsb programs to read it
fmt (int) – if 0 then will write out all variables as float 64. Otherwise should be an list with a length equal to number of variables and with the following format codes 1=int32, 2=float32, 3=float64
variables (List(str)) – List of variables to write out if only a subset is desired.
Code author: Jared Deutsch 2016-02-19, modified by Ryan Barnett 2018-04-12
Write VTK¶
-
pygeostat.data.iotools.
write_vtk
(data, flname, dftype=None, x=None, y=None, z=None, variables=None, griddef=None, null=None, vdtype=None, cdtype=None)¶ Writes out an XML VTK data file. A required dependency is pyevtk, which may be installed using the following command:
>>> pip install pyevtk
Users are also recommended to install the latest Paraview, as versions from 2017 were observed to have odd precision bugs with the XML format.
- Parameters
data (pygeostat.DataFile) – data to write out
flname (str) – Path (or name) of file to write out (without extension)
- Keyword Arguments
dftype (str) – type of datafile options
grid
orpoint
, which if None, is drawn from data.dftypex (str) – name of the x-coordinate, which is used if
point
. Drawn from data.x if the kwarg=None. If not provided by these means for`sgrid`
, calculated viasim.griddef.get_coordinates()
.y (str) – name of the y-coordinate, which is used if
point
. Drawn from data.y if the kwarg=None. If not provided by these means for`sgrid`
, calculated via sim.griddef.get_coordinates().z (str) – name of the z-coordinate, which is used if
point
. Drawn from data.z if the kwarg=None. If not provided by these means for`sgrid`
, calculated via sim.griddef.get_coordinates().griddef (pygeostat.GridDef) – grid definition, which is required if
grid
. Drawn from data.griddef if the kwarg=None.variables (list or str) – List or string of variables to write out. If None, then all columns aside from coordinates are written out by default.
null (float) – NaNs are converted to this value prior to writing. If None, set to pygeostat.Parameters[‘data.null_vtk’].
vdtype (dict(str)) – Dictionary of the format {‘varname’: dtype}, where dtype is a numpy data format. May be used for reducing file size, by specifying
int
,float32
, etc. If a format string is provided instead of a dictionary, that format is applied to all variables. This is not applied to coordinate variables (if applicable). If None, the value is drawn from Parameters[‘data.write_vtk.vdtype’].cdtype (str) – Numpy format to use for the output of coordinates, where valid formats are
float64
(default) andfloat32
. The later is recommended for reducing file sizes, but may not provide the requisite precision for UTM coordinates. If None, the value is drawn from Parameters[‘data.write_vtk.cdtype’].
dftype
should be one of:‘point’ (irregular points) where
data.x
,data.y
anddata.z
are columns indata.data
‘grid’ (regular or rectilinear grid) where
data.griddef
must be initialized‘sgrid’ (structured grid) where
data.x
,data.y
anddata.z
are columns indata.data
.data.griddef
should also be initialized, although onlygriddef.nx
,griddef.ny
andgriddef.nz
are utilized (since the grid is assumed to not be regular)
Write HDF5 VTK¶
-
pygeostat.data.iotools.
write_hvtk
(data, flname, griddef, variables=None)¶ Writes out an H5 file and corresponding xdmf file that Paraview can read. Currently only supports 3D gridded datasets. This function will fail if the length of the DataFile or DataFrame does not equal
griddef.count()
.The extension xdmf is silently enforced. Any other extension passed is replaced.
- Parameters
data (pd.DataFrame) – The DataFrame to writeout
flname (str) – Path (or name) of file to write out.
griddef (GridDef) – Grid definitions for the realizations to be written out
variables (str or list) – optional set of variables to write out from the DataFrame
Count Lines in File¶
-
pygeostat.data.iotools.
file_nlines
(flname)¶ Open a file and get the total number of lines. Seems pretty fast. Copied from stackoverflow http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python
- Parameters
flname (str) – Name of the file to read
Write CCG GMM¶
-
pygeostat.data.iotools.
writeout_gslib_gmm
(gmm, outfile)¶ Writeout a fitted Gaussian mixture to the format consistent with
gmmfit
from the CCG Knowledge Base. Assumegmm
is a ansklearn.mixture.GaussianMixture
class fitted to dataNote
Recently GMM was replaced with GaussianMixture, and there are subtle differences in attributes between the different versions..
- Parameters
gmm (GaussianMixture) – a fitted mixture model
outfile (str) – the output file
HDF5 I/O¶
Write HDF5¶
-
pygeostat.data.h5_io.
write_h5
(data, flname, h5path=None, datasets=None, dtype=None, gridstr=None, trim_variable=None, var_min=-998.0)¶ Write data to an HDF5 file using the python package H5PY. The file is appended to and in the case that a dataset already exists, it is overwritten.
- Parameters
data – A 1-D np.array/pd.Series or a
pd.DataFrame
containing different variables as columnsflname (str) – Path of the HDF5 you wish to write to or create
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument
datasets
into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value ofNone
places the dataset into the root directory of the HDF5 file.datasets (str or list) – Name of the dataset(s) to write out. If a
pd.DataFrame
is passed, the values passed by the argumentdatasets
must match the DataFrame’s columns.dtype (str) – The data type to write. Currently, only the following values are permitted:
['int32', 'float32', 'float64']
. If apd.DataFrame
is passed and this argument is left to it’s default value ofNone
, the DataFrame’s dtypes must be of the types listed above.gridstr (str) – Grid definition string that is saved to the HDF5 file as an attribute of the group defined by the parameter
h5path
.trim_variable (str) – Variable to use for trimming the data. An index will be written to the h5file and will be used to rebuild dataset while only nontrimmed data will be written out
var_min (float) – minimum trimming limit usedif trim_variable is passed
Examples
Write a single pd.Series or np.array to an HDF5 file:
>>> gs.write_h5(array, 'file.h5', h5path='Modeled/Var1', datasets='Realization_0001')
Write a whole
pd.DataFrame
in group (folder) ‘OriginalData’ that contains a dataset for every column in thepd.DataFrame
:>>> gs.write_h5('file.h5', DataFrame, h5path='OriginalData')
Read HDF5¶
-
pygeostat.data.h5_io.
read_h5
(flname, h5path=None, datasets=None, fill_value=-999)¶ Return a 1-D array from an HDF5 file or build a
pd.DataFrame()
from a list of datasets in a single group.The argument
h5path
must be a path to a group. If 1 or more specific variables are desired to be loaded, pass a list todatasets
to specify which to read.- Parameters
flname (str) – Path of the HDF5 you wish to write to or create
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument
datasets
from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value ofNone
places the dataset into the root directory of the HDF5 file. A value ofFalse
loads a blank pd.DataFrame().datasets (str or list) – Name of the dataset(s) to read from the group specified by
h5path
. Does nothing ifh5path
points to a dataset.fill_value (float or np.NaN) – value to fill in grid with if trimmed data was written out. default is -999
- Returns
DataFrame containing one or more columns, each containing a single 1-D array of a variable.
- Return type
data (pd.DataFrame)
Is HDF5¶
-
pygeostat.data.h5_io.
ish5dataset
(h5fl, dataset, h5path=None)¶ Check to see if a dataset exits within an HDF5 file
The argument
h5path
must be a path to a group and cannot contain the dataset name. Can only check for one dataset at a time.- Parameters
flname (str) – Path of the HDF5 you wish to check
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to check for the specified dataset. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of
None
places the dataset into the root directory of the HDF5 file.dataset (str) – Name of the dataset to check for in the group specified by
h5path
.
- Returns
Indicator if the specified dataset exists
- Return type
exists (bool)
Combine Datasets from Multiple Paths¶
-
pygeostat.data.h5_io.
h5_combine_data
(flname, h5paths, datasets=None)¶ Combine data into one DataFrame from multiple paths in a HDF5 file.
- Parameters
flname (str) – Path of the HDF5 you wish to read from
h5paths (list) – A list of h5paths to combine. Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument
datasets
into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value ofNone
places the dataset into the root directory of the HDF5 file.datasets (list of lists) – If only a specific set of datasets from each path are desired then pass a list of lists of equal length as the h5paths list. An empty list within the list will cause all datasets in the corresponding path to be readin.
- Returns
DataFrame
Example:
>>> flname = 'drilldata.h5' ... h5paths = ['/Orig_data/series4870/', 'NS/Declus/series4870/'] ... datasets = [['LOCATIONX', 'LOCATIONY', 'LOCATIONZ'], []] ... data = gs.h5_combine_data(flname, h5paths, datasets=datasets)
Pygeostat HDF5 Class¶
-
class
pygeostat.data.h5_io.
H5Store
(flname, replace=False)¶ A simple class within pygeostat to manage and use HDF5 files.
- Variables
flname (str) – Path to a HDF5 file to create or use
h5data (h5py.File) – h5py File object
paths (dict) – Dictionary containing all of the groups found in the HDF5 file that contain datasets
- Parameters
flname (str) – Path to a HDF5 file to create or use
Usage:
Write a np.array or pd.Series to the HDF5 file:
>>> H5Store['Group1/Group2/Var1'] = np.array()
Write all the columns in a
pd.DataFrame
to the HDF5 file:>>> H5Store['Group1/Group2'] = pd.DataFrame()
Retrieve a single 1-D array:
>>> array = H5Store['Group1/Group2/Var1']
Retrieve a single 1-D array within the root directory of the HDF5 file:
>>> array = H5Store['Var1']
Retrieve the first value from the array:
>>> value = H5Store['Var1', 0]
Retrieve a slice of values from the array:
>>> values = H5Store['Var1', 10:15]
Write Data¶
-
H5Store.
__setitem__
(key, value)¶ Write the the HDF5 file using the self[key] notation.
If a pd.Series or np.array is passed, the last entry in the path is used as the dataset name. If a
pd.DataFrame
is passed, all columns are written to the path specified to datasets with their names retrieved from thepd.DataFrame
’s columns. If more flexible usage is required, please usegs.write_h5()
.Example
Write a np.array or pd.Series to the HDF5 file:
>>> H5Store['Group1/Group2/Var1'] = np.array()
Write all the columns in a
pd.DataFrame
to the HDF5 file:>>> H5Store['Group1/Group2'] = pd.DataFrame()
Read Data¶
-
H5Store.
__getitem__
(key)¶ Retrieve an array using the self[key] notation. The passed key is the path used to access the array desired and included direction through groups if required and the dataset name. The array may be selectively queried allowing a specific value or range of values to be loaded into the systems memory and not the whole array.
Example
Retrieve a single 1-D array:
>>> array = H5Store['Group1/Group2/Var1']
Retrieve a single 1-D array within the root directory of the HDF5 file:
>>> array = H5Store['Var1']
Retrieve the first value from the array:
>>> value = H5Store['Var1', 0]
Retrieve a slice of values from the array:
>>> values = H5Store['Var1', 10:15]
Print Contents of HDF5 File¶
-
H5Store.
__str__
()¶ Print a nice list of groups and the datasets found within them using the variable
self.paths
.Example
Print any groups found within the HDF5 file and the datasets within:
>>> print(H5Store)
Datasets in H5 Store¶
-
H5Store.
datasets
(h5path=None)¶ Return the datasets found in the specified group.
- Keyword Arguments
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve the lists of datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of
None
places the dataset into the root directory of the HDF5 file.- Returns
List of the datasets found within the specified h5path
- Return type
datasets (list)
Generate Iterator¶
-
H5Store.
iteritems
(h5path=None, datasets=None, wildcard=None)¶ Produces an iterator that can be used to iterate over HDF5 datasets.
Can use the parameter
h5path
to indicate which group to retrieve the datasets from. If a set of specific datasets are required, the parameterdatasets
will restrict the iterator to those. The parameterwildcard
allows a string wild-card value to restrict which datasets are iterated over.- Keyword Arguments
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of
None
places the dataset into the root directory of the HDF5 file.datasets (list) – List of specific dataset names found within the specified group to iterator over
wildcard (str) – String to search for within the names of the datasets found within the specified group to iterate over
Examples
Load a HDF5 file to pygeostat:
>>> data = gs.H5Store('data.h5')
Iterate over all datasets within the root directory of a HDF5 file:
>>> for dataset in data.iteritems(): >>> gs.histplt(dataset)
Iterate over the datasets within a specific group that are realizations:
>>> for dataset in data.iteritems(h5path='Simulation/NS_AU', wildcard='Realization'): >>> gs.histplt(dataset)