Data Files¶

The core class in pygeostat is the DataFile class which contains a Pandas DataFrame with the data values and column names in addition to metadata, such as the name of the x, y and z coordinates or grid definition.

DataFile Class¶

class pygeostat.data.data.DataFile(flname=None, readfl=None, fltype=None, dftype=None, data=None, columns=None, null=None, title='data', griddef=None, dh=None, x=None, y=None, z=None, ifrom=None, ito=None, weights=None, cat=None, catdict=None, variables=None, notvariables=None, delimiter='\s+', headeronly=False, h5path=None, h5datasets=None, nreals=-1, tmin=None)¶

This class stores geostatistical data values and metadata.

DataFile classes may be created on initialization, or generated using pygeostat functions. This is the primary class for pygeostat and is used for reading and writing GSLIB, CSV, VTK, and HDF5 file formats.

Parameters

flname (str) – Path (or name) of file to read
readfl (bool) – True if the data file should be read on class initialization
fltype (str) – Type of data file: either csv, gslib or hdf5 or gsb
dftype (str) – Data file type as either ‘point’ or ‘grid’ used for writing out VTK files for visualization
data (pandas.DataFrame) – Pandas dataframe containing array of data values
dicts (List[dict] or dict) – List of dictionaries or dictionary for converting alphanumeric to numeric data
null (float) – Null value for missing values in the data file
title (str) – Title, or name, of the data file
griddef (pygeostat.GridDef) – Grid definition for a gridded data file
dh (str) – Name of drill hole variable
x (str) – Name of X coordinate column
y (str) – Name of Y coordinate column
z (str) – Name of Z coordinate column
ifrom (str) – Name of ‘from’ columns
ito (str) – Name of ‘to’ columns
weights (str or list) – Name of declustering weight column(s)
cat (str) – Name of categorical (e.g., rock type or facies) column
catdict (dict) – Set a dictionary for the categories, which should be formatted as: catdict = {catcode:catname}
variables (str or list) – Name of continuous variable(s), which if unspecified, are the columns not assigned to the above attributes (via kwargs or inference)
notvariables (str or list) – Name of column(s) to exclude from variables
delimiter (str) – Delimiter used in data file (ie: comma or space)
headeronly (bool) – True to just read header + 1 line of data file This is useful for getting column numbers of large files OR if reading hdf5 files will only read in the hdf5 store information
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
columns (list) – List of column labels to use for the resulting data pd.DataFrame
nreals (int) – number of realizations to read in. -1 will read all
tmin (float) – If a number is provided, values less than this number (e.g., trimmed or null values) are convernted to NaN. May be useful since NaN’s are more easily handled within python, matplotlib and pandas. Set to None to disable.

Examples

Quickly reading in a GeoEAS data file:

data_file = gs.DataFile(flname='../data/oilsands.dat')

To read in a GeoEAS datafile and assign attributes

# Point Data Example
data_file = gs.DataFile(flname='../data/oilsands.dat',readfl=True,dh='Drillhole Number', x='East',y='North',z='Elevation')

# Gridded Data Example
griddef = gs.GridDef('''10 0.5 1
10 0.5 1
10 0.5 1''')
data_file = gs.DataFile(flname='../data/3DDecor.dat', griddef=griddef)

# To view grid definition string
print(data_file.griddef)
# Access some Grid Deffinition attributes
data_file.griddef.count() # returns number of blocks in grid
data_file.griddef.extents() # returns an array of the extents for all directions
data_file.griddef.nx # returns nunmber of blocks in x direction

HDF5

Using the HDF5 file format has its own positive features. For one it reads and writes much faster then using the ASCII format. Attributes (like the grid definition) can also be saved within the file. All files for a single project can also be saved in the same file. Please refer to the introduction on HDF5 files for more information

This class currently only searches for and loads a grid definition.

Examples

HDF5 file simple read example:

data_file = gs.DataFile(flname='../data/oilsands_out.hdf5')

To view the HDF5 header information (tables stored in the file):

data_file.store

If you have a HDF5 file with multiple tables and you just want to read in the file information to view what tables are in the file and any attributes saved to the file you can do a header style only read:

data_file = gs.DataFile(flname='../data/oilsands_out.hdf5', dftype='hdf5', headeronly=True)

Then to see what tables are written in the hdf5 file:

data_file.store

DataFile Attributes¶

Attributes of a datafile object are accessed with datafile.<attribute>.

Columns¶

Access the columns of the datafile. Wrapper for datafile.data.columns.

Num Variables¶

Access the nvar of the datafile. e.g., the len(datafile.variables)

Locations¶

Access the locations stored in the datafile. Wrapper for datafile[datafile.xyz].

Example:

>>> datafile = gs.DataFile("somefile.out")  # this file has an x, y[, z] attribute that is found
>>> datafile.locations
... dataframe of x, y, z locations

Shape¶

Access the shape of the data stored in the datafile. Wrapper for datafile.data.shape

Example:

>>> datafile = gs.DataFile("somefile.out")
>>> datafile.shape
... shape of datafile.data

Head¶

DataFile.head(n=5)¶: Return the first n rows of the data, accessing self.data.head()

Rename Columns¶

DataFile.rename(columns)¶

Applies a dictionary to alter self.DataFrame column names. This applies the DataFrame.rename function, but updates any special attributes (dh, x, y, etc.) with the new name, if previously set to the old name. Users should consider using the self.columns property if changing all column names.

Parameters: columns (dict) – formatted as {oldname1: newname1, oldname2:newname2}, etc, where the old and new names are strings. The old names must be present in data.columns.

Drop Columns¶

DataFile.drop(columns)¶

This applies the DataFrame.drop function, where axis=1, inplace=True and columns is used in place of the labels. It also updates any special attributes (dh, x, y, etc.), setting them to None if dropped. Similarly, if any variables are dropped, they are removed from self.variables.

Parameters: columns (str or list) – column names to drop

Check for Duplicate Columns¶

DataFile.check_for_duplicate_cols()¶: Run a quick check on the column names to see if any of them are duplicated. If they are duplicated then print a Warning and rename the columns

Set Columns¶

DataFile.setcol(colattr, colname=None)¶

Set a specialized column attribute (dh, ifrom, ito, x, y, z, cat or weights) for the DataFile, where DataFile.data must be initialized. If colname is None, then the attribute will be set if a common name for it is detected in DataFile.data.columns (e.g., if colattr='dh' and colname=None, and 'DHID' is found in DataFile.data, then DataFile.dh='DHID'. The attribute will be None if none of the common names are detected. If colname is not None, then the provided string will be assigned to the attribute, e.g. DataFile.colattr=colname. Note, however, that an error will be thrown if colname is not None and colname is not in DataFile.data.columns. This is used on DataFile initialization, but may also be useful for calling after specialized columns are altered.

Parameters

colattr (str) – must match one of: 'dh', 'ifrom', 'ito', 'x', 'y', 'z', 'cat' or 'weights'
colname (str or list) – if not None, must be the name(s) of a column in DataFile.data. List is only valid if colattr=weights

Examples

Set the x attribute (dat.x) based on a specified value:

>>> dat.setcol('x', 'Easting')

Set the x attribute (dat.x), where the function checks common names for x:

>>> dat.setcol('x')

Set Variable Columns¶

DataFile.setvarcols(variables=None, notvariables=None)¶

Set the variables for the DataFile. If provided, the function checks that the variables are present in the DataFrame. If not provided, the function assigns columns that are not specified as the variables (dh, x, y, z, rt, weights), as well as a list of user specified notvariables.

This is used on DataFile initialization, but may also be useful for calling after variables are added or removed.

Parameters

variables (list or str) – list of strings
notvariables (list or str) – list of strings

Examples

Set the variables based on a specified list:

>>> dat.setvarcols(variables=['Au', 'Carbon'])

Set the variables based on the function excluding specialized columns (dh, x, y, etc.):

>>> dat.setvarcols()

Set the variables based on the function excluding specialized columns (dh, x, y, etc.), as well as a user specified list of what is not a variable:

>>> dat.setvarcols(notvariables=['Data Spacing', Keyout'])

Set Categorical Dictionary¶

DataFile.setcatdict(catdict)¶

Set a dictionary for the categories, which should be formatted as:

>>> catdict = {catcode:catname}

Example

>>> catdict = {0: "Mudstone", 1: "Sandstone"}
>>> self.setcatdict(catdict)

Check DataFile¶

DataFile.check_datafile(flname, variables, sep, fltype)¶: Run some quick checks on the DataFile before writing and grab info if not provided

Add Coord¶

DataFile.addcoord()¶

Only use on DataFile classes containing GSLIB style gridded data.

If x, y, or z coordinate column(s) do not exist they are created. If the created or current columns only have null values, they are populated based on the GridDef class pass to the DataFile class.

Note

A griddef must be assigned to the DataFile class either at read in like here

>>> data_file = gs.DataFile(flname='test.out', griddef=grid)

Or later such it can be manually assigned such as here

>>> data_file.griddef = gs.GridDef(gridstr=my_grid_str)

Apply Dictionary¶

DataFile.applydict(origvar, outvar, mydict)¶

Applies a dictionary to the original variable to get a new variable.

This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters

origvar (str) – Name of original variable.
outvar (str) – Name of output variable.
mydict (dict) – Dictionary of values to apply.

Examples

>>> data_file.applydict('Drillhole', 'Drillhole-mod', mydict)

Describe DataFile¶

DataFile.describe(variables=None)¶

Describe a data set using pandas describe(), but exclude special variables.

Keyword Arguments: variables (List(str)) – List of variables to describe.
Returns: Pandas description of variables.
Return type: self.data[variables]describe()

Examples

Describe all none special variables in the DataFrame (will exclued columns set as dh ID, coordinate columns, etc.)

>>> data_file.describe()

Or describe specific variables

>>> data_file.describe(['Bitumen', 'Fines'])

Infer Grid Definition¶

DataFile.infergriddef(blksize=None, databuffer=5, nblk=None)¶

Infer a grid definition with the specified dimensions to cover the set of data values. The function operates with two primary options:

Provide a block size (node spacing), the function infers the required number of blocks (grid nodes) to cover the data

Provide the number of blocks, the function infers the required block size

A data buffer may be used for expanding the grid beyond the data extents. Basic integer rounding is also used for attempting to provide a ‘nice’ grid in terms of the origin alignment.

Parameters

blksize (float or 3-tuple) – provides (xsiz, ysiz, zsiz). If blksize is not None, nblk must be None. Set zsiz None if the grid is 2-D. A float may also be provided, where xsiz = ysiz = zsiz = float is assumed.
databuffer (float or 3-tuple) – buffer between the data and the edge of the model, optionally for each direction
nblk (int or 3-tuple) – provides (nx, ny, nz). If blksize is not None, nblk must be None. Set nz to None or 1 if the grid is 2-D. An int may also be provided, where nx = ny = nz = int is assumed.

Returns

this function returns the grid definition object as well as assigns the griddef to the current gs.DataFile

Return type

griddef (GridDef)

Note

this function assumes things are either 3D or 2D along the xy plane. If nx == 1 or ny == 1, nonsense will result!

Usage:

First, import a datafile using gs.DataFile(), make sure to assign the correct columns to x, y and z:
>>> datfl = gs.DataFile('test.dat',x='x',y='y',z='z')
Now create the griddef from the data contained within the dataframe:
>>> blksize = (100, 50, 1)
>>> databuffer = (10, 25, 0) # buffer in the x, y and z directions
>>> griddef = datfl.infergriddef(blksize, databuffer)
Check by printing out the resulting griddef:
>>> print(griddef)

Examples

For 3D data, infergriddef() returns a 3D grid definition even if zsiz is given as None or 0 or 1:

df3d = gs.ExampleData("point3d_ind_mv")
a = df3d.infergriddef(blksize = [50,60,1])
b = df3d.infergriddef(blksize = [50,60,None])
c = df3d.infergriddef(blksize = [50,60,0])
#a,b,c are returned as Pygeostat GridDef:
#                                       20 135.0 50.0
#                                       19 1230.0 60.0
#                                       82 310.5 1.0

For 3D data, nz given as None or 0 or 1 returns a 2D grid that covers the vertical extent of the 3D data:

d = df3d.infergriddef(nblk = [50,60,1])
e = df3d.infergriddef(nblk = [50,60,None])
f = df3d.infergriddef(nblk = [50,60,0])
#d,e,f are returned as Pygeostat GridDef:
#                                       50 119.8 19.6
#                                       60 1209.1 18.2
#                                       1 350.85 81.7

Where xsiz = ysiz = zsiz, a float can also be provided, or where nx = ny = nz, an int can also be provided:

df3d.infergriddef(blksize = 75)
df3d.infergriddef(blksize = [75,75,75])#returns the same as its above line

df3d.infergriddef(nblk = 60)
df3d.infergriddef(nblk = [60,60,60])#returns the same as its above line

If data is 2-D, zsiz or nz must be provided as None. Otherwise it raise exception:

df2d = gs.ExampleData("point2d_ind")
df2d.infergriddef(nblk = [60, 60, None])
df2d.infergriddef(blksize = [50,60,None])

File Name String¶

DataFile.__str__()¶: Return the name of the data file if asked to ‘print’ the data file… or use the datafile in a string!

Generate Dictionary¶

DataFile.gendict(var, outvar=None)¶

Generates a dictionary with unique IDs from alphanumeric IDs. This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters: var (str) – Variable to generate a dictionary for
Keyword Arguments: outvar (str) – Variable to generate using generated dictionary.
Returns: Dictionary of alphanumerics to numeric ids.
Return type: newdict (dict)

Examples

A simple call

>>> data_file.gendict('Drillhole')

OR

>>> dh_dict = data_file.gendict('Drillhole')

GSLIB Column¶

DataFile.gscol(variables, string=True)¶

Returns the GSLIB (1-ordered) column given a (list of) variable(s).

Parameters: variables (str or List(str)) – Path, or name, of the data file.
Keyword Arguments: string (bool) – If True returns the columns as a string.
Returns: GSLIB 1-ordered column(s).
Return type: cols (int or List(int) or string)

Note

None input returns a 0, which may be necessary, for example, with 2-D data: >>> data.xyz … [‘East’, ‘North’, None] >>> data.gscol(data.xyz) … ‘2 3 0’

Examples

Some simple calls

>>> data_file.gscol('Bitumen')
... 5

>>> data_file.gscol(['Bitumen', 'Fines'])
... [5, 6]

>>> data_file.gscol(['Bitumen', 'Fines'], string=True)
... '5 6'

Truncate NaN’s¶

DataFile.truncatenans(variable)¶

Returns a truncated list with nans removed for a variable.

Parameters: variable (str) – Name of original variable.
Returns: Truncated values.
Return type: truncated (values)

Examples

A simple call that will return the list

>>> data_file.truncatenans('Bitumen')

Unique Categories¶

DataFile.unique_cats(variable, truncatenans=False)¶

Returns a sorted list of the unique categories given a variable.

Parameters: variable (str) – Name of original variable.
Keyword Arguments: truncatenans (bool) – Truncates missing values if True.
Returns: Sorted, list of set(object).
Return type: unique_cats (List(object))

Examples

A simple call that

>>> data_file.unique_cats('Drillhole')

Or to save the list

>>> unique_dh_list = data_file.unique_cats('Drillhole')

Write file¶

DataFile.write_file(flname, title=None, variables=None, fmt=None, sep=None, fltype=None, data=None, h5path=None, griddef=None, null=None, tvar=None, nreals=1)¶

Writes out a GSLIB-style, VTK, CSV, Excel (XLSX), HDF5 data file.

Parameters:
flname (str): Path (or name) of file to write out.

Keyword Args:
title (str): Title for output file. variables (List(str)): List of variables to write out if only a subset is desired. fmt (str): Format to use for floating point numbers. sep (str): Delimiter to use for file output, generally don’t need to change. fltype (str): Type of file to write either gslib, vtk, csv, xlsx,

or hdf5.

data (str): Subset of data to write out - cannot be used with variables
option!

h5path (str): The h5 group path to write data to (H5 filetype) griddef (obj): a gslib griddef object

tvar (str): Name of variable to use for compression when NaNs exist within it nreals (int): number of realizations you are writing out (needed for GSB)

null (float): If a number is provided, NaN numbers are converted to this value
prior to writing. May be useful since NaN’s are more easily handled within python and pandas than null values, but are not valid in GSLIB. Set to None to disable (but NaN’s must be handled prior to this function call if so).
Note:
pygeostat.write_file is saved for backwards compatibility or as an overloaded class method. Current write functions can be called seperately with the functions listed below:
>>> import pygeostat as gs
>>> import pandas as pd
>>> gs.write_gslib(gs.DataFile or pd.DataFrame)
>>> gs.write_csv(gs.DataFile or pd.DataFrame)
>>> gs.write_hdf5(gs.DataFile or pd.DataFrame)
>>> gs.write_vtk(gs.DataFile or pd.DataFrame)
>>> gs.write_gsb(gs.DataFile or pd.DataFrame)
Note: The GSB format is not specifically intended for general users of pygeostat. Some CCG programs use GSB that is a compressed GSLIB-like binary data format that greatly reduces the computational expense.
The following calls are equivalent:
>>> data_file.write_file('testgslib.out')
>>> data_file.write_file('testgsb.gsb')
is equivalent to:
>>> gs.write_gslib(data_file, 'testgslib.out')
>>> gs.write_gsb(data_file, 'testgsb.gsb')
and similar to:
>>> gs.write_gslib(data_file.data, 'testgslib.out')
>>> gs.write_gsb(data_file.data, 'testgsb.gsb')

Data Spacing¶

DataFile.spacing(n_nearest, var=None, inplace=True, dh=None, x=None, y=None)¶

Calculates data spacing in the xy plane, based on the average distance to the nearest n_nearest neighbours. The x, y coordinates of 3-D data may be provided in combination with a dh (drill hole or well), in which case the mean x, y of each dh is calculated before performing the calculation. If a dh is not provided in combination with 3-D xy’s, then calculation is applied to all data and may create memory issues if greater than ~5000-10000 records are provided. A var specifier allows for the calculation to only applied where var is not NaN.

If inplace==True:

The output is concatenated as a ‘Data Spacing ({Parameters[‘plotting.unit’]})’ column if inplace=False (or ‘Data Spacing’ if Parameters[‘plotting.unit’] is None). If var is used, then the calculation is only performed where DataFile[var] is not NaN, and the output is concatenated as ‘{var} Data Spacing ({Parameters[‘plotting.unit’]})’.

If inplace==False:

The funciton returns dspace as a numpy array if dspace.shape[0] is equal to DataFile.shape[0], meaning that dh and var functionality was not used, or did not lead to differences in the length of dspace and DataFile (so that the x and y in DataFile can be used for plotting dspace in map view). The function returns a tuple of the form (dspace, dh, x, y), if dh is not None and dspace.shape[0] is not equal to DataFile.shape[0]. The function returns a tuple of the form (dspace, x, y) if dh is None and and var is not None and dspace.shape[0] is not equal to DataFile.shape[0].

Parameters

n_nearest (int) – number of nearest neighbours to consider in data spacing calculation
var (str) – variable for calculating data spacing, where the calculation is only applied to locations where var is not NaN. If None, the calculation is to all locations.
inplace (bool) – if True, the output data spacing is concatenated
dh (str) – dh name, which can override self.dh
x (str) – x coordinate name, which can override self.x
y (str) – y coordinate name, which can override self.y

Examples

Calculate data spacing without consideration of underlying variables, based on the nearest 8 neighbours.

>>> dat.spacing(8)

Output as a numpy array rather than concatenating a column:

>>> dspace = dat.spacing(8, inplace=False):

Only consider values where Au is non NaN for the calculation:

>>> (dspace, x, y) = dat.spacing(8, inplace=False, var=Au)

Example Data¶

pygeostat.data.data.ExampleData(testfile, griddef=None, **kwargs)¶

Get an example pygeostat DataFile

Parameters: testfile (str) – one of the available pygeostat test files, listed below

Test files available in pygeostat include:

“point2d_ind”: 2d indicator dataset

“point2d_surf”: 2d point dataset sampling a surface

“grid2d_surf”: ‘Thickness’ from ‘point2d_surf’ interpolated on the grid

“point3d_ind_mv”: 3d multivariate and indicator dataset

“oilsands”: 3D Oil sands data set

“accuracy_plot”: Simulated realizations to test accuracy plot

“location_plot”: 2D data set to test location plot

“3d_grid”: 3D gridded data set

“point2d_mv” : 2D multivariate data set

“cluster”: GSLIB datafile (data with declustering weights)

“97data”: GSLIB datafile (the first 97 rows of cluster datafile)

“data”: GSLIB datafile (2D data set of primary and secondary variable)

“parta”: GSLIB datafile (small 2D dataset part A)

“partb”: GSLIB datafile (small 2D dataset part B)

“partc”: GSLIB datafile (small 2D dataset part C)

“true”: GSLIB datafile (Primary secondary data pairs)

“ydata”: GSLIB datafile (2D spatial seondary data with some primary data)

Input/Ouput Tools¶

iotools.py: Contains input/output utilities/functions for pygeostat. Many of which are based off of Pandas builtin functions.

Read File¶

pygeostat.data.iotools.read_file(flname, fltype=None, headeronly=False, delimiter='\\s*', h5path=None, h5datasets=None, columns=None, ireal=1, griddef=None, tmin=None)¶

Reads in a GSLIB-style Geo-EAS data file, CSV, GSB or HDF5 data files.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments

fltype (str) – Type of file to read: either csv, gslib, or hdf5.
headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing. OR it allows you to open a hdf5 object with Pandas HDFStore functionality
delimiter (str) – Delimiter specified instead of sniffing
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
column (list) – List of column labels to use for resulting frame
ireal (int) – Number of realizaitons in the file
griddef (GridDef) – griddef for the realization
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Note

Functions can also be called seperately with the following code

>>> data.data = pygeostat.read_gslib(flname)
>>> data.data = pygeostat.read_csv(flname)
>>> data.data = pygeostat.read_h5(flname, h5path='')
>>> data.data = pygeostat.read_gsb(flname)
>>> data.data = pygeostat.open_hdf5(flname)

Examples

>>> data.data = gs.read_gsb('testgsb.gsb')
>>> data = gs.DataFile('testgsb.gsb')

Read CSV¶

pygeostat.data.iotools.read_csv(flname, headeronly=False, tmin=None)¶

Reads in a GSLIB-style CSV data file.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments

headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
delimiter (str) – Delimiter specified instead of sniffing
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Read GSLIB Python¶

pygeostat.data.iotools.read_gslib(flname, headeronly=False, delimiter='\\s*', tmin=None)¶

Reads in a GSLIB-style Geo-EAS data file

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments

headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
delimiter (str) – Delimiter specified instead of sniffing
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Fortran Compile for GSB¶

pygeostat.data.iotools.compile_pygsb()¶

Compiles ‘pygeostat/fortran/src/pygsb.f90’ using ‘pygeostat/fortran/compile.py’ and tries to import pygsb.pyd

Note

How to install a gfortran compiler:

Install chocolatey from:

chocolatey.org/install

(chocolatey is a package manager that let you install software using command prompt and PowerShell)

After installing chocolatey, then install the ‘gnu Fortran compiler’ by writing the below in a PowerShell:

choco install mingw –version 8.1

choco install visualstudio2019community

choco install visualstudio2019-workload-vctools
When installing “mingw” through “chocolatey”, ensure that the path of the “mingw” ‘s “bin” folder is added to the environment variables path.

Read GSB¶

pygeostat.data.iotools.read_gsb(flname, ireal=-1, tmin=None, null=None)¶

Reads in a CCG GSB (GSLIB-Binary) file.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments

ireal (int) – 1-indexed realization number to read (reads 1 at a time), -1 to read all
tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].
null (float) – when the gsb array has a keyout, on reconstruction this value fills the array in keyed out locations. If None taken from Parameters[‘data.null’]

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Code author: Jared Deutsch 2016-02-19

Write GSLIB Python¶

pygeostat.data.iotools.write_gslib(data, flname, title=None, variables=None, fmt=None, sep=' ', null=None)¶

Writes out a GSLIB-style data file.

Parameters

data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.

Keyword Arguments

title (str) – Title for output file.
variables (List(str)) – List of variables to write out if only a subset is desired.
fmt (str) – Format to use for floating point numbers.
sep (str) – Delimiter to use for file output, generally don’t need to change.
null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].

Write CSV¶

pygeostat.data.iotools.write_csv(data, flname, variables=None, fmt='%.5f', sep=', ', fltype='csv', null=None)¶

Writes out a CSV or Excel (XLSX) data file.

Parameters

data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.

Keyword Arguments

variables (List(str)) – List of variables to write out if only a subset is desired.
fmt (str) – Format to use for floating point numbers.
sep (str) – Delimiter to use for file output, generally don’t need to change.
fltype (str) – Type of file to write either csv or xlsx.
null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].

Write GSB¶

pygeostat.data.iotools.write_gsb(data, flname, tvar=None, nreals=1, variables=None, griddef=None, fmt=0)¶

Writes out a GSB (GSLIB-Binary) style data file. NaN values of tvar are compressed in the output with no tmin now provided.

Parameters

data (pygeostat.DataFile or pandas.DataFrame) – data to write out
flname (str) – Path (or name) of file to write out.
tvar (str) – Variable to trim by or None for no trimming. Note that all variables are trimmed in the data file (for compression) when this variable is trimmed.
nreals (int) – number of realizations in data

Keyword Arguments

griddef (pygeostat.griddef.GridDef) – This is required if the data is gridded and you want other gsb programs to read it
fmt (int) – if 0 then will write out all variables as float 64. Otherwise should be an list with a length equal to number of variables and with the following format codes 1=int32, 2=float32, 3=float64
variables (List(str)) – List of variables to write out if only a subset is desired.

Code author: Jared Deutsch 2016-02-19, modified by Ryan Barnett 2018-04-12

Write VTK¶

pygeostat.data.iotools.write_vtk(data, flname, dftype=None, x=None, y=None, z=None, variables=None, griddef=None, null=None, vdtype=None, cdtype=None)¶

Writes out an XML VTK data file. A required dependency is pyevtk, which may be installed using the following command:

>>> pip install pyevtk

Users are also recommended to install the latest Paraview, as versions from 2017 were observed to have odd precision bugs with the XML format.

Parameters

data (pygeostat.DataFile) – data to write out
flname (str) – Path (or name) of file to write out (without extension)

Keyword Arguments

dftype (str) – type of datafile options grid or point, which if None, is drawn from data.dftype
x (str) – name of the x-coordinate, which is used if point. Drawn from data.x if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().
y (str) – name of the y-coordinate, which is used if point. Drawn from data.y if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().
z (str) – name of the z-coordinate, which is used if point. Drawn from data.z if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().
griddef (pygeostat.GridDef) – grid definition, which is required if grid. Drawn from data.griddef if the kwarg=None.
variables (list or str) – List or string of variables to write out. If None, then all columns aside from coordinates are written out by default.
null (float) – NaNs are converted to this value prior to writing. If None, set to pygeostat.Parameters[‘data.null_vtk’].
vdtype (dict(str)) – Dictionary of the format {‘varname’: dtype}, where dtype is a numpy data format. May be used for reducing file size, by specifying int, float32, etc. If a format string is provided instead of a dictionary, that format is applied to all variables. This is not applied to coordinate variables (if applicable). If None, the value is drawn from Parameters[‘data.write_vtk.vdtype’].
cdtype (str) – Numpy format to use for the output of coordinates, where valid formats are float64 (default) and float32. The later is recommended for reducing file sizes, but may not provide the requisite precision for UTM coordinates. If None, the value is drawn from Parameters[‘data.write_vtk.cdtype’].

dftype should be one of:

‘point’ (irregular points) where data.x, data.y and data.z are columns in data.data
‘grid’ (regular or rectilinear grid) where data.griddef must be initialized
‘sgrid’ (structured grid) where data.x, data.y and data.z are columns in data.data. data.griddef should also be initialized, although only griddef.nx, griddef.ny and griddef.nz are utilized (since the grid is assumed to not be regular)

Write HDF5 VTK¶

pygeostat.data.iotools.write_hvtk(data, flname, griddef, variables=None)¶

Writes out an H5 file and corresponding xdmf file that Paraview can read. Currently only supports 3D gridded datasets. This function will fail if the length of the DataFile or DataFrame does not equal griddef.count().

The extension xdmf is silently enforced. Any other extension passed is replaced.

Parameters

data (pd.DataFrame) – The DataFrame to writeout
flname (str) – Path (or name) of file to write out.
griddef (GridDef) – Grid definitions for the realizations to be written out
variables (str or list) – optional set of variables to write out from the DataFrame

Count Lines in File¶

pygeostat.data.iotools.file_nlines(flname)¶

Open a file and get the total number of lines. Seems pretty fast. Copied from stackoverflow http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

Parameters: flname (str) – Name of the file to read

Write CCG GMM¶

pygeostat.data.iotools.writeout_gslib_gmm(gmm, outfile)¶

Writeout a fitted Gaussian mixture to the format consistent with gmmfit from the CCG Knowledge Base. Assume gmm is a an sklearn.mixture.GaussianMixture class fitted to data

Note

Recently GMM was replaced with GaussianMixture, and there are subtle differences in attributes between the different versions..

Parameters

gmm (GaussianMixture) – a fitted mixture model
outfile (str) – the output file

HDF5 I/O¶

Write HDF5¶

pygeostat.data.h5_io.write_h5(data, flname, h5path=None, datasets=None, dtype=None, gridstr=None, trim_variable=None, var_min=-998.0)¶

Write data to an HDF5 file using the python package H5PY. The file is appended to and in the case that a dataset already exists, it is overwritten.

Parameters

data – A 1-D np.array/pd.Series or a pd.DataFrame containing different variables as columns
flname (str) – Path of the HDF5 you wish to write to or create
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
datasets (str or list) – Name of the dataset(s) to write out. If a pd.DataFrame is passed, the values passed by the argument datasets must match the DataFrame’s columns.
dtype (str) – The data type to write. Currently, only the following values are permitted: ['int32', 'float32', 'float64']. If a pd.DataFrame is passed and this argument is left to it’s default value of None, the DataFrame’s dtypes must be of the types listed above.
gridstr (str) – Grid definition string that is saved to the HDF5 file as an attribute of the group defined by the parameter h5path.
trim_variable (str) – Variable to use for trimming the data. An index will be written to the h5file and will be used to rebuild dataset while only nontrimmed data will be written out
var_min (float) – minimum trimming limit usedif trim_variable is passed

Examples

Write a single pd.Series or np.array to an HDF5 file:

>>> gs.write_h5(array, 'file.h5', h5path='Modeled/Var1', datasets='Realization_0001')

Write a whole pd.DataFrame in group (folder) ‘OriginalData’ that contains a dataset for every column in the pd.DataFrame:

>>> gs.write_h5('file.h5', DataFrame, h5path='OriginalData')

Read HDF5¶

pygeostat.data.h5_io.read_h5(flname, h5path=None, datasets=None, fill_value=-999)¶

Return a 1-D array from an HDF5 file or build a pd.DataFrame() from a list of datasets in a single group.

The argument h5path must be a path to a group. If 1 or more specific variables are desired to be loaded, pass a list to datasets to specify which to read.

Parameters

flname (str) – Path of the HDF5 you wish to write to or create
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
fill_value (float or np.NaN) – value to fill in grid with if trimmed data was written out. default is -999

Returns

DataFrame containing one or more columns, each containing a single 1-D array of a variable.

Return type

data (pd.DataFrame)

Is HDF5¶

pygeostat.data.h5_io.ish5dataset(h5fl, dataset, h5path=None)¶

Check to see if a dataset exits within an HDF5 file

The argument h5path must be a path to a group and cannot contain the dataset name. Can only check for one dataset at a time.

Parameters

flname (str) – Path of the HDF5 you wish to check
h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to check for the specified dataset. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file.
dataset (str) – Name of the dataset to check for in the group specified by h5path.

Returns

Indicator if the specified dataset exists

Return type

exists (bool)

Combine Datasets from Multiple Paths¶

pygeostat.data.h5_io.h5_combine_data(flname, h5paths, datasets=None)¶

Combine data into one DataFrame from multiple paths in a HDF5 file.

Parameters

flname (str) – Path of the HDF5 you wish to read from
h5paths (list) – A list of h5paths to combine. Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
datasets (list of lists) – If only a specific set of datasets from each path are desired then pass a list of lists of equal length as the h5paths list. An empty list within the list will cause all datasets in the corresponding path to be readin.

Returns

DataFrame

Example:

>>> flname = 'drilldata.h5'
... h5paths = ['/Orig_data/series4870/', 'NS/Declus/series4870/']
... datasets = [['LOCATIONX', 'LOCATIONY', 'LOCATIONZ'], []]
... data = gs.h5_combine_data(flname, h5paths, datasets=datasets)

Pygeostat HDF5 Class¶

class pygeostat.data.h5_io.H5Store(flname, replace=False)¶

A simple class within pygeostat to manage and use HDF5 files.

Variables

flname (str) – Path to a HDF5 file to create or use
h5data (h5py.File) – h5py File object
paths (dict) – Dictionary containing all of the groups found in the HDF5 file that contain datasets

Parameters

flname (str) – Path to a HDF5 file to create or use

Usage:

Write a np.array or pd.Series to the HDF5 file:
>>> H5Store['Group1/Group2/Var1'] = np.array()
Write all the columns in a pd.DataFrame to the HDF5 file:
>>> H5Store['Group1/Group2'] = pd.DataFrame()
Retrieve a single 1-D array:
>>> array = H5Store['Group1/Group2/Var1']
Retrieve a single 1-D array within the root directory of the HDF5 file:
>>> array = H5Store['Var1']
Retrieve the first value from the array:
>>> value = H5Store['Var1', 0]
Retrieve a slice of values from the array:
>>> values = H5Store['Var1', 10:15]

Write Data¶

H5Store.__setitem__(key, value)¶

Write the the HDF5 file using the self[key] notation.

If a pd.Series or np.array is passed, the last entry in the path is used as the dataset name. If a pd.DataFrame is passed, all columns are written to the path specified to datasets with their names retrieved from the pd.DataFrame’s columns. If more flexible usage is required, please use gs.write_h5().

Example

Write a np.array or pd.Series to the HDF5 file:

>>> H5Store['Group1/Group2/Var1'] = np.array()

Write all the columns in a pd.DataFrame to the HDF5 file:

>>> H5Store['Group1/Group2'] = pd.DataFrame()

Read Data¶

H5Store.__getitem__(key)¶

Retrieve an array using the self[key] notation. The passed key is the path used to access the array desired and included direction through groups if required and the dataset name. The array may be selectively queried allowing a specific value or range of values to be loaded into the systems memory and not the whole array.

Example

Retrieve a single 1-D array:

>>> array = H5Store['Group1/Group2/Var1']

Retrieve a single 1-D array within the root directory of the HDF5 file:

>>> array = H5Store['Var1']

Retrieve the first value from the array:

>>> value = H5Store['Var1', 0]

Retrieve a slice of values from the array:

>>> values = H5Store['Var1', 10:15]

Print Contents of HDF5 File¶

H5Store.__str__()¶

Print a nice list of groups and the datasets found within them using the variable self.paths.

Example

Print any groups found within the HDF5 file and the datasets within:

>>> print(H5Store)

Close the HDF5 File¶

H5Store.close()¶: Release the open HDF5 file from python.

Datasets in H5 Store¶

H5Store.datasets(h5path=None)¶

Return the datasets found in the specified group.

Keyword Arguments: h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve the lists of datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
Returns: List of the datasets found within the specified h5path
Return type: datasets (list)

Generate Iterator¶

H5Store.iteritems(h5path=None, datasets=None, wildcard=None)¶

Produces an iterator that can be used to iterate over HDF5 datasets.

Can use the parameter h5path to indicate which group to retrieve the datasets from. If a set of specific datasets are required, the parameter datasets will restrict the iterator to those. The parameter wildcard allows a string wild-card value to restrict which datasets are iterated over.

Keyword Arguments

h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
datasets (list) – List of specific dataset names found within the specified group to iterator over
wildcard (str) – String to search for within the names of the datasets found within the specified group to iterate over

Examples

Load a HDF5 file to pygeostat:

>>> data = gs.H5Store('data.h5')

Iterate over all datasets within the root directory of a HDF5 file:

>>> for dataset in data.iteritems():
>>>     gs.histplt(dataset)

Iterate over the datasets within a specific group that are realizations:

>>> for dataset in data.iteritems(h5path='Simulation/NS_AU', wildcard='Realization'):
>>>     gs.histplt(dataset)

DictFile Class¶

class pygeostat.data.data.DictFile(flname=None, readfl=False, dictionary={})¶: Class containing dictionary file information

Read Dictionary¶

DictFile.read_dict()¶: Read dictionary information from file

Write Dictionary¶

DictFile.write_dict()¶: Write dictionary information to csv style dictionary