Data Files

The core class in pygeostat is the DataFile class which contains a Pandas DataFrame with the data values and column names in addition to metadata, such as the name of the x, y and z coordinates or grid definition.

DataFile Class

class pygeostat.data.data.DataFile(flname=None, readfl=None, fltype=None, dftype=None, data=None, columns=None, null=None, title='data', griddef=None, dh=None, x=None, y=None, z=None, ifrom=None, ito=None, wts=None, cat=None, catdict=None, variables=None, notvariables=None, delimiter='\s+', headeronly=False, h5path=None, h5datasets=None, nreals=-1, tmin=None)

This class stores geostatistical data values and metadata.

DataFile classes may be created on initialization, or generated using pygeostat functions. This is the primary class for pygeostat and is used for reading and writing GSLIB, CSV, VTK, and HDF5 file formats.

Parameters:
  • flname (str) – Path (or name) of file to read
  • readfl (bool) – True if the data file should be read on class initialization
  • fltype (str) – Type of data file: either csv, gslib, hdf5, or gsb
  • dftype (str) – Data file type as either ‘point’ or ‘grid’ used for writing out VTK files for visualization
  • data (pandas.DataFrame) – Pandas dataframe containing array of data values
  • dicts (List[dict] or dict) – List of dictionaries or dictionary for converting alphanumeric to numeric data
  • null (float) – Null value for missing values in the data file
  • title (str) – Title, or name, of the data file
  • griddef (pygeostat.GridDef) – Grid definition for a gridded data file
  • dh (str) – Name of drill hole variable
  • x (str) – Name of X coordinate column
  • y (str) – Name of Y coordinate column
  • z (str) – Name of Z coordinate column
  • ifrom (str) – Name of ‘from’ columns
  • ito (str) – Name of ‘to’ columns
  • wts (str or list) – Name of declustering weight column(s)
  • cat (str) – Name of categorical (e.g., rock type or facies) column
  • catdict (dict) – Set a dictionary for the categories, which should be formatted as: catdict = {catcode:catname}
  • variables (str or list) – Name of continuous variable(s), which if unspecified, are the columns not assigned to the above attributes (via kwargs or inference)
  • notvariables (str or list) – Name of column(s) to exclude from variables
  • delimiter (str) – Delimiter used in data file (ie: comma or space)
  • headeronly (bool) – True to just read header + 1 line of data file This is useful for getting column numbers of large files OR if reading hdf5 files will only read in the hdf5 store information
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
  • h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
  • columns (list) – List of column labels to use for the resulting data pd.DataFrame
  • nreals (int) – number of realizations to read in. -1 will read all
  • tmin (float) – If a number is provided, values less than this number (e.g., trimmed or null values) are convernted to NaN. May be useful since NaN’s are more easily handled within python, matplotlib and pandas. Set to None to disable.

Examples

Quickly reading in a GeoEAS data file:

>>> datafl = gs.DataFile(flname='../data/oilsands.dat')

To read in a GeoEAS datafile and assign attributes

>>> # Point Data Example
>>> datafl = gs.DataFile(flname='../data/oilsands.dat',readfl=True,dh='Drillhole Number',
>>>                      x='East',y='North',z='Elevation')
>>> # Gridded Data Example
>>> griddef = gs.GridDef('''10 0.5 1
>>> 10 0.5 1
>>> 10 0.5 1''')
>>> datafl = gs.DataFile(flname='../data/3DDecor.dat', griddef=griddef)
>>> # To view grid definition string
>>> print(datafl.griddef())
>>> # Access some Grid Deffinition attributes
>>> datafl.griddef.count() # returns number of blocks in grid
>>> datafl.griddef.extents() # returns an array of the extents for all directions
>>> datafl.griddef.nx # returns nunmber of blocks in x direction

HDF5

Using the HDF5 file format has its own positive features. For one it reads and writes much faster then using the ASCII format. Attributes (like the grid definition) can also be saved within the file. All files for a single project can also be saved in the same file. Please refer to the introduction on HDF5 files for more information

This class currently only searches for and loads a grid definition.

Examples

HDF5 file simple read example:

>>> datafl = gs.DataFile(flname='../data/oilsands_out.hdf5')

To view the HDF5 header information (tables stored in the file):

>>> datafl.store

If you have a HDF5 file with multiple tables and you just want to read in the file information to view what tables are in the file and any attributes saved to the file you can do a header style only read:

>>> datafl = gs.DataFile(flname='../data/oilsands_out.hdf5', dftype='hdf5',
>>>                      headeronly=True)

Then to see what tables are written in the hdf5 file:

>>> datafl.store

Code author: Jared Deutsch 2014-04-03

DataFile Attributes

Attributes of a datafile object are accessed with datafile.<attribute>.

Columns

Access the columns of the datafile. Wrapper for datafile.data.columns.

Num Variables

Access the nvar of the datafile. e.g., the len(datafile.variables)

Locations

Access the locations stored in the datafile. Wrapper for datafile[datafile.xyz].

Example:

>>> datafile = gs.DataFile("somefile.out")  # this file has an x, y[, z] attribute that is found
>>> datafile.locations
... dataframe of x, y, z locations

Shape

Access the shape of the data stored in the datafile. Wrapper for datafile.data.shape

Example:

>>> datafile = gs.DataFile("somefile.out")
>>> datafile.shape
... shape of datafile.data

Rename Columns

DataFile.rename(columns)

Applies a dictionary to alter self.DataFrame column names. This applies the DataFrame.rename function, but updates any special attributes (dh, x, y, etc.) with the new name, if previously set to the old name. Users should consider using the self.columns property if changing all column names.

Parameters:columns (dict) – formatted as {oldname1: newname1, oldname2:newname2}, etc, where the old and new names are strings. The old names must be present in data.columns.

Code author: Ryan Barnett 2018-04-02

Drop Columns

DataFile.drop(columns)

This applies the DataFrame.drop function, where axis=1, inplace=True and columns is used in place of the labels. It also updates any special attributes (dh, x, y, etc.), setting them to None if dropped. Similarly, if any variables are dropped, they are removed from self.variables.

Parameters:columns (str or list) – column names to drop

Code author: Ryan Barnett 2018-04-16

Check for Duplicate Columns

DataFile.check_for_duplicate_cols()

Run a quick check on the column names to see if any of them are duplicated. If they are duplicated then print a Warning and rename the columns

Code author: Tyler Acorn 2017-03-06

Set Columns

DataFile.setcol(colattr, colname=None)

Set a specialized column attribute (dh, ifrom, ito, x, y, z, cat or wts) for the DataFile, where DataFile.data must be initialized. If colname is None, then the attribute will be set if a common name for it is detected in DataFile.data.columns (e.g., if colattr='dh' and colname=None, and 'DHID' is found in DataFile.data, then DataFile.dh='DHID'. The attribute will be None if none of the common names are detected. If colname is not None, then the provided string will be assigned to the attribute, e.g. DataFile.colattr=colname. Note, however, that an error will be thrown if colname is not None and colname is not in DataFile.data.columns. This is used on DataFile initialization, but may also be useful for calling after specialized columns are altered.

Parameters:
  • colattr (str) – must match one of: 'dh', 'ifrom', 'ito', 'x', 'y', 'z', 'cat' or 'wts'
  • colname (str or list) – if not None, must be the name(s) of a column in DataFile.data. List is only valid if colattr=wts

Examples

Set the x attribute (dat.x) based on a specified value:

>>> dat.setcol('x', 'Easting')

Set the x attribute (dat.x), where the function checks common names for x:

>>> dat.setcol('x')

Code author: Ryan Barnett 2018-03-22

Set Variable Columns

DataFile.setvarcols(variables=None, notvariables=None)

Set the variables for the DataFile. If provided, the function checks that the variables are present in the DataFrame. If not provided, the function assigns columns that are not specified as the variables (dh, x, y, z, rt, wts), as well as a list of user specified notvariables.

This is used on DataFile initialization, but may also be useful for calling after variables are added or removed.

Parameters:
  • variables (list or str) – list of strings
  • notvariables (list or str) – list of strings

Examples

Set the variables based on a specified list:

>>> dat.setvarcols(variables=['Au', 'Carbon'])

Set the variables based on the function excluding specialized columns (dh, x, y, etc.):

>>> dat.setvarcols()

Set the variables based on the function excluding specialized columns (dh, x, y, etc.), as well as a user specified list of what is not a variable:

>>> dat.setvarcols(notvariables=['Data Spacing', Keyout'])

Code author: Ryan Barnett 2018-03-19

Set Categorical Dictionary

DataFile.setcatdict(catdict)

Set a dictionary for the categories, which should be formatted as:

>>> catdict = {catcode:catname}

Example

>>> catdict = {0: "Mudstone", 1: "Sandstone"}
>>> self.setcatdict(catdict)

Code author: Ryan Barnett 2018-04-23

Check DataFile

DataFile.check_datafile(flname, variables, sep, fltype)

Run some quick checks on the DataFile before writing and grab info if not provided

Code author: Tyler Acorn 2015-10-05

Add Coord

DataFile.addcoord()

Only use on DataFile classes containing GSLIB style gridded data.

If x, y, or z coordinate column(s) do not exist they are created. If the created or current columns only have null values, they are populated based on the GridDef class pass to the DataFile class.

Note

A griddef must be assigned to the DataFile class either at read in like here

>>> datafl = gs.DataFile(flname='test.out', griddef=grid)

Or later such it can be manually assigned such as here

>>> datafl.griddef = gs.GridDef(gridstr=my_grid_str)

Code author: Warren E. Black 2015-10-26

Apply Dictionary

DataFile.applydict(origvar, outvar, mydict)

Applies a dictionary to the original variable to get a new variable.

This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters:
  • origvar (str) – Name of original variable.
  • outvar (str) – Name of output variable.
  • mydict (dict) – Dictionary of values to apply.

Examples

>>> datafl.applydict('Drillhole', 'Drillhole-mod', mydict)

Code author: Jared Deutsch 2014-04-03

Describe DataFile

DataFile.describe(variables=None)

Describe a data set using pandas describe(), but exclude special variables.

Keyword Arguments:
 variables (List(str)) – List of variables to describe.
Returns:Pandas description of variables.
Return type:self.data[variables]describe()

Examples

Describe all none special variables in the DataFrame (will exclued columns set as dh ID, coordinate columns, etc.)

>>> datafl.describe()

Or describe specific variables

>>> datafl.describe(['Bitumen', 'Fines'])

Code author: Jared Deutsch 2015-08-01

Infer Grid Definition

DataFile.infergriddef(blksize=None, databuffer=5, nblk=None)

Infer a grid definition with the specified dimensions to cover the set of data values. The function operates with two primary options:

  1. Provide a block size (node spacing), the function infers the required number of blocks (grid nodes) to cover the data
  2. Provide the number of blocks, the function infers the required block size

A data buffer may be used for expanding the grid beyond the data extents. Basic integer rounding is also used for attempting to provide a ‘nice’ grid in terms of the origin alignment.

Parameters:
  • blksize (float or 3-tuple) – provides (xsiz, ysiz, zsiz). If blksize is not None, nblk must be None. Set zsiz None if the grid is 2-D. A float may also be provided, where xsiz = ysiz = zsiz = float is assumed.
  • databuffer (float or 3-tuple) – buffer between the data and the edge of the model, optionally for each direction
  • nblk (int or 3-tuple) – provides (nx, ny, nz). If nblk is not None, nblk must be None. Set nz to None or 1 if the grid is 2-D. An int may also be provided, where nx = ny = nz = int is assumed.
Returns:

this function returns the grid definition object as well as assigns the griddef to the current gs.DataFile

Return type:

griddef (GridDef)

Note

this function lazily assumes things are either 3D or 2D along the xy plane. If nx == 1 or ny == 1, nonsense will result!

Usage:

First, import a datafile using gs.DataFile(), make sure to assign the correct columns to x, y and z:

>>> datfl = gs.DataFile('test.dat',x='x',y='y',z='z')

Now create the griddef from the data contained within the dataframe:

>>> blksize = (100, 50, 1)
>>> databuffer = (10, 25, 0) # buffer in the x, y and z directions
>>> griddef = datfl.infergriddef(blksize, databuffer)

Check by printing out the resulting griddef:

>>> print(griddef)

Code author: Ryan Martin - 2016-05-27

File Name String

DataFile.__str__()

Return the name of the data file if asked to ‘print’ the data file… or use the datafile in a string!

Generate Dictionary

DataFile.gendict(var, outvar=None)

Generates a dictionary with unique IDs from alphanumeric IDs. This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters:var (str) – Variable to generate a dictionary for
Keyword Arguments:
 outvar (str) – Variable to generate using generated dictionary.
Returns:Dictionary of alphanumerics to numeric ids.
Return type:newdict (dict)

Examples

A simple call

>>> datafl.gendict('Drillhole')

OR

>>> dh_dict = datafl.gendict('Drillhole')

Code author: Jared Deutsch 2014-04-03

GSLIB Column

DataFile.gscol(variables, string=True)

Returns the GSLIB (1-ordered) column given a (list of) variable(s).

Parameters:variables (str or List(str)) – Path, or name, of the data file.
Keyword Arguments:
 string (bool) – If True returns the columns as a string.
Returns:GSLIB 1-ordered column(s).
Return type:cols (int or List(int) or string)

Note

None input returns a 0, which may be necessary, for example, with 2-D data: >>> data.xyz … [‘East’, ‘North’, None] >>> data.gscol(data.xyz) … ‘2 3 0’

Examples

Some simple calls

>>> datafl.gscol('Bitumen')
... 5
>>> datafl.gscol(['Bitumen', 'Fines'])
... [5, 6]
>>> datafl.gscol(['Bitumen', 'Fines'], string=True)
... '5 6'

Code author: Jared Deutsch 2014-04-03

Set NaN

DataFile.setnan(variables=None, tmin=None, tmax=None)

Sets missing values (defined by a lower trimming limit and upper trimming limit) to np.nan in the DataFrame.

Keyword Arguments:
 
  • variables (List(str)) – List of variables to apply trimming.
  • tmin (float) – Lower trimming limit. Values less than this value are trimmed.
  • tmax (float) – Upper trimming limit. Values greater than or equal to this value are trimmed.

Examples

A simple call

>>> datafl.setnan('Bitumen', -999.0)

Code author: Jared Deutsch 2015-08-01

Truncate NaN’s

DataFile.truncatenans(variable)

Returns a truncated list with nans removed for a variable.

Parameters:variable (str) – Name of original variable.
Returns:Truncated values.
Return type:truncated (values)

Examples

A simple call that will return the list

>>> datafl.truncatenans('Bitumen')

Code author: Jared Deutsch 2014-04-03

Unique Categories

DataFile.unique_cats(variable, truncatenans=False)

Returns a sorted list of the unique categories given a variable.

Parameters:variable (str) – Name of original variable.
Keyword Arguments:
 truncatenans (bool) – Truncates missing values if True.
Returns:Sorted, list of set(object).
Return type:unique_cats (List(object))

Examples

A simple call that

>>> datafl.unique_cats('Drillhole')

Or to save the list

>>> unique_dh_list = datafl.unique_cats('Drillhole')

Code author: Jared Deutsch 2014-04-03

Writefile

DataFile.writefile(flname, title=None, variables=None, fmt='%.5f', fmt_string='g16.8', sep=None, fltype=None, data=None, h5path=None, griddef=None, tvar=None, nreals=1, null=None)

Writes out a GSLIB-style, VTK, CSV, Excel (XLSX), HDF5 data file.

Parameters:

flname (str) – Path (or name) of file to write out.

Keyword Arguments:
 
  • title (str) – Title for output file.
  • variables (List(str)) – List of variables to write out if only a subset is desired.
  • fmt (str) – Format to use for floating point numbers or a GSB precision code.
  • fmt_string (str) – Format string to use for writing out with the GSLIB fortran module
  • sep (str) – Delimiter to use for file output, generally don’t need to change.
  • fltype (str) – Type of file to write either gslib, vtk, csv, xlsx, or hdf5.
  • data (str) – Subset of data to write out - cannot be used with variables option!
  • h5path (str) – The h5 group path to write data to (H5 filetype)
  • griddef (obj) – a gslib griddef object
  • tvar (str) – Name of variable to use for compression when NaNs exist within it
  • nreals (int) – number of realizations you are writing out (needed for GSB)
  • null (float) – If a number is provided, NaN numbers are converted to this value prior to writing. May be useful since NaN’s are more easily handled within python and pandas than null values, but are not valid in GSLIB. Set to None to disable (but NaN’s must be handled prior to this function call if so).

Note

pygeostat.writefile is saved for backwards compatibility or as an overloaded class method. Current write functions can be called seperately with the functions listed below:

>>> import pygeostat as gs
>>> import pandas as pd
>>> gs.write_gslib(gs.DataFile or pd.DataFrame)
>>> gs.write_csv(gs.DataFile or pd.DataFrame)
>>> gs.write_hdf5(gs.DataFile or pd.DataFrame)
>>> gs.write_vtk(gs.DataFile or pd.DataFrame)
>>> gs.write_gsb(gs.DataFile or pd.DataFrame)

The following calls are equivalent:

>>> datafl.writefile('testgslib.out')

is equivalent to:

>>> gs.write_gslib(datafl, 'testgslib.out')

and similar to:

>>> gs.write_gslib(datafl.data, 'testgslib.out')

Code author: Jared Deutsch 2014-04-03

Data Spacing

DataFile.spacing(n_nearest, var=None, inplace=True, dh=None, x=None, y=None)

Calculates data spacing in the xy plane, based on the average distance to the nearest n_nearest neighbours. The x, y coordinates of 3-D data may be provided in combination with a dh (drill hole or well), in which case the mean x, y of each dh is calculated before performing the calculation. If a dh is not provided in combination with 3-D xy’s, then calculation is applied to all data and may create memory issues if greater than ~5000-10000 records are provided. A var specifier allows for the calculation to only applied where var is not NaN.

If inplace==True:

The output is concatenated as a ‘Data Spacing ({gsParams[‘plotting.unit’]})’ column if inplace=False (or ‘Data Spacing’ if gsParams[‘plotting.unit’] is None). If var is used, then the calculation is only performed where DataFile[var] is not NaN, and the output is concatenated as ‘{var} Data Spacing ({gsParams[‘plotting.unit’]})’.

If inplace==False:

The funciton returns dspace as a numpy array if dspace.shape[0] is equal to DataFile.shape[0], meaning that dh and var functionality was not used, or did not lead to differences in the length of dspace and DataFile (so that the x and y in DataFile can be used for plotting dspace in map view). The function returns a tuple of the form (dspace, dh, x, y), if dh is not None and dspace.shape[0] is not equal to DataFile.shape[0]. The function returns a tuple of the form (dspace, x, y) if dh is None and and var is not None and dspace.shape[0] is not equal to DataFile.shape[0].
Parameters:
  • n_nearest (int) – number of nearest neighbours to consider in data spacing calculation
  • var (str) – variable for calculating data spacing, where the calculation is only applied to locations where var is not NaN. If None, the calculation is to all locations.
  • inplace (bool) – if True, the output data spacing is concatenated
  • dh (str) – dh name, which can override self.dh
  • x (str) – x coordinate name, which can override self.x
  • y (str) – y coordinate name, which can override self.y

Examples

Calculate data spacing without consideration of underlying variables, based on the nearest 8 neighbours.

>>> dat.spacing(8)

Output as a numpy array rather than concatenating a column:

>>> dspace = dat.spacing(8, inplace=False):

Only consider values where Au is non NaN for the calculation:

>>> (dspace, x, y) = dat.spacing(8, inplace=False, var=Au)

Code author: Ryan Barnett - 2018-03-25

Example Data

pygeostat.data.data.ExampleData(testfile)

Get an example pygeostat DataFile

Parameters:testfile (str) – one of the available pygeostat test files, listed below

Test files available in pygeostat include:

  • “point2d_ind”: 2d indicator dataset
  • “point2d_surf”: 2d point dataset sampling a surface
  • “grid2d_surf”: ‘Thickness’ from ‘point2d_surf’ interpolated on the grid
  • “point3d_ind_mv”: 3d multivariate and indicator dataset

Input/Ouput Tools

iotools.py: Contains input/output functions for pygeostat. Many of which are based off of Pandas builtin functions.

Read File

pygeostat.data.iotools.readfile(flname, fltype=None, headeronly=False, delimiter='\\s*', h5path=None, h5datasets=None, columns=None, ireal=1, griddef=None, tmin=None)

Reads in a GSLIB-style Geo-EAS data file, CSV, GSB, or HDF5 data files.

Parameters:

flname (str) – Path (or name) of file to read.

Keyword Arguments:
 
  • fltype (str) – Type of file to read: either csv, gslib, or hdf5.
  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing. OR it allows you to open a hdf5 object with Pandas HDFStore functionality
  • delimiter (str) – Delimiter specified instead of sniffing
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
  • h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
  • column (list) – List of column labels to use for resulting frame
  • ireal (int) – Number of realizaitons in the file
  • griddef (GridDef) – griddef for the realization
  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.gsParams[‘data.tmin’].
Returns:

Pandas DataFrame object with input data.

Return type:

data (pandas.DataFrame)

Note

Functions can also be called seperately with the following code

>>> data.data = pygeostat.read_gslib(flname)
>>> data.data = pygeostat.read_csv(flname)
>>> data.data = pygeostat.read_h5(flname, h5path='')
>>> data.data = pygeostat.read_gsb(flname, ireal=1)
>>> data.data = pygeostat.open_hdf5(flname)

Code author: Jared Deutsch 2014-08-25

Read CSV

pygeostat.data.iotools.read_csv(flname, headeronly=False, tmin=None)

Reads in a GSLIB-style CSV data file.

Parameters:

flname (str) – Path (or name) of file to read.

Keyword Arguments:
 
  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
  • delimiter (str) – Delimiter specified instead of sniffing
  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.gsParams[‘data.tmin’].
Returns:

Pandas DataFrame object with input data.

Return type:

data (pandas.DataFrame)

Code author: Jared Deutsch 2014-08-25

Read GSLIB Fortran (Fast)

pygeostat.data.iotools.read_gslib_f(flname, griddef=None, gridsize=None, num_rlztns=1, rlzt_to_read=0, double=True, integer_columns=None, tmin=None)

Reads in a GSLIB-style Geo-EAS data file using ‘pure’ fortran .. the resulting dataframe is floats only. Fastest read option is if you have gridded data and can pass it the grid information as either a griddef or gridsize and the number of realizations. Default is to use point reader function wich searches the file first to figure out the size of the data then reads it (the slower option).

Parameters:
  • flname (str) – Path (or name) of file to read.
  • griddef (pygeostat.GridDef) – Grid definition for a gridded data file.
  • gridsize (int) – Alternatively can use gridsize (number of cells in grid) for gridded data
  • num_rlztns (int) – Number of realizations in the gridded data file
  • rlzt_to_read (int) – 0 will read in all realizations. a number greater than zero will read in that specific realization. Only working for gridded data right now
  • double (bool) – Read in the Data as double precision (True) or single precision (False)
  • integer_columns (str) – An array of the names of columns to convert to integer. This can be important for categorical data due to precision errors since 1 != 1.000000000001
  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.gsParams[‘data.tmin’].
Returns:

Pandas DataFrame object with input data.

Return type:

data (pandas.DataFrame)

Code author: Ryan Martin and Tyler Acorn, modified from the read_gslib

Read GSLIB Python (Slower)

pygeostat.data.iotools.read_gslib(flname, headeronly=False, delimiter='\\s*', fortran_read=True, tmin=None)

Reads in a GSLIB-style Geo-EAS data file or CSV data file.

Parameters:

flname (str) – Path (or name) of file to read.

Keyword Arguments:
 
  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing
  • delimiter (str) – Delimiter specified instead of sniffing
  • fortran_read (bool) – Indicate if fortran should be used to read the file if available
  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.gsParams[‘data.tmin’].
Returns:

Pandas DataFrame object with input data.

Return type:

data (pandas.DataFrame)

Code author: Jared Deutsch 2014-08-25

Read GSB

pygeostat.data.iotools.read_gsb(flname, ireal=-1, tmin=None)

Reads in a CCG GSB (GSLIB-Binary) file.

Parameters:

flname (str) – Path (or name) of file to read.

Keyword Arguments:
 
  • ireal (int) – 1-indexed realization number to read (reads 1 at a time), 0 to read all
  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.gsParams[‘data.tmin’].
Returns:

Pandas DataFrame object with input data.

Return type:

data (pandas.DataFrame)

Code author: Jared Deutsch 2016-02-19

Write GSLIB Fortran (Fast)

pygeostat.data.iotools.write_gslib_f(data, flname, title=None, variables=None, fmt_string='g16.8', sep=' ', null=None)

Use the fast fortran subroutine to writeout the data

Parameters:
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out
  • flname (str) – Path (or name) of file to write out.
Keyword Arguments:
 
  • title (str) – Title for output file.
  • variables (List(str)) – List of variables to write out if only a subset is desired.
  • fmt (str) – Format to use for floating point numbers.
  • sep (str) – Delimiter to use for file output, generally don’t need to change.
  • null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.gsParams[‘data.null’].

Code author: Ryan Martin 2014-04-03 Modified to use the fortran subroutine 3.14.2016

Write GSLIB Python (Slower)

pygeostat.data.iotools.write_gslib(data, flname, title=None, variables=None, fmt='%.5f', sep=' ', null=None)

Writes out a GSLIB-style data file.

Parameters:
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out
  • flname (str) – Path (or name) of file to write out.
Keyword Arguments:
 
  • title (str) – Title for output file.
  • variables (List(str)) – List of variables to write out if only a subset is desired.
  • fmt (str) – Format to use for floating point numbers.
  • sep (str) – Delimiter to use for file output, generally don’t need to change.
  • null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.gsParams[‘data.null’].

Code author: Jared Deutsch 2014-04-03

Write CSV

pygeostat.data.iotools.write_csv(data, flname, variables=None, fmt='%.5f', sep=', ', fltype='csv', null=None)

Writes out a CSV or Excel (XLSX) data file.

Parameters:
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out
  • flname (str) – Path (or name) of file to write out.
Keyword Arguments:
 
  • variables (List(str)) – List of variables to write out if only a subset is desired.
  • fmt (str) – Format to use for floating point numbers.
  • sep (str) – Delimiter to use for file output, generally don’t need to change.
  • fltype (str) – Type of file to write either csv or xlsx.
  • null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.gsParams[‘data.null’].

Code author: Jared Deutsch 2014-04-03

Write GSB

pygeostat.data.iotools.write_gsb(data, flname, tvar, nreals=1, variables=None, griddef=None, fmt=0)

Writes out a GSB (GSLIB-Binary) style data file. NaN values of tvar are compressed in the output with no tmin now provided.

Parameters:
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out
  • flname (str) – Path (or name) of file to write out.
  • tvar (str) – Variable to trim by or None for no trimming. Note that all variables are trimmed in the data file (for compression) when this variable is trimmed.
  • nreals (int) – number of realizations in data
Keyword Arguments:
 
  • griddef (pygeostat.griddef.GridDef) – This is required if the data is gridded and you want other gsb programs to read it
  • fmt (int) – if 0 then will write out all variables as float 64. Otherwise should be an list with a length equal to number of variables and with the following format codes 1=int32, 2=float32, 3=float64
  • variables (List(str)) – List of variables to write out if only a subset is desired.

Code author: Jared Deutsch 2016-02-19, modified by Ryan Barnett 2018-04-12

Is Binary

pygeostat.data.iotools.isbinary(file)

From http://stackoverflow.com/a/7392391/5545005 Its hard to understand what’s going on here.. but it seems to work for gsb files …. H5 has a handy check but this fills the gap for gsb files when trying to read ascii

Write VTK

pygeostat.data.iotools.write_vtk(data, flname, dftype=None, x=None, y=None, z=None, variables=None, griddef=None, null=None, vdtype=None, cdtype=None)

Writes out an XML VTK data file. A required dependency is pyevtk, which may be installed using the following command:

>>> pip install pyevtk

Users are also recommended to install the latest Paraview, as versions from 2017 were observed to have odd precision bugs with the XML format.

Parameters:
  • data (pygeostat.DataFile) – data to write out
  • flname (str) – Path (or name) of file to write out (without extension)
Keyword Arguments:
 
  • dftype (str) – type of datafile options grid or point or sgrid, which if None, is drawn from data.dftype
  • x (str) – name of the x-coordinate, which is used if point or sgrid. Drawn from data.x if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.gridcoord().
  • y (str) – name of the y-coordinate, which is used if point or sgrid. Drawn from data.y if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.gridcoord().
  • z (str) – name of the z-coordinate, which is used if point or sgrid. Drawn from data.z if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.gridcoord().
  • griddef (pygeostat.GridDef) – grid definition, which is required if grid or sgrid. Drawn from data.griddef if the kwarg=None.
  • variables (list or str) – List or string of variables to write out. If None, then all columns aside from coordinates are written out by default.
  • null (float) – NaNs are converted to this value prior to writing. If None, set to pygeostat.gsParams[‘data.null_vtk’].
  • vdtype (dict(str)) – Dictionary of the format {‘varname’: dtype}, where dtype is a numpy data format. May be used for reducing file size, by specifying int, float32, etc. If a format string is provided instead of a dictionary, that format is applied to all variables. This is not applied to coordinate variables (if applicable). If None, the value is drawn from gsParams[‘data.write_vtk.vdtype’].
  • cdtype (str) – Numpy format to use for the output of coordinates, where valid formats are float64 (default) and float32. The later is recommended for reducing file sizes, but may not provide the requisite precision for UTM coordinates. If None, the value is drawn from gsParams[‘data.write_vtk.cdtype’].

dftype should be one of:

  1. ‘point’ (irregular points) where data.x, data.y and data.z are columns in data.data
  2. ‘grid’ (regular or rectilinear grid) where data.griddef must be initialized
  3. ‘sgrid’ (structured grid) where data.x, data.y and data.z are columns in data.data. data.griddef should also be initialized, although only griddef.nx, griddef.ny and griddef.nz are utilized (since the grid is assumed to not be regular)

Code author: Ryan Barnett 2017-11-30

Write HDF5 VTK

pygeostat.data.iotools.write_hvtk(data, flname, griddef, variables=None)

Writes out an H5 file and corresponding xdmf file that Paraview can read. Currently only supports 3D gridded datasets. This function will fail if the length of the DataFile or DataFrame does not equal griddef.count().

The extension xdmf is silently enforced. Any other extension passed is replaced.

Parameters:
  • data (pd.DataFrame) – The DataFrame to writeout
  • flname (str) – Path (or name) of file to write out.
  • griddef (GridDef) – Grid definitions for the realizations to be written out
  • variables (str or list) – optional set of variables to write out from the DataFrame

Code author: Ryan Martin - 2016-09-02

Count Lines in File

pygeostat.data.iotools.file_nlines(flname)

Open a file and get the total number of lines. Seems pretty fast. Copied from stackoverflow http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

Parameters:flname (str) – Name of the file to read

Write CCG GMM

pygeostat.data.iotools.writeout_gslib_gmm(gmm, outfile)

Writeout a fitted Gaussian mixture to the format consistent with gmmfit from the CCG Knowledge Base. Assume gmm is a an sklearn.mixture.GaussianMixture class fitted to data

Note

Recently GMM was replaced with GaussianMixture, and there are subtle differences in attributes between the different versions..

Parameters:
  • gmm (GaussianMixture) – a fitted mixture model
  • outfile (str) – the output file

Code author: Ryan Martin - 2017-02-13

HDF5 I/O

Write HDF5

pygeostat.data.h5_io.write_h5(data, flname, h5path=None, datasets=None, dtype=None, gridstr=None, trim_variable=None, var_min=None)

Write data to an HDF5 file. The file is appended to and in the case that a dataset already exists, it is overwritten.

Current Fortran implementation of the hdft_io module only allows the writing of a 1-D single precision integer or single/double precision float. If a pd.DataFrame is passed, all columns are placed in the same group within the HDF5 file or as specified by the argument datasets.

The python library h5py is not used for writing data to a HDF5 file as CCG programs may not be able to read data type written by it. By doing this, only the data type supported by the CCG hdf5_io Fortran module are exported. If you wish to have more flexibility, use the python library h5py directly.

Parameters:
  • data – A 1-D np.array/pd.Series or a pd.DataFrame containing different variables as columns
  • flname (str) – Path of the HDF5 you wish to write to or create
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
  • datasets (str or list) – Name of the dataset(s) to write out. If a pd.DataFrame is passed, the values passed by the argument datasets must match the DataFrame’s columns.
  • dtype (str) – The data type to write. Currently, only the following values are permitted: ['int32', 'float32', 'float64']. If a pd.DataFrame is passed and this argument is left to it’s default value of None, the DataFrame’s dtypes must be of the types listed above.
  • gridstr (str) – Grid definition string that is saved to the HDF5 file as an attribute of the group defined by the parameter h5path.

Examples

Write a single pd.Series or np.array to an HDF5 file:

>>> gs.write_h5(array, 'file.h5', h5path='Modeled/Var1', datasets='Realization_0001')

Write a whole pd.DataFrame in group (folder) ‘OriginalData’ that contains a dataset for every column in the pd.DataFrame:

>>> gs.write_h5('file.h5', DataFrame, h5path='OriginalData')

Code author: Warren E. Black - 2016-06-09

Write HDF5 Using Python

pygeostat.data.h5_io.write_h5_p(data, flname, h5path=None, datasets=None, dtype=None, gridstr=None, trim_variable=None, var_min=-998.0)

Write data to an HDF5 file useing the python package H5PY. The file is appended to and in the case that a dataset already exists, it is overwritten.

Current Fortran implementation of the hdft_io module only allows the writing of a 1-D single precision integer or single/double precision float. If a pd.DataFrame is passed, all columns are placed in the same group within the HDF5 file or as specified by the argument datasets.

The python library h5py is not used for writing data to a HDF5 file as CCG programs may not be able to read data type written by it. By doing this, only the data type supported by the CCG hdf5_io Fortran module are exported. If you wish to have more flexibility, use the python library h5py directly.

Parameters:
  • data – A 1-D np.array/pd.Series or a pd.DataFrame containing different variables as columns
  • flname (str) – Path of the HDF5 you wish to write to or create
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
  • datasets (str or list) – Name of the dataset(s) to write out. If a pd.DataFrame is passed, the values passed by the argument datasets must match the DataFrame’s columns.
  • dtype (str) – The data type to write. Currently, only the following values are permitted: ['int32', 'float32', 'float64']. If a pd.DataFrame is passed and this argument is left to it’s default value of None, the DataFrame’s dtypes must be of the types listed above.
  • gridstr (str) – Grid definition string that is saved to the HDF5 file as an attribute of the group defined by the parameter h5path.
  • trim_variable (str) – Variable to use for trimming the data. An index will be written to the h5file and will be used to rebuild dataset while only nontrimmed data will be written out
  • var_min (float) – minimum trimming limit usedif trim_variable is passed

Examples

Write a single pd.Series or np.array to an HDF5 file:

>>> gs.write_h5(array, 'file.h5', h5path='Modeled/Var1', datasets='Realization_0001')

Write a whole pd.DataFrame in group (folder) ‘OriginalData’ that contains a dataset for every column in the pd.DataFrame:

>>> gs.write_h5('file.h5', DataFrame, h5path='OriginalData')

Code author: Warren E. Black - 2016-06-09

Read HDF5

pygeostat.data.h5_io.read_h5(flname, h5path=None, datasets=None, fill_value=-999)

Return a 1-D array from an HDF5 file or build a pd.DataFrame() from a list of datasets in a single group.

The argument h5path must be a path to a group. If 1 or more specific variables are desired to be loaded, pass a list to datasets to specify which to read.

Parameters:
  • flname (str) – Path of the HDF5 you wish to write to or create
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().
  • datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.
  • fill_value (float or np.NaN) – value to fill in grid with if trimmed data was written out. default is -999
Returns:

DataFrame containing one or more columns, each containing a single 1-D array of a variable.

Return type:

data (pd.DataFrame)

Code author: Warren E. Black - 2016-06-09

Is HDF5

pygeostat.data.h5_io.ish5dataset(h5fl, dataset, h5path=None)

Check to see if a dataset exits within an HDF5 file

The argument h5path must be a path to a group and cannot contain the dataset name. Can only check for one dataset at a time.

Parameters:
  • flname (str) – Path of the HDF5 you wish to check
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to check for the specified dataset. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file.
  • dataset (str) – Name of the dataset to check for in the group specified by h5path.
Returns:

Indicator if the specified dataset exists

Return type:

exists (bool)

Code author: Warren E. Black - 07/14/16

Combine Datasets from Multiple Paths

pygeostat.data.h5_io.h5_combine_data(flname, h5paths, datasets=None)

Combine data into one DataFrame from multiple paths in a HDF5 file.

Parameters:
  • flname (str) – Path of the HDF5 you wish to read from
  • h5paths (list) – A list of h5paths to combine. Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
  • datasets (list of lists) – If only a specific set of datasets from each path are desired then pass a list of lists of equal length as the h5paths list. An empty list within the list will cause all datasets in the corresponding path to be readin.
Returns:

DataFrame

Example:

>>> flname = 'drilldata.h5'
... h5paths = ['/Orig_data/series4870/', 'NS/Declus/series4870/']
... datasets = [['LOCATIONX', 'LOCATIONY', 'LOCATIONZ'], []]
... data = gs.h5_combine_data(flname, h5paths, datasets=datasets)

Code author: Tyler Acorn - May 2017

Pygeostat HDF5 Class

class pygeostat.data.h5_io.H5Store(flname, replace=False)

A simple class within pygeostat to manage and use HDF5 files.

Variables:
  • flname (str) – Path to a HDF5 file to create or use
  • h5data (h5py.File) – h5py File object
  • paths (dict) – Dictionary containing all of the groups found in the HDF5 file that contain datasets
Parameters:

flname (str) – Path to a HDF5 file to create or use

Usage:

Write a np.array or pd.Series to the HDF5 file:

>>> H5Store['Group1/Group2/Var1'] = np.array()

Write all the columns in a pd.DataFrame to the HDF5 file:

>>> H5Store['Group1/Group2'] = pd.DataFrame()

Retrieve a single 1-D array:

>>> array = H5Store['Group1/Group2/Var1']

Retrieve a single 1-D array within the root directory of the HDF5 file:

>>> array = H5Store['Var1']

Retrieve the first value from the array:

>>> value = H5Store['Var1', 0]

Retrieve a slice of values from the array:

>>> values = H5Store['Var1', 10:15]

Code author: Warren E. Black - 2016-06-09

Write Data

H5Store.__setitem__(key, value)

Write the the HDF5 file using the self[key] notation.

If a pd.Series or np.array is passed, the last entry in the path is used as the dataset name. If a pd.DataFrame is passed, all columns are written to the path specified to datasets with their names retrieved from the pd.DataFrame’s columns. If more flexible usage is required, please use gs.write_h5().

Example

Write a np.array or pd.Series to the HDF5 file:

>>> H5Store['Group1/Group2/Var1'] = np.array()

Write all the columns in a pd.DataFrame to the HDF5 file:

>>> H5Store['Group1/Group2'] = pd.DataFrame()

Code author: Warren E. Black - 2016-06-09

Read Data

H5Store.__getitem__(key)

Retrieve an array using the self[key] notation. The passed key is the path used to access the array desired and included direction through groups if required and the dataset name. The array may be selectively queried allowing a specific value or range of values to be loaded into the systems memory and not the whole array.

Example

Retrieve a single 1-D array:

>>> array = H5Store['Group1/Group2/Var1']

Retrieve a single 1-D array within the root directory of the HDF5 file:

>>> array = H5Store['Var1']

Retrieve the first value from the array:

>>> value = H5Store['Var1', 0]

Retrieve a slice of values from the array:

>>> values = H5Store['Var1', 10:15]

Code author: Warren E. Black - 2016-06-09

Close the HDF5 File

H5Store.close()

Release the open HDF5 file from python.

Code author: Warren E. Black - 2016-06-09

Datasets in H5 Store

H5Store.datasets(h5path=None)

Return the datasets found in the specified group.

Keyword Arguments:
 h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve the lists of datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
Returns:List of the datasets found within the specified h5path
Return type:datasets (list)

Code author: Warren E. Black - 2016-07-22

Generate Iterator

H5Store.iteritems(h5path=None, datasets=None, wildcard=None)

Produces an iterator that can be used to iterate over HDF5 datasets.

Can use the parameter h5path to indicate which group to retrieve the datasets from. If a set of specific datasets are required, the parameter datasets will restrict the iterator to those. The parameter wildcard allows a string wild-card value to restrict which datasets are iterated over.

Keyword Arguments:
 
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.
  • datasets (list) – List of specific dataset names found within the specified group to iterator over
  • wildcard (str) – String to search for within the names of the datasets found within the specified group to iterate over

Examples

Load a HDF5 file to pygeostat:

>>> data = gs.H5Store('data.h5')

Iterate over all datasets within the root directory of a HDF5 file:

>>> for dataset in data.iteritems():
>>>     gs.histplt(dataset)

Iterate over the datasets within a specific group that are realizations:

>>> for dataset in data.iteritems(h5path='Simulation/NS_AU', wildcard='Realization'):
>>>     gs.histplt(dataset)

Code author: Warren E. Black - 2016-07-22

DictFile Class

class pygeostat.data.data.DictFile(flname=None, readfl=False, dictionary={})

Class containing dictionary file information

Read Dictionary

DictFile.read_dict()

Read dictionary information from file

Write Dictionary

DictFile.write_dict()

Write dictionary information to csv style dictionary