Monte Carlo Simulation¶
Note
This page is under development. Needs a section on how it works internally.
Overview¶
OPGEE support for Monte Carlo Simulation includes:
The script
opgee/bin/combine_wor_data.py, which reads these input files from the directoryopgee/mcs/etc:
Norway_historical_WOR.csv
UK_Results.MATLAB.csv
WOR_observations_long.csvand combines them to create
all_wor.csvin that same directory. All rows in which “WOR” is zero orNaNare deleted. Thegensimcommand randomly draws values for fields’WORattributes from the remaining values. Note that this script needs to be run only if/when the input files are updated.The
csv2xmlsubcommand, which converts a specifically formatted CSV file containing field attributes to the XML representation required for running in OPGEE. The primary source file (opgee/etc/test-fields.csv) is based on data extracted from the OPGEEv3 Excel workbook.The
gensimsubcommand which readsopgee/mcs/etc/parameter_distributions.csv, a file describing probability distributions for model attributes.Gensimthen generates a file containing data values drawn from these distributions for a defined number of trials, with one file generated per field explicitly or implicitly identified in the call togensim.The
runsimsubcommand, which runs a simulation, by distributing the simulation across simulation across multiple processors.Runsimcan optionally run trials serially in a single process, which is useful primarily for debugging. Based on options in$HOME/opgee.cfg, the simulation can run on a single multi-processor computer or on a high-performance computing cluster using the SLURM job management system.
These scripts and subcommands are described in more detail below.
Generating a simulation¶
OPGEE uses a two-step process to run simulations. In the first step, the gensim subcommand creates
a “simulation directory” containing the model XML file, metadata describing the simulation (e.g., number
of trials, which fields are included), and field-specific subdirectories that each contain a file
trial_data.csv.
In the second step, the runsim subcommand creates a software “cluster” (a monitor process communicating
with some number of worker processes) using the dask package and instructs each worker to run a simulation
on a specified field. Results are saved to a file results.csv in each field-specific sub-directory of
the simulation directory.
The gensim subcommand¶
The gensim subcommand currently supports the following distributions, though it is fairly easy to add new ones:
Uniform
Normal
Truncated Normal
Lognormal
Trianglular
Binary
Weighted binary
Empirical, in which values are drawn randomly from a specified column in a CSV file.
Other distributions are supported in the code but no by the
parameter_distributions.csvfile format currently used (more below). These includesequence, which returns a random selection from a given list of values;integers, which returns an integer value between given min and max values; andconstant, which always returns the same user-defined value.
- In the following example invocation of
gensim, the model XML file is set to “/users/joe/models/model1.xml”
the simulation directory is set to “/users/joe/sims/test1”
trial data is generated for 1000 trials
gensim -m /users/joe/models/model1.xml -t 1000 -s /users/joe/sims/test1
A full description of all options is available here.
Distributions file format¶
The parameters distributions file is a CSV file with the following columns:
variable_name– The name of an OPGEE model attribute. These are interpreted as attributes of theFieldclass unless preceded by a class name and “.”. For example, the variable nameReservoirWellInterface.frac_CO2_breakthroughis interpreted as thefrac_CO2_breakthroughattribute of theReservoirWellInterfaceprocess.
distribution_type– one of “Uniform”, “Binary”, “Normal”, “Lognormal”, “Triangular”, “Emprical”.
“Uniform” distributions require values in the
low_boundandhigh_boundcolumns.“Normal” and “Lognormal” distributions require values in the
meanandSDcolumns. If there are values in thelow_boundandhigh_boundcolumns convert this to a truncated normal distribution.For “Binary” distributions, a value in the
prob_of_yescolumn converts this to a weighted binary distribution.“Triangular” distributions require values in the
low_bound,high_bound, anddefault_valuecolumns.“Empirical” distributions require a pathname to a CSV file, which must have a column whose name matches the value in the
variable_namecolumn.
mean– Used for Normal and Lognormal distributions
SD– Used for Normal and Lognormal distributions
low_boundandhigh_bound– Used to define “Uniform” and “Triangular” distributions and to truncate “Normal” distributions
prob_of_yes– used to turn a “Binary” distribution into a weighted binary
default_value– defines the mode of a triangular distribution
pathname– defines the CSV file to use for empirical data. Note that the same file can be used for multiple parameters provided that there is a column name that matches each parameter.
notes– for documentation only.
Running a simulation¶
Before running a simulation, the simulation directory must be created using the gensim
sub-command, described above. The simulation directory must be specified using the -s/--simulation-dir
argument to the runsim subcommand. This directory holds a fully expanded version of the model,
the input data generated from parameter distributions, and after running the simulation, the results
for each field.
The simulation directory contains a sub-directory for each field evaluated, in which the files “results.csv” and “failures.csv” will be written when all trials for the field have been run.
The runsim sub-command¶
Note
On some cluster computing systems (e.g., Stanford’s “sherlock”) the runsim subcommand
must be run on a compute node to be able to communicate with worker tasks. Be sure to allocate
enough walltime for runsim to be able to monitor all results.
The runsim subcommand can run simulations in any of three modes:
Serially, in which one model run is executed at a time. This is the slowest method, but often the most convenient to use for debugging. To select serial mode, use the
-S/--serialcommand-line option.If the
-s/--serialoption is not used, the simulation mode is determined from the configuration file variableOPGEE.ClusterType, which defaults tolocal. The other recognized value isslurm.In local mode, the simulation is run on a single- or multiple-CPU computer. By default,
runsimwill spawn a process for each available processor. The number of tasks can be controlled by the-n/--ntasksargument torunsim. Each process runs the designated number of trials for a field before moving onto any remaining fields.If
-s/--serialis not used, and the value ofOPGEE.ClusterTypeisslurm, the SLURM task management system is used. Note that this option works only on high-performance computing (HPC) clusters that use SLURM. In this mode,runsimsubmits a designated number of jobs which are allocated to available compute nodes. Again, each process runs the required trials for one field to completion before starting on any remaining fields. Note that there are several configuration file options controlling behavior on SLURM systems.
In the example below,
we use the same simulation directory (“/users/joe/sims/test1”) created in the
gensimexample abovewe run only the first 100 trials
we run only field “test_field” (there may be multiple fields defined in the analysis)
runsim -f test_field -t 0-99 -s /users/joe/sims/test1
A full description of all options is available here.