Simulating Genetic Regulatory Networks |
Version
1.2
Richard Scheines and Joseph Ramsey
To simulate data in Tetrad 4, you must specify a Lag graph and then interpret it as a parametric model. There are 4 choices of parametric models, where in each either i) the value of each variable at the current time is a function of its causal parents and an error term, or ii) the probability distribution of the variable is a function of its causal parents.
Understanding the General Model
Consider a discrete time series involving N individuals, each with M variables G1 through Gm. For example, Figure 11 represents a time series for three variables, each measured at 5 times.
Understanding Time Series Graphs
If we assume that the process modelled is stable over time, then we can represent the causal structure of the series with a time series graph that includes the smallest fragment of the series that repeats. The number of temporal slices in the time series graph is the longest lag of direct influence plus one. For example, the time series graph in Figure 33, which represents the series in Figure 11, needs three temporal slices to represent a repeating sequence, because G2 has a direct effect on G3 with a temporal lag of two.
In order to simulate data from this class of models, we need to express each variable (or its probability distribution) at an arbitrary "current time" as a function of its direct causes. To represent only the set of direct causes for each variable, we construct a Lag graph. For example, the Lag graph in Figure 55 expresses the causal information needed to simulate the time series represented in Figure 11 and Figure 33. The Lag graph consists of the 3 variables repeated in temporal slices from lag = mlag (most remote direct influence) to lag=0 (current time), but includes only edges that are into variables at lag = 0, that is, it includes only edges that represent direct influences into the current time.
Note: We connect all pairs of variables at the beginning of the repeating sequence with a double-headed arrow to represent an unconstrained causal connection, so that d-separation applied to this graph does not entail that these variables are independent, contrary to the fact that later versions of the same variable are causally connected in the time series graph
The time lag i in a Lag graph is indexed by ":Li". Thus, variable G1 at the current time (lag of 0) in an Lag graph is G1:L0, and the same variable two time slices in the past G1:L2. The constant mlag is the maximum lag of direct influence.
Undertanding the Data Generated
To simulate data in Tetrad 4, you must specify a Lag graph and then interpret it as a parametric model. There are 4 choices of parametric models, where in each either i) the value of each variable at the current time is a function of its causal parents and an error term, or ii) the probability distribution of the variable is a function of its causal parents. Given an instantiation of the chosen parametric model, and the assumption that the causal relations remain constant over time for each individual, values for N individuals (cells) over T times can be simulated to produce a data cube that is:
N (individuals) x M (variables) x T (times).
Although the cube has three dimensions, we will store it in the standard two,
by repeating M columns for each time slice, as so:
The simulation to produce these data is run in two stages:
an "initialization"
phase and an "update" phase. The initialization phase must assign values
for each individual for at least as many times as the maximum
lag in the time series model, i.e, from time t = 1 until time t = mlag.
After time � mlag, values can be assigned to variables for a given individual
using the given instantiation of the parametric model that interprets the Lag
graph.
Data collection regimes for protein expression involve tissues that contain
thousands of cells, not all of which behave strictly identically and not all
of which can be measured in isolation.
Since current technology cannot perfectly measure the levels, even relatively,
of gene (or mRNA) expression, or levels of protein synthesis in cells, we also
allow the user to specify a measurement model that aggregates cells by dish
and models measurement error.
Using Tetrad: Setting Up the Program
This section will explain how to specify the graph, choose a parametric model
to interpret the graph, instantiate the parameters of the model, pick an initialization
routine, and finally specify the measurement model.
When the program opens, it will display the empty workbench. In order to generate data, you need to specify a graph, a parametric model (PM), an instantiation of this model (IM), and connect them all to a Data modelNode. These objects can be deposited on the workbench by clicking on their icon in the left and clicking where you want them located on the workbench, and then by clicking on the "Flow-Charter" icon on the left and then dragging from one modelNode to another to connect them. The skeleton for a simulation looks like Figure 9.
After double-clicking on the graph modelNode - you will be given a choice (Figure 11) of whether you want to specify a regular graph or a Lag graph.
You will then be prompted for whether you want to specify the graph manually or randomly, with manually the default.
2.1 Manual Graph Specification
2.2 Random Graph Specification
If you choose Random Graph Specification, you will be
confronted with the dialogue box in Figure 15, which asks you to set four parameters.
2.3 Output and Storage
If the graph involves a managable number of vertices
(<30), then the graph editor represents it pictorially, allowing you to edit
the graph by adding or taking away edges. For example, in Figure 17 we show
a Lag graph with five genes and a maxlag = 2 after adding a few other regulating
connections.
Figure 17: Lag Graph with 5 Genes and Mlag=2
If the number of genes is large, e.g. 5,000, then representing the Lag
graph pictorially is a waste. In that case we change to a textual representation
in which each gene is given a row, and the genes that regulate (cause) it are
listed on its row. For example, Figure 19 shows the top 20 or so lines of a
Lag graph
with 500 variables. In this representation, Gene 5 (G005) is regulated by itself
one time step back (G005:1) and by Gene 461 two time steps back (G461:2). This
representation can be saved as a text file to disk and then processed by software
of your choice.
Figure 19: Large Lag Graph Represented Textually
Specifying the Parametric Model
Having specified the Lag
graph, you must now intrepret the graph as a parametric model. Currently,
only one type of parametric model is implemented-- the "Boolean Glass Gene
Parametric Model." When you double-click the PM modelNode at this, point, that
type of parametric model will automatically be chosen for you.
Eventually, four types of parametric models will be implemented, as follows:
These families are not exclusive, but each has advantages.
3.1 Glass Updating
- Implemented.
In both Glass (Edwards & Glass, 2000) and General Updating
parametric models, the value of each gene Gi at the current time (a lag of 0)
Gi:L0 is set by a function with the following general form:
where: is an
"error" term drawn from a given probability distribution (discussed below),
all variables are continuous, and Fi is a function specified by the user. The
difference between Glass and General updating models involves only the form
of the functions Fi.
3.1.1 Preliminary Binary Projection
Even though the variables in this parametric model class are continuous, Glass
functions take boolean valued inputs, so some pre-processing is necessary. The
inputs to the function Fi are the members of the set P = {parents(Gi:L0)/Gi:L1},
that is, the parents of Gi:L0 except for Gi:L1, which is already in the updating
function. The idea behind Glass functions is to simplify the input to whether
a given gene is expressing at "high" or "low" levels. We map each Gp � P to
0 (low, that is, below its average expression level of 0) or 1 (high, that is,
above 0). We do this with the following binary projection BP:
Simulated values in for variables Gi will range mostly over the interval [-2.0,
2.0] and will oscillate above and below 0.0. Since raw microarray data is typically
given as a ratio of microarray spot brightness to average spot intensity, where
this ratio typically ranges from near 0.0 up to about 10.0, with averages around
1.0, it is useful to think of the simulated data as loglinear with respect to
raw data-viz., log(x) of each intensity ratio x is recorded in the simulation
in place of the intensity ratio x.
Note: We thank Stuart Kauffmann for pointing us to the Glass model.
3.1.2 The Function Table for Fi
Since genetic regulators either inhibit or activate their targets, Edwards
and Glass (2000, p.3) set the range of Fi to {-1,1}. To specify a particular
Fi, construct a truth-table with 1 column for each
and one for the output of Fi. Thus the truth table will have 2 to the power
of P rows. For example, if P = {G3:L1, G5:L2}, then the function table for Fi
is:
Filling in the Fi column in this function table specifies
the particular instantiation of Fi used in the update function for variable
Gi, thus this step is left to the Instantiated Model specification below.
By picking the Glass Updating parametric family, you are committing yourself
to the class of models parametrized by the update function above and Glass functions
for the Fi.
3.2 General
Updating - Unimplemented.
Again, in both Glass and General Updating parametric models, the current value
of each variable Gi:0 is set by a function with the following general form.
where: is
an "error" term drawn from a given probability distribution (discussed below),
all variables are continuous, and Fi is a function specified by the user. The
difference between Glass and General updating models involves only the form
of the functions Fi. In General Updating parametric models, the user is free
to specify any function for the Fi, not just Glass functions. Here we will use
Tianjiao's GAM specifier to allow the user to specify, for each i, the function
Fi.
3.3 GRN
GAM - Unimplemented.
In this parametric family, the variables are continuous and the update function
is slightly more general than the ones used in 3.1 and 3.2.
GAM stands for general additive model, so the only constraint on this parametric
model is that the Gi are additive functions. The particular instantiaions of
Fi for each i are fixed when you specify the instantiated model (IM). Here we
will use Tianjiao's GAM specifier to allow the user to specify, for each i,
the function Gi.
3.4 BN SEM
- Unimplemented.
In this parametric family, the variables are discrete and the causal system
is just interpreted as a standard discrete Bayes Network, that is, the update
function is just a conditional probability table. <br>
At the parametric level, you need to specify the number of categories for each
variable, as well as the value of each category. This is identical to the Bayes
Net parametric model in general Tetrad 4 models.
Specifying the Instantiated Model
Once the parametric model has been specified, you need only specfiy values for
the parameters in the corresponding instantiated
model (IM). To do this, double-click on the IM modelNode in the session workbench.
Close the IM editor to finish. Since the parameters depend on the PM chosen,
we cover each of the above classes in turn.
4.1 Glass Updating
Recall that the update
function used in Glass model is:
The free parameters for a Glass model are thus:
By the structure of the Glass function, an unregulated
gene (no other genes besides itself affecting it) will have an equilbirum value
of 0. Rate 1 determines how quickly such genes return to equilbrium. If Rate
1 is set = .1, for example (its default), then in each time step a gene with
no regulators will return 1/10th of the way toward 0, whatever its previous
value, before noise. Rate2 governs the influence of other regulators. The Glass
functions Fi output +1 (promoting expression) or -1 (inhibiting expression).
If Rate 2 is set = .5, then in one time step the contribution to a gene is either
+.5 or -.5.
The error terms are meant to simulate random biological error in the process
of gene expression. We give a default of mean 0 and standard deviation = .05.
The range of Fi is {-1,1}. The number of possible functions for each Fi is ,
where P is the set of parents of Gi. For example, if P = {G3:L1, G5:L2}, then
the uninstantiated function table for Fi is:
which can be filled out in 16 different ways. If P contains 5 parents, then
there are 32 rows in Fi and there are 2^32 different ways you can instantiate
Fi. Thus, we give you a choice of filling in all Fi randomly or manually, with
randomly the default.
Randomly:
5. Initializing the Cells
Given the full parametric model and its instantiation, and assuming that the
same model is used for each individual cell, then values for N individuals over
T times can be simulated in two stages: an "initialization" phase and an "update"
phase. The initialization phase must assign values for all the genes in each
individual for time step 1. After this, the update function specified above
can be used to assign values to variables for a given individual, with the caveat
that if mlag > 1, then not all the "parents" of a variable in time 2 will exist.
In that case the update function will simply use the latest value of that variable
that does exist as input.
Biologically, cells from the same tissue can be starved of nutrients, or manipulated
in some other way and all brought into roughly the same state, that is, synchronized.
They can then be shocked, e.g., exposed to nutrients, and let run.
Further, many genes in a cell are considered "housekeeping genes," which express
protein at some basal rate stably over time. In this version of the simulator
we allow the user to choose among two methods for initializing values: synchronized
or random.
5.1 Synchronized Initialization
a. if Gi is a housekeeping gene, that is, its only parent in the Lag graph is itself, then let Gi:1 = 0,
b. else draw Gi from a standard normal distribution, i.e, N(0,1)
5.2 Random Initialization
a. if Gi is a housekeeping gene, that is, its only parent in the Lag graph is itself, then let Gi:1 = 0,
b. else draw Gi:1 from a standard normal distribution, i.e, N(0,1)
Aggregation and Measurement Error
6.1 Aggregation by Dish
So far, we have modelled the progression of gene activity over time for individual
cells. We cannot currently obtain biological data on this level, however. In
microarray experiments, thousands (or millions) of cells are grown in each of
several dishes. For example, Figure 26 shows two dishes with a million cells
each.
A microarray experiment involves at least the following sequence of steps to
begin the experiment:
Initial Dish to Dish Differences
Although they are intentionally minimized, small variations in nutrient or temperature
make dish to dish variations inevitable, even at the beginning of an experiment.
To simulate these differences, we need to specify how much variation in expression
levels we can expect between dishes. Let that quantity be the standard deviation
of expression level difference due to the dish, which we will call sd(DV),
with default at 10%.
For each dish Dj, we draw Dev(Dj) from a normal distribution with mean 100 and
standard deviation = sd(DV), that is, N(100, sd(DV)).
We then adjust the initial^3 expression levels in all genes in all the cells
in dish Dj by Dev(Dj):
For each dish Dj
For each cell Ck in Dj,
For each gene Gi in Ck at time 1 (Gi:1),Let Gi = Dev(Dj) % of Gi
Dishes and Time
In the usual studies, although several dishes might begin an experiment in a
relatively "synchronized" state, each dish must be "frozen" before it can be
processed for measurement, and thus we cannot use the same dish to measure cells
at two different time points in the experiment. To measure two time points,
we take a sample from one dish at time 1 and a sample from a different dish
at time 2, e.g., Figure 29.
If the goal is to compare the expression levels of a particular gene at two
different times to see if they substantially differ, then this constraint forces
us to compare the level of a sample from one dish against a sample from another.
This constraint will NOT be imposed by the simulator, but rather by the data
analyst after data is generated and stored.
6.2 Measurement Error
After a dish is "frozen" at a time, its RNA is extracted. From this RNA, several
samples can be drawn and each "measured" by exposing it to a microarray chip
(chips can be re-used) and then digitally converting pixel intensities to gene
expression levels (Figure 31).
Chips can be reused approximately four times. Samples from the same dish might
vary slightly, chips definitely vary, the same chip functions differently from
one use to another, and the pixel digitization routine might include error as
well. Each "measurement" therefore, has five sources of potential error:
We discussed how we will model the dish variability above. In this version of
the simulator, we will NOT model dish to dish variability beyond initialization.
To model the other sources of measurement error, we will proceed as follows.
After initializing the cells in an experiment, and adding dish to dish variabiltiy,
we will have N cells partitioned into D dishes. After we run the experiment,
in which each cell obeys the same causal laws but updates independently of each
other, each cell has an expression level at each of t times.
After Running the Experiment
We call this the Raw data from the experiment.
After RNA Extraction
Next, we model the RNA extraction process for each dish by aggregating all the
cells in a dish, averaging the expression levels for each gene, and recording
an average expression level for each gene in each dish at each time (Figure
34):
The average expression level for a gene Gi at time t
in a dish d, is:
Adding Measurement Error There remain 4 sources of measurement error:
sample, chip to chip, chip re-use, and pixel digitization error. In this version
of the simulator - we will NOT model chip re-use variability. We treat each
of the other as additive normal error with mean 0 and standard deviation
Let Sample to Sample Variability error
Let Chip to Chip error
Pixel digitization error
For each sample s taken from a dish d and measured, new values of
are drawn. For each gene in each sample, a new value of
is drawn. Thus the average measured expression level for each gene Gi at time
t is:
If we draw 4 samples from each dish, and don't re-use any chips, then our final
data table would look as follows:
We call this the Measurement Data.
To summarize, the parameters the user must specify for the measurement error
model, with defaults in brackets are:
Running a Simulation
To initiate a data generation process, double click on the red-die on the arrow
from an IM modelNode to a Data modelNode (). The program will only produce data if it
has all the info it needs, otherwise it will prompt you.
Simulation Run Parameters:
After clicking the red die, a dialogue box comes up showing the parameters you
must specify, along with their default values (Figure 38).
We discussed the measurement model parameters in section 6. For the Simulation
Parameters, the number of dishes defaults to 1 and the number of cells per dish
to 100,000. The number of time steps generated defaults at 4, and the user has
the option of when to start storing data (default at time step 1). The user
can also choose how often to store data, with the default every step.
Finally, because there are often thousands of cells and thousands of genes in
each cell, many simulations will involve millions of gene expression levels
at each time. Call this the raw data. The measured, as opposed to raw data,
involves only a few samples from each dish, each of which contains as many measurements
as there are genes. Since storing the raw data is expensive relative to the
aggregated data with measurement error - we give the user a choice as to whether
to store raw data, with the default = no. If the user chooses to store raw data,
both data sets will be contained in the data object on the workbench in Tetrad.
Edwards, R., & Glass L. (2000). Combinatorial explosion in model gene networks., Chaos, V. 10, N. 3, September., pp. 1-14.