`squid.surrogate_zoo`

Module Contents

Classes

`SurrogateBase`	Base class for surrogate model.
`SurrogateLinear`	Module for linear surrogate model (no GE or noise models).
`SurrogateLasso`	Module for linear surrogate model (no GE or noise models) using sklearn Lasso.
`SurrogateRidgeCV`	Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV.
`SurrogateLIME`	Module for linear surrogate model (no GE or noise models) using LIME.
`SurrogateMAVENN`	Module for MAVE-NN surrogate models (optional GE and noise models).

Functions

data_splits(N, test_split, valid_split[, rnd_seed])

Function to determine which sequences randomly split into train, validation, test set.

Attributes

`TENSORFLOW_AVAILABLE`
`MAVENN_AVAILABLE`
`SKLEARN_AVAILABLE`

squid.surrogate_zoo.TENSORFLOW_AVAILABLE = True[source]

squid.surrogate_zoo.MAVENN_AVAILABLE = True[source]

squid.surrogate_zoo.SKLEARN_AVAILABLE = True[source]

class squid.surrogate_zoo.SurrogateBase[source]

Base class for surrogate model.

abstract train(x)[source]

abstract get_params(x)[source]

class squid.surrogate_zoo.SurrogateLinear(input_shape, num_tasks, l1=1e-08, l2=0.0001, alphabet=['A', 'C', 'G', 'T'])[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models).

Parameters:

l1 (float, optional) – L1 regularization penalty
l2 (float, optional) – L2 regularization penalty

Return type:

keras.Model

build(input_shape, num_tasks, l1, l2)[source]: Build linear surrogate model.

train(x, y, learning_rate=0.001, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, rnd_seed=None, save_dir=None, verbose=1)[source]: Train linear surrogate model.

get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the linear model.

Parameters:

gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None: None, to match output of MAVE-NN get_params()
theta_lcnumpy.ndarray: Additive terms in trained parameters (shape : (L,C)).
theta_lclcNone: None, to match output of MAVE-NN get_params()

Return type:

tuple

get_logo(full_length=None, view_window=None)[source]

class squid.surrogate_zoo.SurrogateLasso(input_shape, num_tasks, alpha=1, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using sklearn Lasso. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

Parameters:: alpha (float, default=1.0) – Constant that multiplies the L1 term, controlling regularization strength; alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Instead, you should use the LinearRegression object.
Return type:: sklearn.Model

dataframe(x, y, alphabet, gpu)[source]

train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]

get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the Lasso model.

Parameters:

gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None: None, to match output of MAVE-NN get_params()
theta_lcnumpy.ndarray: Additive terms in trained parameters (shape : (L,C)).
theta_lclcNone: None, to match output of MAVE-NN get_params()

Return type:

tuple

get_logo(full_length=None, view_window=None)[source]

class squid.surrogate_zoo.SurrogateRidgeCV(input_shape, num_tasks, alphas=[0.001, 0.01, 0.1, 1], cv=5, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

Parameters:

alphas (array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)) – Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC. If using Leave-One-Out cross-validation, alphas must be positive.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the efficient Leave-One-Out cross-validation
- integer, to specify the number of folds
- CV splitter
- An iterable yielding (train, test) splits as arrays of indices.

Return type:

sklearn.Model

dataframe(x, y, alphabet, gpu)[source]

train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]

get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the RidgeCV model.

Parameters:

gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None: None, to match output of MAVE-NN get_params()
theta_lcnumpy.ndarray: Additive terms in trained parameters (shape : (L,C)).
theta_lclcNone: None, to match output of MAVE-NN get_params()

Return type:

tuple

get_logo(full_length=None, view_window=None)[source]

class squid.surrogate_zoo.SurrogateLIME(input_shape, num_tasks, k=20, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using LIME. For more information, see https://arxiv.org/pdf/1602.04938.pdf and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html

Parameters:: k (int) – The desired number of nonzero weights
Return type:: sklearn.Model

dataframe(x, y, alphabet, gpu)[source]

train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]

get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the RidgeCV model.

Parameters:

gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None: None, to match output of MAVE-NN get_params()
theta_lcnumpy.ndarray: Additive terms in trained parameters (shape : (L,C)).
theta_lclcNone: None, to match output of MAVE-NN get_params()

Return type:

tuple

get_logo(full_length=None, view_window=None)[source]

class squid.surrogate_zoo.SurrogateMAVENN(input_shape, num_tasks, gpmap='additive', regression_type='GE', linearity='nonlinear', noise='SkewedT', noise_order=2, reg_strength=0.1, hidden_nodes=50, alphabet=['A', 'C', 'G', 'T'], deduplicate=True, gpu=True)[source]

Bases: SurrogateBase

Module for MAVE-NN surrogate models (optional GE and noise models).

Parameters:

gpmap (string {'additive' or 'pairwise'}) –

Define MAVE-NN surrogate model used to interpret deep learning model.
’additive’ : Assume that each position contributes independently to the latent phenotype. ‘pairwise’ : Assume that every pair of positions contribute to the latent phenotype.
regression_type (string) –

Type of regression used for measurement process.
’MPA’ : measurement process agnostic (categorical y-values). ‘GE’ : global epistasis (continuous y-values).
linearity (string) –

Define use of additional nonlinearity for fitting data.
’nonlinear’ : Additionally fit data using GE nonlinear function. ‘linear’ : Do not apply GE nonlinearity for fitting data.
noise (string) –
Noise model to use for when defining a GE model (no effect on MPA models). See https://mavenn.readthedocs.io/en/latest/math.html for more info.

’Gaussian’ : Gaussian-based noise model. ‘Cauchy’ : Cauchy-based noise model. ‘SkewedT’ : SkewedT-based noise model.
noise_order (int) – In the GE context, the order of the polynomial(s) used to define noise model parameters. In the linear context, the order is zero by default.
reg_strength (float) – L2 regularization strength for G-P map parameters.
hidden_nodes (int) – Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.
deduplicate (boole) – Remove duplicate sequence-function pairs in dataset (default: True).
gpu (boolean) – Enable GPUs (default: True).

Return type:

keras.Model

dataframe(x, y, alphabet, gpu)[source]

train(x, y, learning_rate=0.0005, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, save_dir=None, verbose=1)[source]

get_info(save_dir=None, verbose=1)[source]

Function to return estimated variational information from MAVE-NN model.

Parameters:

save_dir (str) – Directory for saving figures to file.
verbose (bool) – print info

Returns:

I_pred – MAVE-NN estimated variational information (I_pred), in bits.

Return type:

float

get_params(gauge='empirical', save_dir=None)[source]

Function to return trained parameters from MAVE-NN model.

Parameters:

gauge (gauge mode used to fix model parameters.) –
See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over

the characters at each position observed in the training set (unobserved characters are assigned probability 0).

’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence.
save_dir (str) – Directory for saving figures to file.

Returns:

theta_0float: Constant term in trained parameters.
theta_lcnumpy.ndarray: Additive terms in trained parameters (shape : (L,C)).
theta_lclcnumpy.ndarray: Pairwise terms in trained parameters (shape : (L,C,L,C)), if gpmap is ‘pairwise’.

Return type:

tuple

get_logo(full_length=None, view_window=None)[source]

Function to place trained additive parameters into surrounding nonmutated sequence (zeros).

Parameters:

full_length (int) – Full length of sequence.
view_window ([int, int]) – Index of start and stop position along sequence to probe; i.e., [start, stop], where start < stop and both entries satisfy 0 <= int <= L.

Returns:

additive_logo – Additive logo parameters (shape : (‘full_length’,C)).

Return type:

numpy.ndarray

squid.surrogate_zoo.data_splits(N, test_split, valid_split, rnd_seed=None)[source]

Function to determine which sequences randomly split into train, validation, test set.

Parameters:

N (int) – number of data
test_splitc (float (between 0 and 1)) – percent to split into test set
valid_split (float (between 0 and 1)) – percent to split into validation set
rnd_seed (int) – random number seed

Returns:

train_index – array of indices to be included in training set
valid_index – array of indices to be included in validation set
test_index – array of indices to be included in test set

squid.surrogate_zoo

Module Contents

Classes

Functions

Attributes

`squid.surrogate_zoo`