squid.surrogate_zoo
Module Contents
Classes
Base class for surrogate model. |
|
Module for linear surrogate model (no GE or noise models). |
|
Module for linear surrogate model (no GE or noise models) using sklearn Lasso. |
|
Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV. |
|
Module for linear surrogate model (no GE or noise models) using LIME. |
|
Module for MAVE-NN surrogate models (optional GE and noise models). |
Functions
|
Function to determine which sequences randomly split into train, validation, test set. |
Attributes
- class squid.surrogate_zoo.SurrogateLinear(input_shape, num_tasks, l1=1e-08, l2=0.0001, alphabet=['A', 'C', 'G', 'T'])[source]
Bases:
SurrogateBaseModule for linear surrogate model (no GE or noise models).
- Parameters:
- Return type:
keras.Model
- train(x, y, learning_rate=0.001, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, rnd_seed=None, save_dir=None, verbose=1)[source]
Train linear surrogate model.
- get_params(gauge=None, save_dir=None)[source]
Function to return trained parameters from the linear model.
- Parameters:
gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.
- Returns:
- theta_0None
None, to match output of MAVE-NN get_params()
- theta_lcnumpy.ndarray
Additive terms in trained parameters (shape : (L,C)).
- theta_lclcNone
None, to match output of MAVE-NN get_params()
- Return type:
- class squid.surrogate_zoo.SurrogateLasso(input_shape, num_tasks, alpha=1, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]
Bases:
SurrogateBaseModule for linear surrogate model (no GE or noise models) using sklearn Lasso. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
- Parameters:
alpha (float, default=1.0) – Constant that multiplies the L1 term, controlling regularization strength; alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Instead, you should use the LinearRegression object.
- Return type:
sklearn.Model
- train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
- get_params(gauge=None, save_dir=None)[source]
Function to return trained parameters from the Lasso model.
- Parameters:
gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.
- Returns:
- theta_0None
None, to match output of MAVE-NN get_params()
- theta_lcnumpy.ndarray
Additive terms in trained parameters (shape : (L,C)).
- theta_lclcNone
None, to match output of MAVE-NN get_params()
- Return type:
- class squid.surrogate_zoo.SurrogateRidgeCV(input_shape, num_tasks, alphas=[0.001, 0.01, 0.1, 1], cv=5, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]
Bases:
SurrogateBaseModule for linear surrogate model (no GE or noise models) using sklearn RidgeCV. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
- Parameters:
alphas (array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)) – Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC. If using Leave-One-Out cross-validation, alphas must be positive.
cv (int, cross-validation generator or an iterable, default=None) –
- Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the efficient Leave-One-Out cross-validation
integer, to specify the number of folds
CV splitter
An iterable yielding (train, test) splits as arrays of indices.
- Return type:
sklearn.Model
- train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
- get_params(gauge=None, save_dir=None)[source]
Function to return trained parameters from the RidgeCV model.
- Parameters:
gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.
- Returns:
- theta_0None
None, to match output of MAVE-NN get_params()
- theta_lcnumpy.ndarray
Additive terms in trained parameters (shape : (L,C)).
- theta_lclcNone
None, to match output of MAVE-NN get_params()
- Return type:
- class squid.surrogate_zoo.SurrogateLIME(input_shape, num_tasks, k=20, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]
Bases:
SurrogateBaseModule for linear surrogate model (no GE or noise models) using LIME. For more information, see https://arxiv.org/pdf/1602.04938.pdf and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html
- Parameters:
k (int) – The desired number of nonzero weights
- Return type:
sklearn.Model
- train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
- get_params(gauge=None, save_dir=None)[source]
Function to return trained parameters from the RidgeCV model.
- Parameters:
gauge (None) – None, to match output of MAVE-NN get_params()
save_dir (str) – Directory for saving figures to file.
- Returns:
- theta_0None
None, to match output of MAVE-NN get_params()
- theta_lcnumpy.ndarray
Additive terms in trained parameters (shape : (L,C)).
- theta_lclcNone
None, to match output of MAVE-NN get_params()
- Return type:
- class squid.surrogate_zoo.SurrogateMAVENN(input_shape, num_tasks, gpmap='additive', regression_type='GE', linearity='nonlinear', noise='SkewedT', noise_order=2, reg_strength=0.1, hidden_nodes=50, alphabet=['A', 'C', 'G', 'T'], deduplicate=True, gpu=True)[source]
Bases:
SurrogateBaseModule for MAVE-NN surrogate models (optional GE and noise models).
- Parameters:
gpmap (string {'additive' or 'pairwise'}) –
- Define MAVE-NN surrogate model used to interpret deep learning model.
’additive’ : Assume that each position contributes independently to the latent phenotype. ‘pairwise’ : Assume that every pair of positions contribute to the latent phenotype.
regression_type (string) –
- Type of regression used for measurement process.
’MPA’ : measurement process agnostic (categorical y-values). ‘GE’ : global epistasis (continuous y-values).
linearity (string) –
- Define use of additional nonlinearity for fitting data.
’nonlinear’ : Additionally fit data using GE nonlinear function. ‘linear’ : Do not apply GE nonlinearity for fitting data.
noise (string) –
Noise model to use for when defining a GE model (no effect on MPA models). See https://mavenn.readthedocs.io/en/latest/math.html for more info.
’Gaussian’ : Gaussian-based noise model. ‘Cauchy’ : Cauchy-based noise model. ‘SkewedT’ : SkewedT-based noise model.
noise_order (int) – In the GE context, the order of the polynomial(s) used to define noise model parameters. In the linear context, the order is zero by default.
reg_strength (float) – L2 regularization strength for G-P map parameters.
hidden_nodes (int) – Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.
deduplicate (boole) – Remove duplicate sequence-function pairs in dataset (default: True).
gpu (boolean) – Enable GPUs (default: True).
- Return type:
keras.Model
- train(x, y, learning_rate=0.0005, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, save_dir=None, verbose=1)[source]
- get_info(save_dir=None, verbose=1)[source]
Function to return estimated variational information from MAVE-NN model.
- get_params(gauge='empirical', save_dir=None)[source]
Function to return trained parameters from MAVE-NN model.
- Parameters:
gauge (gauge mode used to fix model parameters.) –
See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over
the characters at each position observed in the training set (unobserved characters are assigned probability 0).
’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence.
save_dir (str) – Directory for saving figures to file.
- Returns:
- theta_0float
Constant term in trained parameters.
- theta_lcnumpy.ndarray
Additive terms in trained parameters (shape : (L,C)).
- theta_lclcnumpy.ndarray
Pairwise terms in trained parameters (shape : (L,C,L,C)), if gpmap is ‘pairwise’.
- Return type:
- squid.surrogate_zoo.data_splits(N, test_split, valid_split, rnd_seed=None)[source]
Function to determine which sequences randomly split into train, validation, test set.
- Parameters:
- Returns:
train_index – array of indices to be included in training set
valid_index – array of indices to be included in validation set
test_index – array of indices to be included in test set