squid.surrogate_zoo

Module Contents

Classes

SurrogateBase

Base class for surrogate model.

SurrogateLinear

Module for linear surrogate model (no GE or noise models).

SurrogateLasso

Module for linear surrogate model (no GE or noise models) using sklearn Lasso.

SurrogateRidgeCV

Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV.

SurrogateLIME

Module for linear surrogate model (no GE or noise models) using LIME.

SurrogateMAVENN

Module for MAVE-NN surrogate models (optional GE and noise models).

Functions

data_splits(N, test_split, valid_split[, rnd_seed])

Function to determine which sequences randomly split into train, validation, test set.

Attributes

TENSORFLOW_AVAILABLE

MAVENN_AVAILABLE

SKLEARN_AVAILABLE

squid.surrogate_zoo.TENSORFLOW_AVAILABLE = True[source]
squid.surrogate_zoo.MAVENN_AVAILABLE = True[source]
squid.surrogate_zoo.SKLEARN_AVAILABLE = True[source]
class squid.surrogate_zoo.SurrogateBase[source]

Base class for surrogate model.

abstract train(x)[source]
abstract get_params(x)[source]
class squid.surrogate_zoo.SurrogateLinear(input_shape, num_tasks, l1=1e-08, l2=0.0001, alphabet=['A', 'C', 'G', 'T'])[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models).

Parameters:
  • l1 (float, optional) – L1 regularization penalty

  • l2 (float, optional) – L2 regularization penalty

Return type:

keras.Model

build(input_shape, num_tasks, l1, l2)[source]

Build linear surrogate model.

train(x, y, learning_rate=0.001, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, rnd_seed=None, save_dir=None, verbose=1)[source]

Train linear surrogate model.

get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the linear model.

Parameters:
  • gauge (None) – None, to match output of MAVE-NN get_params()

  • save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None

None, to match output of MAVE-NN get_params()

theta_lcnumpy.ndarray

Additive terms in trained parameters (shape : (L,C)).

theta_lclcNone

None, to match output of MAVE-NN get_params()

Return type:

tuple

class squid.surrogate_zoo.SurrogateLasso(input_shape, num_tasks, alpha=1, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using sklearn Lasso. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

Parameters:

alpha (float, default=1.0) – Constant that multiplies the L1 term, controlling regularization strength; alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Instead, you should use the LinearRegression object.

Return type:

sklearn.Model

dataframe(x, y, alphabet, gpu)[source]
train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the Lasso model.

Parameters:
  • gauge (None) – None, to match output of MAVE-NN get_params()

  • save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None

None, to match output of MAVE-NN get_params()

theta_lcnumpy.ndarray

Additive terms in trained parameters (shape : (L,C)).

theta_lclcNone

None, to match output of MAVE-NN get_params()

Return type:

tuple

class squid.surrogate_zoo.SurrogateRidgeCV(input_shape, num_tasks, alphas=[0.001, 0.01, 0.1, 1], cv=5, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

Parameters:
  • alphas (array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)) – Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC. If using Leave-One-Out cross-validation, alphas must be positive.

  • cv (int, cross-validation generator or an iterable, default=None) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are:
    • None, to use the efficient Leave-One-Out cross-validation

    • integer, to specify the number of folds

    • CV splitter

    • An iterable yielding (train, test) splits as arrays of indices.

Return type:

sklearn.Model

dataframe(x, y, alphabet, gpu)[source]
train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the RidgeCV model.

Parameters:
  • gauge (None) – None, to match output of MAVE-NN get_params()

  • save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None

None, to match output of MAVE-NN get_params()

theta_lcnumpy.ndarray

Additive terms in trained parameters (shape : (L,C)).

theta_lclcNone

None, to match output of MAVE-NN get_params()

Return type:

tuple

class squid.surrogate_zoo.SurrogateLIME(input_shape, num_tasks, k=20, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)[source]

Bases: SurrogateBase

Module for linear surrogate model (no GE or noise models) using LIME. For more information, see https://arxiv.org/pdf/1602.04938.pdf and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html

Parameters:

k (int) – The desired number of nonzero weights

Return type:

sklearn.Model

dataframe(x, y, alphabet, gpu)[source]
train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)[source]
get_params(gauge=None, save_dir=None)[source]

Function to return trained parameters from the RidgeCV model.

Parameters:
  • gauge (None) – None, to match output of MAVE-NN get_params()

  • save_dir (str) – Directory for saving figures to file.

Returns:

theta_0None

None, to match output of MAVE-NN get_params()

theta_lcnumpy.ndarray

Additive terms in trained parameters (shape : (L,C)).

theta_lclcNone

None, to match output of MAVE-NN get_params()

Return type:

tuple

class squid.surrogate_zoo.SurrogateMAVENN(input_shape, num_tasks, gpmap='additive', regression_type='GE', linearity='nonlinear', noise='SkewedT', noise_order=2, reg_strength=0.1, hidden_nodes=50, alphabet=['A', 'C', 'G', 'T'], deduplicate=True, gpu=True)[source]

Bases: SurrogateBase

Module for MAVE-NN surrogate models (optional GE and noise models).

Parameters:
  • gpmap (string {'additive' or 'pairwise'}) –

    Define MAVE-NN surrogate model used to interpret deep learning model.

    ’additive’ : Assume that each position contributes independently to the latent phenotype. ‘pairwise’ : Assume that every pair of positions contribute to the latent phenotype.

  • regression_type (string) –

    Type of regression used for measurement process.

    ’MPA’ : measurement process agnostic (categorical y-values). ‘GE’ : global epistasis (continuous y-values).

  • linearity (string) –

    Define use of additional nonlinearity for fitting data.

    ’nonlinear’ : Additionally fit data using GE nonlinear function. ‘linear’ : Do not apply GE nonlinearity for fitting data.

  • noise (string) –

    Noise model to use for when defining a GE model (no effect on MPA models). See https://mavenn.readthedocs.io/en/latest/math.html for more info.

    ’Gaussian’ : Gaussian-based noise model. ‘Cauchy’ : Cauchy-based noise model. ‘SkewedT’ : SkewedT-based noise model.

  • noise_order (int) – In the GE context, the order of the polynomial(s) used to define noise model parameters. In the linear context, the order is zero by default.

  • reg_strength (float) – L2 regularization strength for G-P map parameters.

  • hidden_nodes (int) – Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models.

  • alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

  • deduplicate (boole) – Remove duplicate sequence-function pairs in dataset (default: True).

  • gpu (boolean) – Enable GPUs (default: True).

Return type:

keras.Model

dataframe(x, y, alphabet, gpu)[source]
train(x, y, learning_rate=0.0005, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, save_dir=None, verbose=1)[source]
get_info(save_dir=None, verbose=1)[source]

Function to return estimated variational information from MAVE-NN model.

Parameters:
  • save_dir (str) – Directory for saving figures to file.

  • verbose (bool) – print info

Returns:

I_pred – MAVE-NN estimated variational information (I_pred), in bits.

Return type:

float

get_params(gauge='empirical', save_dir=None)[source]

Function to return trained parameters from MAVE-NN model.

Parameters:
  • gauge (gauge mode used to fix model parameters.) –

    See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over

    the characters at each position observed in the training set (unobserved characters are assigned probability 0).

    ’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence.

  • save_dir (str) – Directory for saving figures to file.

Returns:

theta_0float

Constant term in trained parameters.

theta_lcnumpy.ndarray

Additive terms in trained parameters (shape : (L,C)).

theta_lclcnumpy.ndarray

Pairwise terms in trained parameters (shape : (L,C,L,C)), if gpmap is ‘pairwise’.

Return type:

tuple

Function to place trained additive parameters into surrounding nonmutated sequence (zeros).

Parameters:
  • full_length (int) – Full length of sequence.

  • view_window ([int, int]) – Index of start and stop position along sequence to probe; i.e., [start, stop], where start < stop and both entries satisfy 0 <= int <= L.

Returns:

additive_logo – Additive logo parameters (shape : (‘full_length’,C)).

Return type:

numpy.ndarray

squid.surrogate_zoo.data_splits(N, test_split, valid_split, rnd_seed=None)[source]

Function to determine which sequences randomly split into train, validation, test set.

Parameters:
  • N (int) – number of data

  • test_splitc (float (between 0 and 1)) – percent to split into test set

  • valid_split (float (between 0 and 1)) – percent to split into validation set

  • rnd_seed (int) – random number seed

Returns:

  • train_index – array of indices to be included in training set

  • valid_index – array of indices to be included in validation set

  • test_index – array of indices to be included in test set