:py:mod:`squid.surrogate_zoo` ============================= .. py:module:: squid.surrogate_zoo Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: squid.surrogate_zoo.SurrogateBase squid.surrogate_zoo.SurrogateLinear squid.surrogate_zoo.SurrogateLasso squid.surrogate_zoo.SurrogateRidgeCV squid.surrogate_zoo.SurrogateLIME squid.surrogate_zoo.SurrogateMAVENN Functions ~~~~~~~~~ .. autoapisummary:: squid.surrogate_zoo.data_splits Attributes ~~~~~~~~~~ .. autoapisummary:: squid.surrogate_zoo.TENSORFLOW_AVAILABLE squid.surrogate_zoo.MAVENN_AVAILABLE squid.surrogate_zoo.SKLEARN_AVAILABLE .. py:data:: TENSORFLOW_AVAILABLE :value: True .. py:data:: MAVENN_AVAILABLE :value: True .. py:data:: SKLEARN_AVAILABLE :value: True .. py:class:: SurrogateBase Base class for surrogate model. .. !! processed by numpydoc !! .. py:method:: train(x) :abstractmethod: .. py:method:: get_params(x) :abstractmethod: .. py:class:: SurrogateLinear(input_shape, num_tasks, l1=1e-08, l2=0.0001, alphabet=['A', 'C', 'G', 'T']) Bases: :py:obj:`SurrogateBase` Module for linear surrogate model (no GE or noise models). :param l1: L1 regularization penalty :type l1: float, optional :param l2: L2 regularization penalty :type l2: float, optional :rtype: keras.Model .. !! processed by numpydoc !! .. py:method:: build(input_shape, num_tasks, l1, l2) Build linear surrogate model. .. !! processed by numpydoc !! .. py:method:: train(x, y, learning_rate=0.001, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, rnd_seed=None, save_dir=None, verbose=1) Train linear surrogate model. .. !! processed by numpydoc !! .. py:method:: get_params(gauge=None, save_dir=None) Function to return trained parameters from the linear model. :param gauge: None, to match output of MAVE-NN get_params() :type gauge: None :param save_dir: Directory for saving figures to file. :type save_dir: str :returns: theta_0 : None None, to match output of MAVE-NN get_params() theta_lc : numpy.ndarray Additive terms in trained parameters (shape : (L,C)). theta_lclc : None None, to match output of MAVE-NN get_params() :rtype: tuple .. !! processed by numpydoc !! .. py:method:: get_logo(full_length=None, view_window=None) .. py:class:: SurrogateLasso(input_shape, num_tasks, alpha=1, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True) Bases: :py:obj:`SurrogateBase` Module for linear surrogate model (no GE or noise models) using sklearn Lasso. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html :param alpha: Constant that multiplies the L1 term, controlling regularization strength; alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Instead, you should use the LinearRegression object. :type alpha: float, default=1.0 :rtype: sklearn.Model .. !! processed by numpydoc !! .. py:method:: dataframe(x, y, alphabet, gpu) .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1) .. py:method:: get_params(gauge=None, save_dir=None) Function to return trained parameters from the Lasso model. :param gauge: None, to match output of MAVE-NN get_params() :type gauge: None :param save_dir: Directory for saving figures to file. :type save_dir: str :returns: theta_0 : None None, to match output of MAVE-NN get_params() theta_lc : numpy.ndarray Additive terms in trained parameters (shape : (L,C)). theta_lclc : None None, to match output of MAVE-NN get_params() :rtype: tuple .. !! processed by numpydoc !! .. py:method:: get_logo(full_length=None, view_window=None) .. py:class:: SurrogateRidgeCV(input_shape, num_tasks, alphas=[0.001, 0.01, 0.1, 1], cv=5, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True) Bases: :py:obj:`SurrogateBase` Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV. For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html :param alphas: Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or LinearSVC. If using Leave-One-Out cross-validation, alphas must be positive. :type alphas: array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0) :param cv: Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds - CV splitter - An iterable yielding (train, test) splits as arrays of indices. :type cv: int, cross-validation generator or an iterable, default=None :rtype: sklearn.Model .. !! processed by numpydoc !! .. py:method:: dataframe(x, y, alphabet, gpu) .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1) .. py:method:: get_params(gauge=None, save_dir=None) Function to return trained parameters from the RidgeCV model. :param gauge: None, to match output of MAVE-NN get_params() :type gauge: None :param save_dir: Directory for saving figures to file. :type save_dir: str :returns: theta_0 : None None, to match output of MAVE-NN get_params() theta_lc : numpy.ndarray Additive terms in trained parameters (shape : (L,C)). theta_lclc : None None, to match output of MAVE-NN get_params() :rtype: tuple .. !! processed by numpydoc !! .. py:method:: get_logo(full_length=None, view_window=None) .. py:class:: SurrogateLIME(input_shape, num_tasks, k=20, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True) Bases: :py:obj:`SurrogateBase` Module for linear surrogate model (no GE or noise models) using LIME. For more information, see https://arxiv.org/pdf/1602.04938.pdf and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html :param k: The desired number of nonzero weights :type k: int :rtype: sklearn.Model .. !! processed by numpydoc !! .. py:method:: dataframe(x, y, alphabet, gpu) .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1) .. py:method:: get_params(gauge=None, save_dir=None) Function to return trained parameters from the RidgeCV model. :param gauge: None, to match output of MAVE-NN get_params() :type gauge: None :param save_dir: Directory for saving figures to file. :type save_dir: str :returns: theta_0 : None None, to match output of MAVE-NN get_params() theta_lc : numpy.ndarray Additive terms in trained parameters (shape : (L,C)). theta_lclc : None None, to match output of MAVE-NN get_params() :rtype: tuple .. !! processed by numpydoc !! .. py:method:: get_logo(full_length=None, view_window=None) .. py:class:: SurrogateMAVENN(input_shape, num_tasks, gpmap='additive', regression_type='GE', linearity='nonlinear', noise='SkewedT', noise_order=2, reg_strength=0.1, hidden_nodes=50, alphabet=['A', 'C', 'G', 'T'], deduplicate=True, gpu=True) Bases: :py:obj:`SurrogateBase` Module for MAVE-NN surrogate models (optional GE and noise models). :param gpmap: Define MAVE-NN surrogate model used to interpret deep learning model. 'additive' : Assume that each position contributes independently to the latent phenotype. 'pairwise' : Assume that every pair of positions contribute to the latent phenotype. :type gpmap: string {'additive' or 'pairwise'} :param regression_type: Type of regression used for measurement process. 'MPA' : measurement process agnostic (categorical y-values). 'GE' : global epistasis (continuous y-values). :type regression_type: string :param linearity: Define use of additional nonlinearity for fitting data. 'nonlinear' : Additionally fit data using GE nonlinear function. 'linear' : Do not apply GE nonlinearity for fitting data. :type linearity: string :param noise: Noise model to use for when defining a GE model (no effect on MPA models). See https://mavenn.readthedocs.io/en/latest/math.html for more info. 'Gaussian' : Gaussian-based noise model. 'Cauchy' : Cauchy-based noise model. 'SkewedT' : SkewedT-based noise model. :type noise: string :param noise_order: In the GE context, the order of the polynomial(s) used to define noise model parameters. In the linear context, the order is zero by default. :type noise_order: int :param reg_strength: L2 regularization strength for G-P map parameters. :type reg_strength: float :param hidden_nodes: Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model. Has no effect on MPA models. :type hidden_nodes: int :param alphabet: The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., ['A','C','G','T'] for DNA. :type alphabet: list :param deduplicate: Remove duplicate sequence-function pairs in dataset (default: True). :type deduplicate: boole :param gpu: Enable GPUs (default: True). :type gpu: boolean :rtype: keras.Model .. !! processed by numpydoc !! .. py:method:: dataframe(x, y, alphabet, gpu) .. py:method:: train(x, y, learning_rate=0.0005, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, save_dir=None, verbose=1) .. py:method:: get_info(save_dir=None, verbose=1) Function to return estimated variational information from MAVE-NN model. :param save_dir: Directory for saving figures to file. :type save_dir: str :param verbose: print info :type verbose: bool :returns: **I_pred** -- MAVE-NN estimated variational information (I_pred), in bits. :rtype: float .. !! processed by numpydoc !! .. py:method:: get_params(gauge='empirical', save_dir=None) Function to return trained parameters from MAVE-NN model. :param gauge: See https://mavenn.readthedocs.io/en/latest/math.html for more info. 'uniform' : hierarchical gauge using a uniform sequence distribution over the characters at each position observed in the training set (unobserved characters are assigned probability 0). 'empirical' : uses an empirical distribution computed from the training data. 'consensus' : wild-type gauge using the training data consensus sequence. :type gauge: gauge mode used to fix model parameters. :param save_dir: Directory for saving figures to file. :type save_dir: str :returns: theta_0 : float Constant term in trained parameters. theta_lc : numpy.ndarray Additive terms in trained parameters (shape : (L,C)). theta_lclc : numpy.ndarray Pairwise terms in trained parameters (shape : (L,C,L,C)), if gpmap is 'pairwise'. :rtype: tuple .. !! processed by numpydoc !! .. py:method:: get_logo(full_length=None, view_window=None) Function to place trained additive parameters into surrounding nonmutated sequence (zeros). :param full_length: Full length of sequence. :type full_length: int :param view_window: Index of start and stop position along sequence to probe; i.e., [start, stop], where start < stop and both entries satisfy 0 <= int <= L. :type view_window: [int, int] :returns: **additive_logo** -- Additive logo parameters (shape : ('full_length',C)). :rtype: numpy.ndarray .. !! processed by numpydoc !! .. py:function:: data_splits(N, test_split, valid_split, rnd_seed=None) Function to determine which sequences randomly split into train, validation, test set. :param N: number of data :type N: int :param test_splitc: percent to split into test set :type test_splitc: float (between 0 and 1) :param valid_split: percent to split into validation set :type valid_split: float (between 0 and 1) :param rnd_seed: random number seed :type rnd_seed: int :returns: * *train_index* -- array of indices to be included in training set * *valid_index* -- array of indices to be included in validation set * *test_index* -- array of indices to be included in test set .. !! processed by numpydoc !!