:py:mod:`squid.surrogate_zoo`
=============================

.. py:module:: squid.surrogate_zoo


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   squid.surrogate_zoo.SurrogateBase
   squid.surrogate_zoo.SurrogateLinear
   squid.surrogate_zoo.SurrogateLasso
   squid.surrogate_zoo.SurrogateRidgeCV
   squid.surrogate_zoo.SurrogateLIME
   squid.surrogate_zoo.SurrogateMAVENN


Functions
~~~~~~~~~

.. autoapisummary::

   squid.surrogate_zoo.data_splits


Attributes
~~~~~~~~~~

.. autoapisummary::

   squid.surrogate_zoo.TENSORFLOW_AVAILABLE
   squid.surrogate_zoo.MAVENN_AVAILABLE
   squid.surrogate_zoo.SKLEARN_AVAILABLE


.. py:data:: TENSORFLOW_AVAILABLE
   :value: True

   
.. py:data:: MAVENN_AVAILABLE
   :value: True

   
.. py:data:: SKLEARN_AVAILABLE
   :value: True

   
.. py:class:: SurrogateBase


   Base class for surrogate model.


   ..
       !! processed by numpydoc !!
   .. py:method:: train(x)
      :abstractmethod:


   .. py:method:: get_params(x)
      :abstractmethod:


.. py:class:: SurrogateLinear(input_shape, num_tasks, l1=1e-08, l2=0.0001, alphabet=['A', 'C', 'G', 'T'])


   Bases: :py:obj:`SurrogateBase`

   
   Module for linear surrogate model (no GE or noise models).

   :param l1: L1 regularization penalty
   :type l1: float, optional
   :param l2: L2 regularization penalty
   :type l2: float, optional

   :rtype: keras.Model


   ..
       !! processed by numpydoc !!
   .. py:method:: build(input_shape, num_tasks, l1, l2)

      
      Build linear surrogate model.


      ..
          !! processed by numpydoc !!

   .. py:method:: train(x, y, learning_rate=0.001, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, rnd_seed=None, save_dir=None, verbose=1)

      
      Train linear surrogate model.


      ..
          !! processed by numpydoc !!

   .. py:method:: get_params(gauge=None, save_dir=None)

      
      Function to return trained parameters from the linear model.

      :param gauge: None, to match output of MAVE-NN get_params()
      :type gauge: None
      :param save_dir: Directory for saving figures to file.
      :type save_dir: str

      :returns:

                theta_0     :   None
                    None, to match output of MAVE-NN get_params()
                theta_lc    :   numpy.ndarray
                    Additive terms in trained parameters (shape : (L,C)).
                theta_lclc  :   None
                    None, to match output of MAVE-NN get_params()
      :rtype: tuple


      ..
          !! processed by numpydoc !!

   .. py:method:: get_logo(full_length=None, view_window=None)


.. py:class:: SurrogateLasso(input_shape, num_tasks, alpha=1, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)


   Bases: :py:obj:`SurrogateBase`

   
   Module for linear surrogate model (no GE or noise models) using sklearn Lasso.
   For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

   :param alpha: Constant that multiplies the L1 term, controlling regularization strength;
                 alpha must be a non-negative float i.e. in [0, inf). When alpha = 0,
                 the objective is equivalent to ordinary least squares, solved by the
                 LinearRegression object. For numerical reasons, using alpha = 0 with the
                 Lasso object is not advised. Instead, you should use the LinearRegression object.
   :type alpha: float, default=1.0

   :rtype: sklearn.Model


   ..
       !! processed by numpydoc !!
   .. py:method:: dataframe(x, y, alphabet, gpu)


   .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)


   .. py:method:: get_params(gauge=None, save_dir=None)

      
      Function to return trained parameters from the Lasso model.

      :param gauge: None, to match output of MAVE-NN get_params()
      :type gauge: None
      :param save_dir: Directory for saving figures to file.
      :type save_dir: str

      :returns:

                theta_0     :   None
                    None, to match output of MAVE-NN get_params()
                theta_lc    :   numpy.ndarray
                    Additive terms in trained parameters (shape : (L,C)).
                theta_lclc  :   None
                    None, to match output of MAVE-NN get_params()
      :rtype: tuple


      ..
          !! processed by numpydoc !!

   .. py:method:: get_logo(full_length=None, view_window=None)


.. py:class:: SurrogateRidgeCV(input_shape, num_tasks, alphas=[0.001, 0.01, 0.1, 1], cv=5, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)


   Bases: :py:obj:`SurrogateBase`

   
   Module for linear surrogate model (no GE or noise models) using sklearn RidgeCV.
   For more information, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

   :param alphas: Array of alpha values to try. Regularization strength; must be a positive float.
                  Regularization improves the conditioning of the problem and reduces the variance of the estimates.
                  Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models
                  such as LogisticRegression or LinearSVC. If using Leave-One-Out cross-validation, alphas must be positive.
   :type alphas: array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)
   :param cv:
              Determines the cross-validation splitting strategy. Possible inputs for cv are:
                  - None, to use the efficient Leave-One-Out cross-validation
                  - integer, to specify the number of folds
                  - CV splitter
                  - An iterable yielding (train, test) splits as arrays of indices.
   :type cv: int, cross-validation generator or an iterable, default=None

   :rtype: sklearn.Model


   ..
       !! processed by numpydoc !!
   .. py:method:: dataframe(x, y, alphabet, gpu)


   .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)


   .. py:method:: get_params(gauge=None, save_dir=None)

      
      Function to return trained parameters from the RidgeCV model.

      :param gauge: None, to match output of MAVE-NN get_params()
      :type gauge: None
      :param save_dir: Directory for saving figures to file.
      :type save_dir: str

      :returns:

                theta_0     :   None
                    None, to match output of MAVE-NN get_params()
                theta_lc    :   numpy.ndarray
                    Additive terms in trained parameters (shape : (L,C)).
                theta_lclc  :   None
                    None, to match output of MAVE-NN get_params()
      :rtype: tuple


      ..
          !! processed by numpydoc !!

   .. py:method:: get_logo(full_length=None, view_window=None)


.. py:class:: SurrogateLIME(input_shape, num_tasks, k=20, deduplicate=True, alphabet=['A', 'C', 'G', 'T'], gpu=True)


   Bases: :py:obj:`SurrogateBase`

   
   Module for linear surrogate model (no GE or noise models) using LIME.
   For more information, see https://arxiv.org/pdf/1602.04938.pdf and
   https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html

   :param k: The desired number of nonzero weights
   :type k: int

   :rtype: sklearn.Model


   ..
       !! processed by numpydoc !!
   .. py:method:: dataframe(x, y, alphabet, gpu)


   .. py:method:: train(x, y, learning_rate=None, epochs=None, batch_size=None, early_stopping=None, patience=None, restore_best_weights=None, rnd_seed=None, save_dir=None, verbose=1)


   .. py:method:: get_params(gauge=None, save_dir=None)

      
      Function to return trained parameters from the RidgeCV model.

      :param gauge: None, to match output of MAVE-NN get_params()
      :type gauge: None
      :param save_dir: Directory for saving figures to file.
      :type save_dir: str

      :returns:

                theta_0     :   None
                    None, to match output of MAVE-NN get_params()
                theta_lc    :   numpy.ndarray
                    Additive terms in trained parameters (shape : (L,C)).
                theta_lclc  :   None
                    None, to match output of MAVE-NN get_params()
      :rtype: tuple


      ..
          !! processed by numpydoc !!

   .. py:method:: get_logo(full_length=None, view_window=None)


.. py:class:: SurrogateMAVENN(input_shape, num_tasks, gpmap='additive', regression_type='GE', linearity='nonlinear', noise='SkewedT', noise_order=2, reg_strength=0.1, hidden_nodes=50, alphabet=['A', 'C', 'G', 'T'], deduplicate=True, gpu=True)


   Bases: :py:obj:`SurrogateBase`

   
   Module for MAVE-NN surrogate models (optional GE and noise models).

   :param gpmap:
                 Define MAVE-NN surrogate model used to interpret deep learning model.
                     'additive'  :   Assume that each position contributes independently to the latent phenotype.
                     'pairwise'  :   Assume that every pair of positions contribute to the latent phenotype.
   :type gpmap: string {'additive' or 'pairwise'}
   :param regression_type:
                           Type of regression used for measurement process.
                               'MPA'   :   measurement process agnostic (categorical y-values).
                               'GE'    :   global epistasis (continuous y-values).
   :type regression_type: string
   :param linearity:
                     Define use of additional nonlinearity for fitting data.
                         'nonlinear' :   Additionally fit data using GE nonlinear function.
                         'linear'    :   Do not apply GE nonlinearity for fitting data.
   :type linearity: string
   :param noise: Noise model to use for when defining a GE model (no effect on MPA models).
                 See https://mavenn.readthedocs.io/en/latest/math.html for more info.
                     'Gaussian'  :   Gaussian-based noise model.
                     'Cauchy'    :   Cauchy-based noise model.
                     'SkewedT'   :   SkewedT-based noise model.
   :type noise: string
   :param noise_order: In the GE context, the order of the polynomial(s) used to define noise model parameters.
                       In the linear context, the order is zero by default.
   :type noise_order: int
   :param reg_strength: L2 regularization strength for G-P map parameters.
   :type reg_strength: float
   :param hidden_nodes: Number of hidden nodes (i.e. sigmoidal contributions) to use when defining the nonlinearity component of a GE model.
                        Has no effect on MPA models.
   :type hidden_nodes: int
   :param alphabet: The alphabet used to determine the C characters in the logo such that
                    each entry is a string; e.g., ['A','C','G','T'] for DNA.
   :type alphabet: list
   :param deduplicate: Remove duplicate sequence-function pairs in dataset (default: True).
   :type deduplicate: boole
   :param gpu: Enable GPUs (default: True).
   :type gpu: boolean

   :rtype: keras.Model


   ..
       !! processed by numpydoc !!
   .. py:method:: dataframe(x, y, alphabet, gpu)


   .. py:method:: train(x, y, learning_rate=0.0005, epochs=500, batch_size=100, early_stopping=True, patience=25, restore_best_weights=True, save_dir=None, verbose=1)


   .. py:method:: get_info(save_dir=None, verbose=1)

      
      Function to return estimated variational information from MAVE-NN model.

      :param save_dir: Directory for saving figures to file.
      :type save_dir: str
      :param verbose: print info
      :type verbose: bool

      :returns: **I_pred** -- MAVE-NN estimated variational information (I_pred), in bits.
      :rtype: float


      ..
          !! processed by numpydoc !!

   .. py:method:: get_params(gauge='empirical', save_dir=None)

      
      Function to return trained parameters from MAVE-NN model.

      :param gauge:     See https://mavenn.readthedocs.io/en/latest/math.html for more info.
                    'uniform'   :   hierarchical gauge using a uniform sequence distribution over
                                    the characters at each position observed in the training set
                                    (unobserved characters are assigned probability 0).
                    'empirical' :   uses an empirical distribution computed from the training data.
                    'consensus' :   wild-type gauge using the training data consensus sequence.
      :type gauge: gauge mode used to fix model parameters.
      :param save_dir: Directory for saving figures to file.
      :type save_dir: str

      :returns:

                theta_0     :   float
                    Constant term in trained parameters.
                theta_lc    :   numpy.ndarray
                    Additive terms in trained parameters (shape : (L,C)).
                theta_lclc  :   numpy.ndarray
                    Pairwise terms in trained parameters (shape : (L,C,L,C)), if gpmap is 'pairwise'.
      :rtype: tuple


      ..
          !! processed by numpydoc !!

   .. py:method:: get_logo(full_length=None, view_window=None)

      
      Function to place trained additive parameters into surrounding
      nonmutated sequence (zeros).

      :param full_length: Full length of sequence.
      :type full_length: int
      :param view_window: Index of start and stop position along sequence to probe;
                          i.e., [start, stop], where start < stop and both entries
                          satisfy 0 <= int <= L.
      :type view_window: [int, int]

      :returns: **additive_logo** -- Additive logo parameters (shape : ('full_length',C)).
      :rtype: numpy.ndarray


      ..
          !! processed by numpydoc !!


.. py:function:: data_splits(N, test_split, valid_split, rnd_seed=None)

   
   Function to determine which sequences randomly split into train, validation, test set.

   :param N: number of data
   :type N: int
   :param test_splitc: percent to split into test set
   :type test_splitc: float (between 0 and 1)
   :param valid_split: percent to split into validation set
   :type valid_split: float (between 0 and 1)
   :param rnd_seed: random number seed
   :type rnd_seed: int

   :returns: * *train_index* -- array of indices to be included in training set
             * *valid_index* -- array of indices to be included in validation set
             * *test_index* -- array of indices to be included in test set


   ..
       !! processed by numpydoc !!