:py:mod:`squid.mutagenizer` =========================== .. py:module:: squid.mutagenizer Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: squid.mutagenizer.BaseMutagenesis squid.mutagenizer.RandomMutagenesis squid.mutagenizer.CombinatorialMutagenesis squid.mutagenizer.TwoHotMutagenesis Functions ~~~~~~~~~ .. autoapisummary:: squid.mutagenizer.apply_mut_by_seq_index squid.mutagenizer.twohot2seq squid.mutagenizer.seq2twohot squid.mutagenizer.get_alternative_bases Attributes ~~~~~~~~~~ .. autoapisummary:: squid.mutagenizer.L .. py:class:: BaseMutagenesis Base class for in silico MAVE data generation for a given sequence. .. !! processed by numpydoc !! .. py:method:: __call__(x, num_sim) :abstractmethod: Return an in silico MAVE based on mutagenesis of 'x'. :param x: one-hot sequence (shape: (L, A)). :type x: torch.Tensor :param num_sim: Number of sequences to mutagenize. :type num_sim: int :returns: Batch of one-hot sequences with random augmentation applied. :rtype: torch.Tensor .. !! processed by numpydoc !! .. py:class:: RandomMutagenesis(mut_rate, uniform=False, seed=None) Bases: :py:obj:`BaseMutagenesis` Module for performing random mutagenesis. :param mut_rate: Mutation rate for random mutagenesis (defaults to 0.1). :type mut_rate: float, optional :param uniform: uniform (True), Poisson (False); sets the number of mutations per sequence. :type uniform: bool :param seed: Random seed for reproducibility. If None, results will not be reproducible. (defaults to None) :type seed: int, optional :returns: Batch of one-hot sequences with random mutagenesis applied. :rtype: numpy.ndarray .. !! processed by numpydoc !! .. py:method:: __call__(x, num_sim) Return an in silico MAVE based on mutagenesis of 'x'. :param x: one-hot sequence (shape: (L, A)). :type x: torch.Tensor :param num_sim: Number of sequences to mutagenize. :type num_sim: int :returns: Batch of one-hot sequences with random augmentation applied. :rtype: torch.Tensor .. !! processed by numpydoc !! .. py:class:: CombinatorialMutagenesis(max_order=-1, mut_window=None, batch_size=256, seed=None) Module for performing combinatorial mutagenesis. :param max_order: Maximum order of mutations to generate. If -1, generates all possible combinations. If 1, generates only single mutations (all SNVs). If 2, generates single and double mutations, etc. Must be less than or equal to sequence length L, or -1 for all combinations. (defaults to -1) :type max_order: int, optional :param mut_window: Index of start and stop position along sequence to probe for mutations. If provided, only generates mutations within this window (inclusive on both ends). For example, mut_window=[4,6] will generate mutations at positions 4, 5, and 6. (defaults to None, which means the entire sequence is considered) :type mut_window: [int, int], optional :param batch_size: Batch size for one-hot encoding conversion. If None, converts all at once. For large sequences, using a batch size can help manage memory usage. (defaults to None) :type batch_size: int, optional :param seed: Random seed for reproducibility. If None, results will not be reproducible. (defaults to None) :type seed: int, optional :returns: Batch of one-hot sequences with combinatorial mutagenesis applied. For max_order=-1: number of sequences is A^L For max_order=k: number of sequences is 1 + sum(n_choose_r * (A-1)^r) for r in 1..k where: - L is sequence length - A is alphabet size - n_choose_r is the binomial coefficient (L choose r) - The leading 1 accounts for the reference sequence :rtype: numpy.ndarray .. rubric:: Examples For L=4, A=4: - max_order=1: 1 + C(4,1)*(3^1) = 1 + 12 = 13 sequences - max_order=2: 1 + C(4,1)*(3^1) + C(4,2)*(3^2) = 1 + 12 + 54 = 67 sequences :raises ValueError: If max_order is greater than sequence length L or less than -1 .. !! processed by numpydoc !! .. py:method:: __call__(x, num_sim) .. py:class:: TwoHotMutagenesis(mut_rate, uniform=False, seed=None) Bases: :py:obj:`BaseMutagenesis` Module to perform random mutagenesis using two-hot encoding. That is, encode each individual nucleotide at a given position using a one-hot encoding scheme, then represent the unphased diploid sequence as the sum of the two one-hot encoded nucleotides at each position. The sequence "AYCR", for example, would be encoded as: [[2, 0, 0, 0], [0, 1, 0, 1], [0, 2, 0, 0], [1, 0, 1, 0]]. :param mut_rate: Mutation rate for random mutagenesis. :type mut_rate: float :param uniform: uniform (True), Poisson (False); sets the number of mutations per sequence. (defaults to False) :type uniform: bool, optional :param seed: Random seed for reproducibility. If None, results will not be reproducible. (defaults to None) :type seed: int, optional :returns: Batch of one-hot sequences with random mutagenesis applied, with alphabet: {A, C, G, T, R (A/G), Y (C/T), S (C/G), W (A/T), K (G/T), M (A/C)}, such that heterozygous positions are represented using the IUPAC ambiguity codes. :rtype: numpy.ndarray .. !! processed by numpydoc !! .. py:method:: __call__(x, num_sim) Return an in silico MAVE based on mutagenesis of 'x'. :param x: one-hot sequence (shape: (L, A)). :type x: torch.Tensor :param num_sim: Number of sequences to mutagenize. :type num_sim: int :returns: Batch of one-hot sequences with random augmentation applied. :rtype: torch.Tensor .. !! processed by numpydoc !! .. py:function:: apply_mut_by_seq_index(x_index, shape, num_muts) Function to perform random mutagenesis. :param x_index: Indices of wildtype sequence. :type x_index: np.ndarray :param shape: Shape of MAVE array; i.e., (num_sim,L,A). :type shape: list :param num_muts: Number of mutations per sequence. :type num_muts: int :returns: Batch of one-hot sequences with random mutagenesis applied. :rtype: torch.Tensor .. !! processed by numpydoc !! .. py:function:: twohot2seq(one_hot) Function to convert two-hot encoding to a DNA sequence. :param one_hot: Input one-hot encoding of sequence (shape : (L,C)) :type one_hot: numpy.ndarray :returns: **seq** -- Input sequence with length L. :rtype: string .. !! processed by numpydoc !! .. py:function:: seq2twohot(seq) Function to convert heterozygous DNA sequence to two-hot encoding. :param seq: Input sequence with length L. :type seq: string :returns: **one_hot** -- Input one-hot encoding of sequence (shape : (L,C)) :rtype: numpy.ndarray .. !! processed by numpydoc !! .. py:function:: get_alternative_bases(ref_base, A) Get all possible alternative bases for a given reference base. .. !! processed by numpydoc !! .. py:data:: L :value: 10