squid.mutagenizer
Module Contents
Classes
Base class for in silico MAVE data generation for a given sequence. |
|
Module for performing random mutagenesis. |
|
Module for performing combinatorial mutagenesis. |
|
Module to perform random mutagenesis using two-hot encoding. |
Functions
|
Function to perform random mutagenesis. |
|
Function to convert two-hot encoding to a DNA sequence. |
|
Function to convert heterozygous DNA sequence to two-hot encoding. |
|
Get all possible alternative bases for a given reference base. |
Attributes
- class squid.mutagenizer.BaseMutagenesis[source]
Base class for in silico MAVE data generation for a given sequence.
- abstract __call__(x, num_sim)[source]
Return an in silico MAVE based on mutagenesis of ‘x’.
- Parameters:
x (torch.Tensor) – one-hot sequence (shape: (L, A)).
num_sim (int) – Number of sequences to mutagenize.
- Returns:
Batch of one-hot sequences with random augmentation applied.
- Return type:
torch.Tensor
- class squid.mutagenizer.RandomMutagenesis(mut_rate, uniform=False, seed=None)[source]
Bases:
BaseMutagenesisModule for performing random mutagenesis.
- Parameters:
- Returns:
Batch of one-hot sequences with random mutagenesis applied.
- Return type:
numpy.ndarray
- class squid.mutagenizer.CombinatorialMutagenesis(max_order=-1, mut_window=None, batch_size=256, seed=None)[source]
Module for performing combinatorial mutagenesis.
- Parameters:
max_order (int, optional) – Maximum order of mutations to generate. If -1, generates all possible combinations. If 1, generates only single mutations (all SNVs). If 2, generates single and double mutations, etc. Must be less than or equal to sequence length L, or -1 for all combinations. (defaults to -1)
mut_window ([int, int], optional) – Index of start and stop position along sequence to probe for mutations. If provided, only generates mutations within this window (inclusive on both ends). For example, mut_window=[4,6] will generate mutations at positions 4, 5, and 6. (defaults to None, which means the entire sequence is considered)
batch_size (int, optional) – Batch size for one-hot encoding conversion. If None, converts all at once. For large sequences, using a batch size can help manage memory usage. (defaults to None)
seed (int, optional) – Random seed for reproducibility. If None, results will not be reproducible. (defaults to None)
- Returns:
Batch of one-hot sequences with combinatorial mutagenesis applied. For max_order=-1: number of sequences is A^L For max_order=k: number of sequences is 1 + sum(n_choose_r * (A-1)^r) for r in 1..k where: - L is sequence length - A is alphabet size - n_choose_r is the binomial coefficient (L choose r) - The leading 1 accounts for the reference sequence
- Return type:
numpy.ndarray
Examples
For L=4, A=4: - max_order=1: 1 + C(4,1)*(3^1) = 1 + 12 = 13 sequences - max_order=2: 1 + C(4,1)*(3^1) + C(4,2)*(3^2) = 1 + 12 + 54 = 67 sequences
- Raises:
ValueError – If max_order is greater than sequence length L or less than -1
- class squid.mutagenizer.TwoHotMutagenesis(mut_rate, uniform=False, seed=None)[source]
Bases:
BaseMutagenesisModule to perform random mutagenesis using two-hot encoding. That is, encode each individual nucleotide at a given position using a one-hot encoding scheme, then represent the unphased diploid sequence as the sum of the two one-hot encoded nucleotides at each position. The sequence “AYCR”, for example, would be encoded as: [[2, 0, 0, 0], [0, 1, 0, 1], [0, 2, 0, 0], [1, 0, 1, 0]].
- Parameters:
- Returns:
Batch of one-hot sequences with random mutagenesis applied, with alphabet: {A, C, G, T, R (A/G), Y (C/T), S (C/G), W (A/T), K (G/T), M (A/C)}, such that heterozygous positions are represented using the IUPAC ambiguity codes.
- Return type:
numpy.ndarray
- squid.mutagenizer.apply_mut_by_seq_index(x_index, shape, num_muts)[source]
Function to perform random mutagenesis.
- squid.mutagenizer.twohot2seq(one_hot)[source]
Function to convert two-hot encoding to a DNA sequence.
- Parameters:
one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))
- Returns:
seq – Input sequence with length L.
- Return type:
string
- squid.mutagenizer.seq2twohot(seq)[source]
Function to convert heterozygous DNA sequence to two-hot encoding.
- Parameters:
seq (string) – Input sequence with length L.
- Returns:
one_hot – Input one-hot encoding of sequence (shape : (L,C))
- Return type:
numpy.ndarray