squid.mutagenizer

Module Contents

Classes

BaseMutagenesis

Base class for in silico MAVE data generation for a given sequence.

RandomMutagenesis

Module for performing random mutagenesis.

CombinatorialMutagenesis

Module for performing combinatorial mutagenesis.

TwoHotMutagenesis

Module to perform random mutagenesis using two-hot encoding.

Functions

apply_mut_by_seq_index(x_index, shape, num_muts)

Function to perform random mutagenesis.

twohot2seq(one_hot)

Function to convert two-hot encoding to a DNA sequence.

seq2twohot(seq)

Function to convert heterozygous DNA sequence to two-hot encoding.

get_alternative_bases(ref_base, A)

Get all possible alternative bases for a given reference base.

Attributes

L

class squid.mutagenizer.BaseMutagenesis[source]

Base class for in silico MAVE data generation for a given sequence.

abstract __call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

class squid.mutagenizer.RandomMutagenesis(mut_rate, uniform=False, seed=None)[source]

Bases: BaseMutagenesis

Module for performing random mutagenesis.

Parameters:
  • mut_rate (float, optional) – Mutation rate for random mutagenesis (defaults to 0.1).

  • uniform (bool) – uniform (True), Poisson (False); sets the number of mutations per sequence.

  • seed (int, optional) – Random seed for reproducibility. If None, results will not be reproducible. (defaults to None)

Returns:

Batch of one-hot sequences with random mutagenesis applied.

Return type:

numpy.ndarray

__call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

class squid.mutagenizer.CombinatorialMutagenesis(max_order=-1, mut_window=None, batch_size=256, seed=None)[source]

Module for performing combinatorial mutagenesis.

Parameters:
  • max_order (int, optional) – Maximum order of mutations to generate. If -1, generates all possible combinations. If 1, generates only single mutations (all SNVs). If 2, generates single and double mutations, etc. Must be less than or equal to sequence length L, or -1 for all combinations. (defaults to -1)

  • mut_window ([int, int], optional) – Index of start and stop position along sequence to probe for mutations. If provided, only generates mutations within this window (inclusive on both ends). For example, mut_window=[4,6] will generate mutations at positions 4, 5, and 6. (defaults to None, which means the entire sequence is considered)

  • batch_size (int, optional) – Batch size for one-hot encoding conversion. If None, converts all at once. For large sequences, using a batch size can help manage memory usage. (defaults to None)

  • seed (int, optional) – Random seed for reproducibility. If None, results will not be reproducible. (defaults to None)

Returns:

Batch of one-hot sequences with combinatorial mutagenesis applied. For max_order=-1: number of sequences is A^L For max_order=k: number of sequences is 1 + sum(n_choose_r * (A-1)^r) for r in 1..k where: - L is sequence length - A is alphabet size - n_choose_r is the binomial coefficient (L choose r) - The leading 1 accounts for the reference sequence

Return type:

numpy.ndarray

Examples

For L=4, A=4: - max_order=1: 1 + C(4,1)*(3^1) = 1 + 12 = 13 sequences - max_order=2: 1 + C(4,1)*(3^1) + C(4,2)*(3^2) = 1 + 12 + 54 = 67 sequences

Raises:

ValueError – If max_order is greater than sequence length L or less than -1

__call__(x, num_sim)[source]
class squid.mutagenizer.TwoHotMutagenesis(mut_rate, uniform=False, seed=None)[source]

Bases: BaseMutagenesis

Module to perform random mutagenesis using two-hot encoding. That is, encode each individual nucleotide at a given position using a one-hot encoding scheme, then represent the unphased diploid sequence as the sum of the two one-hot encoded nucleotides at each position. The sequence “AYCR”, for example, would be encoded as: [[2, 0, 0, 0], [0, 1, 0, 1], [0, 2, 0, 0], [1, 0, 1, 0]].

Parameters:
  • mut_rate (float) – Mutation rate for random mutagenesis.

  • uniform (bool, optional) – uniform (True), Poisson (False); sets the number of mutations per sequence. (defaults to False)

  • seed (int, optional) – Random seed for reproducibility. If None, results will not be reproducible. (defaults to None)

Returns:

Batch of one-hot sequences with random mutagenesis applied, with alphabet: {A, C, G, T, R (A/G), Y (C/T), S (C/G), W (A/T), K (G/T), M (A/C)}, such that heterozygous positions are represented using the IUPAC ambiguity codes.

Return type:

numpy.ndarray

__call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

squid.mutagenizer.apply_mut_by_seq_index(x_index, shape, num_muts)[source]

Function to perform random mutagenesis.

Parameters:
  • x_index (np.ndarray) – Indices of wildtype sequence.

  • shape (list) – Shape of MAVE array; i.e., (num_sim,L,A).

  • num_muts (int) – Number of mutations per sequence.

Returns:

Batch of one-hot sequences with random mutagenesis applied.

Return type:

torch.Tensor

squid.mutagenizer.twohot2seq(one_hot)[source]

Function to convert two-hot encoding to a DNA sequence.

Parameters:

one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))

Returns:

seq – Input sequence with length L.

Return type:

string

squid.mutagenizer.seq2twohot(seq)[source]

Function to convert heterozygous DNA sequence to two-hot encoding.

Parameters:

seq (string) – Input sequence with length L.

Returns:

one_hot – Input one-hot encoding of sequence (shape : (L,C))

Return type:

numpy.ndarray

squid.mutagenizer.get_alternative_bases(ref_base, A)[source]

Get all possible alternative bases for a given reference base.

squid.mutagenizer.L = 10[source]