squid.mutagenizer

Module Contents

Classes

BaseMutagenesis

Base class for in silico MAVE data generation for a given sequence.

RandomMutagenesis

Module for performing random mutagenesis.

CombinatorialMutagenesis

Module for performing combinatorial mutagenesis.

TwoHotMutagenesis

Module to perform random mutagenesis using two-hot encoding.

Functions

apply_mut_by_seq_index(x_index, shape, num_muts)

Function to perform random mutagenesis.

twohot2seq(one_hot)

Function to convert two-hot encoding to a DNA sequence.

seq2twohot(seq)

Function to convert heterozygous DNA sequence to two-hot encoding.

class squid.mutagenizer.BaseMutagenesis[source]

Base class for in silico MAVE data generation for a given sequence.

abstract __call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

class squid.mutagenizer.RandomMutagenesis(mut_rate, uniform=False)[source]

Bases: BaseMutagenesis

Module for performing random mutagenesis.

Parameters:
  • mut_rate (float, optional) – Mutation rate for random mutagenesis (defaults to 0.1).

  • uniform (bool) – uniform (True), Poisson (False); sets the number of mutations per sequence.

Returns:

Batch of one-hot sequences with random mutagenesis applied.

Return type:

numpy.ndarray

__call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

class squid.mutagenizer.CombinatorialMutagenesis[source]

Module for performing combinatorial mutagenesis.

Returns:

Batch of one-hot sequences with combinatorial mutagenesis applied, such that the number of sequences produced is the number of characters A in the alphabet raised to the length L of the ‘mut_window’.

Return type:

numpy.ndarray

__call__(x, num_sim)[source]
class squid.mutagenizer.TwoHotMutagenesis(mut_rate, uniform=False)[source]

Bases: BaseMutagenesis

Module to perform random mutagenesis using two-hot encoding. That is, encode each individual nucleotide at a given position using a one-hot encoding scheme, then represent the unphased diploid sequence as the sum of the two one-hot encoded nucleotides at each position. The sequence “AYCR”, for example, would be encoded as: [[2, 0, 0, 0], [0, 1, 0, 1], [0, 2, 0, 0], [1, 0, 1, 0]].

Returns:

Batch of one-hot sequences with random mutagenesis applied, with alphabet: {A, C, G, T, R (A/G), Y (C/T), S (C/G), W (A/T), K (G/T), M (A/C)}, such that heterozygous positions are represented using the IUPAC ambiguity codes.

Return type:

numpy.ndarray

__call__(x, num_sim)[source]

Return an in silico MAVE based on mutagenesis of ‘x’.

Parameters:
  • x (torch.Tensor) – one-hot sequence (shape: (L, A)).

  • num_sim (int) – Number of sequences to mutagenize.

Returns:

Batch of one-hot sequences with random augmentation applied.

Return type:

torch.Tensor

squid.mutagenizer.apply_mut_by_seq_index(x_index, shape, num_muts)[source]

Function to perform random mutagenesis.

Parameters:
  • x_index (np.ndarray) – Indices of wildtype sequence.

  • shape (list) – Shape of MAVE array; i.e., (num_sim,L,A).

  • num_muts (int) – Number of mutations per sequence.

Returns:

Batch of one-hot sequences with random mutagenesis applied.

Return type:

torch.Tensor

squid.mutagenizer.twohot2seq(one_hot)[source]

Function to convert two-hot encoding to a DNA sequence.

Parameters:

one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))

Returns:

seq – Input sequence with length L.

Return type:

string

squid.mutagenizer.seq2twohot(seq)[source]

Function to convert heterozygous DNA sequence to two-hot encoding.

Parameters:

seq (string) – Input sequence with length L.

Returns:

one_hot – Input one-hot encoding of sequence (shape : (L,C))

Return type:

numpy.ndarray