squid.utils

Module Contents

Functions

arr2pd(x[, alphabet])

Function to convert a Numpy array to Pandas dataframe with proper column headings.

oh2seq(one_hot[, alphabet])

Function to convert one-hot encoding to a sequence.

seq2oh(seq[, alphabet])

Function to convert a sequence to one-hot encoding.

fix_gauge(x, gauge[, wt, r])

Function to fix the gauge for an attribution matrix.

squid.utils.arr2pd(x, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert a Numpy array to Pandas dataframe with proper column headings.

Parameters:
  • x (numpy.ndarray) – One-hot encoding or attribution map (shape : (L,C)).

  • alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

x – Dataframe corresponding to the input array.

Return type:

pandas.dataframe

squid.utils.oh2seq(one_hot, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert one-hot encoding to a sequence.

Parameters:
  • one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))

  • alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

seq – Input sequence with length L.

Return type:

string

squid.utils.seq2oh(seq, alphabet=['A', 'C', 'G', 'T'])[source]

Function to convert a sequence to one-hot encoding.

Parameters:
  • seq (string) – Input sequence with length L

  • alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.

Returns:

one_hot – One-hot encoding corresponding to input sequence (shape : (L,C)).

Return type:

numpy.ndarray

squid.utils.fix_gauge(x, gauge, wt=None, r=0.1)[source]

Function to fix the gauge for an attribution matrix.

Parameters:
  • x (numpy.ndarray) – Attribution scores for a sequence-of-interest (shape : (L,C)).

  • gauge (gauge mode used to fix model parameters.) –

    See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over

    the characters at each position observed in the training set (unobserved characters are assigned probability 0).

    ’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence. ‘default’ : default gauge (no change).

  • OH_wt (numpy.ndarray) – Wild-type sequence (one-hot encoding) for ‘wildtype’ or ‘empirical’ gauge (shape : (L,C)).

  • r (float) – For ‘empirical gauge’, the probability of mutation used during generation of in silico MAVE dataset (should match user-defined ‘mut_rate’).

Returns:

OH – Gauge-fixed one-hot encoding corresponding to input sequence (shape : (L,C)).

Return type:

numpy.ndarray