squid.utils
Module Contents
Functions
|
Function to convert a Numpy array to Pandas dataframe with proper column headings. |
|
Function to convert one-hot encoding to a sequence. |
|
Function to convert a sequence to one-hot encoding. |
|
Function to fix the gauge for an attribution matrix. |
- squid.utils.arr2pd(x, alphabet=['A', 'C', 'G', 'T'])[source]
Function to convert a Numpy array to Pandas dataframe with proper column headings.
- Parameters:
x (numpy.ndarray) – One-hot encoding or attribution map (shape : (L,C)).
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.
- Returns:
x – Dataframe corresponding to the input array.
- Return type:
pandas.dataframe
- squid.utils.oh2seq(one_hot, alphabet=['A', 'C', 'G', 'T'])[source]
Function to convert one-hot encoding to a sequence.
- Parameters:
one_hot (numpy.ndarray) – Input one-hot encoding of sequence (shape : (L,C))
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.
- Returns:
seq – Input sequence with length L.
- Return type:
string
- squid.utils.seq2oh(seq, alphabet=['A', 'C', 'G', 'T'])[source]
Function to convert a sequence to one-hot encoding.
- Parameters:
seq (string) – Input sequence with length L
alphabet (list) – The alphabet used to determine the C characters in the logo such that each entry is a string; e.g., [‘A’,’C’,’G’,’T’] for DNA.
- Returns:
one_hot – One-hot encoding corresponding to input sequence (shape : (L,C)).
- Return type:
numpy.ndarray
- squid.utils.fix_gauge(x, gauge, wt=None, r=0.1)[source]
Function to fix the gauge for an attribution matrix.
- Parameters:
x (numpy.ndarray) – Attribution scores for a sequence-of-interest (shape : (L,C)).
gauge (gauge mode used to fix model parameters.) –
See https://mavenn.readthedocs.io/en/latest/math.html for more info. ‘uniform’ : hierarchical gauge using a uniform sequence distribution over
the characters at each position observed in the training set (unobserved characters are assigned probability 0).
’empirical’ : uses an empirical distribution computed from the training data. ‘consensus’ : wild-type gauge using the training data consensus sequence. ‘default’ : default gauge (no change).
OH_wt (numpy.ndarray) – Wild-type sequence (one-hot encoding) for ‘wildtype’ or ‘empirical’ gauge (shape : (L,C)).
r (float) – For ‘empirical gauge’, the probability of mutation used during generation of in silico MAVE dataset (should match user-defined ‘mut_rate’).
- Returns:
OH – Gauge-fixed one-hot encoding corresponding to input sequence (shape : (L,C)).
- Return type:
numpy.ndarray