# Sequencing Errors
## The Sequencing Protocol
Mistakes during DNA synthesis isn’t the only cause of errors. DNA also risks being damaged during certain steps such as fragmentation.
![[Pasted image 20240630192949.png|725]]
> Source: [Sequencing error profiles of Illumina sequencing instruments](https://academic.oup.com/nargab/article/doi/10.1093/nargab/lqab019/6193612)
---
Originally, I was thinking of integrating the **PHRED quality score** metric into the model. But, from what I’ve read, there isn’t actuall much association between the resulting PHRED quality score for basecalling and the likelihood of an error occuring during sequencing.
- Mistakes during basecalling/[chromatogram reading](https://biology.unt.edu/~jajohnson/Chromatogram_Interpretation) seem to be an entirely different problem in and of itself
> “Quality score filtering does not improve accuracy if the errors are introduced prior to the sequencing”
### Primers, Barcodes, and UMIs (oh my)
- [ ] Index Hopping
- [ ] i5 & i7
- [ ] The demultiplexing process
## Other Considerations
> [!summary] Summary
> We need to have a better idea of the types of errors that are occuring. It’s far more complicated than “Oops, polymerase slipped and caused a transition mutation.”
For the scope of this project, we should only focus on errors specific to the *Illumina MiSeq* platform. The types and occurances of sequencing errors can vary dramatically due to technological variation, so honing in on one specific procedure should help limit the additional variation.
### Complications/Overlaps With SSM Diversity
> [!summary] Summary
> We want to make sure that there isn’t any correlation between the patterns of the sequencing errors and the patterns of the induced mutations from the engineering process.
(N = A/C/G/T, K = G/T and S = C/G)
Certain **trimer motifs** have been shown to be highly associated with error rates (i.e. GGT).
Considering the lab sometimes uses *NNK-restricted* site saturation mutagenesis (SSM), this could cause a *significant overlap* with the induced mutations and the likelihood for errors.
---
“Why can’t we just filter stuff using the base call quality scores?”
- Well first of all, we do…
- Second, there are multiple “points of failure” within the process that introduce these errors. The base quality metric only captures the
- If DNA polymerase incorporated the incorrect base, the emitted signal would technically be “correct”, but
### Relationships Between Datasets & Error Profiles
- [ ] Does the nature of each datasets influence the presence of certain error motifs?
## Error Correction Techniques
> See also:
> - [BayesHammer: Bayesian clustering for error correction in single-cell sequencing](https://www.zotero.org/micheeela/collections/9364QRBW/items/A7IH5WHA/attachment/6UMES3W6)
> - [HIDDEN MARKOV MODELS FOR DNA SEQUENCING](https://www.zotero.org/micheeela/collections/9364QRBW/items/XSMSFKNU/attachment/8MKQB47J)
A lot of error correction approaches rely on some level of redundancy in the data to:
1. First determine average clusters
2. Identify variations within these clusters (that differ from a centroid)
The issue is, if a lot of the mutations being generated in these libraries are
I feel like trying to use these traditional means of error detection would be no different (or very similar) to what we are already doing with our *clustering process*.
### Reference Alignments
Nearly every single existing approach relies on aligning reads to a reference genome
- This is also the case for most of the benchmarking papers.
One paper that used a TCR-Seq dataset set the value of the genome_length parameter to be the sum of all the TCR sequences.
- [ ] How does this impact the algorithms performance? There would be a lot of homology between the TCR genes which I’d imagine would lead to a huge misalignment rate.
### UMI-Based Clustering
> References:
> - [Correcting PCR amplification errors in unique molecular identifiers to generate absolute numbers of sequencing molecules.](https://www.zotero.org/micheeela/collections/9364QRBW/items/66SLA2B4/attachment/9PIWCIYN)
TODO