libera/#clasp - IRC Chatlog
Search
3:55:52
drmeister
I asked our collaborator about 19012A01 and they said - "Oh yeah that was one of our first sequences and it keeps cropping up - you should see the entire thing 11-11-11-11-11" So I went looking.
3:56:54
drmeister
I'm being kind of sloppy with the codes because the 19012A01 style codes is a stupid code developed by some biologist and it takes a while to type out.
3:58:21
drmeister
Anyway - I gathered up all the full sequences with the 19012A01 library code and found one sequence...
3:59:48
drmeister
So that means this sample was probably contaminated by a SINGLE strand of DNA that lead to this this sequence and it got amplified in the PCR so that we see it in the 48M sequencing reads 2,775 times.
4:02:27
drmeister
I've been trying to figure out the difference between sequencing noise and low copy contaminating DNA sequence that got amplified with everything else. This gives me a kind of lower limit on that.
4:04:28
drmeister
The y-axis is the log10 of the number of times a sequence shows up and X-axis is just the index of the sequence.
4:06:13
drmeister
They have quality data - the "phred" score for each base. I use that to filter out sequences that I consider to noisy to be reliable.
4:08:49
drmeister
Each 8-base stretch codes for a number from 1-10 using only 10 sequences chosen carefully from the 4^8=65536 possible sequences.
4:09:45
drmeister
I exposed the SeqAn C++ library to align short sequences to the overlapping 3+8+3 "codons".
4:11:55
drmeister
It's all running in Cando in a Jupyter notebook. I'm going to show it to our collaborators tomorrow. It kicks ass.
4:12:38
drmeister
They have some R scripts that take 2.5 days to analyze this same data. Mine takes 1-2 hours.