libera/#clasp - IRC Chatlog

3:53:52 drmeister Got called away.

3:54:21 drmeister Our library code is 10922A02 - we see that code 39.6M times.

3:54:43 drmeister Then there is 19012A01 and then a bunch of noise.

3:55:52 drmeister I asked our collaborator about 19012A01 and they said - "Oh yeah that was one of our first sequences and it keeps cropping up - you should see the entire thing 11-11-11-11-11" So I went looking.

3:56:54 drmeister I'm being kind of sloppy with the codes because the 19012A01 style codes is a stupid code developed by some biologist and it takes a while to type out.

3:57:20 drmeister Basically you got 1x01-1x10 and 2x01-2x10

3:57:36 drmeister I say "Zero" have you heard of it?

3:57:58 Bike i don't recognize this code, no

3:58:10 Bike i'm used to just the atcggctagc stuff

3:58:21 drmeister Anyway - I gathered up all the full sequences with the 19012A01 library code and found one sequence...

3:58:30 Bike two bits per character is rather suboptimal, i spose

3:58:38 drmeister ((("11012201" "13012401" "15012601" "17012801" "19012A01") . 2775))

3:58:57 drmeister That's the 11-11-11-11-11 code - that's the only one I see.

3:59:48 drmeister So that means this sample was probably contaminated by a SINGLE strand of DNA that lead to this this sequence and it got amplified in the PCR so that we see it in the 48M sequencing reads 2,775 times.

3:59:53 drmeister That's kinda neat.

4:00:31 Bike yeah, cool. shame it's messing up the read though.

4:01:04 drmeister It's not - it's literally seen only 2775/48000000 reads.

4:01:26 drmeister They said that it was a bigger problem a couple of months ago.

4:02:27 drmeister I've been trying to figure out the difference between sequencing noise and low copy contaminating DNA sequence that got amplified with everything else. This gives me a kind of lower limit on that.

4:03:31 drmeister Here's a histogram of the number of times different sequences show up...

4:03:49 drmeister https://usercontent.irccloud-cdn.com/file/lSJpe8jG/image.png

4:04:28 drmeister The y-axis is the log10 of the number of times a sequence shows up and X-axis is just the index of the sequence.

4:05:03 Bike nice x axis

4:05:07 Bike how long are these sequences?

4:05:14 drmeister 167 bases.

4:05:26 Bike and they're all the same length? i see...

4:05:35 drmeister Yes.

4:06:13 drmeister They have quality data - the "phred" score for each base. I use that to filter out sequences that I consider to noisy to be reliable.

4:06:49 drmeister Here's how it's organized...

4:06:50 drmeister https://usercontent.irccloud-cdn.com/file/NxYOH0Vz/image.png

4:07:12 drmeister I aligned about 200 of them in emacs and lined up the columns.

4:07:32 drmeister From left to right there's a ~40 base constant/forward-primer.

4:08:04 drmeister Then (3-bases-8-bases)x10-3-bases

4:08:49 drmeister Each 8-base stretch codes for a number from 1-10 using only 10 sequences chosen carefully from the 4^8=65536 possible sequences.

4:09:04 drmeister They are chosen to be at least 3 apart by Hamming distance.

4:09:45 drmeister I exposed the SeqAn C++ library to align short sequences to the overlapping 3+8+3 "codons".

4:10:13 drmeister SeqAn scores mismatches and gaps in a consistent way.

4:10:28 drmeister I end up throwing out about 2/5 of the sequences.

4:10:52 drmeister In the end it gave us two top molecules that we are resynthesizing.

4:11:02 drmeister We have more but there are two that really stand out.

4:11:55 drmeister It's all running in Cando in a Jupyter notebook. I'm going to show it to our collaborators tomorrow. It kicks ass.

4:12:06 drmeister It's also about 50x faster than what they have.

4:12:38 drmeister They have some R scripts that take 2.5 days to analyze this same data. Mine takes 1-2 hours.

4:14:44 drmeister I've got this kind of stuff going in the jupyter notebook:

4:14:46 drmeister https://usercontent.irccloud-cdn.com/file/H2AzF26G/image.png

13:27:36 Bike i can reproduce the compilation issue easily

13:27:45 Bike probably just need to fix cleavir's deletion of cycles...

14:12:22 Bike yeah, fixed.

14:12:28 Bike i mean, i have to push it, but easy fix