Polishing long reads

I’ve been away from this lab notebook for a little while – for good reasons. Pace of projects I’m involved in moved at breakneck speeds, and some of them being publication facing projects ended up enforcing a bit of self-imposed embargo for the stuff I’d normally write about. The first of the preprints should be coming out soon, so hopefully I’ll be able to share some interesting stuff here in a couple of week’s time.

This aspect of ’embargo before publication’ for an amateur/independent researcher is something that would need to be discussed at length in the future. Amateurs in biology tend to be isolated from the normal academic support infrastructure like lab meetings and casual conversations with peers within the institution. So an embargo in this these situations happen in far more solitary context, with the amateur often missing out on crucial peer to peer transfer of knowledge that increasingly seems necessary to development of the scientific craft. Publication culture for amateur and independent researchers would have to be fundamentally different in this regard.

But I digress, onto the topic for today – the long read polishing question.

Long read based assemblies by very nature of the technologies used suffer from indel issues – for almost any scenario, post-assembly correction step is mandatory. Granted, improvements I’ve seen in the field between just 2018 and now had been staggering – I’m getting the kind of raw reads and assemblies on today’s casual experiments that would have been unimaginable back when we first started out.

Introduction of HAC + SUP guppy calling and Medaka for polishing really changed the game – to the point that the usual basecalling-> assembly-> polishing cycle for ONT platform seem to be coalescing into something resembling a standardization (consisting of guppy SUP basecall -> flye assembly -> medaka polishing, IMHO).

Of the three main steps, post-assembly polishing was always trickier one of the bunch for me. Sure – Racon, Medaka followed by some sort of homopolymer specific polishing step tend to be the way to go. But should it be 4x Racon as was suggested in earlier Medaka literature? What about aligner to feed SAM files for Racon polishing? Super high accuracy lastal? Minimap? Tons of literature out there also suggest BWA despite the fact that we’re working solely with long reads. Or should we ignore the manual and follow some well-known bioinformaticians out there, sticking with BWA -> 1x Racon, Medaka and figure out the rest as we go along?

While checking for a new Medaka update on my work machine (1.6.0) I was reminded of the below passage on their official github (https://github.com/nanoporetech/medaka) and decided to test things out for myself at last.

I’ll process two raw assembly outputs- deino and meta_deino from Flye assembler. Deino is a simple –nano-hq mode assembly, and the meta_deino is a metaflye mode assembly (I suspect there’s a prophage in the genome, thus the second data set). For this very informal comparison, we’re just going to look at BUSCO score of the final genome and IDEEL indel estimation graph to check for broken or duplicated genes.

Two assemblies used here came from minimally quality controlled guppy SUP reads. More information below:

#Guppy 6.0.1+652ffd1
guppy_basecaller -i fast5/ -s SUP_fastq -c dna_r9.4.1_450bps_sup.cfg -x 'cuda:0' --gpu_runners_per_device 1 --chunk_size 500 --chunks_per_runner 128

#Filtered data stat
Mean read length:                18,757.6
Mean read quality:                   13.2
Median read length:              11,292.0
Median read quality:                 13.2
Number of reads:                 69,387.0
Read length N50:                 26,341.0
STDEV read length:               20,444.2
Total bases:              1,301,532,662.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:	69387 (100.0%) 1301.5Mb
>Q7:	69387 (100.0%) 1301.5Mb
>Q10:	69387 (100.0%) 1301.5Mb
>Q12:	54104 (78.0%) 963.1Mb
>Q15:	5919 (8.5%) 85.2Mb
Top 5 highest mean basecall quality scores and their read lengths
1:	37.3 (1948)
2:	37.2 (1182)
3:	35.4 (3067)
4:	33.8 (2535)
5:	33.4 (4727)
Top 5 longest reads and their mean basecall quality score
1:	306879 (11.1)
2:	299065 (14.9)
3:	290924 (15.4)
4:	290778 (14.2)
5:	290638 (12.3)

Command for the two assemblies and their stats are:

#Deino assembly
flye --nano-hq deino_1k_90.fastq -t 8 -o deino_1k

     Total sequence composition is as follows     
 518903 A
 519556 T
 865827 C
 868228 G
Total gapped sequence length is: 2772514
Total ungapped sequence length is: 2772514
GC content in deino_1k_raw.fasta is 62.54 %                 

busco -m genome -i deino_1k_raw.fasta --auto-lineage-prok -o deino_1k_raw_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |22     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#meta_deino assembly
flye --nano-hq deino_1k_90.fastq --meta -t 16 -o meta_deino_1k

     Total sequence composition is as follows     
 548550 A
 548984 T
 908052 C
 909337 G
Total gapped sequence length is: 2914923
Total ungapped sequence length is: 2914923
GC content in meta_deino_raw.fasta is 62.34 %                 

busco -m genome -i meta_deino_raw.fasta --auto-lineage-prok -o meta_deino_raw_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |22     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

71% BUSCO recovery for both – not too bad, not too great. Interesting that their BUSCO scores are the same despite almost 150kb size difference.

We’ll process the two assemblies through five different polishing scenarios.

  • Medaka only (relatively recent official ONT recommendation)
  • BWA based SAM, 1x Racon and then Medaka
  • Minimap2 based SAM, 1x Racon and then Medaka
  • Lastal high accuracy mode based SAM, 1x Racon and then Medaka
  • Minimap2 based SAM, 4x Racon and then Medaka

Some data and processing scripts used for this note is on my github at https://github.com/naturepoker/polishing_long_reads_note.

Let’s look at the Medaka only polishing output:

#Medaka only deino polish
#Medaka 1.6.0
medaka_consensus -i deino_1k_90.fastq -d deino_1k_raw.fasta -o meda
ka_deino_1k -t 8 -m r941_min_sup_g507

     Total sequence composition is as follows
 518847 A
 519407 T
 865794 C
 868162 G
Total gapped sequence length is: 2772210
Total ungapped sequence length is: 2772210
GC content in consensus.fasta is 62.54 %

busco -m genome -i medaka_deino_1k.fasta --auto-lineage-prok -o medaka_deino_1k_busco

        |Results from dataset bacteria_odb10              |
        |C:76.6%[S:76.6%,D:0.0%],F:12.1%,M:11.3%,n:124    |
        |95     Complete BUSCOs (C)                       |
        |95     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |15     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#Medaka only meta_deino polish
medaka_consensus -i deino_1k_90.fastq -d meta_deino_raw.fasta -o me
daka_meta_deino -t 8 -m r941_min_sup_g507

     Total sequence composition is as follows
 548694 A
 549101 T
 908388 C
 909636 G
Total gapped sequence length is: 2915819
Total ungapped sequence length is: 2915819
GC content in consensus.fasta is 62.35 %

busco -m genome -i medaka_meta_deino.fasta --auto-lineage-prok -o medaka_meta_deino_busco

        |Results from dataset bacteria_odb10              |
        |C:75.8%[S:75.8%,D:0.0%],F:12.9%,M:11.3%,n:124    |
        |94     Complete BUSCOs (C)                       |
        |94     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |16     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

Not a drastic change, but an improvement nonetheless. Deino went from 71% to 76.6%, meta_deino went from 71% to 75.8%.

Let’s look at BWA based SAM used in 1x Racon iteration, and then Medaka. Shell script for this and other polishing processes will be on my github.

#BWA 1x Racon and Medaka deino polish

     Total sequence composition is as follows
 518596 A
 519148 T
 865755 C
 868145 G
Total gapped sequence length is: 2771644
Total ungapped sequence length is: 2771644
GC content in old_RM_deino_1k.fasta is 62.55 %

busco -m genome -i old_RM_deino_1k.fasta --auto-lineage-prok -o old_RM_deino_1k_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |21     Fragmented BUSCOs (F)                     |
        |15     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#BWA 1x Racon and Medaka meta_deino polish

     Total sequence composition is as follows
 548312 A
 548592 T
 908184 C
 909388 G
Total gapped sequence length is: 2914476
Total ungapped sequence length is: 2914476
GC content in old_RM_meta_deino.fasta is 62.36 %

busco -m genome -i old_RM_meta_deino.fasta --auto-lineage-prok -o old_RM_meta_deino_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |22     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

So – I don’t know what happened here. The polishing didn’t really improve anything in obvious manner. If I put above IDEEL graph next to the raw genome data and squint really hard, I can see decreased instances of partial genomes… But the result is overall disappointing.

Maybe this is BWA related? While the suspicion makes sense, it’s a pretty jarring change considering that in the past, BWA-Racon cycle tended to get us at least some degree of improvement. For old BWA SAM fed polishing of ONT long-read assembly, check the polishing step for this draft (https://naturepoker.wordpress.com/2021/11/27/genome-assembly-for-artists-draft-part-1/).

Let’s look at 1x minimap2 SAM fed Racon-Medaka polishing output. Minimap2 is the de-facto standard long read mapper out there, so there shouldn’t be any obvious issues.

#minimap2 1x Racon and Medaka deino polish

     Total sequence composition is as follows
 518501 A
 519030 T
 865604 C
 867994 G
Total gapped sequence length is: 2771129
Total ungapped sequence length is: 2771129
GC content in new_RM_deino_1k.fasta is 62.55 %

busco -m genome -i new_RM_deino_1k.fasta --auto-lineage-prok -o new_RM_deino_1k_busco

        |Results from dataset bacteria_odb10              |
        |C:75.0%[S:75.0%,D:0.0%],F:13.7%,M:11.3%,n:124    |
        |93     Complete BUSCOs (C)                       |
        |93     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |17     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#minimap2 1x Racon and Medaka meta_deino polish

     Total sequence composition is as follows
 548324 A
 548606 T
 908178 C
 909410 G
Total gapped sequence length is: 2914518
Total ungapped sequence length is: 2914518
GC content in new_RM_meta_deino.fasta is 62.36 %

busco -m genome -i new_RM_meta_deino.fasta --auto-lineage-prok -o new_RM_meta_deino_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |22     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

The outcome here is also pretty open to interpretation. If I take the IDEEL graphs side by side with the one from raw unpolished genomes there are obvious signs of improvements, smaller number of partial genes and smaller number of duplicate genes. And yet according to more definitive metric of BUSCO scores, the meta assembly from Deinococcus reads barely improved at all. So far all of the results we’ve been looking at haven’t been as good as just running Medaka straight on a raw assembly – so technically, added polishing step with Racon have been making the assemblies worse.

Maybe this is due to sometimes noisy alignment from minimap2? Let’s try out lastalign, and see what their highest accuracy setting will get us. I’ve used lastalign before and had been impressed with the output, especially in context of subsampling reads against a target genome.

#lastal 1x Racon and Medaka deino polish

     Total sequence composition is as follows
 518393 A
 518961 T
 865446 C
 867847 G
Total gapped sequence length is: 2770647
Total ungapped sequence length is: 2770647
GC content in racon_medaka_deino_1k.fasta is 62.55 %

busco -m genome -i racon_medaka_deino_1k.fasta --auto-lineage-prok -o racon_medaka_deino_1k_busco

        |Results from dataset bacteria_odb10              |
        |C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124    |
        |86     Complete BUSCOs (C)                       |
        |86     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |23     Fragmented BUSCOs (F)                     |
        |15     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#lastal 1x Racon and Medaka meta_deino polish

     Total sequence composition is as follows
 548297 A
 548571 T
 908186 C
 909468 G
Total gapped sequence length is: 2914522
Total ungapped sequence length is: 2914522
GC content in racon_medaka_meta_deino.fasta is 62.36 %

busco -m genome -i racon_medaka_meta_deino.fasta --auto-lineage-prok -o racon_medaka_meta_deino_busco

        |Results from dataset bacteria_odb10              |
        |C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124    |
        |86     Complete BUSCOs (C)                       |
        |86     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |23     Fragmented BUSCOs (F)                     |
        |15     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

Holy cow. I don’t think I’ve ever seen a polishing step (involving two different polishers no less) actually make an assembly worse across the board. Obviously the polishing step somehow ended up introducing more indels to the genome. Is this a trait of super-high accuracy SAM files fed to Racon? Or just some weird quirk specific to the mapper? Anyway, combination of lastal version 1293, Racon v.1.5.0 and Medaka v.1.6.0 makes for some strange, unexpected failure state. The old adage of “you can’t overpolish a genome” no longer seems to hold water.

Let’s see what we get if we follow the old school recommendation of 4x Racon iteration before Medaka. We’ll be using minimap2 for SAM input.

#minimap2 4x Racon and Medaka deino polish

     Total sequence composition is as follows
 518037 A
 518693 T
 865063 C
 867714 G
Total gapped sequence length is: 2769507
Total ungapped sequence length is: 2769507
GC content in 4x_deino_1k.fasta is 62.56 %

busco -m genome -i 4x_deino_1k.fasta --auto-lineage-prok -o 4x_deino_1k_busco

        |Results from dataset bacteria_odb10              |
        |C:70.2%[S:70.2%,D:0.0%],F:18.5%,M:11.3%,n:124    |
        |87     Complete BUSCOs (C)                       |
        |87     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |23     Fragmented BUSCOs (F)                     |
        |14     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |
#minimap2 4x Racon and Medaka meta_deino polish

     Total sequence composition is as follows
 547978 A
 548285 T
 907936 C
 909262 G
Total gapped sequence length is: 2913461
Total ungapped sequence length is: 2913461
GC content in 4x_meta_deino.fasta is 62.37 %

busco -m genome -i 4x_meta_deino.fasta --auto-lineage-prok -o 4x_meta_deino_busco

        |Results from dataset bacteria_odb10              |
        |C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124    |
        |88     Complete BUSCOs (C)                       |
        |88     Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |21     Fragmented BUSCOs (F)                     |
        |15     Missing BUSCOs (M)                        |
        |124    Total BUSCO groups searched               |

The polished assemblies are certainly technically improved- but when we consider that just running Medaka on raw assemblies gives us better results, this is more of a subtle regression in the quality of the genome.

There are some caveats to the results here – I’ve intentionally left out specialized downstream polishers that will improve the results drastically by getting rid of significant chunks of homopolymer indel errors. Polypolish and homopolish come to mind as tools I’ve used before that frequently turns so-so assemblies into amazing ones. Deinococcus genome is a repeat and GC rich, polyploid landscape of transposons, prophages and homopolymers, so we’re really looking at a pretty tricky genome to begin with.

I’m not sure what to think about the results here though. Either there’s something significantly wrong with what I’ve done (I seriously doubt it based on previous experiences, but it’s possible), or later versions of Medaka might actually perform worse when added to the traditional pipeline of Racons and Pilons.

What bothers me the most is – if I simply relied on existing pipelines in my setup (which integrates polypolish/homopolish) I wouldn’t have realized what’s happening at all, since fixed homopolymer/indel errors will almost always get me better BUSCO scores compared to the raw inputs.

I’m looking for suggestions and comments on what’s happening here, and better best practices for long-read assembly polishing. As usual, please feel free to leave a comment here or get in touch at twitter @naturepoker1

Edit May 29th 2022:

@gaworj on twitter suggested using subsampled reads to cut down on potential noise. I cut the reads down to about 100x coverage and processed them through the same medaka only vs racon + medaka comparison again, focusing primarily on BUSCO scores (for now). The behavior described in the body of this lab note persists.

@Psy_Fer_ on twitter suggested dropping the –nano-hq argument from Flye assembly since I’m still using r941 data. Comparing –nano-hq assembled genome and –nano-raw assembled genome through the same medaka only vs racon + medaka comparison resulted in replication of the behavior seen in this lab note.

It should be noted that this entire test is looking at indel errors rather than any real structural correction of the assemblies (through IDEEL and BUSCO), so I’m missing whole bunch of stuff that’s happening under the surface. This note is also confined to a single data point of one Deinococcus ONT flowcell output, obtained using rapid sequencing kit and basecalled using a VERY subpar consumer graphics card (NVIDIA GeForce GT 1030).

Thanks for the suggestions folks! Hope this note’s been useful for you.

Lab note Feb 11

Continuing on with the Halobacterium long-term culture observations.

Confirmed living as of February 11 2022 

1) 2021/Jan14 Halo R control : doing well, I don't see a difference between this culture and other, far fresher inoculations. Same goes for all other cultures (including the mutant culture in acidic media) inoculated on the same day. 

2) 2020/Oct23 Halo M mutant : surprisingly weak growth - I can see beginnings of the typical white-pink Halobacterial ring around the rim of the tube, but it's only started. I'd safely say at least a month's lag time for these cultures to thrive again (when re-inoculated). 

3) 2020/Apr02 Halo M mutant A/B : weak growth, more or less on par with the 2020/Oct23 culture. Should I say it's doing better than it ought to, considering half a year's time between the Oct23 and the Apr02 A/B cultures?

4) 2020/Mar30 Halo M mutant : it's hard to tell at this stage - the red sedimentation on the bottom of the tube is getting thicker, so we could be seeing a mutant culture with damaged gas vesicle operon, but I'll have to give it another couple of weeks to make sure. 

5) 2020/Mar17 Halo R control : no growth. The media remains completely clear


So out of the ‘unknown status’ cultures recorded on January 28th, 2020/Mar17 and 2020/Mar30 cultures remain questionable. I think Mar30 culture will end up picking up soon, but it could be safe at this stage to assume Mar17 culture is simply dead. I’ll still leave both in the incubator for the next few weeks to see if anything changes.

Now, above long-term Halobacterium growths are from maintained lines of sealed, liquid media cultures. I also maintain a separate line of aerated cultures that are specifically meant to dry out and crystallize (due to high salt content media of +25%) in order to track their long term behavior. My previous observations suggested that aerated (sponge capped) and dried cultures of Halobacterium does not fare as well as sealed liquid cultures, but maybe my method for resuscitating them had been wrong all along.

I recently resuscitated a test tube of fully dried and salt-crystallized Halobacterium culture dating back to 2020/Apr25 that happened to incorporate a brief heat-shock period. About 1 minute in 60C. For whatever we can tell from a sample size of one, this particular culture seem to be thriving now – far outgrowing even the more recent cultures from 2021.

Again, sample size of one, so this result doesn’t mean anything other than this particular salt-crystal captured culture seems to be doing very well with simple resuspension in fresh media and perhaps exposure to heat, meant to help some of the crystals to melt faster. Once I can get a decent saturation of this particular culture, I’ll extract its genomic DNA via previously discussed methods and send it out for WGS for possible variant observation.

I’m also planning a far more controlled experiment dealing with behavior of salt crystal encapsulated Halobacterium culture, using 28 well plates and 0.5mm glass beads (usually used for cell disruption – important thing is they are standardized). I have a hunch that a rough response profile between a salt crystallized Halobacterium and sealed long-term liquid culture Halobacterium would remain similar, and would like to confirm it physically.

Lab note: Jan 17-Jan 28

Before going further – Jan 27th was something of a momentous day. Our Deinococcus radiophilus genome submission finally became the representative genome on Genbank! Now I just need to get the preprint out there.

I’m in middle of a big re-arrangement of the warehouse lab, finally dusting our bench and wiping down all the grime accrued over the pandemic period. The lab was still being used through the months, but the activity level was nowhere near what we had during the Deinococcus radiophilus project. Such a shame too, we’ve managed to put together a pretty tidy lab and a backlog of surprisingly fancy hardware for bunch of amateurs.

For the time being, main lab bench looks like this.

One of the big tasks was to catalog and re-inoculate bunch of long term Halobacterium salinarum cultures in the lab. I have Halobacterium NRC-1 steady state cultures dating back to January of 2020, and slant agar cultures dating all the way back to 2019 here.

The slant cultures doesn’t look too bad, as they’re backups stored in the fridge… The fridge itself is a bargain basement unit that should be giving up its ghost one of these days, but for now it keep on trucking.

As for the larger volume, liquid Halobacterium NRC-1 cultures I’ve been keeping here (intentionally! The goal was always to track its mutation profile over very long term storage and stress, between a mutant and a control strain) are in different states of absolutely decimated to hanging onto their dear lives. On the whole cultures stores in well sealed, limited oxygen environment fares far better than the ones with more aeration, as expected.

What I didn’t expect is what seems to be ravenous efficacy with which Halobacterium cultures clear out everything in their environment, and seemingly themselves, in states of extreme starvation. Above pictures depict two types of sealed cultures kept in constant 30C and 24/7 grow light- plastic bottle from January of 2021, the glass flasks from March of 2020 – they’re very unscientifically a year apart. Even with the same strain of Halobacterium NRC-1 in same media (ATCC 213), a year’s time is enough to clarify the whole media. A nice relaxing side project for me is to track the length of stasis and rate of change through all this, and check for genomic rearrangements common in these extreme cases.

And it turns out there could be reasons to suspect these types of long-term starvation in sealed environments is a type of evolutionary stimuli Halobacterium have been exposed to throughout history, as anyone who made salt crystals in youth might attest. Look at this wholly unintentional creation:

One of the old Halobacterium flasks left out to dry in back of the lab. It’s hard to see in the picture, but the large salt crystal still contains Halobacterium

This is a Halobacterium culture forgotten to dry out at the back of the lab, inoculated on July 9th 2020. It’s almost impossible to tell in the picture, but the large crystal to the right has a red center that still contains liquid-or-gel like something, which I suspect could be still-living culture of Halobacterium in stasis. I suspect because I’ve been through this exact situations before, and have revived a small number of Halobacterium cultures caught up in crystals. So simulation of this cycle under far more controlled conditions is something I’m running in the lab – Haloarchaeal culture, heavy dose of light, dry out, crystallization, cracking and sequencing for genomic rearrangements, on year length time scale.

As for current long-term Halobacterium cultures that went through recent re-inoculation:

Confirmed living as of Jan 28 2022

1) 2021/Jan21 Halo R control -> 2022/Jan17 Halo R control
2) 2021/Jan21 Halo M mutant -> 2022/Jan17 Halo M mutant

3) 2021/Jan14 Halo R control -> 2022/Jan17 Halo R control
4) 2021/Jan14 Halo M mutant -> 2022/Jan17 Halo M mutant

5) 2020/Oct23 Halo M mutant -> 2022/Jan17 Halo M mutant

6) 2020/Apr02 Halo M mutant B -> 2022/Jan17 Halo M mutant

Unknown status as of Jan 28 2022

2020/Apr02 Halo M mutant A 
2020/Mar17 Halo R control 
2020/Mar30 Halo M mutant 
How the long-term Halobacterium are grown. It’s a little shabby, but it works. I’m working to expand this out into about three sets

Another big change is moving more of the active Halobacterium NRC-1 mutant/control culture to the warehouse lab as well. I’m thinking of setting up a rotating culture of 10ml acidic vs control halo media with 10ul inoculation per generation – and for now it looks something like this, with day-night cycle. Inoculation time for the Halobacterium mutant and control culture below is 1:45 AM on January 28th, 2022.

Very shabby, but I’d rather get the cultures started sooner rather than later. This whole arrangement will be going through expansion and refinement as well.

ONT long-read assembly note: on unusually difficult archaeal genome


This particular lab note is from an ongoing project in collaboration with a different lab. Please pardon some of the necessary vagueness. All this is pre-publication data, so I have to blur out some details until we can at least get a preprint out (which should happen soon, fingers crossed!).

I’ve been working on a particularly befuddling genome of a previously unsequenced Archaeal microbe for the last couple of months. The key issue with the data had been presence of spurious contigs that differ drastically in size and structure from assembly to assembly, leaving only the chromosome and one circularized contig as reliable constants across maybe a 100 or so different iterations (it got to a point that I’d jokingly refer to the genome as a random number generator). There finally seems to be a light at the end of the tunnel though, so I’m leaving a quick note on the slightly modified process here, hopefully to be merged later into the “Genome assembly for artists” zine’s long-read assembly portion.

Now, contig level variation from assembly to assembly itself is perfectly normal, IMHO, and could be based on myriad different reasons including differences in assembly processes such as filtering method for raw data, some variation in version numbers of programs used, and etc etc. What was puzzling about this particular genome was: (1) the sheer variety of contigs produced through different assembly processes, their sizes ranging from thousands to +600kb and (2) the said variety of contigs appearing and disappearing from different assembly iterations without and discernible pattern.

The issue was exacerbated by the somewhat unique character of the archaea at hand – it’s an extremophile with particular growth condition requirements that our collaborator didn’t have the infrastructure to handle at the time. That combined with some unfortunate timing issues meant the genome extraction had to be performed directly on the shipping samples, making for three separate genome extractions on three physically separate samples. I’ve seen some odd results from similar cases before due to mutations present in one sample of the same species but not in the other, so we were definitely walking into a somewhat complex territory (at least for an amateur biologist like myself). One saving grace of the situation was the sheer size of the available data – our collaborator was successful in extracting and sequencing a positively ridiculous amount of DNA from his samples, so we were sitting on raw reads pointing toward at least 1500x combined coverage against the suspected size of the archaeal genome (another reason why exploratory assemblies are so important! We always want to keep an eye on the suspected complete genome size versus the depth of the available data).

So going in, we had three high quality raw read sets. First one using conventional Illumina short read sequencing, second and third from ONT MinIon flowcells (our collaborator did use GridION for the sequencing).

I won’t get into details of what happened during the first month of genome assembly experiment – due to timing and computation resource issues I was forced to focus on more stringent filtering of the raw data more so than the assemblies themselves. Pervading concern at the time was contamination causing fragmentation of the assembly, potentially via cryptic environmental symbiotes, prophages and other oddities our archaea might have dragged in from the deep. Since screening all of the raw reads via Gunc and Kraken (as I normally do) simply proved largely unfeasible, I had to rely on kmer based analysis and short-read based quality filtering. Even running Unicycler based hybrid assembly (still considered by many to be a most accurate method) became extremely cumbersome, requiring about 12~15 hours of run time with extremely unsatisfying results (granted, unlike long-read based assemblies, the hybrid assembly continued to give us the same contigs regardless of the filter state of the raw reads. A curious observation for future reference).

With December of 2021 lost among enigmatic genome assemblies, I was finally able to sit down properly with a more powerful rig (the System 76 Thelio written about in a previous note) in the new year and finally finish the assembly given the quality of the data at hand.

From the get-go it was obvious that Trycycler was the best pipeline for handling such high-depth long reads data, since it allows for more in-depth subsampling and curation of more assemblies that could be thrown into the reconciliation step, allowing for more detailed description of the genome overall. Unfortunately Trycycler wasn’t really an option in the very beginning. Our assembly project originally began with just the short reads and the flowcell 2 data, which ended up being just short of the depth needed to curate reliable subsamples. Lack of depth during those initial stages would simply result in more fragmentation in the assemblies, and I’d be able to rescue only about 2/3 of the expected genomes since reconciliation among other fragmented contigs would not be possible.

Availability of the 3rd flowcell data (which actually composes the bulk of all available data in our set now) and re-basecalling the 2nd flowcell data with SUP mode finally allowed me to attempt another Trycycler run – with similar disappointing results.

By this point it became reasonable to suspect that chaotic, fragmented state of our assemblies so far were themselves reasonable facsimiles of the archaeal genome at the time of extraction and sequencing. Obvious sources of contamination that could cause misassemblies of this magnitude would have revealed itself as an independent contig by now, and screening of dozens of assemblies up to this point did not point toward any obviously foreign contig. No mystery rRNA operons in any of the contigs or assemblies were detected as well. Based on numerous factors it was only natural to suspect an unreasonably healthy, perhaps transposonal fragmentation of the genome happening during the storage and transfer process the microbes were subjected to prior to genome extraction.

How we need to approach post-assembly steps in Trycycler changes if we assume significant existing damage and fragmentation in the input genome, since the goal shifts from processing all of the genomic components in the assembly to choosing a correct subset of existing genomic components that might be more representative of their unmodified state.

Easiest way to utilize the Trycycler pipeline in this way was to supply it with a curated set of assemblies before the reads clustering step. This required two things:

  1. Freshly produced sets of assemblies using minimally filtered reads
  2. Curation criteria for screening the newly assembled genomes

(1) was simple – top 90% scoring reads above 1000 base pairs from each of passed m2 and m3 flowcells were isolated into their own read groups, and a third read set was created from combined passed m2 and m3 reads following the same filter cutoff. The third raw read set would take the place of assembly raw reads for all post-assembly downstream steps in the Trycycler pipeline. Filtlong command for this step would look something like this:

filtlong --fastq pass.fastq.gz --min_length 1000 --keep_percent 90 > filtered.fastq

Here are the post-filter raw read plots:

(2) – figuring out assembly screening criteria – ended up being pretty simple as well. By this point I had a pretty decent idea on some of the more interesting features that must be present in a complete genome of this mystery archaea. I focused on seemingly conserved placements between a certain type of operon cluster and crispr spacer sites.

Chromosome is one component consistently recovered through all of the assemblies so far, in many cases mere few bases of difference between different assembly iterations. And a consistent feature of all of the chromosomes so far have been presence of a specific operon some preserved distance away from a crispr spacer site consisting of 118 and 13 repeat units. Curiously, the same operon cluster (with enough differences to be a unique instance on its own) shows up on one other contig as well, again a roughly conserved distance away from another crispr spacer site consisting of 8,34 and 38 repeat units. Similar operon seen in the other two contigs (the chromosome and a mystery contig) shows up on a third contig, this time carrying a transposase in between one of the operon elements.

I know the chromosome is real, as well as the operon and the 118/13 crispr repeat units. The other 8/34/38 crispr spacer site and the operon seems real, and the third iteration of the operon on the third contig looks unique enough with the conserved transposase site stuck in middle of the operon pattern. These contig level characteristics were enough to let me form the initial assembly screening criteria- a more complete representation of the archaeal genome in question should be carrying three x operons, two of them along with crispr spacer sites of 118/13 and 8/34/38 pattern, with the final operon carrying a transposase insertion site.

With the screening criteria set, I subsampled each of the m2, m3 and m23 reads into 16 groups, creating four assemblies from four different assemblers (flye, miniasm+minipolish, raven and shasta) per read source, ending up with 16×3 = 48 assemblies to screen through. The new assemblies were annotated with prokka and analyzed for presence of the three marker characteristics described above (no fancy bioinformatics tools available for this step beyond the usual shell utilities and piping – I had to read through the annotation files and make notes with pen and paper). As expected, assemblies with spurious contigs tended to carry duplicates of the marker traits across the genome, such as carrying a third or sometimes up to a fifth crispr spacer sites following the same 8/34/38 pattern, all carbon copies of each other down to the base level. Same went for the operon x – some assemblies carried up to 13, but after the initial three they were all essentially just copies of each other. Curiously enough, the largest group of shared marker traits across the 48 assemblies followed the traits I’ve described already, 3 operons and 2 crispr sites with specific spacer repeat units, supporting the possibility that the pattern represents the true layout of the genome. 18 assemblies that displayed the targeted marker site properties were isolated and used for further downstream Trycycler process, outlined in my previous note starting with the trycycler clustering step. The final genome was polished with the usual racon – medaka – polypolish pipeline as well, using the Illumina reads from the first genome extraction.

The resulting genome is composed of 6 fully circularized contigs totaling 3965466 bp and 62.18% GC content.

001_chromosome  2767537
002_plasmid  406045  
003_plasmid  260739
004_plasmid  123302 
006_plasmid  350887 
009_plasmid  56956   

It’s BUSCO profile seems pretty good, though we’ve always had a good recovery % on this one, probably because we’ve always had a pretty decent representation of the chromosome.

busco -m genome -i final_asm.fasta -o final_asm -l archaea_odb10

        |Results from dataset archaea_odb10               |
        |C:99.5%[S:99.5%,D:0.0%],F:0.0%,M:0.5%,n:194      |
        |193    Complete BUSCOs (C)                       |
        |193    Complete and single-copy BUSCOs (S)       |
        |0      Complete and duplicated BUSCOs (D)        |
        |0      Fragmented BUSCOs (F)                     |
        |1      Missing BUSCOs (M)                        |
        |194    Total BUSCO groups searched               |

I’ve also run the gamut of usual minimap2 alignment tests on problematic regions and checked for possible duplicates, and the results are extremely promising. More notes on the post-assembly analysis methods later, probably in conjunction with ‘Genome assembly for artists’ part II.

We hope to have a preprint of the results and observations soon, possibly within the next couple of weeks. I’ll make sure to post it here and write up a post-mortem as well.

A couple of takeaways from this particular assembly experience:

  • Assemblers might not make up for already damaged DNA, even if they could have happened through the microbe’s native DNA indel mechanism.
  • Hybrid assemblies could be relegated to an interesting oddity soon, especially as Q20 ONT reads and SUP basecalling become the norm. All of Trycycler (48 assemblies + subsampling, clustering) and polishing maybe took four or five hours. Hybrid assembly using the same dataset took 12~15, and still resulted in fragmented genome.
  • Rapid iteration of assemblies using raw reads filtering and flye can get us results that are about as good as a well-considered Trycycler output, but it’s a gamble.
  • Re-sequencing older genomes with later technology is definitely a worthy endeavor. I’ve run into numerous assemblies of our archaea’s genus cousins, all of which, in retrospect, seems to be very incomplete and in dire need of updates.
  • Archeal plasmids are uncharted territories. My usual HMM/blast tools were not terribly useful this time, since important features often went unannotated.

Genomes in the machine

Our warehouse lab has a new addition – a Thelio desktop computer from System76, fully decked out with 32 thread AMD Ryzen 9 5950X and 64GB of ECC RAM.

Thelio in the lab! Pardon the mess- I’m still figuring out a decent desktop arrangement

Of course, I couldn’t afford a machine like this with our amateur biologist resources – we were fortunate enough to have been chosen for the “Unleash your potential” program ran by System76 earlier this year aimed at the maker/hacker crowd. I believe we need to update System76 with some of our experiments performed with the machine, so stay tuned for an article that might come out on their end later.

System specification via neofetch

The maxed out 64GB ram should be enough to get us through most microbial genome assembly and analysis needs, though long-read microbial metagenomics might require some tinkering – I’m pretty sure we’ll need at least 75GB ram for a decent data set. Ryzen 9 5950X is still about the best most people will ever use on a consumer-to-semi-pro rig, and it’s been proving its mettle time and time again. From very informal initial tests, the machine seems to assemble/align genomes about eight times faster than I’m used to from another workstation at our lab, running on 32 threads Intel Xeon E5-2670.

The new machine also comes with a discrete GPU (as is usual for most AMD motherboards I think, since AMD processors tend to not carry IGP on the package) – alas, GPU is the main victim of the 2020-2021 cryptocurrency mania. Nvidia GT1030 (I think we have the 4GB version) is probably close to the lowest powered graphics card still widely available these days, and the only desktop GPU available for this model at the time. I do need to upgrade the card at some reasonable time in the future since we’re pretty heavy on Nanopore long reads sequencing and analysis, and best parts of their processing pipeline more or less requires GPU these days.

This got me thinking – how far can GT1030 get us in the traditional Nanopore pipeline – say, GPU based guppy basecalling from raw fast5 data? I decided to re-basecall our Deinococcus radiophilus fast5 data from 2019, based on RAD-004 rapid sequencing kit and MinIon flowcell 106D. Original data was basecalled using albacore, and the new data will be processed using guppy’s super mode.

Here’s the stat from the original sequencing data, from before the era of guppy/medaka and other fancy shenanigans we get to work with these days (long-read oriented tools were in their infancy when this data was produced – most of our favorite assemblers and none of our polishers of choice were available back then).

The key takeaway here is the quality score distribution

The command to run guppy in super basecalling mode is simple – just run the guppy_basecaller as usual and call for an appropriate config file, like dna_r9.4.1_450bps_sup.cfg. However, if running the GPU version of guppy on an anemic graphics card like GT1030 will require additional tweaking – otherwise the processes will terminate due to lack of memory.

From my experience, below setting is necessary to run guppy in super quality basecalling mode using Nvidia GeForce GT1030.

guppy_basecaller -i input_fast5_folder/ -s fastq_output -c dna_r9.4.1_450bps_sup.cfg -x 'cuda:0' --gpu_runners_per_device 1 --chunk_size 500 --chunks_per_runner 128

Above guppy setting with GT1030 will basecall a 2.45 billion nucleotide fast5 dataset in a little less than 7 days. Now, this might sound like a lot, but remember that super quality basecalling on guppy is very computation intensive. I tried running the same fast5 dataset on all 32 threads of a Xeon E5-2670 workstation before. Running the machine like a space heater for almost a month barely got us within 15%~20% of the whole data, and I had to cancel the basecalling process for fear of damaging the CPU. So it takes 7 days for one of the cheapest possible GPU on the market to finish the job that would take a server class Xeon processor 4~5 months to wrap up. I’d call that an almost unbelievable degree of improvement, to the point that super mode basecalling on CPU only guppy version shouldn’t even be considered an option for most.

Please do note that all of this applies to guppy’s super quality basecalling mode only – HAC (high accuracy) basecalling mode could be almost as good (I’m not seeing a huge difference between SUP basecalled archaeal genome versus HAC basecalled archaeal genome in our lab right now), and tend to be light enough to run on CPUs if absolutely needed (still not an optimal experience, but resource requirements are far, far more manageable).

So, what’s the difference between an old albacore basecalled data versus the new guppy super basecalled data? If you’re sitting on an old (pre-2020) Nanopore rapid kit microbial genome, would it be worth re-basecalling with guppy and re-assembling circa 2022? I believe so. Not only is our Deinococcus radiophilus dataset vastly improved, it’s a little different now.

Let’s start with Nanostat summary of the newly basecalled data.

Summary of the newly basecalled Deinococcus radiophilus data

Some differences stand out immediately. For one, we now have fewer bases, from 2.45 billion to 2.28 billion. This is a behavior we’ve seen in other in-house dataset as well. Guppy running a fast5 data in super mode will likely result in fewer bases. In return, however, we’re seeing pretty great increase in quality scores across the board. Top mean basecall quality score on previous data was 12.8. Now it’s 41.7. Lowest mean basecall quality score among the longest reads in previous data was 4.5. Now it’s 9.5. We had 23283 reads over q10 in the previous dataset. Now it’s 183923 reads over q10. That’s an almost 8x increase in reads that could be classified as high accuracy.

Let’s look at plots of average read quality against average read length from both datasets:

Please do note that the average read quality notch on Y axis runs differently for the two graphs. The original data (left, fuchsia) runs from Q2 to Q12+, and the new data (right, green) runs from Q5 to Q20+. Best quality score cluster for the old data is between Q8 and Q11-12, and the best quality score cluster for the new data is between Q10 and Q15. On the other hand, we can also see that the longest read on x axis for the old data stretches to 200k+, while the new data tops out somewhere above 150k. Guppy’s super mode could be doing some sort of trimming over here. But if the choice is between a couple of <Q7 210kb reads versus a couple of Q14 160kb reads, most would opt for the latter.

Here are the same plots in log length:

To summarize – GT1030 can do super quality basecalling with guppy under a very specific setting, and a smaller microbial genome fast5 will need about a week’s computing time. Super quality basecalling results are absolutely superior to albacore – it might be worth the time and resources to re-basecall and update pre-2020 Nanopore microbial assemblies.

I’m a little swamped finishing our existing Deinococcus radiophilus manuscript at this time, but will be running our assembly pipeline on the newly basecalled data soon. I get a feeling that the assembly level comparisons between the albacore data versus guppy-super data might tell us something interesting.

Methylobacterium update

A quick note on minor housekeeping stuff for the unidentified Methylobacterium species we recovered from the East River (December 25th, 2021).

Some of the culture plates started exhibiting anomalous growth patterns on the PPES-K plates. Obvious culprit is contamination, or latent microbes carried over from the original sampling through the last few subculture runs, but the abnormal growth also displayed the same red carotenoid hue as the original isolated colonies. My instinct is still on contamination, but we’ll have to wait on further analysis for now.

Colonies from both the plates with abnormal growth and plates that still retain the same phenotype as the original 16s barcode identified Methylobacterium were isolated and subcultured in fresh PPES-K media. Red plates indicate subcultures from plates without sign of contamination, and black plates indicate possibly contaminated sources.


C-MR-Sep 27
D-MR-Sep 27
E-MR-Sep 27



The new batch of PPES-K was made with below recipe – I’ll try out 1/100 methanol addition for test cultures after this one.


0.5g Soy powder
0.5g Milk powder
0.25g Yeast Extract
0.001g Ferric citrate
4g agar

I’d love to move onto growing, analyzing and sequencing these microbes soon, but recent workload had been a little much. Everything’s been delayed by about a month or so… Hopefully I’ll be able to maintain these previous samples properly.

“Genome assembly for artists” draft part 1

Following is an early work-in-progress draft for an upcoming zine “Genome assembly for artists.” The content here is released under CC BY-NC-SA license.

Normally I work with publicly accessible databases full of assembled genomes and other biological data, like the almost eponymous NCBI Genbank. The tools and methods for accessing the vast repositories could sometimes get a little archaic, but gaining even a rudimentary feel for them can open doors to a life time’s worth of study and discoveries for a dedicated amateur researcher. Think of it as gaining the lay of a great library, filled with codices millions of years in the making.

Alas, it looks like modern biology will sooner or later demand some sort of whole genome sequencing on an unsuspecting researcher, usually while lacking a decent reference data point. As I had to dive head first into a couple of these situations recently, I’m writing out a quick note on how an amateur researcher might address an assembly process circa 2021. Please note that processes described here is what I happened to pick up over time and is in no way indicative of a best practice scenario! And as usual any comment, questions or advice are welcome – you can contact me on twitter @naturepoker1 or leave a comment to this post.

In this particular note, I’ll address dealing with long-read data from ONT MinION platform using linux (Ubuntu 20.04 LTS in this case, but the general gist should be universally applicable). I’m also going to assume that the researcher has at least a general idea of the microbe’s taxonomy via traditional 16S Sanger sequencing (barcode sequencing on samples can range from $10~$20 per sample, USD). Another huge assumption is that we have bioconda environment already installed and setup on the machine (https://bioconda.github.io/user/install.html) The demo data used for figures and tables here are from combination of our previous and ongoing pre-publication sequencing experiments – as with all data from my lab notebook, please refrain from using them without asking me first!

Let’s start with establishing a base raw data set that will be used for all downstream assemblies and analysis. Working with raw data can get pretty hectic with multiple branches of one thing or another down the road, so it really helps with establish an initial baseline, even if you decide to work with a differently curated and filtered set later.

Nanopore introduced different basecalling tiers somewhat recently – high accuracy (HAC, seeming the default in MinION client now 2021/11/25) and super (SUP, as the name suggests it’s supposed to be more accurate than even the HAC data). Both HAC and SUP basecalling method could be very computationally intensive, especially on consumer grade machines without dedicated GPU. If you’re working without a GPU equipped machine, I’d recommend turning off the live-basecalling feature during sequencing and manually basecall with an instance of guppy (ONT’s basecaller) once the sequencing run is done.

ONT’s initial sequencing raw data will be in format called fast5. A local instance of guppy (should it not already be installed on your system – which you can check with guppy_basecaller –version) through a direct binary file download from ONT website is the simplest way to process your data. The files are located here https://community.nanoporetech.com/downloads (ONT account required). Simply download a binary appropriate for your hardware (CPU/GPU version) and put it in any directory on your PATH. Make sure all your fast5 files are copied to a separate, single directory for processing as well.

Once everything’s ready, you can check for guppy’s basecalling configuration files suitable for your flowcell and sequencing kit usage via below command:

guppy_basecaller --print_workflows

This will output a table, listing in order: flowcell model, sequencing kit, corresponding configuration file name, and model name. For our purposes we only need to care about entries in the first three columns.

Here’s how a basecalling command from a Flongle flowcell and RAD004 rapid sequencing kit would look like:

guppy_basecaller -i your_fast5_folder/ -s output_folder_name --flowcell FLO-FLG001 --kit SQK-RAD004

Above flowcell and sequencing kit combination corresponds to configuration file dna_r9.4.1_450bps_hac on the table output via –print_workflows command. So above command is the same as below command.

guppy_basecaller -i your_fast5_folder/ -s output_folder_name -c dna_r9.4.1_450bps_hac

For the curious, configuration file format is in following format:

dna_r9.4.1(flowcell chemistry)_450bps_hac(accuracy mode - in this case referring to high accuracy HAC)

Make sure to check your flowcell chemistry! If in doubt, match against the table from –print_workflows command by specifying flowcell and sequencing kit model.

If you’re interested in super quality basecalling, you need to call up the configuration file directly as of the date of this note like below:

guppy_basecaller -i your_fast5_folder/ -s output_folder_name -c dna_r9.4.1_450bps_sup.cfg

Eventually you’ll end up with an output folder containing passed vs failed quality fastq reads, along with sequencing summary file and one or more logs. At this point I usually run a NanoStat (https://github.com/wdecoster/nanostat) script on the summary file to get a feel for the read quality. You can install them on your system via conda or pip3.

NanoStat --summary sequencing_summary.txt --readtype 1D -n outputfile.txt

I like to look at the top 5 highest mean basecall quality score and their read lengths from the NanoStat output file and use it for the next step – filtering raw data using filtlong (https://github.com/rrwick/Filtlong). Before the filtering step though, you’ll want to cat all the fastq files you want to use for downstream processes into a single file. I normally just start with the fastq reads in the pass folder in the guppy output, and then go back and include the failed quality raw reads should we run into anomalies or other trouble down the road.

cat *.fastq > all_data.fastq

Now we’re ready to filter the data with filtlong. My personal preference here is to keep as much data as possible while maintaining a minimal degree of quality. The idea is that as long as we can get a structurally sound assembly with decent contiguity, we can make up for possibly low quality reads with k-mer cluster-based assembly and multiple polishing. We’re using a long-read platform, might as well play to its strengths. So I’ll use the shortest read length among the top 5 scoring reads as the minimum length cutoff. Let’s assume it’s 1000bp.

filtlong --min_length 1000 --keep_percent 95 all_data.fastq > filtered_data.fastq

For reference, you can use below command to check the number of long reads in the final filtered reads set:

echo $(cat filtered_data.fastq | wc -l)/4 | bc

And here’s a handy way to count the number of bases in your raw read set:

cat filtered_data.fastq | paste - - - - | cut -f2 | tr -d '\n' | wc -c

Okay, now we have the filtered data set to carry us through rest of the assembly process. In many cases this should be the extent of filtering necessary to get a reasonably complete genome of your microbe. If much of the following processes fail or result in anomalous behavior, there’s a good possibility that you’ll simply have to go back and get more data through another sequencing run. I generally found long-read assembly processes to be a bit of a binary – you’ll either have enough data for a reasonable assembly, or you don’t have enough data. But, of course, your mileage may vary (in fact, we’re currently working on a puzzling assembly of a previously unsequenced microbe where the genome might be complete or about half a million base pairs short. It’s hard to tell without a reference!).

Since we’re assuming a sequencing case without a good reference genome, my first step is to run a quick exploratory assembly and see where the pieces fall. Think of it like a poster study prior to committing to a full painting of the subject – we’re interested in a quick, broader layout of things.

Poster study for a Caravaggio painting. I can’t remember where I found this one…

Of all the long-read assemblers I used so far, flye (https://github.com/fenderglass/Flye) tend to result in a more consistently reliable output. You can install it in a contained environment via conda or by cloning the git repo and compiling it or installing it directly. Here’s a quick sample of what your command might look like with HAC or SUP long-reads, but please do experiment for your own genome set!

flye -o output_dir --threads 8 --nano-hq filtered_data.fastq

The output folder will include a whole bunch of items, but for now we’re interested in the assembly_graph.gfa and assembly.fasta file (and possibly the flye.log file if you want to go through the assembly steps). The assembly.fasta file is self-explanatory. The GFA (Graph Format Assembly https://github.com/GFA-spec/GFA-spec) file is essentially representation of paths through the reads in the assembly connected via overlaps – non branching paths are resolved as contigs. The file itself contains sets of sequences and their overlaps – but their key advantage (for an amateur like myself, at least) is allowing rapid visualization of the state of assembly by using a package like Bandage (http://rrwick.github.io/Bandage).

Here’s what a graphical representation of your exploratory assembly might look like. When dealing with a microbial genome, it’s usually pretty safe to assume that the larger circularized contig in an assembly is the chromosome and the rest are extrachromosomal elements like plasmids (there are exceptions – linearized plasmids and chromosomes do happen in nature DOI: 10.1111/j.1365-2958.1993.tb00963.x). In my experience (again, your mileage might vary), a decent microbial raw data set should give you a decent representation of a chromosome at this stage. If your assembly is failing to give you at least a chromosome here, it’s time to consider going back and obtaining more sequencing data. In below examples it’s easy to see that the first assembly came out looking pretty great (as far as retrieving broader structural elements are concerned) and the second assembly is going to be a little rocky, though it looks like we recovered the chromosome.

Now, just having one assembly and its graph doesn’t really tell us much about how accurate they are in the real world. It’s fully possible to have ‘assembled a genome’ only to have it turn out to be a bogus. In order to get a somewhat clearer picture of what’s happening, we’ll have to gather some external data. Let’s start with assemblies on NCBI Genbank https://www.ncbi.nlm.nih.gov/assembly/

Depending on the taxon of the organism you’re sequencing (you’ve already done rudimentary 16S barcoding, right?) we can look for additional information by going through https://www.ncbi.nlm.nih.gov/taxonomy – but in this case we’re looking for assembled genomes and CDS of microbes in the general phylogenetic vicinity of what we’re sequencing.

Query on the NCBI Genbank assemblies page for refseq, complete genome (all genomic compartments are closed), and representative genomes (no 8000 different genomes of E.Coli K12 please!) might look like this:

"YourPhylaOrClass"[Organism] AND "latest refseq"[filter] AND "complete genome"[filter] AND "representative genome"[filter]

If you’re working with a particularly unknown phyla or class, you’ll have to remove some filters – maybe complete genome – and tweak around until you have at least 12~20 assemblies. Once you get enough hits, use the send to option on the upper right corner of the page and download the table of assembly accession IDs, like in the screenshot below.

And we can filter out the GCF (NCBI refseq) column from the table with below oneliner, leaving us with just the accessions we need. Make sure to check that you’re getting the expected number of genome accessions by running the follow up command:

awk '{print $3}' assemblies.txt | tail -n +2 > your_target_gcf.txt

wc -l your_target_gcf.txt 

With the GCF genome accession list ready, let’s start out by building a quick phylogenetic tree. The goal here is to see if our exploratory genome falls within its evolutionarily expected place on the tree of life, and to identify any potential close relatives that could provide us with reference points. My absolute favorite pipeline for this process is GToTree – https://github.com/AstrobioMike/GToTree – which uses single copy gene sets to calculate evolutionary distances between supplied genomes. I can safely say that discovering the package from AstrobioMike got me started on phylogenetics itself. An easy way to install the pipeline is through conda:

conda create -y -n gtotree -c conda-forge -c bioconda -c astrobiomike gtotree

And running the package might look something like this:

conda activate gtotree

GToTree -a your_target_gcf.txt -f name_of_local_fasta_files.txt -H Bacteria -t -L Class,Species,Strain -o output_dir_name

Please note that the number of genomes listed in the target_gcf.txt file seems to counts toward your IP’s NCBI download activity – so running overly large lists of genomes over and over again will throttle or cut off your access to NCBI time to time. It’s also worth noting that -H flag denotes HMM file to be used for running the phylogenetic analysis (you can check for included files by using the command gtt-hmms), and GToTree allows users to supply their own SCG HMM compilation. I’ve covered a pretty rudimentary way of putting together a custom HMM file in previous note here (which itself is my study note from AstrobioMike’s material!) https://naturepoker.wordpress.com/2021/06/07/making-use-of-single-copy-gene-hmm-set-in-mostly-bash-way/

GToTree’s output files are extremely detailed and well thought-out, and just describing different ways to analyze them is worth a whole post on their own (which will be posted soon). For now, we’re interested in the GToTree_output.tre file, which is a phylogenetic tree file of all the analyzed genomes. https://itol.embl.de and Dendroscope (https://software-ab.informatik.uni-tuebingen.de/download/dendroscope3/welcome.html) are my favorite ways to visualize them right now, though the field of visualization/annotation packages for phylogenetic tree files is a rapidly evolving field. Here’s a phylogenetic tree built from a Deinococcus genome set.

An example phylogenetic tree from GToTree package. Red line marks our test assembly, blue line marks closest phylogenetic neighbor.

In this case we can clearly see that our assembly (marked with red line) is clearly a Deinococcus species, and its closest phylogenetic neighbor is Deinococcus proteolyticus (marked with blue line). Previous analysis that generated this figure did end up allowing us to use the complete sequences of Deinococcus proteolyticus plasmids to align and verify recovery of plasmids in our own assembly. Turns out the largest plasmids in D. proteolyticus and D. radiophilus shares uncanny degree of sequence homology (data being compiled for preprint usage soon!). Finding closely related genomes of our microbe of interest will be necessary for essentially all sort of analysis we can do at this time – revolving around techniques like general alignment, kmer analysis, and of course, annotation.

Before proceeding further, we’re going to use a tool called ncbi-genome-download (https://github.com/kblin/ncbi-genome-download) to compile a protein fasta (CDS) of our analysis group (though depending on your data set, I’d recommend keeping the CDS faa list smaller. NCBI WILL throttle or cut you off if you download things a bit too liberally). Running the script with the list of accessions will look something like this:

ncbi-genome-download bacteria -A your_target_gcf.txt -F protein-fasta -o your_target_gcf_fasta

Once all the sequences are downloaded, we need to gunzip them, and cat them together into a single annotation source data file. An easy way to do it at a top directory (where all the individual GCF_ folders are located) might look like:

shopt -s globstar 

mkdir faa

cp **/*.gz faa

gunzip faa/*.gz

cat faa/*.faa > reference.faa 

What we’ve done so far is to compile a reference protein fasta from a list of microbes phylogenetically related to our microbe of interest. We’re going to use the protein fasta with our flye exploratory assembly and Prokka annotation pipeline (https://github.com/tseemann/prokka). Prokka is an annotation pipeline that links together a whole bunch of disparate processes, like RNA prediction, CDS prediction, and blast process against supplied databases. I find the easiest way to get Prokka up and running is by using a self-contained conda environment designated only for the pipeline (otherwise you might find yourself with broken packages later. In general, keeping all or most pipelines in their own environment seems to be a good practice with conda).

conda create -n my_prokka prokka

conda activate my_prokka

If you don’t have samtools or pyfaidx installed system wide (they’re used often enough that I’d recommend installing the latest versions on your base OS, rather than as a conda package), now is the time to do so. Once one of the tools are available on your system, let’s split our exploratory assembly.fasta file. I’m doing this because most of the time I want an easier way to see if any of the recovered contigs are fully featured plasmids/chromosomes.

faidx --split-files assembly.fasta

Above command will create a .fai index file and split the assembly to separate files by contigs. If you want to avoid duplicate annotation with Prokka process, feel free to rename the original assembly.fasta file.

Now, let’s use a quick bash loop to cycle through all fasta files in your working directory (it’s better to always create separate working directories and copy over original files to process) and annotate them using Prokka and the reference protein fasta. Feel free to throw in a couple of closely related genomes found through the GToTree phylogenetic tree step – they’ll act as control for our own data.

for F in *.fasta; do

N=$(basename $F .fasta);

prokka --kingdom Bacteria --protein reference.faa --rfam --locustag $N --outdir $N --prefix $N $F; 


We’re interested in gene count and rRNA recovery at this exploratory assembly and analysis stage. The assembly hasn’t been properly polished yet, so whatever the data we have now won’t be fully representative of what we’re going to get once we finish everything. Long-read assemblies still suffer from insertion-deletion issues, which often result in over prediction of CDS in genome. How does the CDS count in the exploratory assembly look compared to Prokka output of it’s closest phylogenetic neighbors? What about rRNA count and location? Are they located in clusters on a single contig or spread across multiple contigs? These are all questions worth considering – but there’s also no need to spend too much time here. Remember, we’re still at the exploratory analysis stage – a poster study for the real painting to come. The primary goal here is to make sure there aren’t unexplainable anomalies (does your Deinococcus genome have contigs with bunch of Bacillus genes? Ten times the CDS count of the nearest neighbor? No rRNA sites?), and to give ourselves some understanding of what we’re dealing with in a somewhat broader biological context.

Final step in exploratory analysis is the simplest one – BUSCO (https://busco.ezlab.org/) analysis of our exploratory assembly against a small set of closest phylogenetic neighbors identified earlier in the process. BUSCO stands for Benchmarking Universal Single Copy Orthologs – and essentially runs on the similar universally and taxonomically expected single copy gene set counts like GToTree from our previous step. Simply put, this analysis tries to see if the microbes in question has all the genes it needs to live, and all the genes most members of a particular evolutionary clade tend to have for whatever the reason. We’ll be utilizing conda again for this one. Keep in mind that BUSCO needs to download a somewhat sizeable amount of data in its installation location to run.

conda create -n my_busco busco

conda activate -n my_busco

busco -m genome -i assembly.fasta -o assembly_busco -l lineage_data_set        

-m denotes mode, in our case genome. -o is output directory. -l should be followed by a lineage dataset name (i.e. aquificae_odb10), which can be listed by running the command:

busco --list-datasets

Going through BUSCO output can also result in tons of interesting analysis, but for now we’re just interested in the short summary txt file in the output directory. Here are two example outputs:

        ***** Results: *****

        164     Complete BUSCOs (C)                        
        164     Complete and single-copy BUSCOs (S)        
        0       Complete and duplicated BUSCOs (D)         
        21      Fragmented BUSCOs (F)                      
        9       Missing BUSCOs (M)                         
        194     Total BUSCO groups searched 

        ***** Results: *****
        22    Complete BUSCOs (C) 
        22    Complete and single-copy BUSCOs (S) 
        0    Complete and duplicated BUSCOs (D) 
        61    Fragmented BUSCOs (F) 
        41    Missing BUSCOs (M) 
        124    Total BUSCO groups searched 

I’ve seen unusual lack of BUSCO hits within long-read assemblies happen in two ways – either the raw reads don’t have the genes in the first place, indicative of low coverage data, or the quality of the sequences are low, so the output is especially indel prone. Indels in this context can chop up genes by terminating them unexpectedly, or by putting start codons in nonsense locations (the eagle eyed among you will eventually notice two or more consecutive annotation of same gene in unpolished long read assemblies. Those cases are largely due to indel errors). And BUSCO scores are pretty simple ways to give us an idea of the quality of the data we’re working with, and if you have decent phylogenetically related genomes on the side, give us a goal to build toward to.

Now, above two BUSCO outputs are especially interesting because they’re more or less on the extreme opposite ends of each other. The top set is from an archeal assembly, and shows recovery of 164 (84.5%) single copy ortholog genes. The bottom one is from a bacterial assembly and shows a depressing 22 (17.7%) single copy ortholog genes recovery. These two BUSCO scores are from the two GFA example assemblies shown earlier in this note. Which one belongs to which?

It turns out that the assembly that showed clean, well circularized contiguity is the one with 17.7% gene recovery, and the other assembly with problematic fragmentation is the one with relatively impressive 84.5% gene recovery. Both genomes haven’t been polished yet. I can also tell you that among the two examples, the bacterial genome with 17.7% gene recovery turned out to be a very accurate representation of all the genomic compartments present in the microbe, as backed by third party literature and analysis. From my experience, it’s fully possible to have a genome of disappointing base pair level quality that nevertheless represents an impressive picture of larger structural elements, and another one that shows promising base pair level quality that looks scrambled up when analyzed on a larger architectural level.

Alright! Now we’re done with the initial, exploratory genome assembly and (extremely surface level) analysis. By this point we should at least have a vague idea of below things:

1)Do we have enough raw data for a reasonable assembly?
2)Is the sequenced microbe what we think it is?
3)Is the raw data what we think it is?
4)What’s the quality we can reasonably expect out of our final assembly?
5)What other genomes can we look at if we’re feeling lost?

And in case you’re worried about the wall of text so far – everything written here up to this point (outside basecalling) is one leisurely evening’s work for someone with New York City public high school education. In fact, it’s so simple that I’ve automated away most of the processes so far, and will be releasing the tool on the Binomica labs github (https://github.com/Binomica-Labs) pretty soon.

It’s finally time to create a ‘final’ working assembly based on the data we have. We need to start again knowing the values from our initial, broad strokes study. The main difference here will be usage of multiple assemblers at once, utilizing the trycycler pipeline to sort through potentially spurious contigs, and building a consensus assembly from sorted and reconciled clusters of contigs. During the earlier days of learning to put together long-read based assemblies, I noticed using the same raw data across different assemblers could often result in genomes that are just different enough from each other to be problematic. Perhaps there’s a nonsense contig that seemed oddly like digital artifact. Maybe that one assembly has a plasmid looking contig precisely double the size of similar ones from other assemblers. Maybe there’s a contig shared across four assemblers, but two of them circularize and the other two doesn’t. There wasn’t a great way to fix those cases outside deleting/ignoring entire contigs or otherwise making a subjective decision call and manipulating them manually, which simply wasn’t a viable option without a reliable, complete reference genome. The newly introduced trycycler pipeline (https://github.com/rrwick/Trycycler) more or less fixes that problem. Detailed description of trycycler and its conceptual basis is explained excellently on the project’s wiki page (https://github.com/rrwick/Trycycler/wiki), and the author does an exemplary job describing often encountered issues in general long-read assembly processes as well. I strongly recommend going through the material yourself.

For now, we just need to know that we’ll be going through the processes of:

1) Subsampling the raw reads to match the number of multiple assembler instances

2) Assembly using multiple assemblers – the more and more varied they are the better

3) Clustering the contigs together based on their mash distance – essentially we’re trying to group together structurally meaningful elements together for further downstream processing. All chromosomes from different assemblers grouped together, and etc etc.

4) Reconciling the contigs within each individual cluster – this means the size of the contigs must be similar, making sure all contigs are circularized if they should be by comparing the end alignments, rotating the contigs so they start from the same position, and aligning them again to make sure they are reasonably similar.

5) Perform multiple sequence alignment on the clustered and reconciled contigs

6) Partition the raw reads to each of the aligned contigs clusters based on best alignment – so all the raw reads corresponding to a chromosomal cluster will end up there, plasmids matching raw reads to plasmids clusters, and etc.

7) Putting everything together to create a gap-filling, error correcting consensus sequence on per cluster basis.

As usual, quickest way to get things up and running is to create an independent conda environment for trycycler.

conda create -n my_trycycler trycycler

conda activate my_trycycler

The process here can get mildly tricky since we’ll also want to install/setup all of the target assemblers within the same conda environment, or on the base OS. In this example case I’ll be using Flye, Miniasm+Minipolish, Raven and Unicycler, but your selection can be different. For this particular case Flye, Raven-assembler and Unicycler were installed using conda. Miniasm+Minipolish was installed from the author’s git via below command:

pip3 install git+https://github.com/rrwick/Minipolish.git

And now to subsample the filtered working raw reads – note that –count flag is for the number of subsample we want to create:

trycycler subsample --reads filtered_data.fastq --out_dir minion_subsets --count 16

Once the subsampling is complete, we can run our 16 assemblies. Best way to do this is to write out a script and just automate the process away. You can write something similar to below, save it as assembly.sh and then execute it via chmod +x and ./assembly.sh

#!/usr/bin/env bash


mkdir minion2_assemblies\

flye --nano-hq minion_subsets/sample_01.fastq --threads "$threads" --out-dir assembly_01 && cp assembly_01/assembly.fasta minion2_assemblies/assembly_01.fasta && rm -r assembly_01

miniasm_and_minipolish.sh minion_subsets/sample_02.fastq "$threads" > assembly_02.gfa && any2fasta assembly_02.gfa > minion2_assemblies/assembly_02.fasta && rm assembly_02.gfa

raven --threads "$threads" minion_subsets/sample_03.fastq > minion2_assemblies/assembly_03.fasta && rm raven.cereal

unicycler --long minion_subsets/sample_04.fastq --threads "$threads" --out assembly_04 && cp assembly_04/assembly.fasta minion2_assemblies/assembly_04.fasta && rm -r assembly_04

flye --nano-hq minion_subsets/sample_05.fastq --threads "$threads" --out-dir assembly_05 && cp assembly_05/assembly.fasta minion2_assemblies/assembly_05.fasta && rm -r assembly_05

miniasm_and_minipolish.sh minion_subsets/sample_06.fastq "$threads" > assembly_06.gfa && any2fasta assembly_06.gfa > minion2_assemblies/assembly_06.fasta && rm assembly_06.gfa

raven --threads "$threads" minion_subsets/sample_07.fastq > minion2_assemblies/assembly_07.fasta && rm raven.cereal

unicycler --long minion_subsets/sample_08.fastq --threads "$threads" --out assembly_08 && cp assembly_08/assembly.fasta minion2_assemblies/assembly_08.fasta && rm -r assembly_08

flye --nano-hq minion_subsets/sample_09.fastq --threads "$threads" --out-dir assembly_09 && cp assembly_09/assembly.fasta 
minion2_assemblies/assembly_09.fasta && rm -r assembly_09

miniasm_and_minipolish.sh minion_subsets/sample_10.fastq "$threads" > assembly_10.gfa && any2fasta assembly_10.gfa > minion2_assemblies/assembly_10.fasta && rm assembly_10.gfa

raven --threads "$threads" minion_subsets/sample_11.fastq > minion2_assemblies/assembly_11.fasta && rm raven.cereal

unicycler --long minion_subsets/sample_12.fastq --threads "$threads" --out assembly_12 && cp assembly_12/assembly.fasta minion2_assemblies/assembly_12.fasta && rm -r assembly_12

flye --nano-hq minion_subsets/sample_13.fastq --threads "$threads" --out-dir assembly_13 && cp assembly_13/assembly.fasta minion2_assemblies/assembly_13.fasta && rm -r assembly_13

miniasm_and_minipolish.sh minion_subsets/sample_14.fastq "$threads" > assembly_14.gfa && any2fasta assembly_14.gfa > minion2_assemblies/assembly_14.fasta && rm assembly_14.gfa

raven --threads "$threads" minion_subsets/sample_15.fastq > minion2_assemblies/assembly_15.fasta && rm raven.cereal

unicycler --long minion2_subsets/sample_16.fastq --threads "$threads" --out assembly_16 && cp assembly_16/assembly.fasta minion2_assemblies/assembly_16.fasta && rm -r assembly_16 

Once the assemblies are complete, we cluster the contigs together:

trycycler cluster --assemblies minion2_assemblies/*.fasta --reads filtered_data.fasq --out_dir trycycler_workdir

We need to do some manual curation of the clusters here. We could end up with 60 or more clusters here, each containing a dozen contigs based on the quality of the raw reads, their coverage, and the assembly quality. The .newick file generated during the process is absolutely necessary to progress further – their visual representation on a cladogram suite like itol or Dendroscope is how we’re going to decide which clusters to keep. Here are two examples of what we could see here. One is from what ended up being a very successful bacterial assembly, and another one is from a problematic archaeal assembly that likely requires more raw reads.

The difference between the two are relatively straightforward – the eventually successful assembly experiment has cleanly defined cluster groups, with minimal mash distances between the contigs (as defined via clade lengths). The problematic genome assembly experiment shows far more scattershot representation of clusters and contigs. There are far too many separate clusters with too few contigs in them, and they’re nowhere near as neatly grouped together as the graph from the bacterial assembly experiment. Graph like this would call for another decision time – depending on the goal of the experiment, we would be better off going back and gathering additional raw data. In this particular case we proceeded on hoping to recover at least a clean picture of the chromosome. Problematic clusters and potentially problematic contigs within the clusters need to be quarantined manually – usually through changing their name so that bulk script processes downstream don’t pick them up anymore.

Once we’re decided on which clusters to keep (or which ones to remove, depending on quality of the data and assemblies), we can move on to reconciliation step – please note that reconciliation step will require running on per-cluster level. So if we’re keeping cluster_001 through cluster_010, we’d have to run below command ten times, swapping out the cluster directory name.

trycycler reconcile --reads filtered_data.fastq --cluster_dir trycycler_workdir/cluster_001/

Depending on the outcome, we’ll have to remove some individual contigs from within the clusters as well. Reasons for removal could be abnormal contig size or mash distances, problem with circularization (when other matching contigs are circularized) as well as unsually high rates of indel occurrence. Here’s an example from a particularly messy cluster. In the below sample, it would be best to remove contig G and K before proceeding further.

Maximum insertion/deletion sizes:
  A_contig_19:    0   17   23    2   18  205    7   14  292    2   27   27
  B_utg000001c:  17    0   23   17   12  205   17   14  292   16   21   21
  C_Utg714:      23   23    0   23   23  205   23   23  292   23   21   23
  E_contig_13:    2   17   23    0   17  205    7   14  292    2   26   26
  F_utg000001c:  18   12   23   17    0  205   18   12  292   17    9    9
  G_Utg722:     205  205  205  205  205    0  205  205  292  205  205  205
  I_contig_17:    7   17   23    7   18  205    0   14  292    7   27   27
  J_utg000001c:  14   14   23   14   12  205   14    0  292   13   21   21
  K_Utg724:     292  292  292  292  292  292  292  292    0  292  292  292
  M_contig_9:     2   16   23    2   17  205    7   13  292    0   26   26
  N_utg000001c:  27   21   21   26    9  205   27   21  292   26    0    5
  O_Utg698:      27   21   23   26    9  205   27   21  292   26    5    0

Once the reconciliation step is done, it’s time for multiple sequence alignment (MSA). This step needs to be run on per-cluster basis.

trycycler msa --cluster_dir trycycler_workdir/cluster_001

And then we run the partitioning process using the raw read data against the clusters directory. This process can be run against the cluster_* directories. * symbol denotes wildcard, meaning the command will run on all directories named in format of cluster_something. Make sure none of the rejected clusters are still around (I normally just rename rejected clusters bad_cluster_068 and so on)!

trycycler partition --reads filtered_data.fastq --cluster_dirs trycycler_workdir/cluster_*

And consensus building step, to be carried out individually on each of the good clusters.

trycycler consensus --cluster_dir trycycler_workdir/cluster_001

And we’re done-ish! We can now concatenate finished fasta files together and it’ll be our final assembly, albeit unpolished. Cat command can be used to combine all the individual consensus fasta files together.

cat trycycler_workdir/cluster_*/7_final_consensus.fasta > trycycler_workdir/final_consensus.fasta

‘Polishing’ is a practice more or less introduced with the modern concept of long read assembly – think of it like a dedicated, more involved version of the reads partition-consensus step we just went through in the trycycler process (don’t quote me on this though – this is where my understanding of the concept begins to waver more significantly). General agreement as of the time of this writing is that polishing steps are mandatory for long read based assemblies due to the technology’s inherent indel issue. I can tell you first hand that the difference in accuracy and quality between a polished long read assembly versus unpolished one is night and day. And of all the standalone polishers out there, ONT’s own Medaka (https://github.com/nanoporetech/medaka) really changed the game.

I’m going to assume we’re going to follow the usual conda route for Medaka installation – in its own environment as usual.

conda create -n my_medaka medaka

However, medaka’s true strength lies in its machine learning trained profile – and it looks like Medaka is trained on dataset that assumes prior racon (https://github.com/isovic/racon) based polishing under a very specific parameter. Personally I’m unsure if racon pre-polishing is necessary for Medaka, but we’ll do things by the book for the sake of this introduction.

conda create -n my_racon racon

conda activate my_racon

Using racon requires bwa for indexing and sam (alignment) file generation (you could try other alignment tools for sam generation as well, but YMMV). Once we have the program set up on our system we can proceed:

bwa index final_consensus.fasta

bwa mem -t 8 final_consensus.fasta filtered_data.fastq > consensus_data.sam

racon -m 8 -x -6 -g -8 -w 500 -t 8 filtered_data.fastq consensus_data.sam final_consensus.fasta > racon_consensus.fasta

Once racon polishing is done, we can use the output file for the medaka polishing process as such:

conda activate my_medaka

medaka_consensus -i filtered_data.fastq -d racon_consensus.fasta -o final_assembly -t 8 -m r941_min_hac_g507

Note the -m flag which calls what kind of trained model needs to be used for the polishing process. You can check the list of available models by typing “medaka tools list_models”. Here, r941 refers to the 9.41 chemistry, min stands for MinION flowcell, hac stands for High Accuracy, and g507 refers to the guppy version used. The last part might not necessarily line up with what we’ve used before, but in general it should be okay to update guppy promptly and use the highest version number available among trained models. So, if raw reads used for the assembly is based on super (SUP) quality base calling, the model called here should be r941_min_sup_g507.

If you’re wondering how much of a difference the combination of trycycler and proper medaka polishing can make – here’s a pretty extreme example. Do you remember the earlier example of a bacterial genome assembly with 22 (17.7%) BUSCO single copy ortholog genes? Here’s a result of the same benchmark – same raw reads, but with trycycler assembly and medaka polishing.

  110 Complete BUSCOs (C)         
  110 Complete and single-copy BUSCOs (S)     
  0 Complete and duplicated BUSCOs (D)     
  6 Fragmented BUSCOs (F)         
  8 Missing BUSCOs (M)         
  124 Total BUSCO groups searched   

If you made it this far, congratulations! It’s time to run the GToTree and Prokka annotation pipeline again, and check for improvements in genome integrity.

One more reminder – the steps described here are far from a recommendation of a best practice. The goal here is to help any interested members of the public assemble their own long-read genomes, should they ever want to for whatever the reason.

Also, I’d highly recommend keeping conda usage to minimum when possible, and to keep each pipelines/packages in self-contained environments. While conda’s fantastic for getting things started, nothing beats a well crafted and laid out base system.

I’m working on two more follow-ups to this article, all part of the “Genome assembly for artists” zine being planned for later this year. Second one will deal with short reads based assembly, and long-short read hybrid assembly. Third one will address extreme low coverage reads scenarios, very, very rudimentary basics of amateur comparative genomic analysis, and smattering of introductory shell scripting.

Deinococcus note November 2nd

Quick note on what might be a second-to-last test of the HMW Deinococcus genome extraction.

I set up a 24 hour ethanol precipitation reaction following the usual lysis step. Deinococcus pellets after the 2×15 minute 37C incubation with lysozyme+TE+RNase was eluted with 140ul of the lysis buffer and 10ul of proteinase K, incubated at 60C for one hour.

800ul of freezer cooled 100% ethanol was then carefully layered on top of the lysate, and 40ul of 3M sodium acetate was layered into the mixture. The effects were pretty immediate – I observed the usual shimmering as the DNA starts to precipitate out of the mixture as well as an almost instantaneous precipitation of some white strands, much like DNA pellets observed during ethanol wash steps of previous genome extraction tests.

It’s difficult to see here – the arrow indicates a built-up layer of DNA precipitating out of the Deinococcus lysate

Goal here is simple – now that we have a very reliable lysis protocol, I want to see the difference between genome precipitation reactions based on room temperature isopropanol (cold isopropanol, which some people recommend, stands a very real chance of over-precipitating out the salts in the lysate solution) and cold ethanol.

DNA is less soluble in isopropanol so it’s possible to run a rather quick reaction based on room temperature reagents, but isopropanol also has a habit of crashing out the salts in the solution. Ethanol precipitation is slower, but has a higher chance of allowing the background salt content to remain soluble, resulting in a cleaner DNA prep. It does seem like prolonged ethanol precipitation can often result in smaller, spurious DNA fragments to precipitate out of the solution as well, but if we have a relatively shear-free genome prep it shouldn’t be a problem in the first place.

The precipitation will continue at -20C for the next 24 hours, until 9PM November 3rd.

Deinococcus note November 1st

Continuing on with optimizing the Deinococcus genome extraction for long-read sequencing. While swapping out the Thermo kit lysis buffer for Sambrook bacterial lysis buffer improved the secondary lysis step greatly (Deinococcus absolutely requires double lysis, using lysozyme to dissolve the outer envelope and then using another lysis step to disrupt the bacterial membrane), it wasn’t quite enough to get improved genome yield. Fortunately, some minor changes to the lysate harvesting method finally got me the result I was looking for (granted, still not as good as what we extracted from D. radiophilus – could it really be the difference in buffers used?).

Here’s how the Deinococcus genome extract looks like as of 13th run of October 31st:

Result from the latest extraction along with some tests – from left to right: NEB 1kb plus, 11th run lysate test, 12th run supernatant precipitate 1, 12th run supernatant precipitate 2, 13th run sample 1, 13th run sample 2, NEB 1kb extend

The samples of interest are the third and second to last lanes on the gel above. Nanodrop reading gives between 110ng/ul to 320ng/ul reading depending on the run, which is similar to how our previous high molecular weight DNA containing solutions behaved. I’m somewhat impressed with absolutely zero amount of shearing visible on the gel for both samples – the entirety of the extraction process was carried out using regular pipette tips (we normally use wide boar tips on sensitive experiments, but couldn’t really afford to do so this time), and I really wasn’t particularly careful with pipetting and handling of the liquid sample due to the prototype nature of this genome extract.

Another gel picture comparing the 26th genome extraction run (3rd in series) to today’s extraction (13th in series):

From left to right, NEB 1kb extend, 13th run sample 1, 13th run sample 2, 3rd run sample 1, 3rd run sample 2

DNA concentration for the right most two samples (3rd run sample 1 and 2) are still in 110ng/ul and 98ng/ul ranges, so the somewhat faded looking bands for the two has more to do with how bright the 13th genome extracts are in comparison.

The improved high molecular weight genome extraction protocol (essentially just using detergents and TE buffer) for Deinococcus is as follows:

> Preheat 37C incubation (PCR machine) and 60C incubation baths

> Spin down 1ml of 2-day room temperature culture at max speed for 5 minutes (~14k rpm)

> Completely remove supernatant 

> Resuspend with 100ul of resuspension buffer (TE buffer with RNase) and 50ul of lysozyme (from 50mg/ml stock)

> Vortex and mix vigorously

> Transfer to labeled 0.2ml PCR tubes

> Freeze at -20C for 60 minutes
* It's possible to make multiple 0.2ml aliquots here and store them at -20C for future experiments 

> Incubate at 37C for 15 minutes

> Vortex the tubes vigorously for 5 seconds

> Incubate at 37C for 15 minutes

> Label fresh 1.5ml tubes and add 490ul Sambrook lysis buffer 

> Transfer bacterial sample to the new 1.5ml tubes containing lysis buffer

> Add 10ul of proteinase K to each tube, pipette to mix

> Vortex for 5 seconds

> Incubate lysate at 60C for 1 hour - insert tubes at angle and do not disturb

> Cool the tubes for 1 minute - take care not to disturb the sample, the lysate should be settled and layered

> Carefully transfer supernatant to freshly labeled 1.5ml tubes, throw out the old tubes

> Add 600ul absolute isopropanol (1/1 volume) and incubate for 1 minute at room temperature

> Add 60ul 3M Sodium Acetate (pH 5.50) (1/10 volume) 

> Cap and invert the tube carefully 10 times (DNA should start precipitating at this time)

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant - semi opaque DNA pellet should be visible at the bottom of the tubes 

> Add 1000ul of 70% EtOH and pipette up and down to carefully wash the pellet

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant

> Add 1000ul of 70% EtOH and pipette up and down to carefully wash the pellet

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant 

> Add 1000ul of 70% EtOH and pipette up and down to carefully wash the pellet

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant completely

> Dry the pellet completely - timing will vary depending on location

> Add 200ul preheated (~60C) elution buffer or DPAC H2O directly into the pellet

> Pipette gently to mix the pellet with elution solution

> Incubate at 60C for 30 minutes 

> Incubate at 4C overnight if the pellets are not dissolved, or are too gelatinous 

I purposefully incubated the lysate sample with room temperature isopropanol here – since DNA is less soluble in isopropanol and are more likely to crash out and pelletize anyway, incubating the sample for too long using colder solution might end up in extraneous precipitation of all the salts present in the lysate as well. It would be interesting to experiment with ethanol based precipitation over a 24 hour -20C, however. I think it might result in somewhat noisier (due to more shorter fragments of DNA precipitating out and pelletizing) but cleaner (since salts are more likely to stay dissolved in ethanol solution) extraction overall, which might be safer to use in our Nanopore flowcell.

Another important note is that the lysate is no longer spun down before supernatant transfer. Simply allowing the lysate to settle down via gravity and carefully skimming off the top layer for further processing via precipitation dramatically increases the yield and efficiency of the process. It’s certainly not safe – the lysed cell layer on the bottom of the sample tube won’t be fixed, and can be disturbed by jostling the sample about. However, DNA is already precipitating out of the lysate at this stage, and spinning down the sample as with conventional precipitation protocol will lock most of the high molecular weight DNA along with the cellular detritus.

Next experiments will revolve around 24 hour ethanol precipitation of the lysates and double spin down of the initial cell mass to see if current protocol can handle increased cellular load (I think it can, based on how efficient the digest and lysis processes look).

Deinococcus note October 30th

The surprisingly eventful high molecular weight genome extraction tests for the new Deinococcus species continues. All this is a bit of a pain point inherent in self-funded amateur research, since purchasing specialty kits for different lab processes needs to be avoided whenever possible, both for sake of knowledge and for financial restraints.

I’m currently at 10th iteration of the initial genome extraction protocol based on our experiments with Halobacterium and Deinococcus radiophilus.

> Preheat 37C incubation (PCR machine) and 60C incubation baths

> Spin down 1ml of 2-day room temperature culture at max speed (~14k rpm) for 5 minutes

> Completely remove supernatant 

> Resuspend with 100ul of resuspension buffer (from miniprep kit containing RNase) and 50ul of lysozyme (from 50mg/1ml stock)

> Vortex and mix vigorously

> Transfer to labeled 0.2ml PCR tubes

> Freeze at -20C for 60 minutes

> Incubate at 37C for 30 minutes

> Label fresh 1.5ml tubes and add 490ul lysis buffer (Sambrook protocol) and 10ul proteinase K

> Transfer bacterial sample to the new 1.5ml tubes

> Vortex and mix

> Incubate lysate at 60C for 1 hour - insert tubes at angle

> Invert lysate 10 times and freeze for 1 minute

> Thaw and equilibrate at room temperature - lysate should be clear pink/red 

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Transfer supernatant to freshly labeled 1.5ml tubes, throw out the old tubes

> Add 600ul absolute isopropanol (1/1 volume) and incubate for 1 minute at room temperature

> Add 60ul 3M Sodium Acetate (pH 5.50) (1/10 volume) 

> Cap and invert the tube carefully 10 times (DNA should start precipitating at this time)

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant - semi opaque DNA pellet might be visible at the bottom of the tube, but sometimes they are too small/translucent 

> Add 1000ul of 70% EtOH and pipette up and down carefully - do not dislodge the pellet (setting the pipette to 900ul for washing part helps)

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant

> Add 1000ul of 70% EtOH and pipette up and down carefully - do not dislodge the pellet

> Spin down the sample at max speed (~14k rpm) for 5 minutes

> Carefully remove supernatant fully

> Dry the pellet for 1~3 minutes in 37C - do not overdry

> Add 200ul preheated (~60C) elution buffer or DPAC H2O directly into the pellet

> Incubate at 60C for 30 minutes 

> Incubate at 4C overnight if the pellets are not dissolved, or are too gelatinous 

One caveat is that I’ve compressed the beginning part of the protocol somewhat by preparing bunch of 100ul resuspension + 50ul of 1ml Deinococcus cultures to be stored at -20C. So most of my later runs have started with the 30 minute 37C incubation step.

It also turned out that the lysis buffer used for our last extractions did not contain much SDS/Triton at all – I swapped it out for lysis buffer based on below Sambrook protocol, usually used for protein extraction from E.Coli (the particular lysis buffer is only responsible for breaking open cells that are digested with lysozyme in the previous step).

Tris pH 8.0    10mM
KCl            50mM
EDTA pH 8.0    1mM
Triton-X-100   1.00%
IGEPAL-CA-630  0.50%

The increased lysozyme digest time and lysis buffer swap out resulted in pretty immediate improvement – here’s the latest DNA pellet I’m dissolving for gel verification and nanodrop quantification tomorrow.