I’ve been away from this lab notebook for a little while – for good reasons. Pace of projects I’m involved in moved at breakneck speeds, and some of them being publication facing projects ended up enforcing a bit of self-imposed embargo for the stuff I’d normally write about. The first of the preprints should be coming out soon, so hopefully I’ll be able to share some interesting stuff here in a couple of week’s time.
This aspect of ’embargo before publication’ for an amateur/independent researcher is something that would need to be discussed at length in the future. Amateurs in biology tend to be isolated from the normal academic support infrastructure like lab meetings and casual conversations with peers within the institution. So an embargo in this these situations happen in far more solitary context, with the amateur often missing out on crucial peer to peer transfer of knowledge that increasingly seems necessary to development of the scientific craft. Publication culture for amateur and independent researchers would have to be fundamentally different in this regard.
But I digress, onto the topic for today – the long read polishing question.
Long read based assemblies by very nature of the technologies used suffer from indel issues – for almost any scenario, post-assembly correction step is mandatory. Granted, improvements I’ve seen in the field between just 2018 and now had been staggering – I’m getting the kind of raw reads and assemblies on today’s casual experiments that would have been unimaginable back when we first started out.
Introduction of HAC + SUP guppy calling and Medaka for polishing really changed the game – to the point that the usual basecalling-> assembly-> polishing cycle for ONT platform seem to be coalescing into something resembling a standardization (consisting of guppy SUP basecall -> flye assembly -> medaka polishing, IMHO).
Of the three main steps, post-assembly polishing was always trickier one of the bunch for me. Sure – Racon, Medaka followed by some sort of homopolymer specific polishing step tend to be the way to go. But should it be 4x Racon as was suggested in earlier Medaka literature? What about aligner to feed SAM files for Racon polishing? Super high accuracy lastal? Minimap? Tons of literature out there also suggest BWA despite the fact that we’re working solely with long reads. Or should we ignore the manual and follow some well-known bioinformaticians out there, sticking with BWA -> 1x Racon, Medaka and figure out the rest as we go along?
While checking for a new Medaka update on my work machine (1.6.0) I was reminded of the below passage on their official github (https://github.com/nanoporetech/medaka) and decided to test things out for myself at last.

I’ll process two raw assembly outputs- deino and meta_deino from Flye assembler. Deino is a simple –nano-hq mode assembly, and the meta_deino is a metaflye mode assembly (I suspect there’s a prophage in the genome, thus the second data set). For this very informal comparison, we’re just going to look at BUSCO score of the final genome and IDEEL indel estimation graph to check for broken or duplicated genes.
Two assemblies used here came from minimally quality controlled guppy SUP reads. More information below:
#Guppy 6.0.1+652ffd1
guppy_basecaller -i fast5/ -s SUP_fastq -c dna_r9.4.1_450bps_sup.cfg -x 'cuda:0' --gpu_runners_per_device 1 --chunk_size 500 --chunks_per_runner 128
#Filtered data stat
Mean read length: 18,757.6
Mean read quality: 13.2
Median read length: 11,292.0
Median read quality: 13.2
Number of reads: 69,387.0
Read length N50: 26,341.0
STDEV read length: 20,444.2
Total bases: 1,301,532,662.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 69387 (100.0%) 1301.5Mb
>Q7: 69387 (100.0%) 1301.5Mb
>Q10: 69387 (100.0%) 1301.5Mb
>Q12: 54104 (78.0%) 963.1Mb
>Q15: 5919 (8.5%) 85.2Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 37.3 (1948)
2: 37.2 (1182)
3: 35.4 (3067)
4: 33.8 (2535)
5: 33.4 (4727)
Top 5 longest reads and their mean basecall quality score
1: 306879 (11.1)
2: 299065 (14.9)
3: 290924 (15.4)
4: 290778 (14.2)
5: 290638 (12.3)
Command for the two assemblies and their stats are:
#Deino assembly
flye --nano-hq deino_1k_90.fastq -t 8 -o deino_1k
##################################################
Total sequence composition is as follows
--------------------------------------------------
518903 A
519556 T
865827 C
868228 G
--------------------------------------------------
Total gapped sequence length is: 2772514
--------------------------------------------------
Total ungapped sequence length is: 2772514
--------------------------------------------------
GC content in deino_1k_raw.fasta is 62.54 %
##################################################
busco -m genome -i deino_1k_raw.fasta --auto-lineage-prok -o deino_1k_raw_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#meta_deino assembly
flye --nano-hq deino_1k_90.fastq --meta -t 16 -o meta_deino_1k
##################################################
Total sequence composition is as follows
--------------------------------------------------
548550 A
548984 T
908052 C
909337 G
--------------------------------------------------
Total gapped sequence length is: 2914923
--------------------------------------------------
Total ungapped sequence length is: 2914923
--------------------------------------------------
GC content in meta_deino_raw.fasta is 62.34 %
##################################################
busco -m genome -i meta_deino_raw.fasta --auto-lineage-prok -o meta_deino_raw_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


71% BUSCO recovery for both – not too bad, not too great. Interesting that their BUSCO scores are the same despite almost 150kb size difference.
We’ll process the two assemblies through five different polishing scenarios.
- Medaka only (relatively recent official ONT recommendation)
- BWA based SAM, 1x Racon and then Medaka
- Minimap2 based SAM, 1x Racon and then Medaka
- Lastal high accuracy mode based SAM, 1x Racon and then Medaka
- Minimap2 based SAM, 4x Racon and then Medaka
Some data and processing scripts used for this note is on my github at https://github.com/naturepoker/polishing_long_reads_note.
Let’s look at the Medaka only polishing output:
#Medaka only deino polish
#Medaka 1.6.0
medaka_consensus -i deino_1k_90.fastq -d deino_1k_raw.fasta -o meda
ka_deino_1k -t 8 -m r941_min_sup_g507
##################################################
Total sequence composition is as follows
--------------------------------------------------
518847 A
519407 T
865794 C
868162 G
--------------------------------------------------
Total gapped sequence length is: 2772210
--------------------------------------------------
Total ungapped sequence length is: 2772210
--------------------------------------------------
GC content in consensus.fasta is 62.54 %
##################################################
busco -m genome -i medaka_deino_1k.fasta --auto-lineage-prok -o medaka_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:76.6%[S:76.6%,D:0.0%],F:12.1%,M:11.3%,n:124 |
|95 Complete BUSCOs (C) |
|95 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|15 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#Medaka only meta_deino polish
medaka_consensus -i deino_1k_90.fastq -d meta_deino_raw.fasta -o me
daka_meta_deino -t 8 -m r941_min_sup_g507
##################################################
Total sequence composition is as follows
--------------------------------------------------
548694 A
549101 T
908388 C
909636 G
--------------------------------------------------
Total gapped sequence length is: 2915819
--------------------------------------------------
Total ungapped sequence length is: 2915819
--------------------------------------------------
GC content in consensus.fasta is 62.35 %
##################################################
busco -m genome -i medaka_meta_deino.fasta --auto-lineage-prok -o medaka_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:75.8%[S:75.8%,D:0.0%],F:12.9%,M:11.3%,n:124 |
|94 Complete BUSCOs (C) |
|94 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|16 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


Not a drastic change, but an improvement nonetheless. Deino went from 71% to 76.6%, meta_deino went from 71% to 75.8%.
Let’s look at BWA based SAM used in 1x Racon iteration, and then Medaka. Shell script for this and other polishing processes will be on my github.
#BWA 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518596 A
519148 T
865755 C
868145 G
--------------------------------------------------
Total gapped sequence length is: 2771644
--------------------------------------------------
Total ungapped sequence length is: 2771644
--------------------------------------------------
GC content in old_RM_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i old_RM_deino_1k.fasta --auto-lineage-prok -o old_RM_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|21 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#BWA 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548312 A
548592 T
908184 C
909388 G
--------------------------------------------------
Total gapped sequence length is: 2914476
--------------------------------------------------
Total ungapped sequence length is: 2914476
--------------------------------------------------
GC content in old_RM_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i old_RM_meta_deino.fasta --auto-lineage-prok -o old_RM_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


So – I don’t know what happened here. The polishing didn’t really improve anything in obvious manner. If I put above IDEEL graph next to the raw genome data and squint really hard, I can see decreased instances of partial genomes… But the result is overall disappointing.
Maybe this is BWA related? While the suspicion makes sense, it’s a pretty jarring change considering that in the past, BWA-Racon cycle tended to get us at least some degree of improvement. For old BWA SAM fed polishing of ONT long-read assembly, check the polishing step for this draft (https://naturepoker.wordpress.com/2021/11/27/genome-assembly-for-artists-draft-part-1/).
Let’s look at 1x minimap2 SAM fed Racon-Medaka polishing output. Minimap2 is the de-facto standard long read mapper out there, so there shouldn’t be any obvious issues.
#minimap2 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518501 A
519030 T
865604 C
867994 G
--------------------------------------------------
Total gapped sequence length is: 2771129
--------------------------------------------------
Total ungapped sequence length is: 2771129
--------------------------------------------------
GC content in new_RM_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i new_RM_deino_1k.fasta --auto-lineage-prok -o new_RM_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:75.0%[S:75.0%,D:0.0%],F:13.7%,M:11.3%,n:124 |
|93 Complete BUSCOs (C) |
|93 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|17 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#minimap2 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548324 A
548606 T
908178 C
909410 G
--------------------------------------------------
Total gapped sequence length is: 2914518
--------------------------------------------------
Total ungapped sequence length is: 2914518
--------------------------------------------------
GC content in new_RM_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i new_RM_meta_deino.fasta --auto-lineage-prok -o new_RM_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


The outcome here is also pretty open to interpretation. If I take the IDEEL graphs side by side with the one from raw unpolished genomes there are obvious signs of improvements, smaller number of partial genes and smaller number of duplicate genes. And yet according to more definitive metric of BUSCO scores, the meta assembly from Deinococcus reads barely improved at all. So far all of the results we’ve been looking at haven’t been as good as just running Medaka straight on a raw assembly – so technically, added polishing step with Racon have been making the assemblies worse.
Maybe this is due to sometimes noisy alignment from minimap2? Let’s try out lastalign, and see what their highest accuracy setting will get us. I’ve used lastalign before and had been impressed with the output, especially in context of subsampling reads against a target genome.
#lastal 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518393 A
518961 T
865446 C
867847 G
--------------------------------------------------
Total gapped sequence length is: 2770647
--------------------------------------------------
Total ungapped sequence length is: 2770647
--------------------------------------------------
GC content in racon_medaka_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i racon_medaka_deino_1k.fasta --auto-lineage-prok -o racon_medaka_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124 |
|86 Complete BUSCOs (C) |
|86 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#lastal 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548297 A
548571 T
908186 C
909468 G
--------------------------------------------------
Total gapped sequence length is: 2914522
--------------------------------------------------
Total ungapped sequence length is: 2914522
--------------------------------------------------
GC content in racon_medaka_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i racon_medaka_meta_deino.fasta --auto-lineage-prok -o racon_medaka_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124 |
|86 Complete BUSCOs (C) |
|86 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


Holy cow. I don’t think I’ve ever seen a polishing step (involving two different polishers no less) actually make an assembly worse across the board. Obviously the polishing step somehow ended up introducing more indels to the genome. Is this a trait of super-high accuracy SAM files fed to Racon? Or just some weird quirk specific to the mapper? Anyway, combination of lastal version 1293, Racon v.1.5.0 and Medaka v.1.6.0 makes for some strange, unexpected failure state. The old adage of “you can’t overpolish a genome” no longer seems to hold water.
Let’s see what we get if we follow the old school recommendation of 4x Racon iteration before Medaka. We’ll be using minimap2 for SAM input.
#minimap2 4x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518037 A
518693 T
865063 C
867714 G
--------------------------------------------------
Total gapped sequence length is: 2769507
--------------------------------------------------
Total ungapped sequence length is: 2769507
--------------------------------------------------
GC content in 4x_deino_1k.fasta is 62.56 %
##################################################
busco -m genome -i 4x_deino_1k.fasta --auto-lineage-prok -o 4x_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:70.2%[S:70.2%,D:0.0%],F:18.5%,M:11.3%,n:124 |
|87 Complete BUSCOs (C) |
|87 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#minimap2 4x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
547978 A
548285 T
907936 C
909262 G
--------------------------------------------------
Total gapped sequence length is: 2913461
--------------------------------------------------
Total ungapped sequence length is: 2913461
--------------------------------------------------
GC content in 4x_meta_deino.fasta is 62.37 %
##################################################
busco -m genome -i 4x_meta_deino.fasta --auto-lineage-prok -o 4x_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|21 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------


The polished assemblies are certainly technically improved- but when we consider that just running Medaka on raw assemblies gives us better results, this is more of a subtle regression in the quality of the genome.
There are some caveats to the results here – I’ve intentionally left out specialized downstream polishers that will improve the results drastically by getting rid of significant chunks of homopolymer indel errors. Polypolish and homopolish come to mind as tools I’ve used before that frequently turns so-so assemblies into amazing ones. Deinococcus genome is a repeat and GC rich, polyploid landscape of transposons, prophages and homopolymers, so we’re really looking at a pretty tricky genome to begin with.
I’m not sure what to think about the results here though. Either there’s something significantly wrong with what I’ve done (I seriously doubt it based on previous experiences, but it’s possible), or later versions of Medaka might actually perform worse when added to the traditional pipeline of Racons and Pilons.
What bothers me the most is – if I simply relied on existing pipelines in my setup (which integrates polypolish/homopolish) I wouldn’t have realized what’s happening at all, since fixed homopolymer/indel errors will almost always get me better BUSCO scores compared to the raw inputs.
I’m looking for suggestions and comments on what’s happening here, and better best practices for long-read assembly polishing. As usual, please feel free to leave a comment here or get in touch at twitter @naturepoker1
Edit May 29th 2022:
@gaworj on twitter suggested using subsampled reads to cut down on potential noise. I cut the reads down to about 100x coverage and processed them through the same medaka only vs racon + medaka comparison again, focusing primarily on BUSCO scores (for now). The behavior described in the body of this lab note persists.
@Psy_Fer_ on twitter suggested dropping the –nano-hq argument from Flye assembly since I’m still using r941 data. Comparing –nano-hq assembled genome and –nano-raw assembled genome through the same medaka only vs racon + medaka comparison resulted in replication of the behavior seen in this lab note.
It should be noted that this entire test is looking at indel errors rather than any real structural correction of the assemblies (through IDEEL and BUSCO), so I’m missing whole bunch of stuff that’s happening under the surface. This note is also confined to a single data point of one Deinococcus ONT flowcell output, obtained using rapid sequencing kit and basecalled using a VERY subpar consumer graphics card (NVIDIA GeForce GT 1030).
Thanks for the suggestions folks! Hope this note’s been useful for you.