As mentioned in some of my previous notes here, I’ve been maintaining a continuous culture of acid-tolerant Halobacteria NRC-1 line for the past couple of years, with short read sequencing every anniversary of initial inoculation date – Feb 14th 2020 (and yes, the pandemic was definitely a factor in starting this study).
Halobacteria NRC-1 has one of the better reference genomes, remarkably uploaded in 2001 before much of modern NGS tools became the norm (the reference genome is available here, GCF_000006805.1). The goal of this small ongoing observation is to continuously grow our Halobacteria culture long-term and record any shifts and variants against the reference, using laboratory resources that might be representative of more involved amateur biologists.
It’s important to note that any observation made in this fashion would be a happenstance snapshot of a very stochastic system. Whatever the change we might observe in a study like this would be representative of only this particular instance of culture and sequencing run (if even that, as this very fascinating study suggests: https://doi.org/10.1038/nature24287). So I would be hesitant to call this a research on evolutionary biology, or even mutation accumulation – at current scale this would merely be series of interesting observations that could one day result in a proper research. A microbiological equivalent of bird-watching, I think.
Despite my preference for long-read sequencing in situ (such as with ONT), this study relies on third party short read sequencing. At this time it’s difficult to compete against pricing from third party sequencers with economies of scale on their side, though hopefully new flongle studies planned for later half of this year will change things. I’m expecting about 650mb data throughput at the cost of $110 USD – not cheap, but certainly not insurmountable if taking place only once a year.
Based on published Halobacteria doubling time (1.86~/h) and OD measurement in our lab, current sequencing run is for a 44595th generation of Halobacteria NRC-1 (HaloM_11) in continuous acidic (pH 4<) media culture. Translated to human generations (with 25 years per generation) this is similar to almost a million (1,114,883 years) years of human propagation – though this is more of a literary interpretation of the two organisms’ life time.
Standard Binomica Labs whole genome extraction method was used, as described below:
>Spin down 1 ml of sample for 1 minute at max speed and decant
>Resuspend vigorously with 100ul Edward's buffer
>Transfer carefully to PCR tube - mixture will be viscous
>Add 2ul of RNase A and mix vigorously, vortex for 10 seconds
>Incubate at 37C for 15 minutes
>Add 2ul of Proteinase K and mix vigorously, vortex for 10 seconds
>Incubate at 55C for 1 hour, and deactivate via incubation at 95C for 10 minutes
>The mixture should be clear and evenly mixed at this point
>Transfer to 1.5ml eppendorf tube
>Add 10% 3M (pH 5.4) sodium acetate, and 1:1 volume of 100% isopropyl alcohol
>Invert tube 10 times - precipitates should begin to form
>Spin down at max speed for 5 minutes (if you can observe semi-clear, orange mixture at this point, RNase and Proteinase mixing and incubation did not work properly)
>Decant carefully
>Add 1ml of 70% EtOH and resuspend the pellet
>Spin down at max speed for 5 mintues
>Repeat 70% EtOH resuspension and washing step at least 2 more times
>Decant completely and dry the pellet for 5 minutes - do not let the pellet overdry
>Resuspend in storage buffer of choice or dH2O to wanted volume (I do dH2O upto 200ul)
>Incubate in a 37C shaker for 1 hour - the extract is likely to be not dissolved fully
>Incubate in 4C overnight - depending on yield, the extract will be extremely viscous
Halo mutant whole genome extraction on a gel, 1 and 2 from same culture. Ladders on the sides are NEB 1kb extend, with top ladder size of 48.5kb. I ended up sending out sample #2
Gel electrophoresis image of the genome extracts is attached above. The last two lanes are Vibrio genome extracts for a friend’s project. Sample #2 was sent out for sequencing – with results expected around mid-March.
Mapping the reads against 2021 sequencing reads and the reference genome should provide for an interesting observation. I’m hoping to document the observation and methods on our upcoming amateur biology zine – Willowlands.
As a bonus, here’s a look at our extremely reliable tube tumbler used for long-term cultures in the lab. It’s been running nonstop for close to 4 years now:
Alphabet representation for protein amino acid residues can fun to play around with – let’s see what a season’s greeting looks like with a quick blast search:
>greetings
MERRYCHRISTMASANDHAPPYNEWYEAR
Dumping above into NCBI blast shows:
Interesting! I’m surprised we got any hits at all – let’s look at the alignment quality:
Ah – this is no good. Essentially we have a short fragment crossover resembling below at best:
HEPPYEEWFE
HEPPYPEWYE
HAQTYNEWYEA
Welp, it goes to show, simply getting a list of alignments off a database sometimes doesn’t mean anything.
Hope everyone’s having a wonderful holiday break and wishing you a HAQTYNEWYEA!
2022 finally draws to a close with coming of an unusually warm and humid first week of December in the city.
This year was a tumultuous one, but life goes on. Completion (of a sort) of the Oxalis stricta chloroplast genome this summer prompted me to submit an abstract for this year’s Nanopore Community Meeting in NYC, coincidentally the first in person/hybrid meeting since the pandemic. With Binomica labs continuing to work on more complex projects, it’s about time we get some practice on presenting and sharing our works, hopefully for some useful feedback and mentorship.
One important lesson from the poster/presentation experience – resist the temptation to present everything in wide breadth, and focus on a particular topic or a feature. And then figure out a way to present the theme in a simple manner to draw in people who’d want to hear about it in depth.
Unfortunately, actual research related feedback never quite materialized during this conference (I was really hoping to pick someone’s brain over the prophage signature I found in the Deinococcus genome…) – the crowd for this year’s NCM seemed to be focused on medical sciences, and not quite talkative to boot. Hopefully I’ll have better luck presenting our research in more specialized conference in the future.
I’ve uploaded PDF of our poster and presentation slides to zenodo with associated DOIs:
Poster:
Presentation:
I’m also including a slide-by-slide presentation talk below, starting with the “About Us” slide:
1>
Hello, I'm Sung won Lim, part time warehouse worker and full time amateur biologist with a small independent research group called Binomica labs. And I'm here to tell you about our experience using the MinION platform over the last couple of years.
2>
My friend Sebastian and I started using MinION in 2017, using a rapid kit on a locally found culture of Oxalis stricta. Unfortunately, our inexperience held us back and we decided to sequence something simpler, a bacteria, as a fun weekend project.
3>
We chose Deinococcus radiophilus for the study. The project started around October 2018 and ended in may 2019, taking place between 9PM and 3AM every weekday. It was largely spent developing an in-house protocol for long-read genome extraction, with improvements shown here.
4>
We were eventually able to assemble our very first fully complete, circularized genome and deposit it on Genbank. Plasmids of the genome confirms an estimate from a 1985 literature in both size and restriction profile. Here's an alignment of known complete Deinococcus genome against ours.
5>
But we didn't just want to sequence a microbe - we wanted to learn about its history. The next year was spent delving into plasmids of Deinococcus, focusing on a particular replication protein that shows up in 89% of known Deinococcus genomes - an unusually high representation compared to any other phyla. Origins of the mechanism seems distantly related to a very specific group of phages, and we're excited to share the results with you in our preprint soon.
6>
Our work with Deinococcus got us in touch with Dr. Kyle Maclea, who very generously suggested that we work with him on producing the first genome assembly of archaea Halococcus dombrowskii. Using raw data produced by his lab, I was able to assemble another complete and circularized genome, and co-wrote a preprint available on biorxiv.
7>
Analysis of Halococcus dombrowskii genome revealed rRNA operons spread across the chromosome and two plasmids. The plasmid borne operons code for drastically different Internal Transcribed Spacer sequences from the chromosomal operon, with alignment of known Halococcus ITS sequences shown here.
8>
And, with new found confidence, we went back to where we started - the Oxalis stricta data from 2017. After applying everything we've learned over the years, we were able to produce a first complete genome of Oxalis stricta chloroplast, also available on Genbank.
9>
Special thanks to Sebastian and Dr. Kyle Maclaea, and thank you for listening. I'd be happy to take any questions.
And – if you’re interested here’s a recording of the presentation itself. I haven’t presented anything in public for over ten years, so the whole ‘reading from prepared index card’ aspect stands out a little.
Here’s a quick test running Dorado basecaller on ONT long read files in a Google Colab GPU instance, based on previous notes written by Miles Benton. The notebook file ran on a ~2.8GB fast5/pod5 file from one of our Deinococcus sequencing experiments (I might just throw the file onto Zenodo soon and use it as a standard benchmarking data, probably in original fast5, pod5, and slow5 format).
A quick overview of the three basecallers available from ONT is at the bottom of this github page – https://github.com/Psy-Fer/buttery-eel. Essentially, Guppy (which the gist linked above already described as working in Colab GPU instance) is the standard production basecaller, Bonito is ONT’s research basecaller based on pytorch, and Dorado is an alpha v0.0.1 release of a future production/research basecaller written mostly in C++ with libtorch.
In terms of hacking, Bonito and Dorado are far more accessible than Guppy, which is only obtainable through Nanopore community portal and is subject to very enforceable licensing agreements. For our purposes, this means we can write Jupyter notes directly accessing Bonito/Dorado distributions and share it with interested public, while doing so with Guppy would likely get us into trouble.
The Google Colab/Jupyter notebook itself is available on my github in the colab_basecalling folder: https://github.com/naturepoker/jupyter_notes. Also attached is a quick pdf export of the notebook file, created via:
jupyter nbconvert --to pdf basecalling_dorado_export.ipynb
It’s a little disappointing to see none of the emojis made it into the PDF export – HTML version fares better, but converting Google Colab written ipynb file to HTML on a local installation of Jupyter results in indexing error.
It should also be noted that both conda and pipenv installation of Jupyter notebook will still require additional installation of latex and pandoc – in my case I just apt installed tex dependency via:
And then opted to download the latest binary for pandoc (the version in Ubuntu 20.04 repository was more than ten releases behind) from https://github.com/jgm/pandoc/releases.
I’ve long been curious about Jupyter environment – and initial experiences are very positive. I can definitely imagine some of my earlier posts really benefiting as a shared Jupyter notebook, albeit with heavy %%bash calls for indispensable system utilities like awk/sed/grep.
Key benefit here is a more accessible processing documentation (not necessarily just programming!) for each step of whatever the inquiry I might have, with full benefit of being able to call variety of python packages via pip – imagine direct visualization of processed data using something like ETE toolkit along with paragraphs of handwritten description. Export function works quite elegantly as well – certainly beats pages and pages of text note files I end up sending over to collaborators to make sure they’re on top of analysis processes. I definitely see myself making more extensive use of Jupyter for all projects going forward.
I think a thematic follow up to this lab note entry will be on testing conda environment on Colab, and testing above linked slow5-buttery eel tool on Colab. Exciting stuff.
With the dusts of a hot and tumultuous September finally in tow, I decided to spent the month’s last few days at my friend Sebastian’s new place. His new lab is situated atop a hill overlooking a small pebble strewn Long Island beach side, surrounded by fresh air and lush greenery. The difference in scenery from the usual hamster wheel trappings of NYC subways and Queens warehouses I’ve been stuck in for the last few years was more shocking than I expected – it’s funny how easily our life’s environments eventually bleed into us, and makes us forget how different things can be mere hours away from where we are.
The visit was more than a simple vacation, however. Sebastian has an Opentrons V2 graciously donated to him by a benefactor, which had been gathering dust for the most part owing to odds and ends of usual life events. I arrived meaning to finally give the machine a decent workout, with a practical python API protocol immediately relevant to our amateur biology projects.
Not a wonderful picture of the wonderful Opentrons robot
The choice of a preliminary experiment was simple. We’ve already been collaborating on East River microbial screening for the last couple of years, with a goal of demonstrating a tractable long term amateur biology project that draws from the researcher’s local environment rather than some exotic location (here’s a long summary of what we’ve found so far: https://naturepoker.wordpress.com/2021/06/26/east-river-screening-so-far-oct-2020-june-2021/). The same would continue in the new location, just with more varied natural habitats compared to the usual sampling haunts located in heart of the city.
I’ve also been looking at halophilic/halotolerant microbes for a couple of years now – such as with the ongoing Halobacterium NRC-1 mutant strain sequence tracking (3+ years and going strong!), ongoing Halobacterium crystallization stress tests, studies on halotolerant fungi we’ve recovered from one of the cultures, and of course a recent sequencing & assembly preprint of Halococcus dombrowskii (more info immediately following this lab note entry – the preprint is located at: https://doi.org/10.1101/2022.08.16.504008). The beaches and salt marshes surrounding Sebastian’s new lab could prove to be a fascinating source of halophilic/halotolerant microbes, and it made perfect sense to come up with a salt gradient serial dilution protocol for the Opentron.
I took two initial samples. Sample #1 from salt marshes in a little dip between the beaches and the mainland that seems to go through continuous cycle of submersion during high tide and stream-forming during low tides, and sample #2 from sand dunes just outside the beach area which also submerges during high tide, but is dry and open to sunlight during low tide periods, essentially 12 hours a day. The hope was the continuous cycle of tides in both locations would result in localization of salts that could prove more hospitable to halophiles.
The sampling site!
Autoclavable 50ml tubes served as great sampling vessels, though packing in sand without touching them or letting sand grains in between the cap seal and the tube body required some degree of finesse. I’ll have to carry a sterile spoon or spatula for next round of sampling. We also noticed some fascinating beach-plants that grows in beach sands clearly below the water line during high tide. Core-sampling the soil around the plants and their root growth could prove to be another fruitful venue for microbial sample collection at a later date.
Left, beach sand dune sample with some sea water. Right, salt marsh sample with some salt marsh water. Sampling process needs a little bit of work.
I prepared 250ml of fresh halobacteria media following the usual ATCC 213 recipe:
The media would act as a sort of selection agent in a gradient, and as a solution for a 96 well diluent of the sample (in this case, #1 sample). Both the media and beach sample would be provided on makeshift rectangular plastic plates (we really didn’t have much in the way of Opentrons’ officially sanctioned labware, so made do with similarly shaped containers and manual labware dimension adjustment during calibration stages).
Initial connection step between my laptop and Opentrons was slightly more involved than most people would expect. It turned out I had to set IPv4 link method to a local-only mode in order to connect the machine via USB, and the particular flavor of Ubuntu I was using (Lubuntu based on LXQT) did not have a way to switch that option within the default network manager. For those facing similar issues, simply apt installing network-manager-gnome and then using its interface for the duration of the Opentrons setup might be the best way. Otherwise, their appimage based connection software is quite intuitive and responsive.
We’re aiming for full automation with custom protocol generating scripts down the road, so installing Opentrons python API is essential. I feel the best way to do this is via pipenv environment – a combination of pip and virtualenv.
#installing pipenv
pip3 install pipenv
#move to the directory you want to use for opentrons scripting
pipenv install opentrons
#activate the pipenv environment containing opentrons API module in the same directory
pipenv shell
#test to see if opentrons module is properly installed
pip3 show opentrons
#opentrons module version at the time of this writing is 6.1.0
Here’s the python API protocol for serial dilution of wildtype sample with halomedia (embedded code for WP can be messy – a properly formatted file can be found in: https://github.com/naturepoker/binomica_labs_opentron)
#importing opentrons api
from opentrons import protocol_api
#metadata for description - Opentrons app will read from this protocol name and description, not the protocol file name!
metadata = {
'protocolName': 'Beach sample dilution experiment',
'author': 'Sung <sung@binomicalabs.org>',
'description': 'Beach sample dilution with two makeshift plastic containers modeled after axygen reservoir',
'apiLevel': '2.12'
}
#defining necessary equipment and their locations in the robot
def run(protocol: protocol_api.ProtocolContext):
#labware
tips = protocol.load_labware('opentrons_96_tiprack_300ul', location='1')
beach_sample = protocol.load_labware('axygen_1_reservoir_90ml', location='6')
halomedia = protocol.load_labware('axygen_1_reservoir_90ml', location='5')
plate = protocol.load_labware('corning_96_wellplate_360ul_flat', location='3')
trash = protocol.load_labware('nest_1_reservoir_195ml', location='11')
#pipette
left_pipette = protocol.load_instrument('p300_single', mount='left', tip_racks=[tips])
#experiment steps
#fill in all of the 96 wells in 'plate' with location 5 diluent, the beach sample in this case
#from left to right, 200ul is taken from 'beach_sample'. A1 simply denotes the reservoir itself. 'Plate' wells() indicate all available wells for the 'plate' location defined above
#3x 100ul mixing step is performed before aspirating from the beach sample - this prevents solid residue carry over
left_pipette.transfer(200, beach_sample['A1'], plate.wells(), mix_before=(3, 100))
#defining range 8 for traveling down the column of the 'plate'
#meanwhile halomedia (location 5) is dispensed from the reservoir to row 0 (python starts from 0)
#the dilution in below step happens sequentially - halomedia is dispensed into row 0, and then the next line activates, dispending 100ul from row 0 and moving to the next row, and so on
#when finished, due to 'range' the operation repeats on the row below the first operation
for i in range(8):
row = plate.rows()[i]
left_pipette.transfer(100, halomedia['A1'], row[0], mix_after=(3, 50))
left_pipette.transfer(100, row[:11], row[1:], mix_after=(3, 50))
#Above loop method moves 100ul of each well from left to right - so last wells of each row (12th) would have 100ul more volume than the others.
#we can use the transfer method here to point specific columns to a location, in this case another plastic container we're using as liquid collection trash
#the destination here can be any other labware located in the robot - i.e. moving all column to one 6.9ml well in a 12 well plate for sample collection
left_pipette.transfer(100, plate.columns_by_name() ['12'], trash['A1'])
Opentrons in motion with above protocol – the two reservoirs (left, halomedia, right, beach sample) are just repurposed tip rack cap and a plastic container.
One absolutely great feature of Opentrons python API is the simulate method – with above protocol written into a .py file, I can simply run opentrons_simulate myprotocol.py to validate the protocol and see each of the steps that will be taken by the machine, down to expected ul/sec dispensing rate.
Despite off-the-cuff nature of the sampling and Opentrons testing, everything more or less worked smoothly. Above protocol was used for initial #1 sample serial dilution, which resulted in a gradient with >83.33g/L NaCl concentration at the highest end, and essentially freshwater on the lowest end. As a reference, average sea water salt concentration is expected to hover around 35g/L.
Another iteration of above protocol was tested out, but with 12 well plates with 6.9ml volume as well – mostly to practice writing out the API. Both of the plates were wrapped with plastic and left out at room temperature by a windowsill over a course of two days.
The results were unexpected – both the 96 well and the 12 well plates show significant milky-white growth toward the wells with higher concentration of salts, while middle and last rows show no growths at all. From a very cursory look, it does seem like higher salt concentration wells show far more robust growth than the lower concentration wells. The behavior more or less repeats on the column level, though the bottom most H row wells seem to show better growth than the top rows for some reason.
For now the plastic wrapped plates are resting on Sebastian’s lab window sills, fully exposed to sunlight. I hope to return around the end of the month and carry out a more detailed analysis – crossing fingers for some interesting discoveries in the future. Phage hunting and dilution tests with the robot sounds ideal as well.
Our preprint addressing a fascinating Haloarchaeal genome discovery is right around the corner! There’s always a chance what we’re looking at is erroneous, but due diligence so far suggests otherwise. I’m really excited about the initial feedback- it looks like my first official publication could start off with a bang.
While writing out the reference section, our research partner suggested searching a little further and looking at non-anglophone sources from earlier days of Halobacteria research. Well, it turns out a major publication in the reference list was never officially translated into English, so I thought I’d type up a quick draft translation. This is an amateur translation by an amateur biologist, so take that as you will.
Schoop, G. Halococcus litoralis, ein obligat halphiler Farbstoffbildner. Deutsche Tierärztliche Wochenschrift. 1935;43:817–20.
Schoop, G. Halococcus litoralis, ein obligat halphiler Farbstoffbildner. Deutsche Tierärztliche Wochenschrift. 1935;43:817–20.
Above paper is notable for being the first to treat cocci-forming halophilic microbes as separate from other red-colored microbes forming on hypersaline samples.
The authors succeeded in isolating pure cultures of the cocci forming halophilic microbes from mixed samples and propagated them on artificial media for 7 years. They go on to suggest that Micrococcus litoralis described in below publication be reclassified as a Halococcus litoralis based on observed characteristics shared with the types of red-hued microbes forming obligately on high salinity samples and artifical media of 15% NaCl concentration and above.
Kellermann, K. F.: Micrococci causing red detoration of salted codfish. Zbl. Bakter. II. 42, 398 (1915)
However, it would take until 1973 study of Kocur M. and Hodgkiss W. for the suggestion to finally be accepted as the de-facto taxonomic classification for these genus of Haloarchaeal microbes, as decribed in below publication.
Kocur M, Hodgkiss W. Taxonomic status of the genus Halococcus Schoop. International Journal of Systematic and Evolutionary Microbiology. 1973 Apr 1;23(2):151-6.
It should be noted that above 1973 paper also found Sarcina litoralis, Micrococcus litoralis, Sarcina morrhuae and Micrococcus morrhuae as taxonomic synonyms of each other, and they've all been consolidated as Halococcus morrhuae in modern days. So in a way, this 1935 publication can be seen as a continuation of the 1880 study by Farlow W.G.:
Farlow WG. Chapter XLIV. On the nature of the peculiar reddening of salted codfish during the summer time (as observed more particularly at Gloucester, Mass., during the summer of 1878). In: Report of the Commissioner - United States Commission of Fish and Fisheries [Internet]. 1880. p. 969–74. Available from: https://www.google.com/books/edition/Report_of_the_Commissioner_United_States/0IkYAAAAYAAJ?hl=en&gbpv=0
I also used this opportunity to test out how I can archive github repository/releases on zenodo and assign a DOI for it. Turns out it’s a simple enough, turn-key solution. You’d just need to work with a zenodo account that’s either already linked with a github account or create a new zenodo account using github login. Trying to do so with an account based on orcid, for example, gave me a strange linking related issues I was never able to resolve. Maybe there’s some bug that needs to be ironed out?
I’m sitting on a whole bunch of Felix d’Herelle translation for some of my other studies as well, and it looks like almost none of them are available in English. Maybe I should polish and archive them on zenodo too.
I’ve been away from this lab notebook for a little while – for good reasons. Pace of projects I’m involved in moved at breakneck speeds, and some of them being publication facing projects ended up enforcing a bit of self-imposed embargo for the stuff I’d normally write about. The first of the preprints should be coming out soon, so hopefully I’ll be able to share some interesting stuff here in a couple of week’s time.
This aspect of ’embargo before publication’ for an amateur/independent researcher is something that would need to be discussed at length in the future. Amateurs in biology tend to be isolated from the normal academic support infrastructure like lab meetings and casual conversations with peers within the institution. So an embargo in this these situations happen in far more solitary context, with the amateur often missing out on crucial peer to peer transfer of knowledge that increasingly seems necessary to development of the scientific craft. Publication culture for amateur and independent researchers would have to be fundamentally different in this regard.
But I digress, onto the topic for today – the long read polishing question.
Long read based assemblies by very nature of the technologies used suffer from indel issues – for almost any scenario, post-assembly correction step is mandatory. Granted, improvements I’ve seen in the field between just 2018 and now had been staggering – I’m getting the kind of raw reads and assemblies on today’s casual experiments that would have been unimaginable back when we first started out.
Introduction of HAC + SUP guppy calling and Medaka for polishing really changed the game – to the point that the usual basecalling-> assembly-> polishing cycle for ONT platform seem to be coalescing into something resembling a standardization (consisting of guppy SUP basecall -> flye assembly -> medaka polishing, IMHO).
Of the three main steps, post-assembly polishing was always trickier one of the bunch for me. Sure – Racon, Medaka followed by some sort of homopolymer specific polishing step tend to be the way to go. But should it be 4x Racon as was suggested in earlier Medaka literature? What about aligner to feed SAM files for Racon polishing? Super high accuracy lastal? Minimap? Tons of literature out there also suggest BWA despite the fact that we’re working solely with long reads. Or should we ignore the manual and follow some well-known bioinformaticians out there, sticking with BWA -> 1x Racon, Medaka and figure out the rest as we go along?
While checking for a new Medaka update on my work machine (1.6.0) I was reminded of the below passage on their official github (https://github.com/nanoporetech/medaka) and decided to test things out for myself at last.
I’ll process two raw assembly outputs- deino and meta_deino from Flye assembler. Deino is a simple –nano-hq mode assembly, and the meta_deino is a metaflye mode assembly (I suspect there’s a prophage in the genome, thus the second data set). For this very informal comparison, we’re just going to look at BUSCO score of the final genome and IDEEL indel estimation graph to check for broken or duplicated genes.
Two assemblies used here came from minimally quality controlled guppy SUP reads. More information below:
#Guppy 6.0.1+652ffd1
guppy_basecaller -i fast5/ -s SUP_fastq -c dna_r9.4.1_450bps_sup.cfg -x 'cuda:0' --gpu_runners_per_device 1 --chunk_size 500 --chunks_per_runner 128
#Filtered data stat
Mean read length: 18,757.6
Mean read quality: 13.2
Median read length: 11,292.0
Median read quality: 13.2
Number of reads: 69,387.0
Read length N50: 26,341.0
STDEV read length: 20,444.2
Total bases: 1,301,532,662.0
Number, percentage and megabases of reads above quality cutoffs
>Q5: 69387 (100.0%) 1301.5Mb
>Q7: 69387 (100.0%) 1301.5Mb
>Q10: 69387 (100.0%) 1301.5Mb
>Q12: 54104 (78.0%) 963.1Mb
>Q15: 5919 (8.5%) 85.2Mb
Top 5 highest mean basecall quality scores and their read lengths
1: 37.3 (1948)
2: 37.2 (1182)
3: 35.4 (3067)
4: 33.8 (2535)
5: 33.4 (4727)
Top 5 longest reads and their mean basecall quality score
1: 306879 (11.1)
2: 299065 (14.9)
3: 290924 (15.4)
4: 290778 (14.2)
5: 290638 (12.3)
Command for the two assemblies and their stats are:
#Deino assembly
flye --nano-hq deino_1k_90.fastq -t 8 -o deino_1k
##################################################
Total sequence composition is as follows
--------------------------------------------------
518903 A
519556 T
865827 C
868228 G
--------------------------------------------------
Total gapped sequence length is: 2772514
--------------------------------------------------
Total ungapped sequence length is: 2772514
--------------------------------------------------
GC content in deino_1k_raw.fasta is 62.54 %
##################################################
busco -m genome -i deino_1k_raw.fasta --auto-lineage-prok -o deino_1k_raw_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#meta_deino assembly
flye --nano-hq deino_1k_90.fastq --meta -t 16 -o meta_deino_1k
##################################################
Total sequence composition is as follows
--------------------------------------------------
548550 A
548984 T
908052 C
909337 G
--------------------------------------------------
Total gapped sequence length is: 2914923
--------------------------------------------------
Total ungapped sequence length is: 2914923
--------------------------------------------------
GC content in meta_deino_raw.fasta is 62.34 %
##################################################
busco -m genome -i meta_deino_raw.fasta --auto-lineage-prok -o meta_deino_raw_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
IDEEL graph output – deino on left, meta_deino on right. They’re more or less the same reflecting the BUSCO score. We can see that there are tons of broken genes, quite likely due to indel issues.
71% BUSCO recovery for both – not too bad, not too great. Interesting that their BUSCO scores are the same despite almost 150kb size difference.
We’ll process the two assemblies through five different polishing scenarios.
Medaka only (relatively recent official ONT recommendation)
BWA based SAM, 1x Racon and then Medaka
Minimap2 based SAM, 1x Racon and then Medaka
Lastal high accuracy mode based SAM, 1x Racon and then Medaka
#Medaka only deino polish
#Medaka 1.6.0
medaka_consensus -i deino_1k_90.fastq -d deino_1k_raw.fasta -o meda
ka_deino_1k -t 8 -m r941_min_sup_g507
##################################################
Total sequence composition is as follows
--------------------------------------------------
518847 A
519407 T
865794 C
868162 G
--------------------------------------------------
Total gapped sequence length is: 2772210
--------------------------------------------------
Total ungapped sequence length is: 2772210
--------------------------------------------------
GC content in consensus.fasta is 62.54 %
##################################################
busco -m genome -i medaka_deino_1k.fasta --auto-lineage-prok -o medaka_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:76.6%[S:76.6%,D:0.0%],F:12.1%,M:11.3%,n:124 |
|95 Complete BUSCOs (C) |
|95 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|15 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#Medaka only meta_deino polish
medaka_consensus -i deino_1k_90.fastq -d meta_deino_raw.fasta -o me
daka_meta_deino -t 8 -m r941_min_sup_g507
##################################################
Total sequence composition is as follows
--------------------------------------------------
548694 A
549101 T
908388 C
909636 G
--------------------------------------------------
Total gapped sequence length is: 2915819
--------------------------------------------------
Total ungapped sequence length is: 2915819
--------------------------------------------------
GC content in consensus.fasta is 62.35 %
##################################################
busco -m genome -i medaka_meta_deino.fasta --auto-lineage-prok -o medaka_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:75.8%[S:75.8%,D:0.0%],F:12.9%,M:11.3%,n:124 |
|94 Complete BUSCOs (C) |
|94 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|16 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
Medaka only polished genomes. Deino on left, meta_deino on right. When compared to the raw data incidences of incomplete genomes and some duplicate genomes went down.
Not a drastic change, but an improvement nonetheless. Deino went from 71% to 76.6%, meta_deino went from 71% to 75.8%.
Let’s look at BWA based SAM used in 1x Racon iteration, and then Medaka. Shell script for this and other polishing processes will be on my github.
#BWA 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518596 A
519148 T
865755 C
868145 G
--------------------------------------------------
Total gapped sequence length is: 2771644
--------------------------------------------------
Total ungapped sequence length is: 2771644
--------------------------------------------------
GC content in old_RM_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i old_RM_deino_1k.fasta --auto-lineage-prok -o old_RM_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|21 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#BWA 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548312 A
548592 T
908184 C
909388 G
--------------------------------------------------
Total gapped sequence length is: 2914476
--------------------------------------------------
Total ungapped sequence length is: 2914476
--------------------------------------------------
GC content in old_RM_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i old_RM_meta_deino.fasta --auto-lineage-prok -o old_RM_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
BWA SAM fed 1x racon round, and then medaka. Deino on left, meta_deino on right.
So – I don’t know what happened here. The polishing didn’t really improve anything in obvious manner. If I put above IDEEL graph next to the raw genome data and squint really hard, I can see decreased instances of partial genomes… But the result is overall disappointing.
Maybe this is BWA related? While the suspicion makes sense, it’s a pretty jarring change considering that in the past, BWA-Racon cycle tended to get us at least some degree of improvement. For old BWA SAM fed polishing of ONT long-read assembly, check the polishing step for this draft (https://naturepoker.wordpress.com/2021/11/27/genome-assembly-for-artists-draft-part-1/).
Let’s look at 1x minimap2 SAM fed Racon-Medaka polishing output. Minimap2 is the de-facto standard long read mapper out there, so there shouldn’t be any obvious issues.
#minimap2 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518501 A
519030 T
865604 C
867994 G
--------------------------------------------------
Total gapped sequence length is: 2771129
--------------------------------------------------
Total ungapped sequence length is: 2771129
--------------------------------------------------
GC content in new_RM_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i new_RM_deino_1k.fasta --auto-lineage-prok -o new_RM_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:75.0%[S:75.0%,D:0.0%],F:13.7%,M:11.3%,n:124 |
|93 Complete BUSCOs (C) |
|93 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|17 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#minimap2 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548324 A
548606 T
908178 C
909410 G
--------------------------------------------------
Total gapped sequence length is: 2914518
--------------------------------------------------
Total ungapped sequence length is: 2914518
--------------------------------------------------
GC content in new_RM_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i new_RM_meta_deino.fasta --auto-lineage-prok -o new_RM_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:17.7%,M:11.3%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|22 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
Minimap2 created SAM, 1x Racon and Medaka. Deino on left, meta_deino on right.
The outcome here is also pretty open to interpretation. If I take the IDEEL graphs side by side with the one from raw unpolished genomes there are obvious signs of improvements, smaller number of partial genes and smaller number of duplicate genes. And yet according to more definitive metric of BUSCO scores, the meta assembly from Deinococcus reads barely improved at all. So far all of the results we’ve been looking at haven’t been as good as just running Medaka straight on a raw assembly – so technically, added polishing step with Racon have been making the assemblies worse.
Maybe this is due to sometimes noisy alignment from minimap2? Let’s try out lastalign, and see what their highest accuracy setting will get us. I’ve used lastalign before and had been impressed with the output, especially in context of subsampling reads against a target genome.
#lastal 1x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518393 A
518961 T
865446 C
867847 G
--------------------------------------------------
Total gapped sequence length is: 2770647
--------------------------------------------------
Total ungapped sequence length is: 2770647
--------------------------------------------------
GC content in racon_medaka_deino_1k.fasta is 62.55 %
##################################################
busco -m genome -i racon_medaka_deino_1k.fasta --auto-lineage-prok -o racon_medaka_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124 |
|86 Complete BUSCOs (C) |
|86 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#lastal 1x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
548297 A
548571 T
908186 C
909468 G
--------------------------------------------------
Total gapped sequence length is: 2914522
--------------------------------------------------
Total ungapped sequence length is: 2914522
--------------------------------------------------
GC content in racon_medaka_meta_deino.fasta is 62.36 %
##################################################
busco -m genome -i racon_medaka_meta_deino.fasta --auto-lineage-prok -o racon_medaka_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:69.4%[S:69.4%,D:0.0%],F:18.5%,M:12.1%,n:124 |
|86 Complete BUSCOs (C) |
|86 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
Last-align SAM based Racon 1x and Medaka polish. Deino on left, meta_deino on right.
Holy cow. I don’t think I’ve ever seen a polishing step (involving two different polishers no less) actually make an assembly worse across the board. Obviously the polishing step somehow ended up introducing more indels to the genome. Is this a trait of super-high accuracy SAM files fed to Racon? Or just some weird quirk specific to the mapper? Anyway, combination of lastal version 1293, Racon v.1.5.0 and Medaka v.1.6.0 makes for some strange, unexpected failure state. The old adage of “you can’t overpolish a genome” no longer seems to hold water.
Let’s see what we get if we follow the old school recommendation of 4x Racon iteration before Medaka. We’ll be using minimap2 for SAM input.
#minimap2 4x Racon and Medaka deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
518037 A
518693 T
865063 C
867714 G
--------------------------------------------------
Total gapped sequence length is: 2769507
--------------------------------------------------
Total ungapped sequence length is: 2769507
--------------------------------------------------
GC content in 4x_deino_1k.fasta is 62.56 %
##################################################
busco -m genome -i 4x_deino_1k.fasta --auto-lineage-prok -o 4x_deino_1k_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:70.2%[S:70.2%,D:0.0%],F:18.5%,M:11.3%,n:124 |
|87 Complete BUSCOs (C) |
|87 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|23 Fragmented BUSCOs (F) |
|14 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
#minimap2 4x Racon and Medaka meta_deino polish
##################################################
Total sequence composition is as follows
--------------------------------------------------
547978 A
548285 T
907936 C
909262 G
--------------------------------------------------
Total gapped sequence length is: 2913461
--------------------------------------------------
Total ungapped sequence length is: 2913461
--------------------------------------------------
GC content in 4x_meta_deino.fasta is 62.37 %
##################################################
busco -m genome -i 4x_meta_deino.fasta --auto-lineage-prok -o 4x_meta_deino_busco
--------------------------------------------------
|Results from dataset bacteria_odb10 |
--------------------------------------------------
|C:71.0%[S:71.0%,D:0.0%],F:16.9%,M:12.1%,n:124 |
|88 Complete BUSCOs (C) |
|88 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|21 Fragmented BUSCOs (F) |
|15 Missing BUSCOs (M) |
|124 Total BUSCO groups searched |
--------------------------------------------------
4x minimap2 SAM based Racon iterations, and then Medaka polished genomes. Deino on left, meta_deino on right.
The polished assemblies are certainly technically improved- but when we consider that just running Medaka on raw assemblies gives us better results, this is more of a subtle regression in the quality of the genome.
There are some caveats to the results here – I’ve intentionally left out specialized downstream polishers that will improve the results drastically by getting rid of significant chunks of homopolymer indel errors. Polypolish and homopolish come to mind as tools I’ve used before that frequently turns so-so assemblies into amazing ones. Deinococcus genome is a repeat and GC rich, polyploid landscape of transposons, prophages and homopolymers, so we’re really looking at a pretty tricky genome to begin with.
I’m not sure what to think about the results here though. Either there’s something significantly wrong with what I’ve done (I seriously doubt it based on previous experiences, but it’s possible), or later versions of Medaka might actually perform worse when added to the traditional pipeline of Racons and Pilons.
What bothers me the most is – if I simply relied on existing pipelines in my setup (which integrates polypolish/homopolish) I wouldn’t have realized what’s happening at all, since fixed homopolymer/indel errors will almost always get me better BUSCO scores compared to the raw inputs.
I’m looking for suggestions and comments on what’s happening here, and better best practices for long-read assembly polishing. As usual, please feel free to leave a comment here or get in touch at twitter @naturepoker1
Edit May 29th 2022:
@gaworj on twitter suggested using subsampled reads to cut down on potential noise. I cut the reads down to about 100x coverage and processed them through the same medaka only vs racon + medaka comparison again, focusing primarily on BUSCO scores (for now). The behavior described in the body of this lab note persists.
@Psy_Fer_ on twitter suggested dropping the –nano-hq argument from Flye assembly since I’m still using r941 data. Comparing –nano-hq assembled genome and –nano-raw assembled genome through the same medaka only vs racon + medaka comparison resulted in replication of the behavior seen in this lab note.
It should be noted that this entire test is looking at indel errors rather than any real structural correction of the assemblies (through IDEEL and BUSCO), so I’m missing whole bunch of stuff that’s happening under the surface. This note is also confined to a single data point of one Deinococcus ONT flowcell output, obtained using rapid sequencing kit and basecalled using a VERY subpar consumer graphics card (NVIDIA GeForce GT 1030).
Thanks for the suggestions folks! Hope this note’s been useful for you.
Continuing on with the Halobacterium long-term culture observations.
Confirmed living as of February 11 2022
1) 2021/Jan14 Halo R control : doing well, I don't see a difference between this culture and other, far fresher inoculations. Same goes for all other cultures (including the mutant culture in acidic media) inoculated on the same day.
2) 2020/Oct23 Halo M mutant : surprisingly weak growth - I can see beginnings of the typical white-pink Halobacterial ring around the rim of the tube, but it's only started. I'd safely say at least a month's lag time for these cultures to thrive again (when re-inoculated).
3) 2020/Apr02 Halo M mutant A/B : weak growth, more or less on par with the 2020/Oct23 culture. Should I say it's doing better than it ought to, considering half a year's time between the Oct23 and the Apr02 A/B cultures?
4) 2020/Mar30 Halo M mutant : it's hard to tell at this stage - the red sedimentation on the bottom of the tube is getting thicker, so we could be seeing a mutant culture with damaged gas vesicle operon, but I'll have to give it another couple of weeks to make sure.
5) 2020/Mar17 Halo R control : no growth. The media remains completely clear
So out of the ‘unknown status’ cultures recorded on January 28th, 2020/Mar17 and 2020/Mar30 cultures remain questionable. I think Mar30 culture will end up picking up soon, but it could be safe at this stage to assume Mar17 culture is simply dead. I’ll still leave both in the incubator for the next few weeks to see if anything changes.
Cultures from top – 2021/Jan14, 2020/Oct23, 2020/Apr02, 2020/Mar30
Now, above long-term Halobacterium growths are from maintained lines of sealed, liquid media cultures. I also maintain a separate line of aerated cultures that are specifically meant to dry out and crystallize (due to high salt content media of +25%) in order to track their long term behavior. My previous observations suggested that aerated (sponge capped) and dried cultures of Halobacterium does not fare as well as sealed liquid cultures, but maybe my method for resuscitating them had been wrong all along.
I recently resuscitated a test tube of fully dried and salt-crystallized Halobacterium culture dating back to 2020/Apr25 that happened to incorporate a brief heat-shock period. About 1 minute in 60C. For whatever we can tell from a sample size of one, this particular culture seem to be thriving now – far outgrowing even the more recent cultures from 2021.
From left, current growth of the dried and crystallized 2020/Apr25 Halobacterium R control culture – notice the thick pink ring around the rim. In the center, I can see that the salt crystal mass is still present. On the right, fresh inoculation of the resuscitated Halobacterium culture in sealed liquid culture.
Again, sample size of one, so this result doesn’t mean anything other than this particular salt-crystal captured culture seems to be doing very well with simple resuspension in fresh media and perhaps exposure to heat, meant to help some of the crystals to melt faster. Once I can get a decent saturation of this particular culture, I’ll extract its genomic DNA via previously discussed methods and send it out for WGS for possible variant observation.
I’m also planning a far more controlled experiment dealing with behavior of salt crystal encapsulated Halobacterium culture, using 28 well plates and 0.5mm glass beads (usually used for cell disruption – important thing is they are standardized). I have a hunch that a rough response profile between a salt crystallized Halobacterium and sealed long-term liquid culture Halobacterium would remain similar, and would like to confirm it physically.
Before going further – Jan 27th was something of a momentous day. Our Deinococcus radiophilus genome submission finally became the representative genome on Genbank! Now I just need to get the preprint out there.
I still can’t believe our lab’s name is on there. And it looks like the names for the plasmids will stick.
I’m in middle of a big re-arrangement of the warehouse lab, finally dusting our bench and wiping down all the grime accrued over the pandemic period. The lab was still being used through the months, but the activity level was nowhere near what we had during the Deinococcus radiophilus project. Such a shame too, we’ve managed to put together a pretty tidy lab and a backlog of surprisingly fancy hardware for bunch of amateurs.
For the time being, main lab bench looks like this.
One of the big tasks was to catalog and re-inoculate bunch of long term Halobacterium salinarum cultures in the lab. I have Halobacterium NRC-1 steady state cultures dating back to January of 2020, and slant agar cultures dating all the way back to 2019 here.
The slant cultures doesn’t look too bad, as they’re backups stored in the fridge… The fridge itself is a bargain basement unit that should be giving up its ghost one of these days, but for now it keep on trucking.
As for the larger volume, liquid Halobacterium NRC-1 cultures I’ve been keeping here (intentionally! The goal was always to track its mutation profile over very long term storage and stress, between a mutant and a control strain) are in different states of absolutely decimated to hanging onto their dear lives. On the whole cultures stores in well sealed, limited oxygen environment fares far better than the ones with more aeration, as expected.
Sealed plastic bottle contains cultures from January of 2021 – the glass flasks on the right contain cultures from March 2020
What I didn’t expect is what seems to be ravenous efficacy with which Halobacterium cultures clear out everything in their environment, and seemingly themselves, in states of extreme starvation. Above pictures depict two types of sealed cultures kept in constant 30C and 24/7 grow light- plastic bottle from January of 2021, the glass flasks from March of 2020 – they’re very unscientifically a year apart. Even with the same strain of Halobacterium NRC-1 in same media (ATCC 213), a year’s time is enough to clarify the whole media. A nice relaxing side project for me is to track the length of stasis and rate of change through all this, and check for genomic rearrangements common in these extreme cases.
And it turns out there could be reasons to suspect these types of long-term starvation in sealed environments is a type of evolutionary stimuli Halobacterium have been exposed to throughout history, as anyone who made salt crystals in youth might attest. Look at this wholly unintentional creation:
One of the old Halobacterium flasks left out to dry in back of the lab. It’s hard to see in the picture, but the large salt crystal still contains Halobacterium
This is a Halobacterium culture forgotten to dry out at the back of the lab, inoculated on July 9th 2020. It’s almost impossible to tell in the picture, but the large crystal to the right has a red center that still contains liquid-or-gel like something, which I suspect could be still-living culture of Halobacterium in stasis. I suspect because I’ve been through this exact situations before, and have revived a small number of Halobacterium cultures caught up in crystals. So simulation of this cycle under far more controlled conditions is something I’m running in the lab – Haloarchaeal culture, heavy dose of light, dry out, crystallization, cracking and sequencing for genomic rearrangements, on year length time scale.
As for current long-term Halobacterium cultures that went through recent re-inoculation:
Confirmed living as of Jan 28 2022
1) 2021/Jan21 Halo R control -> 2022/Jan17 Halo R control
2) 2021/Jan21 Halo M mutant -> 2022/Jan17 Halo M mutant
3) 2021/Jan14 Halo R control -> 2022/Jan17 Halo R control
4) 2021/Jan14 Halo M mutant -> 2022/Jan17 Halo M mutant
5) 2020/Oct23 Halo M mutant -> 2022/Jan17 Halo M mutant
6) 2020/Apr02 Halo M mutant B -> 2022/Jan17 Halo M mutant
Unknown status as of Jan 28 2022
2020/Apr02 Halo M mutant A
2020/Mar17 Halo R control
2020/Mar30 Halo M mutant
How the long-term Halobacterium are grown. It’s a little shabby, but it works. I’m working to expand this out into about three sets
Another big change is moving more of the active Halobacterium NRC-1 mutant/control culture to the warehouse lab as well. I’m thinking of setting up a rotating culture of 10ml acidic vs control halo media with 10ul inoculation per generation – and for now it looks something like this, with day-night cycle. Inoculation time for the Halobacterium mutant and control culture below is 1:45 AM on January 28th, 2022.
Very shabby, but I’d rather get the cultures started sooner rather than later. This whole arrangement will be going through expansion and refinement as well.
This particular lab note is from an ongoing project in collaboration with a different lab. Please pardon some of the necessary vagueness. All this is pre-publication data, so I have to blur out some details until we can at least get a preprint out (which should happen soon, fingers crossed!).
I’ve been working on a particularly befuddling genome of a previously unsequenced Archaeal microbe for the last couple of months. The key issue with the data had been presence of spurious contigs that differ drastically in size and structure from assembly to assembly, leaving only the chromosome and one circularized contig as reliable constants across maybe a 100 or so different iterations (it got to a point that I’d jokingly refer to the genome as a random number generator). There finally seems to be a light at the end of the tunnel though, so I’m leaving a quick note on the slightly modified process here, hopefully to be merged later into the “Genome assembly for artists” zine’s long-read assembly portion.
Now, contig level variation from assembly to assembly itself is perfectly normal, IMHO, and could be based on myriad different reasons including differences in assembly processes such as filtering method for raw data, some variation in version numbers of programs used, and etc etc. What was puzzling about this particular genome was: (1) the sheer variety of contigs produced through different assembly processes, their sizes ranging from thousands to +600kb and (2) the said variety of contigs appearing and disappearing from different assembly iterations without and discernible pattern.
The issue was exacerbated by the somewhat unique character of the archaea at hand – it’s an extremophile with particular growth condition requirements that our collaborator didn’t have the infrastructure to handle at the time. That combined with some unfortunate timing issues meant the genome extraction had to be performed directly on the shipping samples, making for three separate genome extractions on three physically separate samples. I’ve seen some odd results from similar cases before due to mutations present in one sample of the same species but not in the other, so we were definitely walking into a somewhat complex territory (at least for an amateur biologist like myself). One saving grace of the situation was the sheer size of the available data – our collaborator was successful in extracting and sequencing a positively ridiculous amount of DNA from his samples, so we were sitting on raw reads pointing toward at least 1500x combined coverage against the suspected size of the archaeal genome (another reason why exploratory assemblies are so important! We always want to keep an eye on the suspected complete genome size versus the depth of the available data).
So going in, we had three high quality raw read sets. First one using conventional Illumina short read sequencing, second and third from ONT MinIon flowcells (our collaborator did use GridION for the sequencing).
Left, read quality vs length graph from flowcell 2. Right, read quality vs length graph from flowcell 3
I won’t get into details of what happened during the first month of genome assembly experiment – due to timing and computation resource issues I was forced to focus on more stringent filtering of the raw data more so than the assemblies themselves. Pervading concern at the time was contamination causing fragmentation of the assembly, potentially via cryptic environmental symbiotes, prophages and other oddities our archaea might have dragged in from the deep. Since screening all of the raw reads via Gunc and Kraken (as I normally do) simply proved largely unfeasible, I had to rely on kmer based analysis and short-read based quality filtering. Even running Unicycler based hybrid assembly (still considered by many to be a most accurate method) became extremely cumbersome, requiring about 12~15 hours of run time with extremely unsatisfying results (granted, unlike long-read based assemblies, the hybrid assembly continued to give us the same contigs regardless of the filter state of the raw reads. A curious observation for future reference).
With December of 2021 lost among enigmatic genome assemblies, I was finally able to sit down properly with a more powerful rig (the System 76 Thelio written about in a previous note) in the new year and finally finish the assembly given the quality of the data at hand.
From the get-go it was obvious that Trycycler was the best pipeline for handling such high-depth long reads data, since it allows for more in-depth subsampling and curation of more assemblies that could be thrown into the reconciliation step, allowing for more detailed description of the genome overall. Unfortunately Trycycler wasn’t really an option in the very beginning. Our assembly project originally began with just the short reads and the flowcell 2 data, which ended up being just short of the depth needed to curate reliable subsamples. Lack of depth during those initial stages would simply result in more fragmentation in the assemblies, and I’d be able to rescue only about 2/3 of the expected genomes since reconciliation among other fragmented contigs would not be possible.
Availability of the 3rd flowcell data (which actually composes the bulk of all available data in our set now) and re-basecalling the 2nd flowcell data with SUP mode finally allowed me to attempt another Trycycler run – with similar disappointing results.
By this point it became reasonable to suspect that chaotic, fragmented state of our assemblies so far were themselves reasonable facsimiles of the archaeal genome at the time of extraction and sequencing. Obvious sources of contamination that could cause misassemblies of this magnitude would have revealed itself as an independent contig by now, and screening of dozens of assemblies up to this point did not point toward any obviously foreign contig. No mystery rRNA operons in any of the contigs or assemblies were detected as well. Based on numerous factors it was only natural to suspect an unreasonably healthy, perhaps transposonal fragmentation of the genome happening during the storage and transfer process the microbes were subjected to prior to genome extraction.
How we need to approach post-assembly steps in Trycycler changes if we assume significant existing damage and fragmentation in the input genome, since the goal shifts from processing all of the genomic components in the assembly to choosing a correct subset of existing genomic components that might be more representative of their unmodified state.
Easiest way to utilize the Trycycler pipeline in this way was to supply it with a curated set of assemblies before the reads clustering step. This required two things:
Freshly produced sets of assemblies using minimally filtered reads
Curation criteria for screening the newly assembled genomes
(1) was simple – top 90% scoring reads above 1000 base pairs from each of passed m2 and m3 flowcells were isolated into their own read groups, and a third read set was created from combined passed m2 and m3 reads following the same filter cutoff. The third raw read set would take the place of assembly raw reads for all post-assembly downstream steps in the Trycycler pipeline. Filtlong command for this step would look something like this:
Quality vs length plot from each of the filtered read sets. Left, m2 flowcell. Center, combined m2 and m3 flowcell. Right, m3 flowcell.
(2) – figuring out assembly screening criteria – ended up being pretty simple as well. By this point I had a pretty decent idea on some of the more interesting features that must be present in a complete genome of this mystery archaea. I focused on seemingly conserved placements between a certain type of operon cluster and crispr spacer sites.
Chromosome is one component consistently recovered through all of the assemblies so far, in many cases mere few bases of difference between different assembly iterations. And a consistent feature of all of the chromosomes so far have been presence of a specific operon some preserved distance away from a crispr spacer site consisting of 118 and 13 repeat units. Curiously, the same operon cluster (with enough differences to be a unique instance on its own) shows up on one other contig as well, again a roughly conserved distance away from another crispr spacer site consisting of 8,34 and 38 repeat units. Similar operon seen in the other two contigs (the chromosome and a mystery contig) shows up on a third contig, this time carrying a transposase in between one of the operon elements.
I know the chromosome is real, as well as the operon and the 118/13 crispr repeat units. The other 8/34/38 crispr spacer site and the operon seems real, and the third iteration of the operon on the third contig looks unique enough with the conserved transposase site stuck in middle of the operon pattern. These contig level characteristics were enough to let me form the initial assembly screening criteria- a more complete representation of the archaeal genome in question should be carrying three x operons, two of them along with crispr spacer sites of 118/13 and 8/34/38 pattern, with the final operon carrying a transposase insertion site.
With the screening criteria set, I subsampled each of the m2, m3 and m23 reads into 16 groups, creating four assemblies from four different assemblers (flye, miniasm+minipolish, raven and shasta) per read source, ending up with 16×3 = 48 assemblies to screen through. The new assemblies were annotated with prokka and analyzed for presence of the three marker characteristics described above (no fancy bioinformatics tools available for this step beyond the usual shell utilities and piping – I had to read through the annotation files and make notes with pen and paper). As expected, assemblies with spurious contigs tended to carry duplicates of the marker traits across the genome, such as carrying a third or sometimes up to a fifth crispr spacer sites following the same 8/34/38 pattern, all carbon copies of each other down to the base level. Same went for the operon x – some assemblies carried up to 13, but after the initial three they were all essentially just copies of each other. Curiously enough, the largest group of shared marker traits across the 48 assemblies followed the traits I’ve described already, 3 operons and 2 crispr sites with specific spacer repeat units, supporting the possibility that the pattern represents the true layout of the genome. 18 assemblies that displayed the targeted marker site properties were isolated and used for further downstream Trycycler process, outlined in my previous note starting with the trycycler clustering step. The final genome was polished with the usual racon – medaka – polypolish pipeline as well, using the Illumina reads from the first genome extraction.
Contig cluster maps from assemblies based on m2, m3 and m23 flowcell data
The resulting genome is composed of 6 fully circularized contigs totaling 3965466 bp and 62.18% GC content.
It’s BUSCO profile seems pretty good, though we’ve always had a good recovery % on this one, probably because we’ve always had a pretty decent representation of the chromosome.
busco -m genome -i final_asm.fasta -o final_asm -l archaea_odb10
--------------------------------------------------
|Results from dataset archaea_odb10 |
--------------------------------------------------
|C:99.5%[S:99.5%,D:0.0%],F:0.0%,M:0.5%,n:194 |
|193 Complete BUSCOs (C) |
|193 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|0 Fragmented BUSCOs (F) |
|1 Missing BUSCOs (M) |
|194 Total BUSCO groups searched |
--------------------------------------------------
I’ve also run the gamut of usual minimap2 alignment tests on problematic regions and checked for possible duplicates, and the results are extremely promising. More notes on the post-assembly analysis methods later, probably in conjunction with ‘Genome assembly for artists’ part II.
We hope to have a preprint of the results and observations soon, possibly within the next couple of weeks. I’ll make sure to post it here and write up a post-mortem as well.
A couple of takeaways from this particular assembly experience:
Assemblers might not make up for already damaged DNA, even if they could have happened through the microbe’s native DNA indel mechanism.
Hybrid assemblies could be relegated to an interesting oddity soon, especially as Q20 ONT reads and SUP basecalling become the norm. All of Trycycler (48 assemblies + subsampling, clustering) and polishing maybe took four or five hours. Hybrid assembly using the same dataset took 12~15, and still resulted in fragmented genome.
Rapid iteration of assemblies using raw reads filtering and flye can get us results that are about as good as a well-considered Trycycler output, but it’s a gamble.
Re-sequencing older genomes with later technology is definitely a worthy endeavor. I’ve run into numerous assemblies of our archaea’s genus cousins, all of which, in retrospect, seems to be very incomplete and in dire need of updates.
Archeal plasmids are uncharted territories. My usual HMM/blast tools were not terribly useful this time, since important features often went unannotated.