Halobacterium high molecular weight genome extraction test 4

The Halobacterium genome extraction optimization for Nanopore Flongle sequencing continues. After a couple of weeks of some strange hectic schedules, I was finally able to sit down for the last couple of days to really think about the protocol at hand.

82nd culture of Halobacterium was used a little earlier than I normally process them – these cultures were two days old by the time of extraction. Washing process was improved in subtle ways that are difficult to describe as part of the previous protocol – the gist is, I used the pipettor directly on top of the flattened DNA pellet as if they were filters, instead of moving the whole pellet around. The resuspension in this context really involved me pipetting back and forth ~1ml of 70% EtOH directly through the Halobacterium pellet. It’s relatively easy to check the washing progress with Halobacterium cultures using this method – each washing run gradually gets rid of the carotenoid tint usually present in the samples. By the end of the third washing stage, the pellets are bone white (when hydrated).

Left gel picture is the latest genome extraction, right is the genome extraction described in the last lab note entry. I think above pictures speak for the improvements made so far – shearing and lower band signals have gone down significantly. It’s far easier to resolve presence of more intact individual bands with the latest genome extraction.

I should also note – the DNA from the latest genome extractions are still not fully dissolved despite a couple of hours of shaking and incubation. The DNA concentration in today’s samples are expected to be far higher than the one presented here right now (August 25th, 2021, 11:49PM). I’ll update with an edit sometime tomorrow once the DNA fully dissolves.

The key takeaway from the new process is less shearing of the genome during the extraction process, and cleaner samples by the way of 260/280 and 260/230 reads. I was particularly worried about our Halobacterium genome extract being a little too dirty last time. Luckily it looks like the improved washing method actually works.

The DNA taking a awhile to dissolve fully is rather annoying, but I doubt this is indicative of any faulty steps during the extraction. Our Deinococcus radiophilus genome extraction when through the same issue – in fact I remember the raw sample being only fully usable for sequencing (not quite as viscous) only after a full night’s incubation in the fridge. We’ll find out later tomorrow.

All in all, promising developments – if current genome preparation seems good enough, we might be sequencing the Halobacterium by late tomorrow evening. Finally, confirmation of a project a year and a half in making!

Halobacterium high molecular weight genome extraction test 3

A quick update on the third high molecular weight extraction test on our Halobacterium – the culture used for this test is our ~80th subculture. With this we’ve been running continuous acidic media subculture of Halobacterium mutant strain for 551 days, or a little over a year and a half. I can’t wait to wrap up the genome extraction tests and finally sequence this mystery microbe. Could it really have obtained a unique, stable genomic feature that can set itself apart from its peers during the past year and a half?

The third test is a minor spin on the protocol used for the last extraction test– I wasn’t really happy with the 260/230 reads and wanted to see how the sample would fare with a proper lysis step using SDS and EDTA. Ongoing concern for this whole experiment since the beginning was to create a process that could be replicated with minimum possible expenditure and infrastructure requirements – thus an attempt at simply using distilled water to lyse the Halobacterium and precipitating out the DNA. Alas, the output from the initial experiment felt a little too ‘dirty’ despite the decent DNA concentration and somewhat decent molecular integrity.

I decided to use a widely available lysis/buffer mix for this particular attempt – notably, the Edward’s buffer. It’s a lysis/mix buffer normally used for plant genome extraction (we do a lot of plant side projects around here) and contain both EDTA and SDS. The minor NaCl content was a little worrying due to the already high salt concentration in the Halobacterium media, but I’ve been able to clean the extracted DNA samples down to ~1ppt NaCl concentration in previous runs, so maybe there’s not enough reason for concern. And the idea of keeping one pseudo-lysis buffer in the lab capable of producing at least acceptable level of results with multiple bacteria and plant samples is an intensely attractive one.

Edward's buffer recipe for 100ml 

Tris    200mM   3.15g
EDTA    25mM    0.93g
NaCl    250mM   1.46g
SDS     0.5%(w/v)

The third run involved resuspension with 100ml of Edward’s buffer instead of dH20, streamlined heating/incubation cycle on the PCR machine side to decrease the sample’s exposure to heating and ramping times, addition of one more EtOH 70% washing step at the end and decreasing or getting rid of miscellaneous vortexing or incubation times whenever possible.

>Fresh sample preparation 

>Spin down 1ml of the sample for 1 minute at max speed 

>Decant the supernatant 

>Resuspend the sample with 100ul Edward's buffer

>Transfer suspension to PCR tubes, watching out for bubble formation

>Pre-melt NEBuffer3 10x if necessary

>Add 5ul of RNase If and 10ul of NEBuffer3 10x

>Vortex for 1~5 seconds

>Incubate at 37C for 15 minutes

>Add 2ul of Proteinase K

>Vortex for 1 second 

>Incubate at 55C for 1 hour 

>Heat inactivate proteinase K at 95C for 10 min

>Transfer to eppendorf tubes

>Add 10% 3M Sodium Acetate first, 1/1 volume 100% isopropanol second

>Invert 10x to mix

>Spin down at maximum speed for 5 minutes

>Decant completely

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Spin down at maximum speed for 5 minutes

>Decant completely

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Spin down at maximum speed for 5 minutes

>Decant completely

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Spin down at maximum speed for 5 minutes

>Decant and completely dry at 37C for 15 minutes or longer at room temperature

>Add 200ul of sterile dH2O

>Shake and incubate the sample tubes at 37C for 20 minutes or longer if necessary

The steps aren’t radically different compared to the last time, but I was definitely able to observe different physical consistency of the sample starting with the post-proteinase K and RNase treatment. It seem the cellular pellet take on a more solid, non-sticky gel like consistency up until the EtOH washing step, and the presence of carotenoid begins to visibly decrease with each washing step, disappearing altogether by the end of the third washing step.

From left to right, NEB 1KB plus ladder, Mutant 1, Mutant 2, Control 1, Control 2, NEB 1KB extended ladder

This is still more shearing than I’d like to see (in our Deinococcus genome extractions we practically have no bands/smears below the largest band on the NEB 1KB extended ladder – no mean feat!), but it’s quite likely workable with good post sequencing quality control of the raw reads.

Nanodrop reads of the new samples are intriguing – it looks like I was able to get a whopping 1009.62ng/ul of DNA on the most concentrated sample, but the 260/280 and 260/230 reads seem to more or less remain the same as the last sample, with the sample containing the highest amount of DNA faring worst. Could they be related? I know extremely high concentration DNA can throw off Nanodrop reads like this, but I doubt this particular sample qualifies as such.

The third extraction test suggests that using some sort of lysis helper is certainly better than just using water – with increased yield overall. I’m not too sure if the addition translates to cleaner samples, however. If we do decide to go for sequencing with current sample though, it’s quite likely to be the M1 and R1 extractions, at DNA concentrations of 831.58ng/ul and 647.77ng/ul, respectively.

Halobacterium high molecular weight genome extraction test 2

Last Halobacteria genome extraction test (for ONT Flongle sequencing) happened without using any real reagents like proteinase K and RNase – both to see how the skeleton protocol fares with the culture (which is a pretty standard way we do things around here) and because we’re completely out of the said reagents at the time. The results from the initial test was very promising.

The two reagents finally arrived a couple of days ago, so let’s get to testing out what might be the final protocol for extracting Halobacterium’s genomic DNA while maintaining its physical integrity as much as possible.

I used four samples for today’s test run, two from each of Halobacterium mutant and Halobacterium NRC-1 control strain, all of them inoculated on July 31st (this test was performed on August 5th, and I’m writing this post on the night of 5th – early morning of 6th). This isn’t the youngest culture I could have used, but I find 4~5 days turnaround for Halobacterium cultures to be perfectly suited to the condition of the lab and the strains we’re using. Each of the samples from here on will be referred to as M, M1 for the mutant strain cultures, and R, R1 for the NRC-1 control cultures.

Halobacterium cultures
>Spin down 1ml of the sample for 1 minute at max speed

>Decant the supernatant

Note: The high salt content of the media is something to keep in mind through the processes – we need to decant and wash thoroughly whenever we’re given the opportunity, but not so much that it could shear the DNA…

Halobacterium culture after the initial spin down
>Resuspend the sample with 100ul of sterile ddH2O 

>Transfer the resuspended sample into labeled PCR tubes  

Resuspension step here could be tricky – Halobacterium will immediate turn thick and sticky on contact with ddH2O. Be careful not to introduce bubbles into the culture during pipetting. It turns out bubbles can shear DNA as well! Attaching images of how the resuspended cells might look, and what you shouldn’t do at this step.

>Pre-melt NEBuffer3 for RNAse treatment step

>Add 5ul of RNase If and 10ul of NEBuffer3 to the PCR tubes containing cell suspensions

>Vortex for 5 seconds

>Incubate at 37C for 15 minutes

>Add 2ul of Proteinase K

>Vortex for 1 second

>Incubate at 55C for 1 hour

>Heat inactivate at 95C for 10 minutes

>Transfer treated samples to pre-labeled eppendorf tubes (1.5ml) for further processing

I should note that there was a pause of 55 minutes in between the resuspension step and the RNase/proteinase treatment step due to an unrelated emergency. The samples were kept in 4C during that time. The heating and incubation steps took place in a PCR machine – but if you’re using a heat bath the sample transfer step won’t be necessary. However, I might still recommend using a fresh tube during this or a later EtOH washing step, if only to prevent carrying over the ridiculous amount of salt that was present in the Halobacterium media.

Samples immediately after RNase, proteinase K treatment – they’ve homogenized nicely, and the bubbles disappeared
>Add 10% 3M NaOAc (Sodium Acetate) first, and then add 1:1 volume of 100% isopropanol

>Invert 10x to mix

>Spin down at maximum speed for 5 minutes

>Decant completely
>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Spin down at maximum speed for 5 minutes

>Decant completely 

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Decant and completely dry ethanol

The resuspension of the pellet might be a little tricky for people who’ve not worked with this type of protocol before. The pellet will likely act like a little flake and will not dissolve, and neither should it. The idea is to use ethanol to wash the flake without damaging it or losing it. The drying step in our case took place in a 37C incubator on a clean paper towel, over a period of one hour.

>Add 200ul of 60C sterile ddH2O to each tube

>Incubate the sample at 37C for 5~20 minutes, until the pellet dissolves
Dissolved pellets after 30 minute incubation at 37C – your mileage may vary on the incubation time needed to dissolve your sample

And here are the Nanodrop read and gel picture of the genomic DNA samples made today.

There’s been a lot more shearing than I expected – I think one of the incubation steps could be too long, and the bubble formation in the beginning of the protocol might have really hurt the integrity of HW DNA. However, the curves for the sample itself looks quite good – far better than the output from the previous test run. For comparison, here’s the DNA concentration for each of the samples from previous run to this run.

August 5th genome extraction run

M- 457.3 ng/ul
M1- 712.3 ng/ul 
R- 559.5 ng/ul 
R1- 545 ng/ul 

July 23rd genome extraction run

Newer Halobacterium NRC-1 control - 156.8 ng/ul
Older Halobacterium NRC-1 control - 53.9 ng/ul
Newer Halobacterium acidotolerant mutant - 508.6 ng/ul
Older Halobacterium acidotolerant mutant - 264.1 ng/ul

Granted, the numbers above shouldn’t be absolutely representative of the samples due to the highly concentrated nature of the genomic DNA. Still, the numbers and the curves look quite nice. I’m wondering if I could use the sample M1 and R for the Flongle sequencing run without going through further iterations?

As usual inputs and advice are appreciated!

Raw sequencing data archiving test II- with CoLoRd

EDIT: The author of the CoLoRd paper responded with clarification and some parts of the note were corrected with corresponding EDIT tag. His response is copied at the bottom of the post (also in the comments section way below). Thank you!

I recently found out about a new raw sequencing data compression and archiving tool called CoLoRd (reference Kokot M, Gudys A, Li H, Deorowicz S. CoLoRd: Compressing long reads. bioRxiv. 2021 Jan 1. And direct link to the github repository: https://github.com/refresh-bio/CoLoRd). The new tool is developed in collaboration with the inimitable Heng Li (http://www.liheng.org/), the mind behind SAMtools, htslib, BWA short aligner, minimap2, and many other tools that anyone interested in modern sequencing and genomics must have utilized at least once in their career. What a world, for a mere warehouse worker to be able to find out about tools like this in pre-print stages and test them out right away.

From a casual read of the preprint, CoLoRd aims to differentiate itself from other general purpose compression tools used in bioinformatics today by treating the compression object as a specifically biological data in form of fastq raw reads. The tool processes raw fastq data and behaves somewhat like a sloppy mapper, utilizing overlap graphs (I wonder if code from minimap2 went into this part) and then using them as ‘compression anchors’ to arrive at a smaller sized data than one might get with a general purpose compression algorithm. I should note that cluster based compression of raw sequencing data had been in research/use for quite a while now, as described in the pre-print itself, but CoLoRd is the only one of its kind that utilized overlap graphs and is designed exclusively for compression of long-read sequencing data.

Since I made a note on raw sequencing data compression on consumer hardware before (https://naturepoker.wordpress.com/2021/05/17/frugal-bioinformatics-genome-archiving-bench-test/), I thought it would be fun to see how CoLoRd compares against other options on relatively anemic hardware.

The test compression sets (using our 4.7GB Deinococcus radiophilus raw reads) were run across two different machines this time – first is my personal laptop, a Thinkpad X201 (released in 2010) with 8GB of ram and 1st generation core i5 M540 processor with four threads. The second is our lab workstation with 128GB of ram and Xeon CPU E5-2670 with 32 threads. Test script was as follows:

#!/usr/bin/env bash

echo "Starting gzip compression"
time gzip gzip/deino_reads

echo "Starting gzip decompression"
time gzip -dk gzip/deino_reads.gz

echo "Starting zstd dictionary compression"
time ./zstd --rm -D 3_deino_sra_dictionary zstd-d/deino_reads

echo "Starting zstd dictionary decompression"
time ./zstd -D 3_deino_sra_dictionary --decompress zstd-d/deino_reads.zst

echo "Starting zstd compression"
time ./zstd --rm zstd-test/deino_reads

echo "Starting zstd decompression"
time ./zstd -d zstd-test/deino_reads.zst

echo "Starting 7zip compression"
time 7z a deino_reads.7z 7z/deino_reads

echo "Starting 7zip decompression"
time 7z e deino_reads.7z
rm deino_reads

echo "Starting colord ONT compression"
time ./colord compress-ont colord-test/deino_reads colord-test/deino_archive

The output from the Thinkpad X201:

Starting gzip compression

real    7m57.347s
user    7m48.162s
sys     0m8.784s
deino_reads 2.3gb

Starting gzip decompression

real    0m52.766s
user    0m45.284s
sys     0m5.900s

Starting zstd dictionary compression
zstd-d/deino_reads   : 48.43%   (  4.62 GiB =>   2.24 GiB, zstd-d/deino_reads.zst) 

real    1m37.813s
user    1m25.838s
sys     0m8.090s
deino_reads 2.3gb

Starting zstd dictionary decompression
zstd-d/deino_reads.zst: 4960501078 bytes                                       

real    0m49.899s
user    0m14.173s
sys     0m6.789s

Starting zstd compression
zstd-test/deino_reads : 48.43%   (  4.62 GiB =>   2.24 GiB, zstd-test/deino_reads.zst) 

real    1m52.149s
user    1m24.952s
sys     0m8.227s
deino_reads 2.3gb

Starting zstd decompression
zstd-test/deino_reads.zst: 4960501078 bytes                                    

real    0m50.021s
user    0m14.618s
sys     0m6.644s

Starting 7zip compression

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5 CPU       M 540  @ 2.53GHz (20652),ASM,AES-NI)

Scanning the drive:
1 file, 4960501078 bytes (4731 MiB)

Creating archive: deino_reads.7z

Items to compress: 1

Files read from disk: 1
Archive size: 2090631247 bytes (1994 MiB)
Everything is Ok

real    60m17.810s
user    172m22.098s
sys     1m11.012s
deino_reads 2.0gb

Starting 7zip decompression

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5 CPU       M 540  @ 2.53GHz (20652),ASM,AES-NI)

Scanning the drive for archives:
1 file, 2090631247 bytes (1994 MiB)

Extracting archive: deino_reads.7z
Path = deino_reads.7z
Type = 7z
Physical Size = 2090631247
Headers Size = 138
Method = LZMA2:24
Solid = -
Blocks = 1

Everything is Ok     

Size:       4960501078
Compressed: 2090631247

real    2m59.132s
user    2m44.109s
sys     0m6.881s

Starting colord ONT compression
time ./colord compress-ont colord-test/deino_reads colord-test/deino_archive
Counting k-mers.
Stage 1: 100%
Stage 2: 100%
Filtering k-mers.
Running compression.
DNA size        : 407936879
Quality size    : 380225121
Header size     : 5579892
Meta size       : 53
Info size       : 127
Total time      : 1869.7s

real    31m9.876s
user    101m16.811s
sys     0m34.989s
deino_reads 757mb

And the output from our workstation with Xeon processor:

Starting gzip compression

real    8m47.465s
user    8m42.241s
sys     0m5.000s
deino_reads 2.3gb

Starting gzip decompression

real    0m55.767s
user    0m50.810s
sys     0m4.947s

Starting zstd dictionary compression
zstd-d/deino_reads   : 48.43%   (4960501078 => 2402284374 bytes, zstd-d/deino_reads.zst) 

real    1m14.469s
user    1m14.618s
sys     0m4.788s
deino_reads 2.3gb

Starting zstd dictionary decompression
zstd-d/deino_reads.zst: 4960501078 bytes                                       

real    0m17.928s
user    0m12.891s
sys     0m5.035s

Starting zstd compression
zstd-test/deino_reads : 48.43%   (4960501078 => 2402284037 bytes, zstd-test/deino_reads.zst) 

real    1m13.985s
user    1m14.303s
sys     0m4.190s
deino_reads 2.3gb

Starting zstd decompression
zstd-test/deino_reads.zst: 4960501078 bytes                                    

real    0m17.841s
user    0m12.848s
sys     0m4.992s

Starting 7zip compression

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,32 CPUs Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (206D7),ASM,AES-NI)

Scanning the drive:
1 file, 4960501078 bytes (4731 MiB)

Creating archive: deino_reads.7z

Items to compress: 1

Files read from disk: 1
Archive size: 2090631247 bytes (1994 MiB)
Everything is Ok

real    6m42.321s
user    139m25.023s
sys     1m22.728s
deino_reads 2.0gb

Starting 7zip decompression

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,32 CPUs Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (206D7),ASM,AES-NI)

Scanning the drive for archives:
1 file, 2090631247 bytes (1994 MiB)

Extracting archive: deino_reads.7z
Path = deino_reads.7z
Type = 7z
Physical Size = 2090631247
Headers Size = 138
Method = LZMA2:24
Solid = -
Blocks = 1

Everything is Ok     

Size:       4960501078
Compressed: 2090631247

real    2m32.988s
user    2m28.492s
sys     0m4.376s

Starting colord ONT compression
time ./colord compress-ont colord-test/deino_reads colord-test/deino_archive
Counting k-mers.
Stage 1: 100%
Stage 2: 100%
Filtering k-mers.
Running compression.
DNA size        : 407936880
Quality size    : 380225120
Header size     : 5579891
Meta size       : 53
Info size       : 127
Total time      : 249.285s

real    4m9.651s
user    88m50.677s
sys     1m4.218s
deino_reads 757mb

Some caveats to note on the results – 7zip by default uses all available cores. Gzip is a single thread process; while there is a multithreaded alternative called pigzip, I decided to stick with what people are mostly likely going to have on their machine (if you have a somewhat higher end laptop, I can see the gzip compression time falling to about 40%~ of what we have on the results above).

I also ran two different zstd processes. Zstd-d is a brief test of their dictionary based compression and decompression function. I created a makeshift dictionary file from three Deinococcus based long-read fastq files (ours plus two from SRA – Deinococcus grandis and Deinococcus sp. D7000), by splitting them into 10MB pieces (there’s a minimum file number requirement for the training set). Alas, the results are nothing much to speak of. Maybe churning through all ONT long-read data and using them as training set will result in a noticeable difference.

As you can see, zstd remains compression time leader – about a minute to wrap everything up, compressing a 4.7GB data into 2.3GB archive. 7zip does a little better in size, with extra 13%~ compression in return for using all available cores on a machine and a significantly longer running time compared to most options. Zstd can achieve a similar size too by running it with -19 argument (highest level of compression without resorting to their ‘ultra’ argument), but the ridiculously increased running time (more so than 7zip) just isn’t a good trade-off in most scenarios.

Now, the most exciting part of the result is final compressed file size from CoLoRd- 4.7GB raw data to 757MB final archive, which is around a mere 16% of the original file size. While it should be noted that the default behavior of CoLoRd is to utilize all available cores on the machine with no option to control the number of threads on the part of the user (EDIT: this part is incorrect – thread control can be turned on with -t argument, as pointed out by the first author of the paper. His additional notes and clarifications are copied at the bottom of this post for future reference), the ridiculous 84% decrease in file size might make it fully worth it for many researchers out there.

I do need to point out one oddity that might matter to some researchers. When using utilities like zstd or gz, the decompressed output from one’s archive (.zst, .gz or otherwise) is expected to be an exact match of the original item. That is not the case for CoLoRd. (EDIT: The author of the paper responded to the post! Please refer to the copied section below for clarification).

When I compare the original deino_reads.fastq file with the decompressed output of a CoLoRd archive (created from the original deino_reads.fastq file, of course) using diff, say, by using below script:

diff deino_reads.fastq color_decompressed_reads.fastq | awk '{print $1}' | grep -c ">"
diff deino_reads.fastq color_decompressed_reads.fastq | awk '{print $1}' | grep -c "<"

The output is 273555- meaning 273555 different lines across both files. If the two .fastq files are perfect copies of one another, the output should be zero/none. From what I can see, CoLoRd does format the input fastq file so that the formatting of the decompressed output will eventually be different from the original, while preserving the information itself. The 273555 number is meaningful in this context since the number of individual fastq reads in a file is derived by:

echo $(cat deino_reads.fastq | wc -l )/4|bc

Simply meaning raw line count divided by 4 – for our fastq file, the number is – you guessed it – 273555 reads. Running the same command on the decompressed CoLoRd archive gives us the same number of reads as well.

I really don’t think this is a big deal – the eventual assembly from a decompressed CoLoRd archive should be the same as the one you’d get with raw, never-compressed fastq file straight from a sequencer. However researchers looking at downstream processing using the raw fastq data itself might want to keep the difference in mind. For an amateur like me, I might just stick with more conventional compression tools for now… (EDIT: CoLoRd in fact does come with a lossless compression option even for the quality score section of the main data, as pointed out by the author of the paper below this post, anyone interested in using the tool should definitely take the time to check it out!)

#Example command for lossless compression of the fastq raw data
./colord compress-ont -q org test.fq o.3rd

On a side note, I also tried to compile the CoLoRd package for the raspberry pi 400 – a small platform I’m working to turn into a mobile bioinformatics terminal with an experimentalist bent. Alas, CoLoRd relies on some x86-64 specific header files for compilation, and replacing them for ARM7/8 platform seems nontrivial. Minimap2 was the same way as well – I wonder how our future bioinformatics workflow will look like as servers and clusters slowly transition to ARM or any number of other alternate platforms?

EDIT 2021 July 30th:

The first author for the CoLoRd compression papers, Marek Kokot kindly dropped by and clarified quite a few things regarding both the questions in the post and the function of the tool. I believe his response is going to be valuable to anyone who read this post & are interested in learning more about CoLoRd. The whole body of his responses are copied below with the author’s permission.


great to see the tests of our compressor!
I would like to add some notes and maybe clarify a little.
In fact, the user may specify the number of threads during compression.
An example command line to use 12 threads would be:
./colord compress-ont -t 12 -q org test.fq o.3rd

I would also make a comment about the difference between the original and decompressed file. I think it is a really important matter.
In CoLoRd we split the fastq file into three streams: DNA, quality, header, hence in the output you may see the sizes of each of these streams (the remaining data, i.e. meta and info are always very small and presented for completeness, yet not important for a typical user).
The DNA and header streams are always compressed losslessly.
The header size is relatively small (especially in the case of long reads) so it does not affect the compression ratio very much and a simple approach is sufficient.
The fanciest part of the algorithm is DNA compression which utilizes a very simple (for efficiency) assembly-like approach.
Since we are able to find a lot of similarities using this approach we are able to compress DNA stream very efficiently.
The quality stream is a different story and your note about downstream processing is very important.
The problem with the quality scores is that those are really hard to compress lossless and very often they are not very informative, at least it seems that the number of possible values of quality scores may be reduced without affecting downstream analyses.
In the paper, we show how lossy quality score compression affects some typical downstream analyses (invaluable help from Heng Li in this matter). In short, the conclusion is that it does not affect them very much (detailed results in the paper).
Having this in mind we made lossy quality score compression the default mode, but if one really needs the original quality scores there is a parameter for this, command example:
./colord compress-ont -q org test.fq o.3rd
In fact, CoLoRd allows for a couple of quality scores compression modes, but we have chosen the one that we think (basing on experiment results) is most appropriate for most of the regular users.
There are also three general modes (or priorities): memory, balanced, ratio, which allow controlling the resource requirements and compression ratio.
We are totally aware that the number of internal parameters of CoLoRd is large and non of the regular users will ever try to understand them, and we think that defaults should just work well. We really hope that the defaults we have chosen will do.
I think the above description (longer than I intended) explains the differences you notices. They are just from the default lossy quality compression.

The most upsetting part of your post is the sentence “For an amateur like me, I might just stick with more conventional compression tools for now…”.
Out of curiosity, if CoLoRd would compress quality scores lossless by default, would your conclusion be any different?

I do really like your final note about ARM compatibility. In our team, we are aware of the issue and I do really hope that we will slowly adjust our tools to be ARM-compatible. By using x86-64 specific headers we are often able to offer a much better performance of the code than in the opposite case.
Yet, it is probably better to have a code that may run on ARM, even if it is somehow less performant, than have a code that does not run on ARM at all.

Thanks for checking our compressor and posting your results!
Thank you for a quick response.

Thanks for the suggestion about -h. In fact, there is are two levels of help, the main one, that you have shown and the one deeper, dedicated to each subcommand, so for example you can do:
./colord compress-ont -h
it will show quite a big number of options, one of them is -t.
The reason for introducing two levels is that the options available depends on the command (e.g. compress-, decompress).
Setting the number of threads for decompression is not possible and decompression always uses a couple of threads (4 or 5 as far as I remember).
Maybe we should add in the general help information, that each subcommand has its help…

Well, I am not too upset and I totally understand your point of view. Also, I would like to thank you for the kind words.
This is obviously true that chances for any issues with well-known compressors are much less likely than in the case of new ones. Yet, I also feel like it is my obligation to try to convince people to use our software, especially if it offers much better compression ratios. The more users will use the software the better it should become because some bugs may be found and reported, although I hope that we tested the software well and on a wide range of inputs, so there (hopefully) will be not many bugs.

You may compress lossless, just use “-q org” or “–qual org”, example:
./colord compress-ont -q org test.fq o.3rd
If you are not too busy, I would suggest you extend your experiments to include a lossless variant. The compression ratio will be probably worse, yet after decompression, you will have an identical file like before compression.
Maybe we should expose in help the variants of quality compression modes more clearly…
I am not sure what is the coverage of your dataset, in general, the higher the coverage the better compression ratio CoLoRd should provide.

I don’t mind copy-paste my response. In fact, I would be grateful if you do! (the same for any part of this post if you are interested).

Thanks again!

Halobacterium high molecular weight genome extraction test 1

The Halobacterium flongle sequencing data is approaching rapidly – I decided to take the week to clean up our in house high-molecular weight DNA extraction protocol, adapt them for the S-layer membrane of Halobacterium (for example, the extensive freeze-thaw process we used for Deinococci and gram-negative microbes would not be necessary here) and test out the process against our Halobacterium strains.

The RNase and proteinase steps were omitted in the test run – alas, we’re a small lab and reagents need to be saved whenever they can (we take donations, folks!). The goal here was to get a feel for the physical resilience of the Halobacterium microbial pellets and their genome during the extraction conditions – a wildy important characteristic to know when preparing long-read sequencing experiments, along with looking at the amount of DNA we can expect from an extraction run.

I also wanted to confirm a previous recurring observation on the age of the Halobacterium culture correlating with the DNA yield/integrity, to a degree I’ve never seen with other gram-negatives and Deinococcus. So this genome extraction test was run on four separate samples: the Halobacterium NRC-1 control and Halobacterium acidotolerant mutant from June 27th 2021 (Or and Om), and the Halobacterium NRC-1 control and Halobacterium acidotolerant mutant from July 18th (Nr and Nm). The older cultures were grown to saturation in a 37C shaking incubator and were left out on a well-lit windosill for the past 27 days, during which it’s taken on a milky-yellow tone (a tell-tale sign of a dying culture – in oxygenated environment Halobacterium cultures rapidly lose their carotenoid – as demonstrated by the other control sample of sealed Halobacterium cultures taken out on the same date and left at the same location maintaining their pink hue and surface cell cover). The newer cultures were inoculated on July 18th and grown in a shaking 37C incubator for the past four days, and shows the characteristic reddish hue of Halobacterium.

The general processes of the test extraction was as follows:

>Spin down 1ml of the sample for 1 minute at max speed 

>Decant the supernatant 
Right after suspension with dH2O. Simply adding water is enough to lyse Halobacteria, but without proteinase the mixture is extremely gelatinous and difficult to work with. Above, Nm sample. Below, Om sample.
>Resusped the samples with 100ul sterile dH2O
Nm sample with sodium acetate and isopropanol. Clear layer forms above the sample – they need to be mixed together without potentially damaging the DNA strand
>Add 10% 3M Sodium Acetate first, 1/1 volume 100% isopropanol second

>Invert 10x to mix

>Spin down at maximum speed for 5 minutes

>Decant completely

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Spin down at maximum speed for 5 minutes

>Decant completely

>Add 1ml of 70% ethanol

>Resuspend with 3x pipetting - dislodge the pellet but do not lose it

>Decant and completely dry ethanol

>Add 200ul of 60C sterile dH2O

>Incubate sample at 37C with agitation for 2 hours

>Pipette up and down 10x
Here’s a demonstration of how stringy high molecular weight genome extractions can be

It should be noted that not using proteinase and RNase has definite effects on the samples and how they behave during process. For example, the DNA/cell debris pellets after lysis and sodium acetate/isopropanol application to an almost gelatinous, rubbery degree that won’t easily dissolve. It’s an interesting demonstration of liquid dense with high molecular weight polymer (much like liquid crystal), but could wreck havoc with downstream applications.

DNA concentration of each of the samples measured with Nanodrop is as follow- but please note that dense high molecular weight DNA extract could take a while to fully dissolve, and give wildly different reads from run to run.

Newer Halobacterium NRC-1 control - 156.8 ng/ul
Older Halobacterium NRC-1 control - 53.9 ng/ul
Newer Halobacterium acidotolerant mutant - 508.6 ng/ul
Older Halobacterium acidotolerant mutant - 264.1 ng/ul
Electrophoresis of the test genomic extracts. From left to right, NEB 1kb Plus ladder, Newer Halobacterium NRC-1 control, Older Halobacterium NRC-1 control, Newer Halobacterium acidotolerant mutant, Older Halobacterium acidotolerant mutant, NEB 1kb Extend ladder.

In our experience, your garden variety gel eletrophoresis remains a fundamental tool in testing the quality of the genomic DNA extraction for long-read sequencing processes. While there is a fundamental limit to the size of DNA molecule that can be resolved by gel eletrophoresis, it tend to give a pretty reliable measure of the integrity of the molecule, showing both ballpark concentration of DNA as well as indications of possible over-shearing.

The gel result here is interesting- quite different from the types of HW DNA patterns I see with other microbial extracts in the past. Both the control and the mutant strains show bands at around 1KB and 2~3KB region. Halobacterium NRC-1 genome have been sequenced fully before (2001 I believe), and they are known to carry a 2mb chromosome and two plasmids pNRC100 and pNRC200 at 190kb and 365kb. Could it be this is indicative of a natural shearing from the gel electrophoresis? The one bright band just above the top band of the extended ladder is interesting as well – whatever the fragment might be, it’s just above 48.5kb level, which is far below the one expected from the known Halobacterium NRC-1 genome. Could it be that our Halobacterium cultures have some surprises in store for us?

Either way, the protocol seems like it will work without issues – time to move on to second testing with proteinase and RNase treatment. Let’s see if the smaller fragments remain in the second test run as well.

Searching for Phi X 174

This is the first post under the ‘history’ heading on this online lab note, so I’ll start with a bit of a tangent on why there will be more science history related notes and posts on this site going forward.

My first serious long-term exposure to biology began with relationships shared with diybio communities and participation in a number of iGEM competitions (with subsequent focus on genetic engineering-restriction digest cloning forming the crux of what I understood to be biology). However, the first time I genuinely began to study biology as a topic began with a deep dive into Jacques Monod’s graduate thesis – “Recherches sur la croissance des cultures bacteriennes.” In the thesis Monod describes diauxic phenomena in E.coli cultures resulting from different carbon sources, based on grow curves of the microbes as measured by their dry weight (from what I can tell, microbial growth curve measurement remained the de-factor experimental tool in fashion all the way into the 70’s, much like RNA-seq or single cell sequencing of our day). The strikingly simple, almost primitive looking experimental methods he employed in his thesis against the backdrop of ongoing World War II left a deep impression on me, especially with the historical hindsight of knowing that both Jacob-Monod model of gene regulation and the later study of allostery build up on this thesis work- of E.coli in variety of sugar waters.

The study left me with a deep curiosity for the history of biology on a practical level, in terms of being aware of the tools and methods of thinking behind some of the major – or at least personally interesting – themes in biological research, as a kind of preparatory step before getting into the more modern descriptions and practices of the same topic (and I’ve gotten quite a bit better at it too. Hunting down the precise book and passage with Winogradsky’s description of his columnal growth chamber – the Winogradsky column – took all of a short afternoon. More on this at a later post). So I guess it was only natural for this East River screener to delve into the origins of model phages in use today, going back to the days of Delbruck-Luria and beyond.

The phage phi X 174 has always been one of the more accessible phages for the amateur – it’s a coliphage with a long steady stream of research data behind it, with the distinction as the first DNA genome to be sequenced around the time of Sanger (Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes JC, Hutchison CA, Slocombe PM, Smith M. Nucleotide sequence of bacteriophage φX174 DNA. nature. 1977 Feb;265(5596):687-95. https://doi.org/10.1038/265687a0). The 5386 base pair genome is no longer the smallest genome on record (variety of Microviridae members seem to compete for the position, with one of the smallest genomes coming in at 4248 base pairs for a ssDNA phage vB_RpoMi-Mimi described in Zhan Y, Chen F. The smallest ssDNA phage infecting a marine bacterium. Environmental microbiology. 2019 Jun;21(6):1916-28. https://doi.org/10.1111/1462-2920.14394) but still small enough to provide a more accessible model genome for an enterprising student or an amateur biologist. So, phi X 174 remains a posterchild for the beginning of modern molecular biology- but where did it precisely come from? Who first isolated the coliphage phi x 174 from where and decided to give it the name for what reasons? I always found the general lack of interest in history and origin of laboratory phage strains curious, especially considering the possibility of evolution and mutation in those strains over the decades of utility in the labs. Maybe this was a chance to run with the lessons from studying Monod’s thesis and figure out a way to complement the history of science with laboratory experiments.

The first and most broadly available account of the origin of phi x 174 (especially if your search begins on the internet) must be the article written by the inimitable Sydney Brenner, in his column ‘Loose Ends’ (Brenner S. Bacteriophage tales. Current Biology. 1997 Nov 1;7(11):R736-7). In the article Sydney Brenner primarily describes adventurous tales of phage/molecular biologists making do with their local sewage systems when encountering labs seemingly loath to part with their precious phage samples, perhaps to protect themselves from a scoop. The phage phi x 174 is brought up briefly toward the end of the article, quoting Francois Jacob:

François Jacob told me thatwhenever he wanted a phage he would take himself off to the nearest pharmacie, where all kinds of phage were sold as remedies for intestinal complaints. I traced the history of these and found that many had been isolated from the Paris sewers by two characters called Sertic and Boulgakov. The X in phage QX174 is not the letter ‘eks’ but the Roman numeral ten, indicating that it was the 174th isolate from a particularly good sewer in the Xe arrondissement.

My naive assumption early on in the search for the origin of the phage phi x 174 was that it will eventually end up in some laboratory settings, maybe to the Lederbergs, or perhaps to the fledgling labs of Delbruck-Luria phage school during the early days of modern molecular biology. Bottled remedies from Parisian street corner pharmacies, collected from their historic sewer system by some enigmatic sounding characters Sertic and Boulgakov was just about the furthest thing I expected to find as the origin of one of the key models of modern molecular biology- at this point I had to dig more into the background of the characters. What sort of experimental pipeline did they have to isolate, purify and characterize phages at scale in turn of the century Paris?

Standard pubmed search with the names returned results clustered around 1920’s and 1930’s, mostly associated with Sertic, Vladimir. Boulgakov N. becomes a more prominent name toward the late 1940’s, with a number of untranslated and undigitized (at least from pubmed links) papers on phages of B. coli.

While I was quickly able to establish that the Sertic and Boulgakov behind discovery of phi x 174 phage were established phage researchers active around 1920’s to 1950’s (the original passage from Sydney Brenner unwittingly left me with an impression they might be a more entrepreneurial type of pharmacists or freelancers working for a lab, a mistaken impression that was quickly corrected during the research), I still could not pinpoint the precise research paper describing discovery and/or characterization of the phage phi x 174. This was still very early in my amateur-research life, and I was too reliant on the search engines and automatically compiled databases, not realizing that often the true essence of literature search can only be found in human curated citations and references written up as part of an existing literature (it’s been a little over half a decade since my initial research, and even now google scholar and pubmed doesn’t delivery anything close to a reliable answer to the origin of phi x 174 via brute force search, though it’s certainly gotten a lot better at proving more options).

The real breakthrough in the search ended up coming from the reference section of a type of compendium of bacteriophage related topics, “The Bacteriophages. United Kingdom, Oxford University Press, USA, 2006. volumne II”, precisely the reference 111 on page 144.

Sertic, V., and N. Bulgakov. 1935, Classification et identification des typhi-phage. C.R.Soc, Biol. Paris.119:1270-1272

It was a week or so into the search by the time I arrived at what was extremely likely to be the very first description of the phage phi x 174 in history. Yet even with the solid reference at hand, I wasn’t able to find a digitized version of the particular paper anywhere, let alone a translation. At the time only way to access the paper seemed to be obtaining a hard copy from somewhere (which was the case for the Jacque Monod thesis – we weren’t able to find a digitized version anywhere, so we had to get an interlibrary loan through the New York Public Library system, take cellphone pictures of each of the pages and then translate them into English by hand). Alas, I wasn’t able to find any publicly accessible library system in United States with a hard copy of the Comptes Rendus des seances de Societe Biologie 1935 – and lacking any connections to the academia or someone in France, it looked like the search had come to a standstill with no recourse.

Another week went by while sending out requests to acquaintances and going off on more fruitless searches for either digitally or physically accessible copies of the phi x 174 references. It turns out, as with the reference based searches for specific research papers, human search tend to get us far better results than algorithmic ones, especially if the topic of the search veers off the beaten path. A friend of my friend, Cory Tobin soon got in touch and let us know about a rather wonderful resource called Gallica (https://gallica.bnf.fr/)- a freely accessible digital library service of the Bibliotheque nationale de France, which indexes an impressive array of printed materials, including academic publications going back hundreds of years.

From there on it was a relatively trivial matter to find the correct volume and article – page 1270 from July 20th 1935 Report on meetings of the Society of Biology. Classification and identification of typhi-phage (as in, Salmonella typhi). The specific diagram containing what is likely the very first naming and characterization of the phage phi x 174, published July 20th of 1935, is attached below (page 1271). I marked the row for phi x 174 entry with a red dot.

July 20th, 1935. Page 1271 by Sertic V. and Boulgakov N. Phi x 174 phage entry is marked with a red dot, left side, first column.

A few details jumped out at me as I translated the chart and the accompanying texts. Sertic and Boulgakov carried out these collection and characterization experiments as part of the independent and privately owned Felix d’Herelle lab – the eponymous co-discoverer of phage and an early proponent of phage therapy. Their description of phi x 174 behavior includes that of plaque formation patterns associated with the phage, categorizing the it as among those resulting in secondary colonies of the exposed hosts, with clear lysis in the center of the plaque and partial lysis toward the periphery. The phage is capable of passing through the most extensive of filtering apparatus they had at hand (they used ceramic ‘Chamberland candles’ (otherwise also referred to as Pasteur-Chamberland filter)- a popular ceramic filtration devices whose legacy continues to this day.

An example of Chamberland filter (https://www.sciencehistory.org/distillations/the-filter-of-life)

Major characteristics of the phage are well defined in its naming scheme – unlike Sydney Brenner’s assumption, the phi x 174 name does not refer to Paris arrondissement X. Instead, phi represents phages with broader spectrum infection behavior – in this case infecting Eberthella typhi, Salmonella typhi, Shigella dysenteriae and other microbes with both ‘rough’ and ‘smooth’ colony forming behavior. X represents a category under the heading ‘group according to specific immunity’, with the particular phi 174 phage belonging to the immunity group X, described as ‘main antigen’, with phi x 174 in particular showing ‘accessory antigenic’ behavior for group VIII, XI and XII. I believe d’Herelle lab was attempting to categorize phages under their antigenic behavior, perhaps observing exposure to agents like rabbit blood. Another interesting observation describes the phi x 174 phage being able to survive up to 74 C heat, though I doubt if they ever described the specific duration for the heat exposure in the paper.

So, there it is – the account of the phage phi x 174 for the first time in history, along with an explanation of its namesake – phage sample 174 with broad spectrum of associated hosts and immune response property category X, under the internal classification scheme of Felix d’Herelle’s independent laboratory. Sertic V. and Boulgakov N. were not random freelancers of the Paris sewer system, they were pioneering researchers of the nascent phage research field right before the dawn of the DNA era.

Sertic and Boulgakov’s works could not have happened in isolation either – at one point one would have to ask the question of exactly how a French laboratory sample made it to the shores of United States and other research laboratories around the world, and established itself as a model system. I poured through all published research records in the Comptes Rendus des seances de Societe Biologie under Sertic or Boulgakov name, narrowing down their activity to between 1929 and 1944. The record of phi x 174 was published in 1935, but they do specifically mention in the paper that sampling was done years prior to the publication itself. This leaves two possible paths for the laboratory lineage of phi x 174 we could reasonably pursue.

1) Felix d’Herelle brought an earlier, less characterized sample of phi x 174 as part of his phage library when he was hired as the chair of microbiology department at Yale, around 1926. His tenure lasted until 1933 when he resigned due to personal issues and left his apparently significant library of phages with his graduate students – which was later passed on to Delbruck and Luria, among others interested in utilizing the phage work.

2) Sertic and Boulgakov mentions a frequent collaborator, one Igor Nicholas Asheshov, who introduced a specific membrane assisted method of filtering phages from wildtype samples (1 cc of sample was passed through 1 cm sq of membrane at what they refer to as 60 cm of negative pressure). Asheshov would later become a bacteriologist in chief of a research program focusing on inhibition of bacterial viruses at the New York Botanical Gardens, and his position allowed him to leave behind a significant paper trail of preserved letters with his contemporary researchers such as the Lederbergs. According to records he was rather prolific in disseminating his phage library brought on over from the d’Herelle lab.

While the timing and circumstances of both possibilities remain more than a little suspect (Felix d’Herelle’s 1926 appointment sounds a little early for phi x 174 discovery, and Asheshov’s appointment to NYBG around late 1940’s sounds a little too late), I feel like there is a good reason to get a decent enough time line for the lineage of phi x 174 – notably as a first step in tracking long term evolutionary behavior of viruses as laboratory samples. We need to remember that life of phi x 174 began before the idea of cold-chains and -80 C storage. Transfer and storage of phages for the first two to four decades of its life involved vials and slant cultures and nothing more. What sort of changes did a mysterious Microviridae isolated in 1920~1930’s go through being passed from continent to continent, lab to lab? I’m currently designing a study to answer that very question, with goal of comparing characterization of phi x 174 in various labs through time against a modern sample of phi x 174 recovered directly from the Paris sewers.

I’ll close this brief foray into amateur science history sleuthing with a translation of a later passage in the 1935 Sertic and Boulgakov paper – I found this one more mind-blowing than anything I’ve found through the course of this search, and perhaps you’ll get a kick out of this one too.

What makes the identification possible for the bacteriophage is the extreme variability of the characters of the various races of Bacteriophages, a fundamental character which has been pointed out by d'Herelle, of his first researches. According to its own expression, each race (strain) of Bacteriophages presents a "mosaic of properties", each of these properties being able to vary independently (7 *). The table accompanying this note amply confirms it.

Could it be that d’Herelle, Sertic and Boulgakov suspected mosaicism in inheritable traits of phages in 1930’s? Could this be a historic precursor to the modern concept of the mosaicism of phage genomes?

Halobacteria note; sedimentations from stoppered still culture flasks of NRC-1 mutant and control

As I’ve written about a couple of times in this lab note, I have ongoing still-cultures of Halobacterium NRC-1 control and its tentative acidotolerant mutant strain going on in our studio-lab. Still-culture makes it sound a little fancier than it is – what I’m essentially doing is inoculate some fresh media (ATCC 213 Halobacterium media, though the salt component is always a store bought one and I’ll probably end up formulating something more kitchen-friendly to save resources further…) with fresh overnight culture of NRC-1 and its mutant strain, put a rubber stopper on the flask and throw it in a light box for a year or more.

Our glorious long-term Halobacteria still-culture experiments. Pardon the disheveled appearance, I was also drying some flasks in there. The light in there is on 24/7, and the temperature is maintained at a toasty 30’C+ with the help of the egg incubator, and the heat from the LED panel itself.

A couple of days ago (Monday, July 5th) was the date to check the status of the NRC-1 and mutant culture flasks sealed since April of last year. Despite the year-old cultures in other flasks still going rather strong, the flasks from April (of 2020, so about 15 months ago) are looking quite bleached and dead. I suspect at about 50ml volume per culture, a year and at best a couple of months is the upper limit for continued survival of the Halobacterium NRC-1, and the crashing of the culture might happen at a rapid rate rather than through a gradual degradation. Both flasks uniformly show a thin white layer on the surface and some sedimentation at the bottom. A couple of observations – for some reason the media in the flask with NRC-1 control seems paler. Like a pale fluorescent yellow type of color. I don’t believe the media was originally of that coloration, especially since I have some of the original media from more than a year ago that shows a decidedly darker color- but of course this is still just an observation not backed by any real analysis.

Sedimentation from the still-culture flasks. Mutant NRC-1 strain on the left, control NRC-1 strain on the right

What surprised me the most was the drastic difference of sedimentation found built up on the bottom of each of the flasks. The control NRC-1 was looking mostly bleached out, white film on top surface of the media that would break with agitation, and some white residues at the bottom of the flask. The suspected acidotolerant NRC-1 mutant showed the same white film on top surface of the media, but a dark brick red residue build up on the bottom of the flask. I suspected some sort of contamination (in a Halo media? It’s been known to happen…), but for this particular date I actually had a repeat of the configuration in two separate flasks from April of 2020, and both (containing NRC-1 control and NRC-1 tentative mutant) flasks show the same pattern of drastically different sedimentation color.

Sedimentation tubes in front of their respective cell line flasks, though the flasks are test cultures from last month. Left, tentative mutant Halobacterium NRC-1 culture, June 27th 2021 with sedimentation tube from April 2020 culture of the same strain type. Right, control Halobacterium NRC-1 culture, June 27th 2021 with sedimentation tube from April 2020 culture of the same strain type.

Frankly I’m at a loss on what kind of analysis I could even run on this sedimentation/residue. I’ve set up more test still cultures, but I’ll only be able to tell the results at least 15 months from now on, so this will be a long wait.

Many eyes of Halobacteria – screening Archaea for retinalphototrophy

The Flongle sequencing of our potentially mutant Halobacteria NRC-1 is right around the corner (as soon as we get that interminable computer hardware issue out of the way), and I’ve been busy catching up on literature and compiling tools and data we might need for post-sequencing analysis.

I have limited goals for the eventual Flongle based genome assembly, revolving around the question of how we can pinpoint any differences between our suspected mutant and the reference NCBI NRC-1 genome solely based on what I suspect will be a genomic assembly of somewhat limited quality (Flongles aren’t meant for 30x full genome assembly of organisms, and most success stories I’ve heard involved plasmids and barcoded inputs), and of course, figuring out what sort of genetic markers we can keep an eye on that could tell us more about the evolutionary history of the Halobacteria class in general.

First step in finding out markers for phylogenetic analysis was creating a SCG set for Halobacteria. This is the pfam based screening process looking at 50%+ protein coverage per entry accession, and then checking them for at least 90% representation among the group described earlier on this lab notebook. As this particular process for generating SCG set is aimed at rapid prototyping rather than accuracy, it’s usually important to look at the individual entries and note any particular idosyncrasies. For example, I have two different SCG sets for Archaea – first one looking at Archaea in general, and second one I made with Halobacteria class species (both sets utilizing only complete genomes from NCBI).

The Archaeal SCG HMM set contains total of 76 genes, and the Halobacteria SCG HMM set contains total of 215 genes. Under the gene screening criteria described above, one would assume the Halobacteria SCG HMM set would contain all the genes in the Archaea SCG HMM set, with addition of genes widely distributed only within Halobacteria. Alas, that is not the case. Copied below are 6 SCG pfams that only show up during Archaea-wide screening.

PF00121 TIM             Triosephosphate isomerase
PF00368 HMG-CoA_red     HMG-CoA_red1;   Hydroxymethylglutaryl-coenzyme A reductase
PF01139 RtcB    UPF0027;        tRNA-splicing ligase RtcB
PF01951 Archease        DUF101;         Archease protein family (MTH1598/TM1083)
PF02996 Prefoldin       DUF232; Prefoldin subunit
PF13656 RNA_pol_L_2             RNA polymerase Rpb3/Rpb11 dimerisation domain

And below are 145 SCG pfams specific to Halobacteria SCG HMM set (out of total of 215) – whopping 46 (31.7% of the SCG that only shows up in Halobacteria) SCGs are classified under domain/families of unknown function, not even including DUF-adjacent genes with at least some degree of annotation! Despite Halobacteria being arguably the best studied models of Archaea we still don’t know what half of their SCG candidate genes actually do (that DNA topoisomerase in the Halobacteria SCG set certainly stands out to me as well).

PF00025 Arf     arf;    ADP-ribosylation factor family
PF00125 Histone histone;        Core histone H2A/H2B/H3/H4
PF00137 ATP-synt_C              ATP synthase subunit C
PF00146 NADHdh          NADH dehydrogenase
PF00162 PGK             Phosphoglycerate kinase
PF00215 OMPdecase               Orotidine 5'-phosphate decarboxylase / HUMPS family
PF00224 PK              Pyruvate kinase, barrel domain
PF00275 EPSP_synthase   EPSP_syntase;   EPSP synthase (3-phosphoshikimate 1-carboxyvinyltransferase)
PF00278 Orn_DAP_Arg_deC         Pyridoxal-dependent decarboxylase, C-terminal sheet domain
PF00342 PGI             Phosphoglucose isomerase
PF00393 6PGD            6-phosphogluconate dehydrogenase, C-terminal domain
PF00475 IGPD            Imidazoleglycerol-phosphate dehydratase
PF00490 ALAD            Delta-aminolevulinic acid dehydratase
PF00499 Oxidored_q3     oxidored_q3;    NADH-ubiquinone/plastoquinone oxidoreductase chain 6
PF00507 Oxidored_q4     oxidored_q4;    NADH-ubiquinone/plastoquinone oxidoreductase, chain 3
PF00588 SpoU_methylase          SpoU rRNA Methylase family
PF00677 Lum_binding             Lumazine binding domain
PF00697 PRAI            N-(5'phosphoribosyl)anthranilate (PRA) isomerase
PF00719 Pyrophosphatase         Inorganic pyrophosphatase
PF00764 Arginosuc_synth         Arginosuccinate synthase
PF00800 PDT             Prephenate dehydratase
PF00814 TsaD    Glycoprotease; Peptidase_M22;   tRNA N6-adenosine threonylcarbamoyltransferase
PF00815 Histidinol_dh           Histidinol dehydrogenase
PF00885 DMRL_synthase           6,7-dimethyl-8-ribityllumazine synthase
PF00926 DHBP_synthase           3,4-dihydroxy-2-butanone 4-phosphate synthase
PF01025 GrpE            GrpE
PF01131 Topoisom_bac            DNA topoisomerase
PF01259 SAICAR_synt             SAICAR synthetase
PF01264 Chorismate_synt Chorismate_synth;       Chorismate synthase
PF01269 Fibrillarin             Fibrillarin
PF01379 Porphobil_deam          Porphobilinogen deaminase, dipyromethane cofactor binding domain
PF01432 Peptidase_M3            Peptidase family M3
PF01556 DnaJ_C  DnaJ_C; CTDII;  DnaJ C terminal domain
PF01634 HisG            ATP phosphoribosyltransferase
PF01641 SelR    DUF25;  SelR domain
PF01680 SOR_SNZ UPF0019;        SOR/SNZ family
PF01722 BolA            BolA-like protein
PF01761 DHQ_synthase            3-dehydroquinate synthase
PF01784 NIF3    DUF34;  NIF3 (NGG1p interacting factor 3)
PF01808 AICARFT_IMPCHas         AICARFT/IMPCHase bienzyme
PF01862 PvlArgDC        DUF44;  Pyruvoyl-dependent arginine decarboxylase (PvlArgDC)
PF01870 Hjc     DUF50;  Archaeal holliday junction resolvase (hjc)
PF01874 CitG            ATP:dephospho-CoA triphosphoribosyl transferase 
PF01876 RNase_P_p30     DUF53;  RNase P subunit p30
PF01887 SAM_adeno_trans DUF62;  S-adenosyl-l-methionine hydroxide adenosyltransferase
PF01900 RNase_P_Rpp14   DUF69;  Rpp14/Pop5 family
PF01923 Cob_adeno_trans DUF80;  Cobalamin adenosyltransferase
PF01928 CYTH    Adenylate_cyc_2;        CYTH domain
PF01933 CofD    UPF0052;        2-phospho-L-lactate transferase CofD
PF01940 DUF92           Integral membrane protein DUF92
PF01941 AdoMet_Synthase DUF93;  S-adenosylmethionine synthetase (AdoMet synthetase)
PF01949 DUF99           Protein of unknown function DUF99
PF01955 CbiZ    DUF105;         Adenosylcobinamide amidohydrolase
PF01959 DHQS    DUF109;         3-dehydroquinate synthase II
PF01967 MoaC            MoaC family
PF01983 CofC    DUF121; Guanylyl transferase CofC like
PF01996 F420_ligase     DUF129;         F420-0:Gamma-glutamyl ligase
PF02153 PDH             Prephenate dehydrogenase
PF02289 MCH             Cyclohydrolase (MCH)
PF02391 MoaE    MoeA; MoeE;     MoaE protein
PF02464 CinA            Competence-damaged protein
PF02548 Pantoate_transf         Ketopantoate hydroxymethyltransferase
PF02572 CobA_CobO_BtuR          ATP:corrinoid adenosyltransferase BtuR/CobO/CobP
PF02577 DUF151  DUF151; DNase-RNase;    Domain of unknown function (DUF151)
PF02592 Vut_1   DUF165; Putative vitamin uptake transporter
PF02598 Methyltrn_RNA_3 DUF171; Putative RNA methyltransferase
PF02602 HEM4            Uroporphyrinogen-III synthase HemD
PF02616 SMC_ScpA        DUF173; ScpA_ScpB;      Segregation and condensation protein ScpA
PF02624 YcaO    DUF181;         YcaO cyclodehydratase, ATP-ad Mg2+-binding
PF02632 BioY            BioY family
PF02649 GCHY-1  DUF198; Type I GTP cyclohydrolase folE2
PF02654 CobS            Cobalamin-5-phosphate synthase
PF02686 Glu-tRNAGln             Glu-tRNAGln amidotransferase C subunit
PF02700 PurS    UPF0062; PurC;  Phosphoribosylformylglycinamidine (FGAM) synthase
PF02784 Orn_Arg_deC_N           Pyridoxal-dependent decarboxylase, pyridoxal binding domain
PF03186 CobD_Cbib               CobD/Cbib protein
PF03684 UPF0179         Uncharacterised protein family (UPF0179)
PF03833 PolC_DP2                DNA polymerase II large subunit DP2
PF03966 Trm112p DUF343;         Trm112p-like protein
PF04013 Methyltrn_RNA_2 DUF358; Putative SAM-dependent RNA methyltransferase
PF04018 DUF368          Domain of unknown function (DUF368)
PF04027 DUF371          Domain of unknown function (DUF371)
PF04038 DHNA    DUF381; Dihydroneopterin aldolase
PF04135 Nop10p          Nucleolar RNA-binding protein, Nop10p family
PF04186 FxsA            FxsA cytoplasmic membrane protein 
PF04242 DUF424          Protein of unknown function (DUF424)
PF04289 DUF447          Protein of unknown function (DUF447)
PF04414 tRNA_deacylase  DUF516;         D-aminoacyl-tRNA deacylase
PF04493 Endonuclease_5  Endonuc_V;      Endonuclease V
PF04894 Nre_N   DUF650; Archaeal Nre, N-terminal
PF05173 DapB_C          Dihydrodipicolinate reductase, C-terminus
PF05854 MC1             Non-histone chromosomal protein MC1
PF06550 SPP     DUF1119;        Signal-peptide peptidase, presenilin aspartyl protease
PF06559 DCD             2'-deoxycytidine 5'-triphosphate deaminase (DCD)
PF06778 Chlor_dismutase         Chlorite dismutase
PF07754 HVO_2753_ZBP    DUF1610;        Small zinc finger protein HVO_2753-like, Zn-binding pocket
PF07826 IMP_cyclohyd            IMP cyclohydrolase-like protein
PF08540 HMG_CoA_synt_C          Hydroxymethylglutaryl-coenzyme A synthase C terminal
PF08617 CGI-121         Kinase binding protein CGI-121
PF09341 Pcc1            Transcription factor Pcc1
PF09721 Exosortase_EpsH         Transmembrane exosortase (Exosortase_EpsH)
PF09845 DUF2072         Zn-ribbon containing protein
PF09846 DUF2073         Uncharacterized protein conserved in archaea (DUF2073)
PF09876 DUF2103         Predicted metal-binding protein (DUF2103)
PF09883 DUF2110         Uncharacterized protein conserved in archaea (DUF2110)
PF09920 DUF2150         Uncharacterized protein conserved in archaea (DUF2150)
PF09999 DUF2240         Uncharacterized protein conserved in archaea (DUF2240)
PF10103 Zincin_2        DUF2342; Zinicin_2;     Zincin-like metallopeptidase
PF10977 DUF2797         Protein of unknown function (DUF2797)
PF11255 DUF3054         Protein of unknown function (DUF3054)
PF11755 DUF3311         Protein of unknown function (DUF3311)
PF13654 AAA_32          AAA domain
PF18446 DUF5611         Domain of unknown function (DUF5611)
PF18477 PIN_9           PIN like domain
PF19090 DUF5778         Family of unknown function (DUF5778)
PF19091 DUF5779         Family of unknown function (DUF5779)
PF19093 DUF5781         Family of unknown function (DUF5781)
PF19094 EMC6_arch       DUF5782;        EMC6-arch
PF19096 DUF5784         Family of unknown function (DUF5784)
PF19098 DUF5785         Family of unknown function (DUF5785)
PF19103 DUF5790         Family of unknown function (DUF5790)
PF19104 DUF5791         Family of unknown function (DUF5791)
PF19106 DUF5793         Family of unknown function (DUF5793)
PF19108 DUF5795         Family of unknown function (DUF5795)
PF19109 DUF5796         Family of unknown function (DUF5796)
PF19110 DUF5797         Family of unknown function (DUF5797)
PF19113 DUF5799         Family of unknown function (DUF5799)
PF19115 DUF5800         Family of unknown function (DUF5800)
PF19118 DUF5802         Family of unknown function (DUF5802)
PF19119 DUF5803         Family of unknown function (DUF5803)
PF19120 DUF5804         Family of unknown function (DUF5804)
PF19123 DUF5807         Family of unknown function (DUF5807)
PF19125 DUF5809         Family of unknown function (DUF5809)
PF19128 DUF5811         Family of unknown function (DUF5811)
PF19129 DUF5812         Family of unknown function (DUF5812)
PF19130 DUF5813         Family of unknown function (DUF5813)
PF19132 DUF5815         Family of unknown function (DUF5815)
PF19137 DUF5820         Family of unknown function (DUF5820)
PF19145 DUF5827         Family of unknown function (DUF5827)
PF19146 DUF5828         Family of unknown function (DUF5828)
PF19148 DUF5830         Family of unknown function (DUF5830)
PF19646 DUF6149         Family of unknown function (DUF6149)
PF19769 CPxCG_zf                CPxCG-related zinc finger
PF19792 DUF6276         Family of unknown function (DUF6276)
PF20024 DUF6432         Family of unknown function (DUF6432)

Just for sake of reference, here are the pfams of single copy genes that show up across both the Archaea and Halobacteria genome sets.

PF00164 Ribosom_S12_S23 S12; Ribosomal_S12;     Ribosomal protein S12/S23
PF00177 Ribosomal_S7    S7;     Ribosomal protein S7p/S5e
PF00203 Ribosomal_S19   S19;    Ribosomal protein S19
PF00237 Ribosomal_L22   L22;    Ribosomal protein L22p/L17e
PF00238 Ribosomal_L14   L14;    Ribosomal protein L14p/L23e
PF00252 Ribosomal_L16   L16;    Ribosomal protein L16p/L10e
PF00276 Ribosomal_L23   L23;    Ribosomal protein L23
PF00297 Ribosomal_L3    L3;     Ribosomal protein L3
PF00312 Ribosomal_S15   S15;    Ribosomal protein S15
PF00318 Ribosomal_S2    S2;     Ribosomal protein S2
PF00334 NDK             Nucleoside diphosphate kinase
PF00344 SecY    secY;   SecY
PF00347 Ribosomal_L6    L6;     Ribosomal protein L6
PF00366 Ribosomal_S17   S17;    Ribosomal protein S17
PF00380 Ribosomal_S9    S9;     Ribosomal protein S9/S16
PF00410 Ribosomal_S8    S8;     Ribosomal protein S8
PF00411 Ribosomal_S11   S11;    Ribosomal protein S11
PF00416 Ribosomal_S13   S13;    Ribosomal protein S13/S18
PF00572 Ribosomal_L13   L13;    Ribosomal protein L13
PF00573 Ribosomal_L4    L1e; Ribosomal_L1e;     Ribosomal protein L4/L1 family
PF00687 Ribosomal_L1    L1;     Ribosomal protein L1p/L10e family
PF00709 Adenylsucc_synt         Adenylosuccinate synthetase
PF00749 tRNA-synt_1c            tRNA synthetases class I (E and Q), catalytic domain
PF00750 tRNA-synt_1d            tRNA synthetases class I (R)
PF00827 Ribosomal_L15e          Ribosomal L15
PF00831 Ribosomal_L29           Ribosomal L29 protein
PF00832 Ribosomal_L39           Ribosomal L39 protein
PF00833 Ribosomal_S17e  Ribosomal_S17;  Ribosomal S17
PF00935 Ribosomal_L44   L44;    Ribosomal protein L44
PF01015 Ribosomal_S3Ae          Ribosomal S3Ae family
PF01090 Ribosomal_S19e  S19e;   Ribosomal protein S19e
PF01142 TruD            tRNA pseudouridine synthase D (TruD)
PF01157 Ribosomal_L21e  L21e;   Ribosomal protein L21e
PF01192 RNA_pol_Rpb6            RNA polymerase Rpb6 
PF01194 RNA_pol_N               RNA polymerases N / 8 kDa subunit
PF01198 Ribosomal_L31e          Ribosomal protein L31e
PF01200 Ribosomal_S28e          Ribosomal protein S28e
PF01201 Ribosomal_S8e           Ribosomal protein S8e
PF01282 Ribosomal_S24e          Ribosomal protein S24e
PF01351 RNase_HII               Ribonuclease HII
PF01496 V_ATPase_I      V_ATPase_sub_a;         V-type ATPase 116kDa subunit family  
PF01655 Ribosomal_L32e          Ribosomal protein L32
PF01667 Ribosomal_S27e          Ribosomal protein S27
PF01725 Ham1p_like              Ham1 family
PF01780 Ribosomal_L37ae         Ribosomal L37ae protein family
PF01813 ATP-synt_D              ATP synthase subunit D 
PF01864 CarS-like       DUF46;  CDP-archaeol synthase
PF01866 Diphthamide_syn         Putative diphthamide synthesis protein
PF01912 eIF-6   eIF6;   eIF-6 family
PF01948 PyrI            Aspartate carbamoyltransferase regulatory chain, allosteric domain
PF01981 PTH2    DUF119;UPF0099; Pep-tRNA_hydrol;        Peptidyl-tRNA hydrolase PTH2
PF01982 CTP-dep_RFKase  DUF120;         Domain of unknown function DUF120
PF01984 dsDNA_bind      DUF122;         Double-stranded DNA-binding domain
PF01990 ATP-synt_F              ATP synthase (F/14-kDa) subunit
PF01991 vATP-synt_E             ATP synthase (E/31 kDa) subunit
PF01992 vATP-synt_AC39          ATP synthase (C/AC39) subunit
PF01994 Trm56   DUF127; tRNA ribose 2'-O-methyltransferase, aTrm56
PF02006 PPS_PS  DUF137; Phosphopantothenate/pantothenate synthetase
PF03874 RNA_pol_Rpb4            RNA polymerase Rpb4
PF04010 DUF357          Protein of unknown function (DUF357)
PF04019 DUF359          Protein of unknown function (DUF359)
PF04104 DNA_primase_lrg         Eukaryotic and archaeal DNA primase, large subunit
PF04919 DUF655          Protein of unknown function (DUF655)
PF05221 AdoHcyase               S-adenosyl-L-homocysteine hydrolase
PF05833 NFACT_N FbpA;   NFACT N-terminal and middle domains
PF06026 Rib_5-P_isom_A          Ribose 5-phosphate isomerase A (phosphoriboisomerase A)
PF06093 Spt4            Spt4/RpoE2 zinc finger
PF13393 tRNA-synt_His           Histidyl-tRNA synthetase
PF16906 Ribosomal_L26           Ribosomal proteins L26 eukaryotic, L24P archaeal
PF17144 Ribosomal_L5e           Ribosomal large subunit proteins 60S L5, and 50S L18
Quick note on the process I used
Halobacteria and Archaea specific list of pfam accessions were extracted from HMMs I curated beforehand- it's your standard grep '^ACC' affair, with either cut or awk to clean up the output. 

After that I use diff to find entries on one HMM, but not on the other. For example, for Halobacteria specific pfams

diff archaea_hmm_pfam.txt halo_hmm_pfam.txt | grep ">" | cut -d ' ' -f 2 > halo_only_pfam.txt

And the resulting file is used as input for grep against pfamA.txt - this step is to retrieve annotations for the pfam accessions.

grep -f halo_only_pfam.txt pfamA.txt | cut -f 1,2,3,4 > halo.temp

The resulting output is going to be noisy, since grep will retrieve entries with mentions of pfam in the annotated descriptions (this happens surprisingly often). So we need to clean up the output using the pfam accessions we generated earlier. 

grep -f halo_only_pfam.txt halo.temp > halo_only_scg.list

Finding pfam accessions shared across both of the lists uses cat, sort, uniq -c and tr. 

cat archaea_hmm_pfam.txt halo_hmm_pfam.txt | sort | uniq -c | tr -d [:blank:] | grep ^2 | cut -d 'P' -f2 | sed 's/^/P/' > archaea_and_halo_pfam.txt

Grep -f is used again to screen against the pfamA.txt, and we need to run this process twice to clean out any duplicate noise

grep -f archaea_and_halo_pfam.txt pfamA.txt | cut -f1,2,3,4 > temp
grep -f archaea_and_halo_pfam.txt temp > archaea_halo_shared_scg.list

While looking through the SCG sets, I noticed a curious absence of rhodopsin or any rhodopsin related accessory genes in the Halobacteria data. To the best of my knowledge Halobacteria is widely photosynthetic and contains the only examples of rhodpsin in the entire kingdom, so this certainly stood out. Since my labmate, Sebastian, was also looking for data on photosynthetic systems in Archaea in preparation for his research looking into possibilities of alternate Archaeal photosystems, I decided to spend some time screening Archael genomes for presence and distribution rhodopsin (in this case, PF01036).

Phylogenetic tree of Archea, 1142 genomes, midpoint rooted. Fuchsia indicates rhodopsin presence

As expected, presence of rhodopsin is almost exclusive to the Halobacteria class. Of the 1142 Archaea genomes surveyed, 337 genomes have rhodopsin hits- 336 Halobacteria and 1 Methanomicrobia. And the reason why rhodopsin didn’t make it into the SCG set is quite clear as well. We have a total of whopping 1068 rhodopsin gene hits across the 337 genomes, with the distribution pattern below:

60 genomes with 1 rhodopsin gene
75 genomes with 2 rhodopsin genes
55 genomes with 3 rhodopsin genes
80 genomes with 4 rhodopsin genes
39 genomes with 5 rhodopsin genes
22 genomes with 6 rhodopsin genes
 4 genomes with 7 rhodopsin genes
 2 genomes with 9 rhodopsin genes

I’m growing more curious about ancient rhodopsin in Archaea, especially considering the illusive nature of its evolutionary history – retinalphototrophy is expected to have emerged anywhere between 4 billion and 600 million years ago (https://www.preprints.org/manuscript/202011.0700/v3). What could those ancient eyes have seen in the skies of our planet? Did it shape the rhodopsin adaptation as we see them now?

Granted, pfam based screening doesn’t necessarily focus on immediate protein homology, so at least some of the rhodopsin hits here are likely to be false positives (when it comes to pfam based screening, I normally prefer to check with protein alignment-cladogram and then move to manual curation if any branch looks especially strange – I’ll have to chew through this data at length at a later time).

What’s actually interesting, and completely unexpected from the initial screening is that part of the class Halobacteria seem to be composed of genera that completely lacks any and all rhodopsin presence in their genome. I can see entire clades of Halobacteria that maintains uniform lack of rhodopsin in a pattern that suggests that common ancestors of these specific genera did not have a rhodopsin in the first place (or the common ancestors at least evolved to kick rhodopsin off their genomes, and the progeny continued to evolve, of course). Could this be a key to figuring out the evolutionary time frame for emergence of retinalphototrophy in our world? An exciting prospect!

Annotated lower half of the Archaea phylogenetic tree, representing Halobacteria genomes. Fuchsia indicates rhodopsin presence. Genus without rhodopsin presence in the member genomes are marked with red line and name. From left to right – Natronococcus, Haloadaptatus, Halococcus, Halalkalicoccus, Halopenitus, Haloparvum, Haloferax (albeit with two genome exception)

The seemingly non-rhodopsin utilizing genus of Halobacteria so far-







Haloferax (granted this particular genus has rhodopsin signals in two of the screened genomes, but it’s nowhere completely covered in rhodopsin like the rest of the members of Halobacteria).

(I have a sneaking suspicion that many rhodopsin-negative member species of the genera listed above might be alkaliphilic or alkali-tolerant)

One question to answer would be if evolution of the halorhodopsin screened here follows the evolutionary divergence of the Halobacteria genomes. The crazy number of multiplicate rhodopsins in Halobacterial genome makes immediate comparison of the proteins difficult. I’m currently curating sets of Halobacteria genomes that only showed one rhodopsin hit (60 genomes found so far) and have extracted the representative rhodopsin equivalent protein sequences from all of them. The plan is to created a genomic phylogenetic tree, and then compare it against a cladogram of the single-copy rhodopsin proteins seen in these particular species to check for any general evolutionary match. More updates on this one soon!

Tree Thinking Chapter II – Tree thinking and its importance in the development of evolutionary thought.

Tonight we’re continuing on with the weekly study of the Tree Thinking textbook, proceeding to chapter 2. “Tree thinking and its importance in the development of evolutionary thought.”

The chapter covers the transition from ladder thinking around the concept of evolution to tree thinking, beginning with Jean-Baptiste Larmarck’s concept of Scala Naturae- the ladder of life- (described in his 1809 work Philosophie Zoologique) as an example of the ladder based view of the variability of species, prior to emergence of tree thinking as an alternate conceptual model. It should be noted that the ‘ladder thinking’ view of life in nature wasn’t purely an invention of Lamarck’s time, its concept tracing back to the great chain of being proposed by ancient Greek philosophers. Lamarck’s description was curious in that it seemed to suggest a degree of lineage from one living form to another, albeit taking the form of hierarchical improvement from one lower state to a superior one, but also suggested different beginnings for each of the ladders. Perhaps he also observed that it was impossible to fit all life onto one track of linear improvement (especially considering his belief in spontaneous generation), and sought to reconcile possible models of divergence and diversity in nature with existing methods of thinking at the time.

An example of the scientific discourse beginning to address the particular idiosyncrasy can be seen with a passage from Zoonomia (1794), a mostly medical text published by Erasmus Darwin, Charles Darwin’s grandfather.

...would it be too bold to imagine, that in the great length of time, since the earth began to exist, perhaps millions of ages before the commencement of the history of mankind, would it be too bold to imagine, that all warm-blooded animals have arisen from one living filament... (Darwin 1794, Sect IV.8) 

The sort of possible evolutionary branching process from one ancestor to many descendants is also espoused by another contemporary, Charles Lyell (mentor to Charles Darwin) in his criticism of the Lamarckian concept of evolution – where he also mentions the concept of common ancestry and branching (referred to as ‘ramification’).

We know that individuals which are mere varieties of the same species, would, if their pedigree could be traced back far enough, terminate in a single stock; so according to the train of reasoning before described, the species of a genus, and even the genera of a great family, must have had a common point of departure. What then was the single stem from which so many varieties have ramified? (Lyell 1832, p.10)
Charles Darwin’s 1837 sketch from Notebook B, Entry 36. Text top: “I think. Case must be that one generation then should be as many living as now. To do this & to have many species in same genus (as is) requires extinction.” Text below: “Thus between A & B immense gap of relation. C & B the finest graduation. B & D rather greater distinction.”

The specific argument for diversity of species as a consequence of branching from a common ancestor starts in earnest with works of Charles Darwin, who was also drawn to the tree metaphor in his description of evolution as seen in the passage below.

The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth...
...The green and budding twigs may represent existing species: and those produced during former years may represent the long succession of extinct species...
...the great Tree of Life..., which fills with its dead and broken branches the crust of the earth, and covers the earth with
ever-branching and beautiful ramifications (Darwin 1859, p.159)

According to Charles Darwin, a fundamental idea of the evolution of species is the concept of descent from the shared common ancestor, with natural selection following up as a mechanism for variety among the progeny.

In considering the Origin of Species, it is quite conceivable that a naturalist... might come to the conclusion that each species had not been independently created, but had descended, like varieties, from other species. Nevertheless, such a conclusion, even if well founded, would be unsatisfactory, until it could be shown how the innumerable species inhabiting this world have been modified... (Darwin 1859, p.3)

Tree thinking is deeply aligned with the concept of shared common ancestor, and have been used in supporting the idea of evolution itself against critics before. Notable examples include:

  • Distant organisms have deep similarities that fulfill very different functions
  • More ancient fossils differ more from the modern counterparts

While the conceptual model of placing species at the tips of trees in branching relationships was quickly accepted among the naturalists, the full evolutionary significance of the branching structure weren’t addressed for close to a hundred years since Darwin. In fact, depiction of evolutionary relationships as phyletic series with one living group implicated as ancestors of another living group remained commonplace even into 1970’s. Such models often ended up suggesting the idea of advancement among the species in a fashion similar to the one originally espoused by Jean Lamarck, during the early days of the formation of evolutionary thought.

One of the reasons behind the delay might have been due to lack of practical tools for tree construction. First major attempt at creating tools for tree-like analysis of evolution was made with Ernst Haeckel (1834-1919), when he applied study of embryonic development to phylogenetic history, but the data gathered through such methods still remained deeply ambiguous. The particular tradition, however, likely paved the way for eventual development of the phylogenetic systematics, led by Willi Hennig (entomologist, 1913-1976) and Walter Zimmerman (botanist, 1892-1980) during the period between 1930’s to 1960’s.

Three major points raised by phylogenetic systematics have eventually come to define the modern field of systematics and phylogenetic study.

  1. Objective reality of evolutionary trees and the common ancestor implied by the trees. However, while we can make well supported inferences of the truth, we likely won’t be able to fully mathematically prove that those inferences are true.
  2. Degrees of relatedness among organisms should be understood in terms of the recency of their common ancestry. Unrelated lineage can converge on similar traits, and rates of evolution can differ among lineages, producing false understanding of classifications of species or their evolutionary history.
  3. Phylogenetic relatedness should be the sole classification for taxonomic understanding. Only evolutionary relationships between two species or taxa should count, not their physical similarity (e.g. crocodiles are phylogenetically closer to birds than lizards, despite the physiological characters).

The phylogenetic systematists (sometimes referred to as cladists) came head-to-head with the more traditional naturalists (who referred to themselves as traditional, or evolutionary systematists) who sought to maintain similarity as a measure of phylogeny as much as lineage with the translation of the Phylogenetic Systematics to English in 1966. A representative debate at the time is the one based around the reality of the vertebrate class Reptilia. Phylogenetic systematists argued that natural groups must be composed of organisms that show closer evolutionary relationship to other members of the said group than to any other organisms outside the group. The traditional systematists argues that evolutionary relationship as well as ‘similarity’ should be measured, in terms of physical character and ecological niche.

The contentious debate continued into the 1980’s and 1990’s when advancements in statistical models, availability of DNA sequences and computational tools allowed for a far more efficient comparison of supposed natural groups, on favor of classification based phylogenetic systematics.

And this is mostly it for chapter II – the rest of the passages in the chapter investigate the prevalence of ladder thinking in social discussion and education of evolution in our contemporary society, going into some of the more popular tropes (if evolution is real why aren’t monkeys still becoming humans? – A routinely discussed example of the fallacy of linear improvement/advancement based view of evolution).

I’ll go into some of the more interesting papers cited in the chapter in a note down the road and put it under the history tab on this page – some fascinating background information on the references! It’s quite amazing how the very fundamental method of taxonomic classification was still being debated all the way into the 1990’s, with much of the societal level discussion still bringing up concepts originally discussed with Jean Lamarck’s contemporaries back in early 1800’s. We have a tendency to consider the concept and study of evolution as a solved problem – both the vast research space to consider in academia and the level of societal dialog around the topic suggests otherwise.

Panchen AL. Classification, evolution, and the nature of biology. Cambridge University Press; 1992 Jun 26.

Browne J. Darwin's Origin of Species: a biography. Atlantic Books Ltd; 2012 Nov 1.

Penny D, Foulds LR, Hendy MD. Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature. 1982 May;297(5863):197-200.

Archie JW. A randomization test for phylogenetic information in systematic data. Systematic Zoology. 1989 Sep 1;38(3):239-52.

Steel M, Penny D. Common ancestry put to the test. Nature. 2010 May;465(7295):168-9.

Theobald DL. A formal test of the theory of universal common ancestry. Nature. 2010 May;465(7295):219-22.

Hennig W. Phylogenetic systematics. University of Illinois Press; 1999.

Mayr E. The growth of biological thought: Diversity, evolution, and inheritance. Harvard University Press; 1982.

Hull DL. Science as a process: an evolutionary account of the social and conceptual development of science. University of Chicago Press; 2010 Dec 15.

O'HARA RJ. Population thinking and tree thinking in systematics. Zoologica scripta. 1997 Oct;26(4):323-9.

Crisp MD, Cook LG. Do early branching lineages signify ancestral traits?. Trends in Ecology & Evolution. 2005 Mar 1;20(3):122-8.

Omland KE, Cook LG, Crisp MD. Tree thinking for all biology: the problem with reading phylogenies as ladders of progress. BioEssays. 2008 Sep;30(9):854-67.

Next study post will be on chapter III, “What a phylogenetic tree represents”, which will go into the methods with which trees are formed and analyzed. It should be a pretty wild ride.