Thinking about amateur learning – and diving into the Tree Thinking Chapter 1

I’ve mostly stumbled my way into the amateur pursuit of biology, lacking a realistic path to studying the topic in advanced settings. While the unconventional choice led me down a more curious path than experienced by many others, I’ve always been dogged by a constant suspicion that the lack of structured learning in my background left me wanting in deeper understanding of the fundamental concepts in biology.

Sure, I could discuss cloning strategies and assembly methods, but how confident am I in understanding and explaining concepts of phylgenetic trees, or synteny, orthologs and paralog genes among different clades? An unfortunate reality is that those starting out in amateur biology or diybio are often inundated with what amounts to PR blitz around the hot topic of the day, usually around whatever the research topic already undertaken by a larger research or corporate entities. This pattern often leaves the amateur beholden to the media cycle in choices of their studies instead of being able to dictate the theme of their learning on their own terms and ability. The curious by-gone deluge of tutorials on Gibson assembly squarely aimed at amateurs and high school students with hardly any reliable access to a PCR machine, funding for customized primers, or any experience in troubleshooting even traditional restriction cloning pipeline attests to this.

As a life-long amateur I’m increasingly of the opinion that the structure in learning might be more important than we give them credit for – well formulated structure for knowledge (at least in the beginning) tend to reflect the collective pedagogical expertise gathered over generations. It is the kind of built up repository of experiences that remains the primary driver of improvement in research and knowledge in our world, and our amateur practices would benefit similarly from it as well. My personal experiences suggest that amateur biologists must work at forming or adopting a learning tradition of our own in order to truly break new grounds in practical research and improve our research-craft. Simply put – a fundamental part of amateur biology research is a pursuit of better amateur research practices, at least as long as we aim to improve our works.

So, how would we amateurs start at creating our own tradition of independent learning in biology? A good way would be to avoid completely reinventing the wheel – identify a good pedagogical text aimed at the newcomers to the academia, study them and adopt elements in more practical, experiment-focused manner.

Since I’ve been increasingly feeling the need to learn more about tools of phylogenetics and tree-thinking methods, I’ll start my own study with the seminal text, (as far as introductory textbooks on a general topic goes) “Tree Thinking: An introduction to phylogenetic biology” by David A. Baum and Stacey D. Smith. The plan is to go through the book at a more leisurely pace, chapter by chapter a week, making notes on the text and answering the end-of-chapter problems. I’ll likely only mention the writing portion of the problem sets as opposed to all the multiple-choice ones. The goal is to study the topic at hand, not transcribe the whole textbook onto the web.

Baum DA, Smith SD. Tree thinking. An Introduction to Phylogenetic Biology. Roberts and Company Publishers. 2013.

Chapter 1 – phylogenetic trees and their importance in modern biology.

The goal of the textbook is twofold – firstly to instill tree thinking method in students, by allowing them to visualize evolution, organize the knowledge of biological diversity and use the diagram to analyze evolutionary phenomena. Secondly, introduce the students to technical aspects of phylogenetics, in how to organize the evolutionary knowledge using phylogenetic trees, and how the methods work in principle.

In the context of the book, tree, phylogenetic tree and phylogeny are all used interchangeably to refer to branching evolutionary histories.

Phylogenetic trees provide a way to understand evolutionary life as a continuum, from individual to populations to species and more. Every organism has at least one or two ancestors, allowing for tracing of evolutionary histories through their ancestors. Evolutionary theory implies that paths of two different organisms will eventually converge upon a common ancestor, with varying distances depending on how close the organisms are. The concept of common ancestry is fundamental to the tree thinking, and its concept can be applied at diverse scales, from single genes to populations.

A central concept in evolution is that traits change over time, and phylogenetic tree provides a natural method for visualizing them in search of possible patterns. Reconstruction of the evolutionary changes in such a way allows us to distinguish homologous traits from analogous traits (chapter 4) – e.g. similarity of mammary glands in mammals is indicative of common ancestor, but wings of birds, bats and insects evolved independently and are representative of adaptations for similar biological functions.

Chapter 5 will address the concept of relatedness between organisms, based on the idea that degrees of relatedness between two organisms is defined by how far back we must go to reach their most recent common ancestor.
Defining relatedness in terms of common ancestry allows creation of evolutionarily meaningful taxonomic systems, where organisms assigned to a particular taxon would be more closely related to each other than to any other organism outside the taxon.
A taxonomic system reflective of evolutionary history carries a predictive value, in that a new organism discovered to belong to a particular taxon is likely to share key features shared among most of the members of the taxon in question. In turn, a taxonomic system not fully reflecting the evolutionary history would result in less reliable predictions.

Trees can be used to organize and depict histories of genes as well. New genes generally arise by duplication of an existing gene, followed by divergence of one of the copies of the duplicated gene (chapter 6). Much like the concept of the taxa, genes can be assigned to gene families and sub families. Position of a gene in such gene trees can tell us much about its probable properties.

The book will also address the question of reconstructing history based on phylogenetic tress in the later chapters. General considerations for the methods can be – converting the rate of evolution attached to biological records in abstract branch lengths to units of real time (chapter 11), and combining the time calibrated tree with geographic information.

Tree concept in biology is the result of scientific inference rather than direct observation (e.g. Krebs cycle). Thus tree thinking will require understanding of principles that allow scientists to infer phylogenies from biological data. Chapter 7-9 will explore methods for inferring phylogenies, with focus on underlying concepts and principles.

Barcoding strange bacteria part III

Our last attempt at 16s barcode sequencing the East River bacteria wrapped up with out a solid conclusion, owing to the less than spectacular sequence fidelity, and the question of Methylobacterium hit from the salvaged sequence remaining suspicious on the account of it being a contaminant of sequencing reagents.

Well, we prepped a fresh culture right after the first result came in and sent it out for re-barcode sequencing with the same vendor (we use Genewiz in US East, if you’re curious). The results came in earlier this morning, and while it’s not the most clean sanger result I’ve seen, it certainly confirms our mystery East River microbe as being a type of Methylobacterium.

Above CLI graph shows matching aligned base positions (sans Ns) between the forward and the reverse sanger reads as red asterisks – as you can, see gaps abound. This is not what I would call a good sequencing output. However, I will have to note that the quality of the sequencing results from our vendor, especially when it comes to traditional sanger output like with the 16S barcoding processes, tend to be quite excellent. At this point the repetition of the somewhat lackluster result must have more to do with the difficulty of the template in question. I’m thinking the genome of the microbe might be especially repeat rich – we’ve had similar difficulties with Deinococcus sanger verification.

Blast search for the forward and reverse reads more or less confirm the results from the last time, with maybe a curious inclusion of an Asanoa species for the forward read- Methylobacterium are part of the alphaproteobacteria class. Asanoa, I believe is an entirely different phylum, Actinobacteria. One could assume that some conserved core of the sequence got a partial match, but if that’s the case shouldn’t more Actinobacterial results show up in the search? Hmm.

Either way, I think there’s a strong evidence to suspect the mystery bacteria as being some sort of Methylobacterium. And quite likely an unknown one at that. Who knew a humble kitchen media could be capable of isolating such an interesting specimen from the wild?

Chances are we’ll run a gamut of traditional microbiology tests on the East River microbe and proceed with Nanopore flongle sequencing once we have a decent body of data. Looking forward to documenting the results! One major question in my mind – will we find presence of Bacteriochlorophyll- a in the new bacteria? Time to dig through existing genomes to find out.

Rethinking Truepera radiovictrix among Deinos – an amateur note

A particular midpoint rooted Deinoccoccus-Thermus phylogenetic tree I made during the last SCG hmm set curation have been on my mind for the last few days.

As you can see, expansion of the screening gene set from 16 gene universal HMM set to the newly created 193 gene HMM set (derived from complete genomes of Deinococcus-Thermus phylum) swaps the clade position of Truepera radiovictrix from Deinococcus genus adjacent (as traditionally assumed – to a place more adjacent to Marinithermus-Oceanithermus genus, and thus to the Thermus genus.

My impression of the Truepera radiovictrix position swap at the time was that I was observing some sort of noise, produced potentially by a midpoint rooted tree depicting a species already thought to be quite close to the theoretical tree midpoint of the Deinococcus-Thermus phylum. Truepera radiovictrix, while still within the order Deinococcales, is still supposed to be distant enough to be in its own family). It didn’t help that the SCG set being used was derived from a measly 49 complete genome data. Yet – what if the new data is closer to the evolutionary truth than the conventional representation of the Truepera? Under the 193 genes set, the position of the Truepera as Thermus adjacent entity remains true even if I root the tree on the Truepera itself. While the smaller 49 genomes origin for the new SCG set can be a cause for concern, but we cannot doubt that the genes exist, fits within the SCG criteria, and their sequences are about as verified as they can get – thus the alignment results themselves are quite real.

The more I think about it, the more I feel a study to confirm the position of Truepera radiovictrix one way or another can only help with my phylogenetics study, and potentially describe an interesting evolutionary past for us.

As a quick replication test, I decided to make another screened SCG set from a larger Deinococcus-Thermus sample and build a new phylogenetic tree of the phylum, again using a larger genome pool. I used below esearch criteria to screen out a list of 221 reference quality Deinococcus-Thermus genomes for processing as described in the previous SCG HMM set generation post.

esearch -query '"Deinococcus-Thermus"[Organism] AND "latest refseq"[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly

The new Deinococcus-Thermus SCG HMM set contains 199 genes as opposed to the 193 genes in the initial HMM build – I’m assuming a larger genome pool allowed for more genes to pass through the 50% protein coverage and 90% phylum coverage screening criteria, and am preparing a deeper dive in to the differences between the two. Running a phylogenetic tree build on the 221 genome set with the new SCG HMM set results in below trees, both midpoint rooted.

Even with the updated SCG HMM set and increased genome counts, both for SCG set building and phylogenetic tree building, Truepera radiovictrix (green band) is located closer to the Marinithermus-Oceanithermus cluster, on the branch leading down to Thermus. It’s only when we use the 16 gene SCG set which tend to veer closer to rRNA based analysis (Hug, L., Baker, B., Anantharaman, K. et al. A new view of the tree of life. Nat Microbiol 1, 16048 (2016) that we get the more conventional picture of Truepera placement near the branch leading down the Deinococcus genus.

Let’s compare the new results against the result from the original Truepera radiovictrix sequencing paper (Ivanova, N., Rohde, C., Munk, C. et al. Complete genome sequence of Truepera radiovictrix type strain (RQ-24T). Stand in Genomic Sci 4, 91–99 (2011). – which continues to be the only paper of note (that I could find) dealing with the phylogeny of this very interesting species – it’s phylogenetic location as a deeper branching member of Deinococcus is suggested by 16S rRNA sequencing, which might not be quite as reliable as a result coming from a fully curated SCG set derived from the members of the phylum.

Image taken from the paper Ivanova, N., Rohde, C., Munk, C. et al. Complete genome sequence of Truepera radiovictrix type strain (RQ-24T). Stand in Genomic Sci 4, 91–99 (2011)

While I still remain deeply suspicious of the quality of the SCG HMM set I generated, I think there’s a good enough reason to wonder if the Truepera radiovictrix had been mischaracterized as a Deinococcus-like species after all. The much vaunted radiation tolerance of the Truepera and the members of the Deinococcus genus (not even all of them, if my memory serves correctly!) might not be specific enough of a molecular characteristic for us to determine the Truepera as being adjacent to the Deinococcus group. The alphaproteobacteria Methylobacterium radiotolerans and some members of other phyla and classes come to mind as a growing case of the exceptions as well.

An interesting and unexpected takeaway of the new analysis would be phylogenetic placement of the Deinobacterium chartae ( If the custom SCG HMM set results are closer to the true depiction of Truepera-Deinococcus-Thermus phylogeny, then there is a good chance the Deinobacterium chartae could serve as a good root organism for Deinococcus-genus specific phylogenetic analysis. Combined with the shifting placement of the Marinithermus and Oceanithermus in close vicinity of the Truepera radiovictrix, a deeper, higher resolution analysis in comparative genomics between the four groups might serve us with an interesting insight into the evolutionary mechanism of one of the (probably) deeper branching phylum accessible to us.

As an aside – sharing results from analysis in the blog form had been great for organizing my own thought, but I’m beginning to think I need something more extensive to share the real data and work the posts are based on. Hopefully by next week I’ll have some fun jupyter/sos based data to be shared with interested amateur researchers.

Making use of single copy gene HMM set – in mostly bash way

In a previous lab note entry I outlined a workflow for curating a subscreened single copy gene HMM set from a selection of genomes and pfam-HMM, courtesy of the GToTree pipeline. The process works quite robustly, and I was able to make more than a couple of very interesting datasets through it since then. There’s a bit of an inherent weakness in the process that relies on raw FTP download links from NCBI (reliability of post-COVID data access to NCBI had been incredibly shaky – either everyone in the world decided to ping NCBI all the time since then, or something change in their IT policy, or both) – to the point that one should perform a mandatory check of datasets downloaded through such methods, but it is what it is.

The outlined process in the lab note, however, doesn’t describe how we can obtain additional information (such as protein faa specifically corresponding to the SCG-HMM screening results, or related CDS nucleotide sequences) from the results of the filtering process with subscreened HMM, so I’ll take this opportunity to post some examples, both for others and as a reminder for myself.

Note that where or how one would obtain the protein fasta or CDS fasta files to be used for the below process is left to the prerogative of the researcher – there are many different ways to do it, and they all have their upside and downsides. I would personally recommend getting familiar with the entrez-direct tool or using the ncbi-genome-download python script set for most cases.

Let’s start with extracting all CDS fasta and protein faa files associated with the given subscreened HMM set

#This series of command assumes all gz'd cds files with extension .fasta, and all gz'd protein files with extension .faa are at the user directory for processing

#Making a temporary faa directory and concatenating them together in the current directory
mkdir working_faa
cp *.faa.gz working_faa
gunzip working_faa/*.gz
cat working_faa/*.faa > all_genes.faa
rm -r working_faa

#Making a temporary cds directory and concatenating them together in the current directory
mkdir working_cds
cp *.fasta working_cds
gunzip working_cds/*.gz
cat working_cds/*.fasta > all_cds.fasta
rm -r working_cds

#Starting with protein faa files processing - hmmsearch of the protein fasta file of interest against the subscreened SCG.hmm set created in the previous lab note entry
hmmsearch --cut_ga --cpu 2 --tblout SCG_genes.scan subscreened_SCG.hmm all_genes.faa
#Isolating protein accession numbers associated with SCG profile 
awk '{print $1}' SCG_genes.scan | tail -n +4 | head -n -10 | sort | uniq > SCG_genes_protein.acc
#Creating JoinIntoGroupsOf function and processing grouped protein accession entry via entrez direct - final output will be concatenated into the .faa file
JoinIntoGroupsOf() { xargs -n "$@" echo | sed 's/ /,/g'; }
cat SCG_genes_protein.acc | JoinIntoGroupsOf 200 | xargs -n 1 sh -c 'efetch -email "" -db protein -id $0 -format fasta' > SCG_proteins.faa

#Continuing on to gathering nucleotide CDS fasta sequences related to the subscreened SCG HMM profile
#The general idea here is to have a bank of NCBI generated nucleotide fasta sequences and then sort out the ones corresponding to SCG protein accessions generated in the previous step. Final result of this process will be a list of nucleotide fasta headers that correspond to the SCG protein accessions 
#Below awk script maps all_cds.fasta file into the memory as field 1, and checks each line in SCG_genes_protein.acc against the field array 
awk -F, '
NR == FNR {field1[$0] = $1; next}
for (line in field1)
if (line ~ $0)
print field1[line] 
' all_cds.fasta SCG_genes_protein.acc > scg_cds_filtered.list
#An alternative (and likely superior) process would be reading the accession list into memory and then checking the CDS bank file against it - I need to review this code later!
awk -F, '
    NR == FNR {list[$1]; next}
      for (item in list) 
        if ($0 ~ item) 
          print $1
  ' SCG_genes_protein.acc all_cds.fasta > scg_cds_filtered.list  

#Once the filtered header list is ready, we can finally index the all_cds.fasta file with samtools and extract the sequences
#Below command will generate a .fai index file which will be the reference to pull SCG associated nucleotide sequences from 
samtools faidx all_cds.fasta
#At this stage I'd suggest manually checking the header format on scg_cds_filtered.list against the format in the .fai file to see if they're the same. NCBI nucleotide CDS fasta headers have a lot of information on them, but we won't need them for our purpose, and a lot of the extraneous information becomes noise for the samtools process
#Isolating relevant portion from the filtered header list generated in the last step
cut -d " " -f1 scg_cds_filtered.list | sed 's/>//g' > filtered_formatted.list
#Using the filtered list to isolate nucleotide CDS sequences of interest
xargs samtools faidx all_cds.fasta < filtered_formatted.list > SCG_cds.fasta

And if you want a quick note on how I usually obtain the .faa and nucleotide CDS .fasta files of interest –

#For nucleotide cds fasta files, I usually find ncbi-genome-download script to be an amazing time saver
ncbi-genome-download bacteria -A bacterial_assembly_accession_list.txt -F cds-fasta -o outputdir
#you can of course look for archaeal genes like below 
ncbi-genome-download archaea -A archaeal_assembly_accesion_list.txt -F cds-fasta -o outputdir 

#For protein sequences and simple nucleotide sequences I normally stick with entrez-direct. Here's how I'd download protein sequences from a list of accessions (normally obtained through entrez-direct itself or through the NCBI Genbank portal
#Grouping queries so NCBI won't cut me off
JoinIntoGroupsOf() { xargs -n "$@" echo | sed 's/ /,/g'; }
#Feeding protein accessions into the entrez process
cat proteins.acc | JoinIntoGroupsOf 200 | xargs -n 1 sh -c 'efetch -email "" -db protein -id $0 -format fasta' > proteins.faa

The benefit of above process is the versatility – for example, instead of screening whole protein coding regions of a particular taxon, I can use to protein screening section described above to extract the SCG protein sequences from the Halobacterium NRC-1 for amino acid level analysis.

#HMMsearch through the SCG halobacteria HMM set created previously to isolate SCG proteins accesion in Halobacteria NRC-1. I have bunch of these processed self-contained in scripts, so they're automated on the system level
hmmsearch --cut_ga --cpu 30 --tblout halo_nrc.scan halobacteria.hmm GCF_000006805.1_protein.faa 

#Screening out SCG protein accessions from the scan output
awk '{print $1}' halo_nrc.scan | tail -n +4 | head -n -10 | sort | uniq > halo_nrc_scg_protein.acc

#Downloading the SCG protein sequences present in Halobacteria NRC-1  
JoinIntoGroupsOf() { xargs -n "$@" echo | sed 's/ /,/g'; }
cat halo_nrc_scg_protein.acc | JoinIntoGroupsOf 200 | xargs -n 1 sh -c 'efetch -email "my@email" -db protein -id $0 -format fasta' > halo_nrc1_scg_proteins.faa
Halobacterium NRC-1 amino acid preference in single copy genes – blue bars show first base and orange bars show second base

Cracking buffer for East Water sample

One thing I used to do with some of the environmental phage sampling runs in the past (both soil and liquid) was to simply crash them with cracking buffer and see what sort of bands I would get on a gel – of course, most of the time whole environmental samples would result in genomic smears, but occasionally I would see some defined bands indicative of some dominant culture (with a genomic compartment small enough to be resolved on your typical eletrophoresis gel). The effect can be very pronounced in enriched environmental cultures, for obvious reasons.

I’ve had some interesting experiences cracking unknown isolated cultures before as well. One day I hope to figure out how to correlate signals from such simple crack-and-gel processes to some useful observation.

Since we’re all waiting on pins and needles for the East River 16s sequencing result (which likely won’t be processed until early or middle next week), my lab partner reminded me that I ought to run the cracking buffer process with the mystery microbe in the meantime. The tradition of a simple microbial cracking buffer is apparently something that goes way back among microbiologists and there are many different variations of documented cracking buffers out there – I’d guess there are many more if we count those secretive recipes passed down from tech to tech in more of an oral tradition manner.

For now though, we used a protocol from here, transcribed below for convenience.

1.  For 50ml, add the following to a 50ml sterile conical tube:

2.5ml   1N NaOH
2.5ml   10% SDS
0.5ml    0.5M EDTA
25mg (or a few sprinkles to taste)  Bromo-phenol blue
4g         Sucrose

2. q.s. to 50ml using milliQ water.

3. Mix well by vortexing.

4. Can be stored at room temperature.

Note that we simply use distilled or autoclaved distilled water for most of these rough applications.

Once the buffer was made, we spun down a 1 ml liquid aliquot from ‘E’ sample East River bacterial culture, decanted and then resuspended the pellet in the residual media. 10 ul of the resuspended sample was added to 20 ul of the cracking buffer in a PCR tube, and the sample was heated at 95 C for 2 minutes, with cooling down step to 25C. The cracked cell was mixed with loading buffer, loaded onto a TAE 0.5% gel and ran for about 20 minutes.

The image is a little faint due to the quality of the camera used to take the gel picture – but for the cracked East River bacteria samples on the right, you can clearly see one band per lane, far above the 10KB upper band of the ladder, but far down enough that they are likely not genomic smears or shears. I’ve seen higher MW chromosomal extractions shear and give multiple banding patterns on regular eletrophoresis before, but bands from such scenarios tend to be bunched up together in the upper portion of the gel, resolving after a relatively longer period. I think the lower position of the bands suggest a possible plasmid present in the East River bacterial sample in about 50KB to 100KB range.

Note that the reddish bands on the gel are likely to be some sort of protein contamination, binding with the mix of dyes in the sample – we’ve been seeing them in samples that haven’t been cleaned or column-filtered.

Writing the tree – Single Copy Gene set generation with pfam

Sometimes availability of a good tool marks the beginnings of great research questions, just like the case with the wonderful GToTree– a phylogenetics tool I view as the gold standard for utility, ease of use and documentation for researchers and students alike. The combination of chance encounter with GToTree and Nanopore MinIon as the choice of sequencing platform began my interest in naturally occurring plasmids and their replication systems that seems to preside over most of my biology-related studies.

GToTree allows usage of custom single copy gene sets for phylogenetic tree building consideration- I’ve been meaning to build my own SCG sets and test them out against variety of scenarios for a while now, so during the weekend I took a short break to run through all the processes of screening genes and sorting them out for SCG candidates, carried out primarily on an old Thinkpad X201 laptop with i5-520M with 8GB max ram as per our personal need for benchmarks in more frugal computing practices (granted, the same process I ran on a larger set of Halobacteria genomes was run on a 32 core 128GB ram workstation).

General gist of the SCG set compilation process is described here. The process is a simple one, involving using Pfam-HMM file and associated pfamA.txt file to create a sub-HMM file composed only of pfams covering over 50% of the source proteins they were derived from. This is done to prevent multiplicate pfams showing up in presence of a single gene, which would introduce noise into pfam based SCG screening. The resulting sub-HMM is used for general hmmscan of all the known genes in user supplied genome group (usually beginning with lists of genome assembly accessions leading to protein fasta groups), and the resulting hmmscan result is re-screened to save only the pfams occurring in over 90% of the genomes in the user supplied group. The final result is a set of pfams based on single underlying genes that shows up across over 90% of the genome group members, a rudimentary SCG group you can use for any number of downstream analysis.

I tested out two different methods for SCG set compilation – the first method is the manual, primarily bash and R driven one based on PF34.0 (the instruction on the linked page calls for PF32.0, which is a more noteworthy distinction than one might expect. More on that later). The second one is an alpha version script recently introduced into the GToTree suite itself, gtt-gen-SCG-HMMs that more or less automates the manual process in a single python script, and automatically calls for whatever the most recent PF release version is. The genome source for the Deinococcus-Thermus phylum is composed of 49 species selected under the criteria of ‘”Deinococcus-Thermus”[Organism] AND “complete genome”[filter] AND latest[filter] NOT anomalous[filter] AND “has annotation”[Properties]’, available both through entrez-direct or through the NCBI-genbank assemblies page.

42 genomes is almost certainly too small of a data set to derive an SCG set out of, so this is more of a practice/academic exercise than something that could be immensely useful – granted, some of the later phylogenetic tree data does show an interesting difference between a phylogenetic tree built with this set versus something you might end up building with a universal HMM set derived from research like this. I would normally go for a larger dataset, but for building an SCG set I think it’s worth it to work with a potentially smaller dataset for sake of integrating complete genomes, which, in NCBI’s definition applies to genomes with no more than 10 ambiguous (N) bases in the assembly.

A copy of the code base used for the process is below

#downloading the latest pfam release for screening and annotation
gunzip Pfam-A.hmm.gz
#moving to filtering pfam stage 
#downloading a table associated with the pfam-a.hmm file.

#extracting only the pfam ID and coverage from the pfamA.txt file
##extracting pfam accession and coverage - field 34 for PF34 version specifically
cut -f 1,34 pfamA.txt > all_pfam_avg_covs.tsv

#extracting pfam accessions with coverage above 50% using awk 
awk '$2>50 {print $1}' all_pfam_avg_covs.tsv > coverage-filtered-pfam-IDs-no-version-info
#above script gives 9635 total pfam accessions

#extracting all pfam ids from the pfam.hmm file through ^ACC grep, switches space to tab and uses cut to isolate the second field - which includes the pf accession and the version number after dot.
grep "^ACC" Pfam-A.hmm | tr -s " " "\t" | cut -f2 > all-pfam-ids

#use a loop to grep through the all-pfam-ids and match coverage filtered ids with version info 
for i in $(cat coverage-filtered-pfam-IDs-no-version-info); do grep -m1 "$i" all-pfam-ids; done > coverage-filtered-pfam-IDs

#using the id list from above to create a coverage filtered pfam hmm file 
hmmfetch -f Pfam-A.hmm coverage-filtered-pfam-IDs > Pfam-A_coverage_filtered.hmm

#now that the subset .hmm is ready, it's time to get the genomes for the SCG set
esearch -query '"Deinococcus-Thermus"[Organism] AND "complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_Genbank FtpPath_RefSeq FtpPath_Assembly_rpt > dt_complete_ncbi_info.tsv

#49 total results as opposed to the 47 from the web portal - something to consider for the future? 
#comparing the output against web portal accession list with diff gives us two more accessions in favor of the esearch result
# GCF_002355995.1
# GCF_013401415.1

#applying header to the main tsv
cat <(echo "AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt" | tr " " "\t") dt_complete_ncbi_info.tsv > t && mv t dt_complete_ncbi_info.tsv

#extracting accession numbers
tail -n +2 dt_complete_ncbi_info.tsv | cut -f1 > all_genomes.accs

#building links to download genes - from above result we only have refseq links
grep "^GCF_" dt_complete_ncbi_info.tsv > refseqs

#compiling download links from initial data 
#skipping over GCA step since we have none in the initial data 
cut -f1,8 refseqs > refseq_accs_and_base_links
cut -f 2 refseq_accs_and_base_links > refseq_targets_p1
cut -f10 -d "/" refseq_targets_p1 > refseq_targets_p2
paste -d "/" refseq_targets_p1 refseq_targets_p2 | sed 's/$/_protein.faa.gz/' > refseq_targets
cut -f1 refseq_accs_and_base_links | sed 's/$/_protein.faa.gz/' > refseq_names

#Downloading genome acccession associated proteins
parallel --xapply -j 4 curl -L {1} -o {2} :::: all_targets :::: all_names
#I encountered multiple errors and corrupted files during the protein download stage - NCBI have been showing some issues post-covid
#manual check of file integrity might be necessary 
#script used for the next step erases faa file after processing, even in case of error
#it's a good idea to back up .faa files just in case
cp *.faa.gz faas/

#this part assumes you used conda to install the gto helper scripts 
#which can be done with conda create -n gto_bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
conda activate gto_bit
bash all_genomes.accs
#above file creates an all_genes.faa file for all gene across 49 genomes, with 139119 entries (grep -c ">")

#splitting the genes files as before
#for the split parameter - d=use numeric suffixes l=line per file, sub_ is prefix
mkdir split_seqs
cd split_seqs/
split -d -l 7000 ../all_genes.faa sub_ 
time find split_seqs/ -name "sub*" | parallel -I% --max-args 1 -j 4 hmmsearch --cut_ga --cpu 4 --tblout Pfam-A_coverage_filtered.hmm % > /dev/null
#above process took below time to complete 
real    85m36.039s
user    317m53.994s
sys     9m15.014s

#concatenating the hmm results together 
cat split_seqs/*.tab >

#getting coverage filtered pfam ids from the custom made .hmm
grep "^ACC" Pfam-A_coverage_filtered.hmm | tr -s " " "\t" | cut -f2 > coverage_filtered_pfam_ids

#using the python helper script to compile a table holding all genomes as rows, all include pfams as columns, and counts in the cells
python3 -p coverage_filtered_pfam_ids -g all_genomes.accs -H -o all_bacterial_pfam_hmm_counts.tsv
#the table seems to be formatted properly despite how it looks with a head or a cat 
#as usual, always check command outputs manually before proceeding to other steps, especially if the process is new

#Start R
> tab <- read.table("all_bacterial_pfam_hmm_counts.tsv", header=T, row.names=1, sep="\t")
> head(tab[,1:10]) 
                PF09847.11 PF00244.22 PF16998.7 PF00389.32 PF04029.16
GCF_016865295.1          0          0         0          2          2
GCF_016865275.1          0          0         0          2          2
GCF_016629545.1          0          0         0          2          2
GCF_013401415.1          0          0         0          1          1
GCF_011604825.1          0          0         0          1          1
GCF_011604805.1          0          0         0          1          1
                PF04463.14 PF05635.13 PF03390.17 PF13532.8 PF10014.11
GCF_016865295.1          1          0          0         0          0
GCF_016865275.1          1          0          0         0          0
GCF_016629545.1          1          0          0         0          0
GCF_013401415.1          0          1          0         0          0
GCF_011604825.1          0          0          0         0          0
GCF_011604805.1          0          0          0         0          0
> dim(tab)
[1]   49 9635
> all_genomes <- row.names(tab)
> num_genomes_with_1_hit_for_each_pfam <- apply(tab, 2, function(x) sum(x == 1))
> perc_genomes_with_1_hit_for_each_pfam <- round(num_genomes_with_1_hit_for_each_pfam / length(all_genome) * 100, 2) 
Error: object 'all_genome' not found
> perc_genomes_with_1_hit_for_each_pfam <- round(num_genomes_with_1_hit_for_each_pfam / length(all_genomes) * 100, 2) 
> target_pfam_ids <- names(perc_genomes_with_1_hit_for_each_pfam[perc_genomes_with_1_hit_for_each_pfam >= 90]) 
> write(target_pfam_ids, "bacterial_scg_pfam_ids.txt")
> quit()

#concatenating the results together using hmmfetch
hmmfetch -f Pfam-A_coverage_filtered.hmm bacterial_scg_pfam_ids.txt > deinococcus_thermus.hmm

#the final HMM file contains 192 conserved single copy gene sets across deinococcus-thermus 

One thing to keep in mind in the beginning portion of the code above -original instruction calls for isolating field 36 of the pfamA.txt file for the coverage info – alas, that information only seems correct for PF32.0 – if you’re working with PF34.0 the correct field is 34, as written in the code section above. Version to version difference while compiling SCG set based analysis is something to keep in mind.

I tested the final output of above process against the results from built-in, alpha version gtt-gen-SCG-HMMs script. Despite working off the same set of pfam (version 34.0 I think) the results are slightly different. The manual process created an HMM file with 193 SCG entries, while the automated script created a 200 SCG entry file – all of the 193 SCG entries on the manual HMM was found in the results of the automated script, with seven pfam difference being below.

PF01019.23      G_glu_transpept Gamma-glutamyltranspeptidase
PF01702.20      TGT     Queuine tRNA-ribosyltransferase
PF01761.22      DHQ_synthase    3-dehydroquinate synthase
PF02569.17      Pantoate_ligase Pantoate-beta-alanine ligase
PF04227.14      Indigoidine_A   Indigoidine synthase A like protein
PF04474.14      DUF554  Protein of unknown function (DUF554)
PF09862.11      DUF2089 Protein of unknown function (DUF2089)

I used the new SCG HMM set against a GToTree built in universal SCG HMM based on ( on the sample group of Deinoccus-Thermus genomes as a quick test. The results are interesting.

To start with, the deeper branching location of Deinococcus proteolyticus & radiophilus group in the universal SCG HMM set is quite clearly wrong, so unrooted or midpoint rooted phylogenetic tree (or really a cladogram) based purely on the 16 gene set seems quite inaccurate. The organisms are placed in a more conventionally accepted location in the phylogenetic tree using the custom SCG HMM, but the graph is very unusual in another aspect – the location of the species Truepera radiovictrix, one of the more common rooting points for the Deinococcus genus. When weighed against the new SCG set, it looks like Truepera jumps wholly over to the Thermus genus branch of the phylum, placing the more conventionally placed Deinococcus peraridilitoris and maricopensis as the deepest branching members of Deinococcus.

Again, I want to stress that this result likely does not mean anything – the new SCG HMM file was made from far too few members of the phylum, and I agree with the author of GToTree that one should look for groups with at least a hundred genomes to sample from. I just thought it was an interesting demonstration of how a different SCG set one draws from can very dramatically alter the results of a phylogenetic tree building process.

Another utility for SCG sets is determining quality of a genome assembly – usually the higher the recovered number of SCG and lower their redundancy, the better quality the genome is. Here are the SCG screening and recovery rate for our Deinococcus radiophilus assembly (MinIon), the NCBI Genbank Deinococcus radiophilus assembly (Illumina), and Deinococcus proteolyticus (complete genome) as a control.

        Assembly GCF_000190555.1; Deinococcus proteolyticus

      Performing HMM search...
        Found 182 of the targeted 193 genes.
        Est. % comp: 93.78; Est. % redund: 0.52

        Assembly GCF_003966215.1; Deinococcus radiophilus (Illumina)

      Performing HMM search...
        Found 181 of the targeted 193 genes.
        Est. % comp: 93.26; Est. % redund: 0.52

        Our Deinococcus radiophilus assembly (MinIon)

      Performing HMM search...
        Found 180 of the targeted 193 genes.
        Est. % comp: 92.75; Est. % redund: 0.5

Not a bad showing for our genome assembly, especially considering it’s an amateur work based on a single flowcell, run in a dusty warehouse 🙂

Welp, time to process some more stuff for Halobacteria group – we’ll need them for the upcoming sequencing and analysis experiment.

Barcoding strange bacteria part II

In a past episode of our slow misadventures into nature-poking, we PCR’d out 16s barcoding sites from a East River bacterial sample and a peculiar branching bacteria sampled from an iridescent biofilm formed around a mangrove culture. Right before sending out the sample, however, we found out about a new turn-key solution offered by our local sequencing company that only requires a pellet (in an eppendorf) of whatever the microbe we’re trying to barcode, no PCR or gel-purification necessary. We changed our plans and decided to give the new service a test drive (while I have no issues with some manual work, the reagent/kit shortage following the pandemic is something we all need to be wary of, so the less the number of frivolous processing we need to do, the better).

Sequencing results came back after about a week of processing – far more than what would normally be necessary with samples PCR’d and prepped ourselves, so I think those in hurry will still need to do things the traditional way. The results were available in standard sanger sequencing result format from our vendor, with forward and reverse ab1 read files, and .seq fasta output from each of the ab1 files. The results are a little mixed in my eyes – the mangrove iridescent biofilm bacteria returned a clear match in the 16s result. Trace files (pulled through emboss-abiview) look like below (two static images and two moving gifs) – aside from the typical signal issues in the beginning of the read, each bases are quite clearly defined, as we’ll see more clearly down this note.

Simply dumping only one of the forward-reverse sanger sequencing pairs is enough to tell us that whatever the bacteria is, it quite likely belongs to the genus Pseudomonas from class Gammaproteobacteria. Here’s are the top BLAST hits from the 16s forward read.

It’s the sequencing results from the mysterious East River bacteria that gives me pause- here are the trace files from the ab1 generated during the process.

It’s hard to see with the rather lackluster trace file visualization afforded by abiview, but in general the 16s sequencing result from the East River bacteria is a mess – things are more obvious when we look at the sequence level data – below table generated by our in-house sequence counting tool, github located here.

The N count for indeterminate bases for forward sequencing output from the mangrove sample P1 is rather impressive 37. Reverse output fares okay as well, with 110 bases remaining unknown. The pattern shifts pretty drastically with both forward and reverse sequencing output from the East River bacteria sample, with whopping 228 and 227 bases from each of the sequencing output too noisy to be determined one way or another (an interesting aside is the drastic GC% difference between the two bacterial samples, even when just accounting for the 16s region).

With the 16s region sequencing result of the East River bacteria in somewhat concerning state, the BLAST result of the sequence fragment doesn’t really do much to assure me of its reliability either. I can only get a workable result using blastn instead of the common megablast.

BLAST output of the East River bacteria 16s forward read. Notice the 52% query cover – and the very concerning Methylobacterium genus

Conventional BLAST (megablast) is unable to detect anything from the reads, forward or reverse. Blastn search does show matches across Methylobacterium, but even then with about 52% cover and 70% identity match in best cases. Top hits for the reverse read starts at a staggering 20% cover and 70% identity. When we factor in the fact that Methylobacterium genus members are frequently associated with contamination in sequencing & DNA extraction reagents in molecular biology labs, I think there is a cause for concern here – contamination will also explain the very high background noise on the traces.

While going through the sanger results, I wrote a quick script that generates a reverse complement of the reverse read and aligns it against the forward read, generating a quick bird’s eye view of the matching base location against the whole sequence block (along with clw formatted alignment file, using muscle aligner). The github for the script is here. Who knew there would be such dearth of command line tools on working with short PCR’d sequence fragments!

It’s each to see that while the number of Ns across both of the aligned reads of E1 and P1 samples are similar (E1 at 343 and P1 at 335), the distribution pattern is absolutely disastrous for the E1 sample, with aligned bases across the forward and reverse reads lit up like a Christmas tree. The output from the P1 sample shows just about what one would expect with classic sanger sequencing output – the first and the last 200~300 bases are a mess, but there’s a solid core of overlapping alignments suggestive of a clean sample.

16s region for the East River bacteria will have to be sequenced again with a different sample – perhaps we should notify the vendor of the noisy results this time as well. Chances are I’ll send in a sample of the Halobacteria along with the re-do. I’m still hoping the East River bacteria will turn out to be something new – but current results are way too noisy to make anything of them. Oh well.

Weekend crispr tree trip

My initial attitude during the planning stages of Deinococcus radiophilus sequencing project was that the organism would provide a good first step for delving deeper into nitty gritty details of modern microbiology, and that I’ll eventually grow out of it. However, the more I study the genome of Deinococcus radiophilus, the more interesting the entirety of the Deinococcus genus becomes. The often touted radiation resistance and associated genome repair mechanisms are probably the least interesting aspects of the genus. It’s the deeper branching possibilities of the Deinococcus-Thermus phylum (here’s a recent iteration of the likelihood) toward the Last Common Bacterial Ancestor (quite likely there’s no such thing as a physical entity, but the concept is still interesting enough to consider), along with the possibilities embedded in seemingly unique plasmid type distribution pattern discovered through our own soon-to-be-published research, all coming together to constant suggest Deinococcus-Thermus’ possible role as records of the history of microbial life on this planet that intrigues me most, enough to keep me coming back to its member species again and again in search of new insights.

With the paper – hopefully – wrapping up, my recent focus is on looking at the CRISPR associated proteins and their repeat sites across the Deinococcus genus. The hope is to learn more about the types of selfish genome elements, plasmids, phages and such that members of the genus must have been exposed to before, and to derive if any of them shows a pattern of distribution across a particular evolutionary distance. There’s something the analogy of genomes as history books with indices and methods of interpretation that I find intensely appealing, probably an impact from my ongoing interest in the history of science.

So over the weekend I decided to figure out a quick-and-dirty method to screen through the CRISPR related data from the particular genus with phylogenetic relationship of the members in mind. Initial screening was easy – a quick browse through the genbank page to get the assembly accession IDs under specific categories (I usually work with two sets of data – a smaller one focusing on complete reference genomes, and a larger one composed of representative reference genomes. Results from both tend to complement each other – since many of the topics I look at tend to care about whether they are on a plasmid or a chromosome, and representative dataset helps to cut through redundancies of ‘popular’ microbes with their multiple genome samples), using the accession ID to obtain associated protein sequences (usually via ncbi-genome-download or entrez-direct, the latter works best for automated workflows) and then using my custom pipeline to run a high-sensitivity pfam screen. I found 20 pfams associated with most CRISPR associated proteins (some of them far better represented than others) across known genomes of Deinococcus-thermus (screened from a sample of about 221 genomes), and then found that CRISPR occurrences in Deinococcus genus, specifically, tend to conform to one or more of five types of known CRISPR systems – I-B, I-C, I-E, III-A and III-B. Occurrence of CRISPR system per genome is hardly universal across the genus – of the 25 complete genome set, only 8 carried CRISPR proteins, and of the 57 representative genome set, 27 carried CRISPR proteins (the clustering of the proteins themselves seemed well represented and conserved across most genomes- I don’t think I saw any obviously broken system during the screening).

Next step was a quick phylogenetic tree construction (with GToTree; from the genome sets used for CRISPR screening – with goal of figuring out if the distribution of particular system followed a phylogenetic pattern. Here are the results not accounting for the types of CRISPR systems and just counting the occurrences of associated proteins.

Another one with the CRISPR protein count broken up by color corresponding to the specific system type. Red is III-B, yellow is I-C, green is I-E, blue is III-A, purple is I-B. Maybe not quite a glaring pattern as we’ve seen before, but breaking up the full CRISPR protein hit count into only the ones specifically responsible for a type does tell us a few things – like the prevalence of III-B type system in the deeper branching members of the genus, and widely spread I-E system combined with a clustering of CRISPR type systems in a particular portion of the phylogenetic distribution (although it’s only visible on the reference – representative genome set with more sample count). Almost complete lack of representation among type III-A and I-B systems are interesting as well, considering how often they show up on some of the other phyla. Considering that the III-B type system is derived from an archaea Haloferax volcanii, one could suspect an evolutionary reason for distributions in this graph- in that an archaeal system crossed over to a closer genus, thus resulting in a gradual fall-off down the phylogenetic tree. Since I’m looking for ancient phages/selfish genome elements, looking at interspaced repeat sites on the six genomes with III-B type systems might make a lot of sense.

Now – what would be a reason for the seeming gradual fall-off of CRISPR type protein hits down the phylogenetic tree of Deinococcus genus? Given the defense-related roles these systems might play in microbes, their disappearance might be indicative of active function with fitness weight.

For those interested – entirety of the brief exploration took place in a linux shell, just using bash scripting, outside of the phylogenetic tree building with GToTree. Here’s a quick code example for extracting CRISPR annotated genes associated with known type families from NCBI gffs, for reference. The bash script assumes multiple gff file targets in a single directory.

find -type f -name '*.gff' | xargs grep -i "crispr" | grep -i "type" | awk -F ":" '{print $1,$6}' | awk -F ";" '{print $1,$3}' | sed 's/_ASM[^~]*product=/ /g' | awk '!/.gff/'

Frugal bioinformatics – genome archiving bench test

Prior to our deep dive into sequencing and assembling the Deinococcus radiophilus genome, we shared the common wisdom among the laypeople, and perhaps more people in the research community than most would like to admit; that any bioinformatics processes on the prokaryotic and phage genome scale are such trivial processes on any modern hardware, one would simply need a laptop (perhaps not even a good one) to crunch through all research questions.

Welp, we were quickly shaken out of such happy delusions as the ram usage for some of our assembly experiments (for a mere bacterial genome!) started inching toward the 100GB+ line, and routine housekeeping processes started going over 16GB of ram usage and often into 20GB+ range. Ever since then I’ve been very interested in the concept of frugal bioinformatics – what is the scale of bioinformatics processes and analysis that can be considered reasonable given the limitation of common consumer computing hardware, something one might be able to find in the U.S. for a couple to a few hundred USD pricing bracket? Maybe I can build up enough of a body of experiments and benchmarks to get a mini-paper out at some point.

Against the background interest, I ran into an interesting discussion on twitter this afternoon regarding an extremely impressive compression of genomic fasta files (I’m a little confused on the description by the author – he says it’s a 5GB file compressed down to 74MB – but 5GB file is an already compressed gz file. I’m not sure how it could be a comparison of efficiency between the two, unless it wasn’t the intent) using xz. While my initial interest was in figuring out how stable/error free the archive format would be given some controversy behind its codebase and design decisions, I started thinking about how usable such high efficiency compression-decompression tool might be on a consumer level hardware, and decided to run some simple benchmarks to find out.

I kept the process simple – compressing the 4.7GB Nanopore fastq file from our Deinococcus sequencing project on a consumer laptop with i7-4710HQ 2.50 GHz with 4 cores 8 threads, and 16GB of RAM. Five compression methods were used in their default settings, all running on single thread. 7z, gunzip, lzma, xz and zst.

The results were – interesting. Certainly not what I expected from the result seen with the original twitter post (which could very well be the quirk of the input file type). The different file compression methods resulted in two groups of output; 7z, lzma and xz showed about 50% compression at 2.0GB output file size, gunzip and zst showing about 43% compression at 2.3GB output file size. The real surprise here, however, was the time it took to compress these files, each of them varying wildly from each other.

Of the five, the 50% compression group stood out with extremely long processing time compared to the rest, with xz taking whopping 99 minutes to compress the fastq file, followed by lzma at 96 minutes, and 7z at 38 minutes. To put these numbers into perspective, gzip took 17 minutes and zst took blisteringly fast 1 minute to compress the fastq file, almost 100 times faster than xz for about 7% larger final file size.

I wondered if the extreme processing times for xz and lzma were due to the single threaded nature of the test, and whether the members of the 43% compression group could show improved performance by maxing out available compression level. Another informal test was run with xz/lzma and zst set to maximum possible compression level and set to utilize all of 8 available threads on the computer.

And here’s where we run into the common pattern of problem encountered in frugal bioinformatics processing. Normally, someone working with a cluster in an academic environment would attempt to increase the number of threads running in an especially slow process – but for some processes increasing the number of concurrent threads would be accompanied by corresponding increase in ram usage. The zst compression attempt with 8 thread and maximum compression depth needs more than 16GB of ram to compress a 4.7GB file. So the process had to be run again until the thread numbers were cut down to 4, taking a little over 10GB of ram. What we get in return for increased processing time (1 minute to 45 minutes), 4x threads and vastly increased ram usage is a decreased final file size from 2.3GB to 2.0GB, hardly a worthy trade-off in most situations. Compression with lzma and xz fared similarly, with increased time and ram usage resulting in 1.9GB output instead of the previous 2.0GB. Again, not really worth the hassle.

For amateur researchers working with consumer hardware, or for researchers working on their private laptops without immediate ssh access to a lab cluster, zst or even gunzip seems like a generally superior choice for archiving – and for files within scales of raw reads compilation from a bacterial genome, there could be cases where a single threaded processing makes more sense. Who knew, more cores might not be the solution to all our problems (if you don’t have the ram to back it up, that is).

East River sample housekeeping, and barcoding strange bacteria

The cultures of red-tinted bacteria (hopefully a bacteriochlorophyll utilizing species of some sort) from previous East River sampling run were getting a little long in the tooth, so I subcultured them into the new, smaller autoclavable plates with better seal using the last of the PPES-K media. I’ve been using the little plates for continuous subculture of our Deinococcus radiophilus laboratory strain (another one I’m keeping around for mutant detection a year or two down the road), and it had been quite useful – it’s possible to pour thicker media for strain storage owing to the shape of the plates, and the natural snap-seal of the pieces seem far more resistant to drying out than regular petri dishes sealed with parafilm.

While preparing all the isolated samples and new plates, I finally prepped the East River culture for 16s sequencing as well, using a 16s primer sequence below (I forgot the original literature we got the primer sequence from, will update once found).



We’ll be sending out the sample along with another 16s PCR amplification from yet another mystery bacteria- a webbing type of species found in a biofilm layer of our lab mangrove culture. The particular biofilm in question is iridescent when exposed to light – I originally thought they were some sort of oil content pooling on top of the water, but it turns out there are cases of iridescent biofilm layer forming due to oxidation of the iron in the environment, sometimes caused or driven by the bacteria in the biofilm itself (something I ran into reading this blog post: ).

Whatever the isolated strain is, it’s small enough to be a bacteria- and shows interesting growth characteristics, such as the branching pattern and what seems like branching envelope that develops around the round, cocci like cells.

It goes to show, you don’t necessarily have to track to some faraway places to find things to study further. Crossing my fingers that these organisms are something new.