TL;DR: In this blog post, we dive into the fascinating world of phylogenetic trees for drug-resistant microorganisms. We cover the estimation process, including Python code snippets for calculating GC and GA content. We also explore the use of hierarchical clustering and distance-based methods to generate phylogenetic trees. Join us on this journey as we uncover the challenges, insights, and the importance of expanding our organism dataset. Don't miss out on the intriguing findings from our analysis. Keep reading to discover more about phylogenetics and its applications in understanding microbial evolution.

Have you ever seen these kinds of diagrams before 👇🏾

From xkcd.com https://xkcd.com/2269/ Hopefully this will not get you kicked out

From xkcd.com https://xkcd.com/2269/ Hopefully this will not get you kicked out

If not, that's okay. Hopefully, by the end of this post, you'll know what they are and how to apply phylogeny in your work. ****Phylogenetic tree analysis is a technique that shows relationships between members and their sequences. If you have a background in machine learning, these look like results from the hierarchical clustering project: the tree diagrams. The importance of studies like this is to understand evolutionary relationships, pattern divergence, and generate hypotheses about gene and protein function. We can even get clues of drug targets through these studies, depending on where you are looking. But before diving in, let's get something straight. We'll do this over a conserved region of a couple of microorganisms. But first, let's cover some basics.

The term DNA is being thrown around a lot and used as an excuse for observed traits in animals (phenotype, this is called a trait you can observe). DNA is actually an acronym for deoxyribonucleic acid. The DNA is packaged in the nucleus of the cell, specifically in the chromosome and, in mitochondrion in the context of animal cells. The DNA is present in nearly every living organism. Some organisms such as viruses can have a variant of DNA rather a transcribed form of it; it's called RNA (ribonucleic acid). It can be single-stranded or double-stranded remember: measles virus and influenza virus those are great examples.

Long ago, no until now we've been using microscopy to see the differences between microorganisms. However, using this morphological standard is not the best method to classify organisms since some look the same as others, and this can throw off your experiment, for example, Mycobacterium. In the work of Carl Woese, proposed using a conserved region called Ribosomal RNA region (think of it as a string of characters consisting of AUGC) in bacteria. His discovery changed the world of microbial ecology and taxonomy since via his method — we can now discover more organisms that we could not grow in culture media dealing with the great plate count anomaly 😮. The amount of difference between the strings is a measure of the amount of evolution that separates the organisms. Remember this sentence it will come in handy in other sections of this study.

Data

We'll use a multi-FASTA file, a common way of representing genomic features. However, they are for a similar region/locus of different organisms. In our case, we'll be dealing with several microbial regions 16S ribosomal ribonucleic acid region in short, 16S rRNA, a part of the 30S small subunit of a prokaryotic ribosome. Why did I choose this region? It is a small region about 1500 nucleotides long with conserved and variable regions. The conserved region helps us identify a gene, especially the V4 region whereas the variable region tells the microbial species apart. This region doesn't evolve quickly, and it also codes for the major ribosomal subunit.

We'll be working with similar regions of 6 bacterial 16S rRNA regions from the NCBI nucleotide database — obtained from mostly wounds of human beings as an example. The organisms chosen are potentially dangerous since they could cause wound infections leading to bacteraemia or septicaemia. Initially, I was interested in looking at antibiotic-resistant strains only, but they proved to be very difficult to find in the database mentioned. Below are the accession numbers, the names of the bacteria(italicised), the region and number of nucleotides(I counted these separately then appended the result):

NB. the .1 means the version of that particular record