[ARTICLE] Why and how do we annotate a genome?

The genetic code

A, T, G, C: These four letters do not look like much, but they are the alphabet of life. They represent respectively four molecules (bases) Adenosine, Thymine, Guanine and Cytosine, which compose the DNA in the form of nucleotides (base + sugar + phosphate group). It is with an extremely complex and well-orchestrated mechanism that the organisms are able to synthesize genes, proteins, cells and organs. To better understand the DNA means to better understand life itself.
Unlike our languages, all organisms, from bacteria to humans, share the same alphabet and the same words (Amino acids). These amino acids (20 principals) are the words that compose phrases, (in our case, proteins). Genes correspond to a sequence of nucleotides, which then are transcribed into RNA which is itself translated in amino acids to form different proteins based on the sequences. To be recognised as amino acids, the nucleotides’ bases are read by groups of three, called codons (e.g. “TTG” or “CTA”). Each codon corresponds to an amino acid, with several exceptions that permit the start and the end of the reading (you can see them as the upper-case and the period of a sentence) (Figure 1). An amino acid can be coded by several codons in case mutations happen. This four letter alphabet may seem simple to understand, but the complexity of how cells decipher it and the numerous DNA structures that comes with this information makes it extremely difficult to fully understand how DNA works. It promises, however, more breakthroughs in the future and is a wonderful subject of research.

ADN - ALPHABET

Fig 1: The alphabet of nucleotides and how to decipher them into amino acids. Underlined: Threonine amino acid and one of its corresponding codons GCA. Note that the T-Thymine base is replaced by U – Uracil because of the conversion (transcription) from DNA to RNA which permits the synthesis of the amino acids (translation). [1]

Sequencing the genome

Now that we are on the same page, let’s talk about how we use this knowledge to study bacterial genomes. First of all, bacteria are microscopic cells with a relatively small genome. To compare, the most famous bacteria Escherichia coli possesses a genome of 4.6 Mb (millions of bases) whereas a human genome is around 3 Gb (billions of bases). To be fully optimised despite such a small length, each of these nucleotides have an importance. Hence, it is crucial to know what is the nucleotides’ composition of these bacterial genomes. Each bacterial strain possesses a unique genome, and it is used to differentiate bacteria as we use DNA in police investigations to identify the culprit.
There are several methods for sequencing DNA, the most famous one was developed by Frederick Sanger in 1977 and many other techniques have emerged since. The aim of these methods is the same, which is to identify which one of the four nucleotides is present at a position, only the technology to achieve this goal differs. DNA is sequenced in several steps: The first step is to extract the DNA and prepare it for sequencing (different methods exist based on the technology used). Usually, DNA is sequenced in short fragments (ex. 150bp), called reads. These fragments are the puzzle pieces of the genome. To rebuild the genome, these reads must be assembled together in longer DNA fragments, called contigs, which are also assembled together to rebuild the native DNA (Figure 2). To exclude errors, reads and contigs overlap each other multiple times, the whole DNA sequence obtained is the consensus of all these overlaps. The cost of genome sequencing has drastically lowered since the first assembled genome. You may have seen the famous line plot showing the cost of the human genome sequencing. At first, a single human genome cost around 2.7 billion USD [2], in 2020, it was estimated to be around a thousand USD [3]. As you recall, bacterial genomes are significantly shorter than ours. Sequencing it is an economic method to obtain valuable information.

DNA, GENOTYPAGE

Fig 2: Short-reads-sequencing principle requires a biomolecular fragmentation of the extracted DNA into short pieces before the sequencing step. The short sequences obtained (the “reads”) are then bioinformatically reassembled

Global and specific annotations

With the genome fully known, the next step in bacterial annotation is to detect the presence of genes and structural elements. This is where “traditional biology” leaves the place to bioinformatics. Thanks to the development of new computer technologies, numerous software and algorithms have been created with this sole purpose. One of these softwares is Prokka [4], a free software that permits an annotation of high quality. We can differentiate two levels of annotation, global annotation (which retrieves as much information as possible on the genome) and specific annotation (which aims to detect specific genes, such as antibiotics resistance genes). These softwares and algorithms detect genome elements with the help of several databases. What they do is that they look for the four codons we talked about earlier (1 Start – 3 Stop) which are a great indicator for the presence of genes. The DNA sequence found is compared to large databases in order to see if there is a match between the sequence from our bacteria and a known gene present in the database.

The information retrieved is essential to understand what the bacteria can and cannot do. There are many genome elements which give away the bacteria capabilities if they are detected. You have to know that DNA is the currency in the bacterial world. Bacteria exchange genes, a whole portion of their genomes and possess even extra small circular genomes called plasmids specifically designed to exchange their DNA. With global annotation, we can detect which genes or which genomic portion was incorporated in our bacteria. This information is detected by small details in the genome, like the presence of DNA motifs that are known to permit genomic exchange or the GC% (the proportion of Guanine-Cytosine) which is a great indicator that this DNA region comes from a different bacteria.

DNA , bacteria

Fig 3: Different mechanisms of DNA exchange between bacteria. Transformation (a) is the direct exchange of DNA fragments. Transduction (b) occurs when bacteria exchange DNA through viruses. Conjugation (c) is the DNA exchange via plasmids [5]

Other crucial information to retrieve during the specific annotation, is the presence or absence of resistance genes and virulence factors. These genes confer resistance to antibiotics, which is a major health crisis in our society, along with resistance to metals and other biocides used to prevent the apparition of bacteria in food industries. Last but not least, certain genes in bacteria are involved in a metabolic process called biofilm formation. These biofilms are an extracellular structure that bacteria produce together to reinforce themselves for finding food and being protected from external threats. Biofilms are everywhere in this world, from the surface of our teeth to medical instruments if they are not disinfected correctly. It is extremely important to detect and annotate these resistance genes and which regions come from other bacteria. This two-steps annotation is essential to keep track of bacterial evolution, and permits to control epidemic outbreaks, or foodborne diseases. Knowing where resistance genes for antibiotics or biocides come from is the key to isolating potential threats to human health.

Conclusion

We started with four letters, and now we see how resistant bacteria can be problematic in our society. Thanks to genome assembly, sequencing and annotation, we can obtain valuable information about the strengths and weaknesses of a bacterial strain. The annotation of a genome is similar to the ID card of a bacteria, and permits to select better antibiotic treatments for a patient, the choice of biocide to use for cleaning medical or food equipment and the detection of pathogenic bacteria if we find genes involved in diseases. Genome annotation will evolve with new technologies and is already a powerful tool for researchers and industries.

– Kévin Chateau, Trainee Bioinformatician, Biofortis –

REFERENCES:

[1] https://openoregon.pressbooks.pub/mhccbiology102/chapter/the-genetic-code/

[2] https://www.genome.gov/human-genome-project/Completion-FAQ

[3] https://www.nature.com/articles/s41436-019-0618-7

[4] https://github.com/tseemann/prokka

[5] https://www.nature.com/articles/nrmicro1325