Biology and Computer Science

Another pleasant surprise from learning something new By Yihang Ho

Note: This essay contains some stuff about Biology, and I am no expert in Biology, as a result, what I say here might not be 100% accurate.

Biology has always been my least favorite subject among the three main branches of science. However, I am happy that I have learned something really enjoyable in Biology, something very similar to what I know in Computer Science.

As most people know, deep in our cells, we have something called DNA (deoxyribonucleic acid). DNA is actually a pair of helix polymer strands. Each strand consists of nucleotide monomers, which are held together by phosphodiester bonds. Each nucleotide molecule is made up of deoxyribose (a pentose sugar), phosphate group and nitrogen base. In the context of DNA, there are four types of nitrogen base: adenine (A), cytosine (C), guanine (G) and thymine (T). Nitrogen bases come in pairs - one on each side of the double helix structure. Also, adenine will always pair with thymine, and cytosine will always pair with guanine. For example, if adenine appears on one of the strands, thymine will appear on the other strand at the similar position as adenine. These pairs are held together by hydrogen bonds.

What is so interesting about nitrogen bases is that they are the ones which carry important genetic information like how to produce various protein molecules. The idea behind the beauty of DNA is very similar to that of encoding systems like ASCII and UTF-8.

Let's consider the ASCII encoding. ASCII defines how common characters — like the 52 alphabets (uppercase and lowercase letters), the ten Arabic digits, and some punctuation — can be represented and stored in a computer system. Essentially, computers operate by the bits, which have two possible states each. A byte, however, is a string of eight contiguous bits. As a result, each byte can store exactly 2^8 = 256 different information. The modern version of ASCII uses all 256 values, all the way from 0 to 255. For example, value 65 represents the character A, value 126 represents a tilde, ~, and so on.

As for DNA, there is actually a very similar system. Each nucleotide has three possible "states": A, C, G and T; a set of three contiguous nucleotides form a codon. Now, there are some interesting similarities between a computer and DNA: in DNA, a "bit" has four possible states, compared to a binary bit, which has two possible states; a "byte" (codon) is made up of 3 "bits", compared to computer bytes, which is made up of eight bits; a codon can store 4^3 = 64 different information, while a byte can store 256 different information. What is more interesting is that there is actually an encoding system defined for protein structures. A protein structure is a string of smaller units, called amino acids. There are 20 possible amino acids that will be used to synthesize protein molecules. Each possible "state" of a codon represents one of the 20 amino acids. For example, GGT represents glycine, TGG represents tryptophan, and so on.

Computer memory is like a very long row of cells, each cell can hold either 0 or 1. Each cell has an address, which is actually just a number. Hence, when a program runs, it needs addresses in order to access information in the memory cells. Besides that, the program also needs to know when to stop reading. For programs written in C/C++, the stopping positions are set to 0, which is also known as the null character. In other words, given an address, a program can read from that address, go to the next cell, and the next, and so on until it sees the null character.

Quite surprisingly, there is also a similar system for living things. A strand of the nucleotides is like computer memory. The differences are: each "cell" (nucleotide) can hold either A, C, G or T, and (I think) there is no such thing as memory address. So how does the thing that actually reads the DNA sequence to produce protein structures know where to start and stop reading? Well, Mother Nature has some clever schemes. Each protein structure must (yes, it is universal, it doesn't matter if you are a human, a t-rex, an oak tree, etc.) start with the amino acid called methionine, represented by ATG and end with ochre (TAA), amber (TAG) or opal (TGA). Hence, TAA, TAG and TGA are like the null character in C/C++. When that thing is trying to build a protein structure and is trying to look for the "recipe", it will look for ATG, starts reading from there and continue until it sees a TAA, TAG or TGA.

Note: If you check any textbook or the Internet, you might think that the genetic codes that I use here are wrong. The problem is that you will never see thymine (T) in those code tables. This is because those tables show the nucleotide sequence in an RNA strand, not a DNA strand, but I am referring to DNA here. RNA is the small brother of DNA. RNA is made up of a single strand instead of a pair of helix. Also, thymine will never appear in RNA, its place is taken over by uracil (U). In this context, the role of RNA is to act as a middle man between DNA and the thing that produces protein.