Bioinformatics Notes

Table of Contents

How Genes Work

Making Sense of Genomic Data

Basic Bash

Sequence Similarity and Blast

Genomics and Comparative Genomics


How Genes Work?

What is a gene? A section of DNA that encodes a protein or functional RNA

Where are genes? genes are encoded in DNA located in chromosomes.  Different DNA comes from both gametes each that have their own DNA for specific genes which can differ from the other parents genes

Central Dogma- how genetic information flows in a cell.. DNA (represents a gene) makes mRNA by transcription with RNA polymerase, mRNA leaves nucleus and goes to ribosome where polypeptides are made by translation

Genotype- genetic information held by an organism

Phenotype- measurable physical properties that are a result of genotypes

Allel- form of gene

Codon- 3 nucleotides that represent a specific amino acid

Mutations- a permanent change in an organism’s DNA, may or may not effect phenotype.. mutation in germ cells makes mutation inherited.. mutation in somatic cells is not inherited tumour or cancer

Silent- change in sequence that doesn’t change amino acid specified by codon

Missense- change in sequence that changes amino acid specified by codon

Nonsense- change in sequence that results in early stop codon

Frameshift- addition or deletion of a nucleotide

Chromosomal Mutation- large scale change in content or number of chromosomes (one to many, one to few, transposition- piece of one chromosome going to another)

-proteins are the workers of cells make up everything from enzymes to hormones

-proteins are being built today by looking for shape that doesn’t occur naturally to help new processes

-protein shape dependent of R group on amino acid, influences tertiary structure

peptide- one amino acid (poly=multiple)

-carboxyl group becomes carbonyl once bonded to amino group (NH2)

Primary structure- building of peptide bonds (translation)

DNA- nucleic acid made of two strands

made of nucleotides “bases

4 nucleotides (Adenine, Cytosine, Guanine, Thymine)

have sugar, phosphate group, and Nitrogenous base (base changes between ACTG)

DNA and RNA are built the same way bu with ribose

Nitrogenous bases

Pyrimidines- C,U,T

Purines- G,A

DNA forms single strand when nucleotides bond together (phosphodiester linkages) creates ends called 5’ and 3’ (built 5’ to 3’ read 3’ to 5’)

5 and 3 spots on sugar are important (5 is carbon that is on phosphate group, 3 is carbon that binds to hydroxide)

DNA double stranded- complementary and antiparallel, each strand has a 5’ and 3’ end

3’ to 5’ is the template strand used to make new DNA molecule 5 to 3

Making Sense of Genomic Data: Gene Calling and Genome Annotation

-Sanger sequencing was used up to 2007 and then lumina sequencing started being used dramatically dropping the price

-What this means in terms of information?

a bit= 0 or 1, represent base pairs in binary code like AT=00, TA=01, CG=10 and GC=11

This means we use 2 bits for every base pair and there are 8 bits in a byte

Diploid human genome is 6 GBP which adds up to 1.5 gigabytes of information

Where do you find genome data?

  • GOLD: genomes online database, Gebank is hosted by the US national center for biotechnology, European bioinformatics institute

What do we do with all this data?

Genome sequencing projects: initial stages

sequence: get dan in short chunks representing all DNA in genome, assembly using informatics and to construct scaffolds which represents chromosomes, what genes are present?

long stretches of DNA called configs in FASTA format

Annotated genome shows proteins sequenced that contribute to a total phenotypic function

Open Reading Frame ORF

-a sequence of DNA that has the potential to be translated

-contains start and stop codon

-before start codon a promoter for DNA polymerase is needed and a binding site for the ribosome

-6 open reading frames, 3 as you go 5’ to 3’ moving forward one base pair known as the +1 +2 or + 3

-DNA going from 3’ to 5’ also has 3 open reading frames in the anti parallel direction taking the reverse compliment which give us the -1 -2 or -3 open reading frames

-in an open reading frame, you get all 6 possibilities and look for start and stop codons

Basic Bash

Bash= Bourne Again SHell, type of scripting language used for file manipulation and program execution

commands folders and file names are case sensitive


options are separated by a space from the command

command = ls options = -l = ls -l

pwd allows you to see which directory you’re in

home folder is /jstevensbl not actual /home

ls folder name

cd= change directory

cd .. = go back one space

Sequence similarity and Blast

What is sequence similarity?

sequence similarity- a quantification of the degree to which two sequences match one another, sequences must be aligned and a score determined for the match

a similar sequence will have similar function, same shape means it can perform same function in the cell

At it’s core “theory of evolution” where common ancestors (homology) can be supported by similar sequences

Homologs- genes (or proteins) derived from a common ancestor, identified based on an arbitrary level of sequence similarity followed by further analysis, Yes or No not a percentage

Paralogs- homologous genes produced by gene duplication events

Orthologs- homologous genes produced by speciation

Why care about homology?

to learn about evolutionary relationships

to annotate a genome

to predict a structure of a protein

to conduct medical research

The general process of BLAST

– conduct multiple pairwise local alignments

  • evaluate the quality of alignments by calculating alignment score which uses algorithms based on understanding evolution
  • do a statistical analysis of the alignment scores factoring in the length of the query sequence and the size of the search space, this will give potential matches and subjects
  • evaluate the results using statistical and biological knowledge

What is BLAST?

basic local alignment search tool

sequence similarity searcher, it’s an algorithm

local, not global alignments

splits query search into “words” 3 amino acid length long or 12 nucleotide and search for possibilities with one letter variation of the word list and score how those align with the query

the query word list will be used against all sequences in the chosen database, wherever there are exact matches between words and subject sequences where the subject sequence is pulled out for further analysis

longer query = chance of random match decreasing

shorter query = chance of random match increasing

Queries scored by substitution matrices


the S cutoff- statistical analysis based on how often you see matches due to random chance, finds lots of matches “s cutoff” and then determines matches that are unexpectedly better than the rest of the matches

Max score: Score of single best aligned sequence (computed using a scoring matrix, as earlier)

Total score: Sum of scores of all aligned sequences (along the query length)

Query cover: A measure of whether the subject aligns along the entire length of the query sequence.

E value: See earlier. (Number of matches with the same score expected by chance).

Ident (Identity): The percent of nucleotides in the query that align perfectly with those in the subject.

Accession: A unique identifier given to a sequence record by NCBI (or other database).

Genomics and Comparative genomics

Genome- an organisms complete set or genetic material (DNA or RNA)

Term coined by Hans Winkler, a German botanist who combined gene and chromosome

Sometimes we specify mitochondrial genome and chromosomal genome

Comparative genomics- direct comparison of the genetic content of an organism against another, and its main aim is to obtain a better biological understanding of many species

-gene number, gene content, gene location

-length and number of coding regions (exons) within genes

-amount of noncoding DNA in each genome

metagenomes- the genomes present within an environmental sample

metagenomics- direct analysis of all genomes present within an environmental sample

Prokaryotic genome

one or two circular chromosomes


minimal amounts of noncoding DNA (high gene density)

operons (if multiple genes are needed all are combined for one function)

genes have no introns

Eukaryotic genome

multiple pairs of linear chromosomes

no plasmids

extensive amounts of noncoding DNA (lower gene density)

no operons

genes have introns

pan genome- all of the genes and other genetic elements found in any member of the species

core genome- the set of genes found in all organisms

dispensible- genes that may be found only in some strains and can come without which the organism can still metabolize and divide



Similar to synthesis, primer will be used 3’ to 5’ so DNA can be used 5’ to 3’

DNA Replication Principles:

Structure of DNA

Need a template- the sequence being copied

Complementary Base pairing


the role of 3’OH group in phosphodiester bond formation, and structures of nucleotides

DNA sequencing- the process whereby the identity and order of nucleotides in a DNA molecule are determined

Protein Sequencing first came with Edman sequencing (Beta chain of insulin published 1952)

RNA was first nucleic acid sequenced tRNA molecule 1968

Fred Sanger invented Sanger Sequencing in 1977

PCR invented in 1983, amplifies specific sequence of DNA

Sanger Sequencing:

Relies on synthetic dideoxynucleotides (ddNTPs) aka chain terminator nucleotides

OH group on 3’ Carbon of deoxynucleotide reduced to H creating dideoxynucleotide, so synthesis must stop because of the inability for carbon or hydrogen to bind to phosphodiester

Pool of different size fragments created and a gel (flat or capillary) can be used to separate short from long strands of nucleotides and determine what base the end dideoxynucleotide contains using fluorescent molecules, instead of what used to be used radioactive isotopes

Information collected by electropherogram where sequence can be seen in chromatogram

Dummies guide Sanger Sequencing:

DNA is composed of a sugar, base and phosphate

In order to determine sequence, an atom is removed to stop sequencing

Many different lengths are made and different bases have different colors

Colors are detected by machine and show sequence of one strand (opposite strand of primer) and we can determine sequence of other strand

Machine shows different color peaks for base present at specific points

Illumina Sequencing:

Used to sequence whole genomes and metagenomes

Sequencing by synthesis relies on reversible dye-terminators

Reversible dye-terminators are synthetic molecules that replace nucleotides on one strand, stop synthesis for millisecond when being synthesized, flash light, synthetic molecule becomes normal nucleotide, moves onto next nucleotide

This can be done in multiple locations on the chromosome simultaneously known as Massively parallel (massively= big parallel= all light is emitting at same exact times) all happens in flow cells

OH group on 3 carbon in sugar is replaced by OR group and can become OH after stopping synthesis

Dye is attached to cleavage linker which is attached to the base

Sample is chopped up into bits creating multiple templates which are all attached to flow cell

Primer is added to all templates and then reversible dye terminators are all added in at same time, if they are complementary to what needs to be sequenced next it will base pair, release dye, and replace OR group on 3 carbon with OH to continue synthesis

Sequencing by synthesis- the perspective of a single DNA fragment being sequenced in one direction

Many little blocks known as tiles, on illumine chip which fluoresces depending on base pair

There are x y coordinates on all templates which help track where everything is

Where sequencing by synthesis fits into the larger process:

  • Prepare sample: (get DNA, cut it up) make a “library” which are a bunch of different DNA pieces from a whole genome sample, then adaptors are added to ends
  • Attach tiny pieces of DNA (as single strands) to a slide (called a chip) which places DNA to be sequenced on solid surface
  • Amplify them into a bunch of clusters, each cluster contains many molecules of the same tiny piece of DNA attach to the slide called clonal clusters
  • Sequence by synthesis, one strand (forward reads), indices labeling pieces and the other strand (reverse reads) using reversible dye terminators and read the sequence by detecting tinny flashes of light
  • Link flashes of light with the position on the slide that you know what the sequence of each cluster’s DNA is