Microsatellites

Genomes are scattered with simple repeats, which occur in tandem from a single base pair to several base pairs. Based on the length of the repeated stretch, they are classified as,

  • Satellites: Highly repeated segments of 100 nt or greater forming more or less uniform tracts 103 - 107 nt long
  • Minisatellites: Moderately repeated segments of 10 - 100 nt forming more or less uniform tracts 102 - 105 long
  • Microsatellites: Short segments of 1-6 bp repeated in more or less uniform tracts upto ~102 bp long. They are classified as Mono-, Di-, Tri-, Tetra-, Penta-, and Hexanucleotide repeats

We find in the literature that microsatellites are variously referred to as simple sequence repeats (SSR), short tandem repeats (STR), and variable number tandem repeats (VNTR).  Although microsatellites occur in coding regions, they exist predominantly in non-coding regions, where they evolve neutrally. In coding regions selection against frame shift mutations prevent their expansion except in case of tri-nucleotide repeats. Such expansions assume importance for various reasons. For instance, in humans tri-nucleotide expansion is associated with diseases Fragile X syndrome (expansion of CGG repeats >200 in the 5’UTR of FMRi gene); Friedreich Ataxia (expansion of a GAA repeat in the FRDA gene).

 

Microsatellites as markers




Microsatellites have become premier genetic markers because (i) differences in the number of repeats of microsatellite result in multiple alleles at a locus and (ii) microsatellites are inherited in a Mendelian fashion. As a result, microsatellites are used extensively as markers in studies involving forensics and establishment of kinship. Microsatellites have found widespread use as markers for genetic analysis at or below the species level. Their general high mutation rate ensures high level of polymorphism and hence become useful for examining relationships among individuals and breeding groups within populations; conservation genetics; population genetic structure analysis; linkage mapping; marker assisted breeding etc. Though, development of microsatellite markers is quite labour intensive, it turns out to be effective and cost-effective in the long run.

 

Microsatellites for comparative genomic analysis of insects

Apart from being the source of popular genetic markers, microsatellites per se have attracted a lot of attention with respect to their evolution, distribution, expansion, mutation, and disintegration. Questions are also asked about the functional role of microsatellites in particular and biological significance of the microsatellites in general.

Genetic studies and whole genome sequence analysis have established various characteristics of microsatellites as listed below.

  • Microsatellites occur everywhere in the genome (both nuclear and organellar) and are reported all known living organisms. Their distribution within a genome as well as across genomes is NOT RANDOM
  • Variation in the repeat numbers of microsatellites is common and this trait has been exploited extensively as MARKERS. Most of the microsatellite databases offer species specific marker information of microsatellite loci
  • Microsatellites are highly MUTABLE (including somatic mutations) with slippages often contributing to the expansion and contraction of the microsatellites, and point mutations leading to degradation

Insects have long exhibited the greatest genetic diversity on earth that has puzzled mankind. Biologists have relied on insects to unravel many fundamental tenets of biology. Whole sequence genomes of insects have lived up to the reputation and have thrown immense variability in size and genome organization. Among others, we have five fully sequenced genomes of Drosophila melanogaster (as a model organism it provides maximum annotated data), Anopheles gambiae (another Dipteran but economically highly important as a vector), Tribolium castaneum (relatively older insect order of Coleoptera), Apis mellifera (relatively a recent insect order, Hymenoptera) and Bombyx mori (a Lepidopteran, members of which are crop pests). Using five fully sequenced insect genomes; following questions may be addressed:

  • Are the microsatellites equally common everywhere in the genome?
  • Does the length of microsatellites have any relationship with their number?
  • Are the sequences flanking microsatellites anything to do with the origin of microsatellites?
  • Does the microsatellite size affect microsatellite mutation rate?
  • Does the GC content of the microsatellite motif affect the length, repeat units, or mutation rate of microsatellites?
  • Do genomes possess hotspots and islands of microsatellites? In other words, do microsatellites occur as clusters (compound microsatellites)? Is there any favoured association of microsatellites in the compound repeats?
  • Do microsatellites occur as families of common flanking sequence in the genomes?


InSatDb, with an interactive interface, allows users to obtain genome level information on frequency and distribution of microsatellites motif-wise or across-the-board in a single genome or for comparative genomic analysis. One can access microsatellite cluster (compound repeats) information, and particulars of the microsatellites with common flanking sequences (microsatellite family).

 

 

Extraction Method

Extraction of Microsatellites

Microsatellites were extracted from five whole genomes sequences namely Bombyx mori, Drosophila melanogaster, Apis mellifera, Anopheles gambiae and Tribolium castaneum.

Source of Sequences
Bombyx mori                     
http://silkworm.genomics.org.cn/
Drosophila melanogaster   ftp://ftp.ncbi.nlm.nih.gov/genomes/
Apis mellifera                      ftp://ftp.ncbi.nlm.nih.gov/genomes/
Anopheles gambiae            ftp://ftp.ncbi.nlm.nih.gov/genomes/
Tribolium castaneum          ftp://ftp.ncbi.nlm.nih.gov/genomes/

We extracted perfect and imperfect (caused by substitutions and indels) microsatellites.

Example:-

Perfect microsatellite:

consensus pattern (2 bp): AT

microsatellite sequence : AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT

consensus sequence     :AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT

Imperfect microsatellite:

consensus pattern (4 bp): GATA

microsatellite sequence : GATA GATC GATA GAT- GATA GATA

consensus sequence     :GATA GATA GATA GATA GATA GATA

Size of microsatellites was restricted as follows

 

Repeat type
Copy no
Length
Mono
15
15
Di
7.5
15
Tri
5
15
Tetra
5
20
Penta
5
25
Hexa
5
30

The whole genome sequences were submitted to tandem repeat finder version 4 (http://tandem.bu.edu/trf/trf.html) to extract microsatellites.

Input to the program consists of the following parameter

Match
Mismatch
Indel
2
-4
-7
2
-5
-7

 

Detection Parameters: Matching probability Pm = 0.80 and indel probability Pi = 0.10

A minimum alignment score 30. Microsatellites which meets or exceeds the alignment score are reported.

The microsatellites were extracted using two sets of parameters 2,-4,-5 and 2,-5,-7 to maximise the number of microsatellites within the defined limits. Further, these were combined and redundancy was removed. Since TRF extracts microsatellites with a constraint to generate the best possible score for a repeat stretch, it often returns repeats such as “(AAAAA)5” for a nucleotide stretch of 25 'A's, instead of (A)25. Such errors were corrected during verification stage.

Extraction of Repetitive Elements and Genes :

Repetitive elements: Short stretches of DNA with the capacity to move between different points within a genome.

The insect genomes were submitted to repeat masker (http://www.repeatmasker.org/) to extract the repetitive elements.

Genes: A unit of hereditary information. A gene is a piece of a DNA molecule that specifies the production of a particular protein.

The masked sequence from Repeat masker was submitted to Genscan (http://genes.mit.edu/GENSCAN.html) to extract genes.

Locations of microsatellites were found based on the indices of repetitive elements and genes. If a microsatellite is placed on both gene and repetitive element then it is shown that it is present on repetitive element.

Extraction of Compound Repeats

A compound microsatellite repeat is defined here as an occurrence of two or more microsatellites contiguously with intervening non-repeat sequence of <= 70 bp.

Extraction of microsatellite families

A microsatellite family consists of all those microsatellites occurring in a genome, which possess highly matching flanking sequences, the stringency of the sequence match being: percentage match=95% and alignment length= 85%.

Please note that various authors describe microsatellite family as a set of microsatellites with similar features, most often the repeat motif. However, in the context of present study, microsatellite family members are paralogues for the flanking sequences.

100bp of 5' and 3' flanking sequences of  each microsatellite were extracted. The sequence match was done by performing all versus all blast with parameters e=1, match=95% and alignment length= 85%. Grouping of microsatellites based on sequence match (both +/+ and +/- alignments) was done using a perl script.