MATERIALS AND METHODS
The main motivation of Micro-Primers is to eliminate issues regarding
time and computational work, while doing a manual selection of
microsatellites candidates. As such, several tools and scripts were
integrated within Micro-Primers for discovering SSRs and designing the
respective primers for further in vitro amplification.
Internal & External
Components
The Micro-Primers pipeline was written in Python version 3.6. The two
main internal components were implemented using the following scripts:
(i) install.py that includes all necessary pre-requisites for a proper
installation of Micro-Primers, and (ii) micro-primers.py, which is the
main script that defined the pipeline. Analysis settings are described
in the config.txt file, and parameters can be modified by the user
accordingly to their own needs. The folder software, provided together
with the Python scripts, holds all the scripts and external software
employed by micro-primers.py.
The Micro-Primers pipeline integrates several external components, such
as: (i) Trimmomatic (Bolger, Lohse, & Usadel, 2014) for the removal of
the sequencing adapters; (ii) Cutadapt (Martin, 2011) for the removal of
the technology-specific adapter; (iii) FLASH (Magoč & Salzberg, 2011)
for the merging of paired-end reads (R1 and R2); (iv) MISA (Thiel,
Michalek, Varshney, & Graner, 2003) for the SSR searching; (v) CD-HIT
(Fu, Niu, Zhu, Wu, & Li, 2012; Li & Godzik, 2006) for the removal of
redundancy; (vi) Primer3 (Rozen & Skaletsky, 2000) for the primer
design.
Input Files & Pipeline
To run Micro-Primers, users only need to provide two FASTQ files
corresponding to both ends of a paired-end sequencing run. Samples
should come from a pool of (untagged) individuals of the same species so
the microsatellite selection can be optimized. SSR selection will be
performed based on the number of alleles of each SSR loci, so the more
heterogeneous the sample is (i.e. containing individuals from distinct
populations across the species distribution), the better the final
result will be. Reads must come from a microsatellite library built
using a restriction enzyme and following an enrichment protocol such as
the one described in Garrett et al. (2017). The enrichment protocol is
performed after digestion so the target SSR motifs are the most
represented strands in the final library. A fragment size selection is
then performed on the enriched library to keep only fragments of an
average length lower than the maximum sequencing length to allow both
paired-ends reads overlap when merged later on. The final fragment size
is important for microsatellite screening and must comprise the full SSR
pattern (variable in length) and the two flanking regions with fair
length for primer design.
Additionally, prior to the execution of Micro-Primers, the users must
install all the external components and set the environment variables
through the script install.py. Moreover, the users must also check the
config.txt file (we will describe the configuration parameters in the
next subsection) and then, they can execute the main script
(micro-primers.py). Up on execution, Micro-Primers will follow the
flowchart described in Figure 1. It begins using Trimmomatic and
Cutadapt for the removal of sequencing and technology-specific adapters
respectively, and both paired-end reads are merged via FLASH. Only
sequence reads containing the restriction enzyme pattern are kept by the
pipeline. Various parameters are then calculated and only the sequences
that comply with the specifications of the users are selected. Next, the
repeating region of SSRs is removed from sequences, and the flanking
regions are aligned and assigned to a cluster using CD-HIT with the
following parameters (-c=0.95 -n=10).
Sequences belonging to the same cluster are sorted and number of
different alleles in the cluster are computed. Only clusters with a
minimal number of alleles (set by the user at the config.txt file) are
chosen and a random sequence among variants is selected as the
representative of each SSR locus. Every representative is then parsed
into primer3 and an output file with both, primer information and number
of alleles for each sequence, will be created accordingly to the
primer’s specifications set by the user.
Execution Parameters
As described previously, all the parameters that Micro-Primers needs to
perform the analysis properly, must be dully set at the config.txt file.
In this file there are four sections with different parameters to be
considered for the pipeline execution:
In the first section (Input Files), the user has to indicate the name of
the paired-end files that will be used as input in the analysis.
In the second section (CUTADAPT), the sequence of adapters used after
the restriction enzyme digestion is required. These adapters are
necessary to transform the longer overhangs into blunt ends after the
enzyme digestion. Only sequences with these adapters are considered
‘true’ digested sequences. In the third section (SSR), several
parameters regarding the microsatellite selection are involved. The
parameter MIN FLANK LEN indicates the minimum length accepted in both
flanking regions where the primers will be designed on. The length of
the flanking areas is critical to the final outcome since a very narrow
window prevents the design of primers and subsequently causes the
exclusion of the respective SSR. Thus, any sequence with shorter
flanking region (in any of both ends) that the length specified will be
discarded. The MIN MOTIF REP sets the minimum number of repeats that
every SSR loci must have to be kept in the pipeline. Also, specific SSR
motifs can be discarded from the output if indicated in the EXC MOTIF
TYPE parameter. Options for this parameter are c (compound), c*
(compound with imperfection) and p1 to p6 (repeated motif of 1 to 6
nucleotides) (Thiel et al., 2003). Motifs chosen to be discarded should
be indicated separately with comma. The MIN ALLEL CNT option indicates
the minimum number of alleles for a SSR locus to be selected and it is
based on the observed alleles. In opposition, the parameter MIN ALLEL
SPECIAL DIF indicates the minimum potential number of alleles desired
for each loci, taking into account that not all alleles are represented
in the multi-individual sample. Assuming that the difference between the
alleles with higher and lower number of repeats, only loci that satisfy
the minimum number of alleles indicated in the MIN ALLEL SPECIAL DIF are
kept. The parameter MIN ALLEL SPECIAL is used to enable
(=1) or disable (=0) this option.
Finally, in the fourth section (PRIMER3), the config.txt file is used to
implement PRIMER3, where the only requirement is to indicate the path to
the Primer3 settings file containing the standard parameters of Primer3.
However, the parameters can be changed according with a user demands,
e.g., PRIMER PRODUCT SIZE RANGE, PRIMER OPT SIZE, PRIMER OPT TM and/or
PRIMER MAX POLY X, among others (find all parameters in
https://primer3.org/manual.html).
PCR amplification primers usually are designed with a length of 20-25
nucleotides and some particularities are required to avoid future
problems during genotyping (Dieffenbach, Lowe, & Dveksler, 1993;
Flores-Rentería & Krohn, 2013), like the presence of G or C at the 3’
end, certain percentage of GC for a proper melting temperature and both
primers having similar melting temperature for their hybridization to
take place at the same time.