RESULTS
In this section, we show the experimental results obtained with our
pipeline in a real case study and following the same procedure that
would have been carried out in a manual analysis.
Micro-Primers’ Output
The execution of Micro-Primers pipeline produces a single output file in
plain text with useful information for the amplification of the SSR loci
through its representative. Figure 2 shows a sample of file and how it
is divided. It has eleven columns and each line represents the primers
designed by Primer3 for each SSR recovered from the multi-individual
sample. From left to right, the first column (in red) is the
representative sequence of each cluster preceded by a unique index to
easily identify them (sequence ID). Lines with same sequence ID
represent different primer pairs for the same SSR loci. The second
column has the size (or length) of the sequence resultant from PCR
amplification using the respective primer pairs. The third and fourth
columns are the forward primer sequence and its melting temperature. The
fifth and sixth column are the equivalent but for the reverse primer. In
the seventh column, the specific motif/allele found is shown with the
number of repeats found in the SSR representative. Column marked as
’Range’ shows the length rage for the PCR amplicon for all the alleles
detected for the same SSR. Nineth column contains the total number of
alleles for the specific SSR loci. Next column, the tenth, indicates the
potential number of alleles to be found in the population estimated from
the difference between the longest and shortest alleles found. Finally,
the eleventh column indicates the best combination of primer pairs for
each loci (coded as ” | BEST | ”) as provided by
Primer3.
Performance Analysis
The Micro-Primers pipeline was tested with a dataset belonging to bats
from two different populations, Namibia and Botswana, with a total of 15
and 21 individuals respectively (the dataset is also available in the
GitHub repository together with the Micro-Primers’ software). Samples
were pooled, enriched for di- and tetra-repeat motifs separately
following the protocol established by Garrett et al. (2017) and
sequenced together on an Illumina MiSeq v2 kit (250 cycles, paired-end)
targeting 300k reads.
Since the species is diploid, the maximum number of alleles to be found
by locus is 72. The process was reproduced by both manually running each
required program one after the other and executing the Micro-Primers
pipeline with exactly the same parameters used in the manual run.
Results from both procedures were identical, as expected, being the only
difference the time spent to complete them. The manual process took no
less than 24 hours, mainly spent on the manual selection of the
clusters. Some changes in input format were also required for the proper
functioning of certain programs, such as Primer3 for which sequence
identifiers were modified to include an index in the beginning to
facilitate handling. The goal was to avoid problems with some software
on dealing with long and redundant sequence names. On the other hand,
the automatic pipeline took less than 2 minutes to execute the entire
analysis, using a single core of an Intel i7 Octa-Core processor with 64
Gbytes of main memory. It should be noted that the unique point where
the memory is more demanding is at the trimming step carried out by the
Trimmomatic component, so in general, minimal resources are needed.
In addition, four different parameters configurations were tested to
check the performance of the pipeline and evaluate the differences in
the number of microsatellites loci detected. The pipeline’s execution
was modified by changing the parameters at the configuration file or at
the Primer3 settings file, and the number of sequences remaining after
each step is presented in Table 1. The four configurations tested were:
(i) the default; (ii) with activation of the special search with a
minimal difference between extreme alleles of 8; (iii) with change of
the flanking region length; and (iv) modifying the difference in melting
temperature between forward and reverse primers at Primer3. As observed
in Table 1, the numbers of sequences that comply the requirements in the
first four pipeline steps are exactly the same since none of the tested
configurations are applied in these levels. The pipeline output changes
after Filter 2 depending on the configuration used.
The implementation of the special feature MIN ALLEL SPECIAL DIF, based
on the potential number of alleles per loci, shows substantial impact on
the final number of loci kept and subsequent number of primers selected
in comparison with the default setting based on the observed number of
alleles. When the special parameter is activated and the minimal
difference between the extreme alleles is 8, the number of SSR loci
increases from 26 to 104, producing a total of 83 primer pairs.
Variations on the minimal flanking region length at Filter 2 affect the
number of sequences that will pass to the following steps, and thus the
number of SSR markers at the end. Higher values in the flanking region
parameter make the filter more restrictive, and less sequences are kept.
There should be a compromise between the length of the flanking region
and the capacity of Primer3 to design primers considering the parameter
settings given. The shorter the flanking regions are the more sequences
will pass through the Filter 2, although most will not be processed by
Primer3 since they will not have enough length for the primers to be
designed without overlapping with the microsatellite region.
At the end of the pipeline, changes in the maximum difference of melting
temperature between primers in Primer3 (MAX DIFF TM) induces variation
in the number of primer pairs designed as expected. Higher values in
this parameter increase the capacity of Primer3 to find primer pairs in
a sequence but, contrary to expectations, they may not necessarily be
the most suitable and therefore less primers are selected at the end.