Current methodologies of genome-wide Single Nucleotide Polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on Self-Organizing Maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. We follow a classical approach that explores genotype datasets to select SNP loci for each query missing SNP genotype to build training sets, and that initializes and trains the neural networks to finally use the SOM-derived clustering to impute the best genotype. To automate the imputation process, we have implemented GTIMPUTATION, an open source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.
Functional annotation aims to assess the biochemical and biological functions of sets of genomic or transcriptomic sequences yielded after next-generation sequencing experiments. One common way to perform functional annotation of a set of sequences obtained from a next-generation sequencing experiment, is by searching for homologous sequences and accessing to the related functional information deposited in genomic databases. Functional annotation is especially challenging in de novo assemblies of transcriptomes of non-model organisms, like many plant species. In such cases, existing commercial and open access general purpose applications may not offer complete and accurate results. We present TOA (Taxonomy-oriented annotation), a user-friendly open-access application designed to establish functional annotation pipelines geared towards non-model plant species. TOA performs homology searches against proteins stored in the PLAZA platform databases, NCBI RefSeq Plant, Nucleotide Database and Non-Redundant Protein Sequence Database, and retrieves functional information for several gene ontology systems. The software performance was validated by comparing the runtimes, total number of annotated sequences and accuracy of the functional information obtained for several plant benchmark datasets with TOA and other open-access functional annotation solutions. TOA outperformed the other software in terms of number of annotated sequences and accuracy of the annotation, and constitutes a good alternative to improve functional annotation in plants. TOA is especially recommended for gymnosperms or for low quality sequence datasets of non-model plants.