Data collection and preprocessing
The occurrence data of targeted species is related to the performance of
SDMs. The public databases (such as GBIF, always obtain species
occurrence from it) have proven to researchers that their data are not
enough and existing sampling deviation (Beck et al., 2014; de Araujo M.
L et al., 2022; Garcia‐Rosello et al., 2023). Improving the accuracy of
distributions cannot be ignored issue (Tulloch et al., 2016), as
overlooking it could lead to significant conservation challenges or
shortcomings. Thus, the quality of species occurrence data plays a more
vital role than their quantity as long as meeting statistical
requirements.
In this study, most orchid occurrence data were obtained from our field
surveys in recent years (n=10470), and another small portion was
obtained from the National Specimen Information Infrastructure (n=963).
All of them had been rigorously screened to ensure accuracy. Referring
to Zhou et al.’s (Zhou et al., 2016) research on orchid lifeforms
classification, we divided all orchid data into terrestrial (n=10794),
epiphytic (n=193), and mycoheterotrophic (n=446). The spatial
autocorrelation was limited to 1km to avoid overfitting. At last, we
prepared four data sets (all-data, t-data, e-data, and m-data) to model
for comparing physiology characteristics effects in models.
We considered all possible biological and abiotic factors in our models.
The latest 19 bioclimatic variables (30-second resolution) were
downloaded from the Worldclim database (Fick & Hijmans, 2017), reduced
variable collinearity with the Pearson correlation analysis, and
remained five bioclimatic factors finally (|r|<0.7,
see detail in Appendix S1.1). Local vegetation (Hou, 2019), terrain
features (e. g. elevation, slope, and aspect)(Crain & Fernandez, 2020),
and four major soil natures (e. g. gravel, sand, silt, and clay) (Wieder
et al., 2014) were included in our models due to their potential
ecological effects.