Data collection and preprocessing
The occurrence data of targeted species is related to the performance of SDMs. The public databases (such as GBIF, always obtain species occurrence from it) have proven to researchers that their data are not enough and existing sampling deviation (Beck et al., 2014; de Araujo M. L et al., 2022; Garcia‐Rosello et al., 2023). Improving the accuracy of distributions cannot be ignored issue (Tulloch et al., 2016), as overlooking it could lead to significant conservation challenges or shortcomings. Thus, the quality of species occurrence data plays a more vital role than their quantity as long as meeting statistical requirements.
In this study, most orchid occurrence data were obtained from our field surveys in recent years (n=10470), and another small portion was obtained from the National Specimen Information Infrastructure (n=963). All of them had been rigorously screened to ensure accuracy. Referring to Zhou et al.’s (Zhou et al., 2016) research on orchid lifeforms classification, we divided all orchid data into terrestrial (n=10794), epiphytic (n=193), and mycoheterotrophic (n=446). The spatial autocorrelation was limited to 1km to avoid overfitting. At last, we prepared four data sets (all-data, t-data, e-data, and m-data) to model for comparing physiology characteristics effects in models.
We considered all possible biological and abiotic factors in our models. The latest 19 bioclimatic variables (30-second resolution) were downloaded from the Worldclim database (Fick & Hijmans, 2017), reduced variable collinearity with the Pearson correlation analysis, and remained five bioclimatic factors finally (|r|<0.7, see detail in Appendix S1.1). Local vegetation (Hou, 2019), terrain features (e. g. elevation, slope, and aspect)(Crain & Fernandez, 2020), and four major soil natures (e. g. gravel, sand, silt, and clay) (Wieder et al., 2014) were included in our models due to their potential ecological effects.