DNA Barcoding, and Phylogenetic Placement of the Genus Euphorbia L. (Euphorbiaceae) in Egypt

(1) Background: The genus Euphorbia L. in Egypt is represented by 40 species, one subspecies, and three varieties which are distributed in almost all phytogeographical regions in Egypt. The genus is well known for its medicinal importance; however, various and sometimes anomalous morphological characters make the identifincation of the genus a dificult case. (2) Methods: In this study, six DNA markers: matK, rbcL, ETS, trnL intron, trnL spacer, and the entire ITS region (ITS1 + 5.8S + ITS2), as well as subunits ITS1 and ITS2 were evaluated singly and in combination to investigate their usage as potential DNA barcodes. The Maximum Likelihood (ML) and BLASTn analyses were conducted for 37 individuals representing 26 species of Egyptian Euphorbia. (3) Results: The BLASTn comparison of the newly generated DNA sequences of the Egyptian Euphorbia species showed that ITS, ITS1 and ITS2 subunits displayed high levels of species discrimination. On the other hand, the ML analysis of the DNA sequences of trnL intron yielded a better resolved phylogenetic tree, compared to the other regions. However, our phylogenetic analysis based on DNA sequences of other markers: matK, rbcL, trnL, and the entire ITS region, with additional sequences from GenBank have shown that E. dracunculoides, E. hyssopifolia, E. lasiocarpa and E. granulata are probably not monophyletic. (4) Conclusion: This study along with the widest taxon coverage in Egypt, emphasizes the importance of using DNA markers for precise identification and phylogenetic placement of the genus Euphorbia in Egypt within the whole genus.


Introduction
Genus Euphorbia L. is one of the largest angiospermic genera of Euphorbiaceae; it includes around 2200 species and has a cosmopolitan distribution [1,2]. Despite its great vegetative diversity, Euphorbia is morphologically characterized by having a highly reduced cyathiate inflorescence [3], and based on its phytogeographical distribution, habit information, leaf morphology and venation patterns, stipules characters, inflorescences branching, and seed characters, Euphorbia has been divided into four subgenera: E. subgenus Esula Pers., E. subgenus Athymalus Neck. ex Rchb. Wheeler, E. subgenus Chamaesyce Raf., and E. subgenus Euphorbia [1,[3][4][5][6]. On the other hand, in Egypt, genus Euphorbia is represented by 40 species, one subspecies, and three varieties and considered as one of the largest genera in the Egyptian flora [7]. It is distributed in all phytogeographical areas of the country with different habits and habitats [8].
DNA barcoding is a novel, cost-effective and rapid taxonomic method to identify organisms by the use of short standardized gene region(s), where morphological identification is challenging or not possible due to the condition of material [9]. In this tudy, we explored the utility of DNA brcoding of Egyptian Euphorbia, because various and sometimes anomalous morphological characters make the identifincation of the genus a dificult case. Therefore, six DNA markers: matK, rbcL, ETS, trnL intron, trnL spacer, and the entire ITS region (ITS1 + 5.8S + ITS2), as well as subunits ITS1 and ITS2 were evaluated singly and in combination to investigate their usage as potential DNA barcodes.

Taxon Sampling, DNA Extraction, Amplification and Sequencing
Fieldwork in this study was conducted from 2018 to 2019. Specimens were collected from different phytogeographical regions in Egypt. All specimens were identified in the field using reproductive and vegetative characters by the authors, based on books Flora of Boulos [7,8]. We also compared our specimens with the dry specimens at the SCUH, ASTU, CAI, CAIM as weel as CAIRC Herbaria. All voucher samples are curated in the ASTU Herbarium; herbaria acronyms were following to Thiers [10].
The genomic DNA was extracted from fresh material using the Cetyltrimethylammonium bromide (CTAB) protocol with some modifications. The PCR amplification performed in 15 μL volume containing 5 U/μL Taq DNA polymerase with 25 μM MgCl2, 10 μM of dNTPs, 10 μM of each primer. For the PCR profiles and primers we followed Baldwin and Markos [11] for the ETS region, White et al. [12] for the ITS region, Johnson and Soltis [13] for the matK region, CBOL [14] for the rbcL region, and Taberlet et al. [15] for the trnL-F region. Amplifications were conducted using an Applied Biosystems ® -VeritiTM 96-well thermal cycler. PCR products were sent to Eurofins Genomics, USA for purification and direct sequencing in both directions.

Sequence Editing, Alignment and Phylogenetic Analyses
Sequences were assembled and aligned using the Geneious alignment option in Geneious Pro 4.8.4 [16] and edited manually. All indels were scored as missing data. ITS1 (362 bp) and ITS2 (301 bp) regions were also extracted from the entre ITS alignments to see whether these regions are as useful as the entire ITS region. We chose our 20 outgroup taxa following Horn et al. [17].
Including outgroups and GenBank sequences, the final matK data matrix comprised 42 sequences, the rbcL data matrix comprised 98 sequences, the trnL intron data matrix contained 78 sequences, the trnL-F intergenic spacer data matrix contained 89 sequences, the ETS data matrix contained 23 sequences, the entire ITS data matrix contained 131 sequences, the ITS1 data matrix contained 124 sequences, the ITS2 data matrix contained 127 sequences, and the ITS + rbcL + matK + trnL intron + trnL-F spacer data matrix contained 147 sequences (Table 1). For the phylogeny and BLASTn analyses, we also concatenated the individual matrices using 'concatenate' option in Geneious Pro 4.8.4 [16]. The alignment details for the data matrices are provided in Table 1.
The substitution models for each of the individual genes were estimated using ModelFinder [18] implemented in IQ-TREE v.2.0 [19].
Eleven ML analyses were performed using RAxML version 8.2.12 [20] as implemented on CIPRES portal [21] (http://www.phylo.org/). We defined our outgroups and partitions for each dataset. Since RAxML does not support the best models for our datasets, we specified the GTRGAMMA model to each dataset. "Let RAxML halt bootstrapping automatically" option and was applied to each partition individually. Default maximum likelihood search options were selected. The best scoring trees with bootstrap values (BS) were saved. We used a cutoff of 50% to define support for "successful" resolution of monophyletic taxa.
For the total evidence, plastid and nuclear data matrices, Bayesian analyses were conducted using MrBayes 3.1.2 [22] as implemented on the CIPRES portal [21] (http://www.phylo.org/). MrBayes was run with four (one cold and three heated) Monte Carlo Markov chains (MCMC) and for ten million generations, sampling one tree in every 100 generations. This was repeated twice as independent runs, and the resulting parameter files were jointly visualized in Tracer [23]. Among the 100,000 trees obtained, the first 25,000 trees were discarded as "burn-in" and a maximum credibility tree and associated posterior probabilities (PP) were compiled using the remaining trees. The total evidence tree was visualized using the Interactive Tree of Life (iTOL) online tool (https://itol.embl.de/) [24].

BLAST Analysis
To evaluate the taxonomic resolution, all DNA regions were tested singly and in combination by running a BLASTn search in GenBank (Table 2). In addition to the 11 datasets used in the ML analyses, we created ten more datasets by concatenating individual datasets, and these were matK+ ITS1, matK+ ITS2, ITS1 + trnL intron, ITS1 + trnL-F spacer, ITS2 + trnL intron, ITS2 + trnL-F spacer, trnL intron + trnL-F spacer, ITS + trnL intron, ITS + trnL-F spacer, ITS + trnL intron + trnL-F spacer and ITS + matK + rbcL + trnL intron + trnL-F spacer (results not shown). However, since these analyses did not yield better results than the individual DNA barcodes, we did not include them in Table 2.
We used a cutoff of 90% species identity for the BLASTn similarity approach. Additionally, we 'BLASTed' our sequences again to see whether the first hit on the BLASTn results represented the correct identification [25]. Table 2. Identification success of DNA barcodes singly and in combination using Maximum Likelihood (ML) and BLASTn methods. Crosses (X) indicate the species was not monophyletic in the ML tree or no BLASTn results. For the ML analyses, BS values were included for each monophyletic taxon. For the BLASTn analyses, species identification cutoff results (percentages) were also indicated. One asterisk (*) indicates that the "first hit" on the BLASTn results was correctly identified (i.e., unambiguous identifications). Empty cells indicate that species was not sampled or only one taxon was sampled (ML analysis).
In terms of parsimony informative characters, compared to their lenghts, ITS1 and ITS2 showed the highest percentage (69% and 69.8%, respectively), followed by the entire ITS (55.5%), trnL-F spacer (37%) and trnL intron (31.5%). However, in terms of PCR and sequencing success, trnL-F spacer showed the highest PCR and sequencing success, followed by rbcL and ITS, respectively (Table  1). matK and trnL intron showed the lowest PCR and sequencing success. In terms of primer pairs used, while the matK and the ITS regions required two pairs of primers, others required only one pair of primers. Except the matK, trnL intron and the entire ITS regions, the length of all regions well suited for DNA barcoding (less than ~550 bp). On the other hand, in terms of alignment, in contrast to the rbcL, matK and the trnL-F spacer; the entire ITS (namely, ITS1 and ITS2 subunits) and the trnL intron regions were extremely difficult to align.
In terms of BLASTn and "first hit" BLASTn searches, ITS1, ITS2 and the entire ITS, were the most successful DNA barcodes, respectively, for the Egyptian Euphorbia (Table 2). Furthermore, both ITS1 and ITS2 were particularly successful in "first hit" BLASTn searches. Combining individual datasets did not improve neither the BLASTn search results, nor the "first hit" BLASTn search results (results not shown). Similarly, ETS nuclear region did not give any correct results in BLASTn and "first hit" BLASTn searches (please note that, for this reason we did not include the results of the ETS BLASTn search results in Table 2). Among the Euphorbia sequences, while E. paralias, E. helioscopia, E. hirta, E. hyssopifolia, E. prostrata and E. heterophylla always yielded correct identification for all DNA regions in the BLASTn searches (please note that only E. paralias and E. heterophylla were always correctly identified in the "first hit"); E. grossheimii (E. isthmia), E. falcata, E. chamaepeplus and E. forsskalii never yielded correct identifications ( Table 2).
Our ML results have shown that, in terms of retrieving monophyletic species trnL intron was the most successful DNA region, followed by the entire ITS, trnL-F spacer, ITS1 and ITS2, respectively ( Table 2). Both the matK and the rbcL regions yielded the lowest number of monophyletic species. Similar to the BLASTn search results, combining all regions (ITS + ETS + matK + rbcL + trnL intron + trnL-F spacer and 17 more combinations) did not resulted in better resolution (Table 2).
In terms of phylogenetic analyses, the GTR + G+I substitution model of molecular evolution was selected for the entire ITS region, the TIM3e + R4 model was selected for the matK and the ITS1 region, the SYM + I+G4 model was selected for the ITS2 region, the K3Pu + F+R2 model was selected for the rbcL region, the TIM2 + F+R3 model was selected for the trnL intron, and the GTR + F+R5 model was selected for the trnL-F spacer. Genus Euphorbia was monophyletic in only five of the analyses ( Table  2) (Table 2).

Discussion
Identification of morphologically challenging Egyptian Euphorbia using DNA barcodes would be extremely practical. However, our analyses have shown that the genus is a difficult DNA barcoding case, and species borders of some taxa within the genus are not clear (Table 2). While the rbcL + matK combination has been adopted as a standard DNA barcode for plants [14], to date several studies have shown that this standard combination is not applicable to many plant groups (e.g., [26]). Furthermore, while amplification and universal primer problems have been reported for the plastid matK region (e.g., [27]), low discrimination power has been reported for the rbcL region (e.g., [28]). Indeed, our results have shown that both the amplification and sequencing success were low for the matK region; however, in terms of primer pairs used, both the matK and the entire ITS regions required two pairs of primers. On the other hand, the rbcL region yielded lower parsimonyinformative characters compared to the other regions (Table 1). Furthermore, excepting the nuclear ETS region, the rbcL region was the least successful locus in our "BLASTn first hit" searches (please note that for the BLASTn searches, the rbcL region was not very different than the matK, trnL intron and trnL-F intergenic spacer results) ( Table 2). Therefore, sequencing the standard DNA barcode combination, namely rbcL + matK would be time and resource waste for the Egyptian Euphorbia.
Our results have shown that, in terms of BLASTn species discrimination, the ITS1 subunit was the most successful DNA barcode, followed by ITS2 and the entire ITS (Table 2). Furthermore, ITS1 and ITS2 subunits were particularly advantageous in the "first hit" BLASTn searches. However, in terms of retrieving monophyletic species in our ML analyses, particularly the ITS1 and ITS2 subunits were not as useful as the trnL intron and the entire ITS region ( Table 2).
Similar to the entire ITS region, the two components of the ITS region (i.e., ITS1 and ITS2) have been widely used in low-level plant systematic studies (i.e., genus and species level) due to their high nucleotide substitution rates. Furthermore, to date, several studies have shown that if there is an amplification or sequencing problem with the entire ITS region (e.g., requiring specific primers, PCR conditions and PCR additives, low PCR efficiency and difficulties in sequence recovery and alignment, particularly with degraded material), employing the ITS1 or ITS2 subunits as DNA barcodes are very practical, due to the existence of universal primers (i.e., has relatively conserved flanking regions) (please note that we did not test this for none of the loci), their short length, ease of amplification and sequencing even with highly degraded material (e.g., herbal medicine ingredients, museum and herbarium samples) [29][30][31]. On the other hand, several undesired qualifications, such as, possibility of fungal contamination, gene conversions, the presence of paralogous gene copies (i.e., incomplete lineage sorting) and cloning requirement in some cases, pseudogenes, recombination among copies, containing many indels (insertion-deletions) [29,[32][33][34][35][36] were reported not only for the entire ITS, but also for the ITS1 and ITS2 subunits. While our ITS sequences did not show double peaks in the chromatograms; yet, cloning may be required for other taxa, which could not be sequenced for the current study.
The trnL intron and the trnL-F intergenic spacer have also been frequently used in generic and specific level molecular taxonomy studies. In our ML analyses, trnL intron was the most successful DNA locus, in terms of retrieving monophyletic species, followed by trnL-F intergenic spacer and ITS (equally), and ITS1 (Table 2). Furthermore, both the trnL intron and the trnL-F intergenic spacer have advantages with respect to length (i.e., relatively short as a DNA barcode) and were easy to amplify and sequence with only one pair of primers However, the percentage of the parsimony-informative characters of the trnL intron and trnL-F intergenic spacer was much less than the ITS, ITS1 and ITS2 regions (Table 1), and in both the BLASTn and "first hit" BLASTn searches, these regions were not as successful as the ITS, ITS1 and ITS2 regions.

Conclusions
While in our BLASTn analyses, ITS, ITS1 and ITS2 subunits displayed high levels of species discrimination, in our ML analyses trnL intron yielded a better resolved phylogenetic tree, compared to the other regions. However, the unsatisfactory results of the trnL intron in the BLASTn analyses were noteworthy. Therefore, if there is not a complex evolutionary history, such as, paralogous sequences as a result of gene duplication and incomplete concerted evolution (i.e., cloning requirement) (e.g., [33]), we recommend using at least the ITS1, ITS2 or ITS regions as a DNA barcode for the Egyptian Euphorbia. Particularly, in the case of degraded material and/or sequencing and/or amplification problems, ITS1 and ITS2 subunits could be better options.