Genome Survey and Mitochondrial Genome Analysis of Wild Silkworm, Bombyx mandarina

Aimed to lay references for the whole genome sequencing and provide basic data for the genetic relationships between Bombyx mandarina and Bombyx mori, the genome of Bombyx mandarina is investigated and the mitochondrial genome is analyzed. Using the Illumina NovaSeq pair-end sequencing platform, a female Bombyx mandarina was sequenced, the genome size, heterozygosity and GC contend have been estimated based on the genomic survey analysis, and the genome of wild silkworm has been pre-assembled. Using the sequencing data, the mitochondrial genome sequences in the genome were extracted and assembled directly. As results, 25.8 GB clean data obtained, the estimated genome size of Bombyx mandarina is 456.5 Mb, the heterozygosity rate is 1.94%. After preliminary assembly, the scaffold N50 is 1 792 bp, scaffold number is 737055, contig N50 is 587 bp, contig number is 1477268. As assembly and annotation, the mitochondrial genome of wild silkworm is 15 662 bp, a total of 37 genes were arranged in the mitochondrial genome, which did not have a gene rearrangement. According the phylogenetic tree of mitochondrial genome, wild silkworm could be divided into north or south group according to the geographical source. Subgroup of Wild silkworm from the North China has the closest genetical relationship with the domestic silkworm. Since the genome of Qin-ba wild silkworm belong to the complex genome, integrating the second-, the thirdgeneration sequencing and Hi-C technology could be helpful to obtain high quality genomic of Bombyx mandarina. This study also supports the contention that domestic of silkworm descended from the northern of China, and implied that the Qin-ba mountain area could be one of the original regions of domestic silkworm.

The wild silkworm Bombyx mandarina (Lepidoptera: Saturniidae) is considered one of the important pests whose larvae feed mainly on mulberry trees. Wild silkworm could be divided into two geographical groups based on the chromosomal characteristics, one is the Chinese Bombyx mandarina that mainly distributed in China and Russian Far East that number of chromosomes is the same as Bombyx mori (2n = 56), and the other is the Japanese Bombyx mandarina that mainly distributed in Japan and South Korea that number of chromosomes is different from Bombyx mori (2n = 54) (Arunkumar et al., 2006;Nakamura et al., 1999;Shimada et al., 1995). Research shown that silkworms may have been initially domesticated in China as tri-moulting lines, then subjected to independent spreads along the Silk Road that gave rise to the development of most local strains, and further improved for modern silk production in Japan and China, having descended from diverse ancestral sources (Lu et al., 2002;Pan et al., 2008;Sun et al., 2012;Xia et al., 2009;Xiang et al., 2018). Without artificial selection and live in the wild, wild silkworm preserved abundant genetic diversity, and shown significant differences in cocoon quality, disease resistance, stress tolerance, growth and development when compared with domestic silkworm, so it could be germplasm resources of genetic improvement of domestic silkworm (Fang et al., 2020;Fang et al., 2015;Xiang et al., 2013;Zhang et al., 2015). The origin and domestication of Bombyx mori and Bombyx mandarina are basic scientific issues in sericulture, and also involve in the inheritance of the silk culture. Therefore, clarifying the genetic relationship and geographical origination of silkworms based on the nuclear genome and mitogenome has important sense to sericulture development and scientific research (Goldsmith et al., 2005;Liu and Lu, 2018).
Mitochondria is a semi-autonomous organelle found in the majority of eukaryotic cells (Boore J L, 1999;Russell et al., 2020). Mitochondrial genomes (mtDNAs) are semi-independently replicating entities in mitochondria and also the only extra-nuclear genetic material in lots of animals. The mtDNA in most insects is typically a double-stranded circular molecule of 14~20 kb in length. The number of coding genes in mtDNA is thirty-seven that include twenty-two transfer RNA, two ribosome RNA and thirteen protein-coding genes (Boore, 1999;Pan et al., 2008). The mitogenome is widely used for population genetic structure, phylogenetic and phylogeography study for its characteristic of brief molecule structure, highly conservation, maternal inheritance, rapid evolutionary rate, less frequent reorganization and high copy number (Simon et al., 2006;Chan et al., 2019;Min-Shan et al., 2018). The traditional methods to obtain mitogenome sequence are mainly based on sanger sequencing of PCR amplification. However, due to the lack of universal primer in many species, and the existence of special structures such as repetitive sequences and high AT content, it is difficult to obtain the full-length sequence of mtDNA by traditional methods. Therefore, high-throughput sequencing combined with mature bio-informatics methods have facilitate the assembly and annotation of mitogenome effectively, and become the most main technical way to obtaining the complete mitogenome .
Genome survey is an analysis to obtain bio-information such as heterozygote rate, GC content and genome size with low coverage high-throughput sequencing. It provides the basis for appropriate sequencing strategy and fine mapping of whole genome. For non-model organisms, the genome survey analysis is the basis for molecular mechanism study and genomic resources exploration. In this work, the genome size, GC content and heterozygote rate of Qin-ba wild silkworm were analyzed by Illumina NovaSeq high-throughput sequencing platform, and the mitogenome was assembled based on the sequencing data. The aim of this work is to provide a theoretical basis for whole genome sequencing, and the cluster analysis of silkworms from different geographical origins could contributes to a better comprehension of the genetic diversity and domestication of silkworm. This work could offer theoretical references for utilization of wild silkworm and inheritance of silk culture.

Sequencing data statistics
A total of 26.8 Gb sequence data were generated from the small-insert (300 bp) library of Bombyx mandarina. A total of 25.8 Gb clean bases were generated after the sequence data was filtered, low-quality and duplication sequences were trimmed, with 92% Q20 bases and 38.0% GC content which was used for assembly.

prediction of genome size, heterozygosity ratio
The estimation of genome size and heterozygosity ratio were based on the K-mer analysis ( Figure 1). Statistical analysis showed that the total number of K-mers was 22 366 412 989. The results shown that the 17-mer frequency distribution curve exhibited two peaks at depths of 24 and 49, respectively. Using the formula of genome size = total K-mer number / peak depth, the genome size of Qin-ba wild silkworm was estimated to be 456.5 MB. The first peak observed at 1/2 of peak depth displayed a high level of heterozygosity for this wild silkworm sample. Simulation analysis using the Arabidopsis thaliana genome revealed that it had a 1.94% heterozygosity rate, represent that the genome belonging to the complex genome with higher heterozygosis rate.

Preliminary genome assembly and GC content-sequencing depth analysis
SOAP denovo program was employed to preliminary genome assembly after filtered and removed low quality data reads (Table 1). The results shown that the software produced a contig with the N50 of 587 bp, the longest contig length of 23.641 kb, and the total length of 540.745 MB, while sequence with a scaffold N50 length of 1792 bp, the longest scaffold length of 64.381 kb, and the total length of 649.282 Mb.
The statistics of GC content and read depth were shown as Figure 2. It represents that most plots scatted in the area of GC 30~40% and depth 40~60, implied that there is no contaminant in the sequencing reads, and the heterozygosity result the read-depth clusters divided into two obvious layers. Ten thousand scaffolds were random selected to align to the NCBI Nucleotide (Nt) database. Its shown that 51.17% of them matched with Bombyx

Assembly and annotation of Mitochondrial genome
The full-length mitogenome of Bombyx mandarina was assembled from the high-throughput sequencing data, shown as Figure 3. Its shown that the mitogenome size is 15 662 bp with 37 genes in it. The number of genes for protein coding, transfer RNA and ribosomal RNA are 13, 22 and 2 respectively, while the length of transfer RNA genes variated from 61~71 bp. The protein coding genes include one cytochrome b apoenzyme (CYTB), two ATP synthase subunits (ATP6 and ATP8), three cytochrome oxidase subunits (COX1, COX2 and COX3) and seven NADH dehydrogenase subunits (ND1, ND2, ND3, ND4L, ND4, ND5 and ND6). Twenty-five intergenic-and five overlapping-regions distributed in the mitogenome of Qin-ba Bombyx mandarina ( Table 2). The gene order of wild silkworm was the same as that in fruit fly and domestic silkworm, and no re-arrangement observed.

Phylogenetic analysis
For cluster analysis of wild-and domestic-silkworm, the mitogenomes of silkworms were downloaded from GenBank to constructed the phylogenetic tree, and Drosophila melanogaster, Antheraea pernyi and Rondotia menciana were set as the out group (Figure 4). The result shown that the wild silkworms could be divided into Japanese, southern or northern of china sub groups according to their origin geographical areas. Mitogenome of wild silkworm that from Shandong, Liaoning and Shaanxi belong to the sub group of northern china. The wild silkworms from northern china have the closer relationship with domestic silkworm than that of Japanese and southern china.

Discussion
For non-model organism without genomic data, study on the genomic features could pave the foundation for molecular mechanism research and gene resource utilization. Flow cytometry and Feulgen spectrophotometry are the common approaches for genomic size estimation, while high throughput sequencing is another faster route to complete genomic survey. In this work, we carried out genome survey of wild silkworm from Ankang city, Shaanxi province, via the high-throughput sequencing. The results shown that the estimated genome size is 456.5 Mb, and the heterozygosity ratio is 1.94%. The preliminary genome assembly produced a contig with the N50 of 587 bp, the longest contig length of 23.641 kb, and the total length of 540.745 MB, while scaffold N50 length of 1792 bp, the longest scaffold length of 64.381 kb, and the total length of 649.282 Mb. The complete mitogenome length is 15662 bp with 37 genes in it, no gene arrangement observed. As phylogenetic analysis based on the mitogenomes, the china wild silkworm could be divided into northern and southern geographic sub groups, and the northern sub group has the closest relationship with domestic silkworm.
In general, genome with heterozygosity ratio up than 0.8% and repeat sequence ration outweigh 60% could be considered as complex genome, then it is hardly to obtain high quality genome through the conventional sequencing and assembly technology (Gao et al., 2018). Our work represents that genome of Qin-ba wild silkworm belongs to complex genome, and the results of preliminary genome assembly through Illumina sequencing is not ideal. Studies shown that it is hardly for the second-or the third-generation sequencing to overcome the repeats in genomes, while high-throughput/resolution chromosome conformation capture technology (Hi-C) can depict three-dimensional global chromatin interactions across eukaryotic genome. Therefore, integrating the third-generation sequencing technology with the Hi-C could be great helpful to obtain high quality genome of wild silkworm.
The wild silkworm Bombyx mandarina is the wild ancestor of the domestic silkworm Bombyx mori, but there remain single-or multi-geographical origin hypothesis of domestication. Chen DB revealed that Chinese B. mandarina populations represented two genetically distinctive subtypes in line with the geographic boundary of northern and southern China based on the mitogenome analysis, and the true wild ancestor of domestic silkworm is northern Chinese Bombyx mandarina, rather than southern Chinese Bombyx mandarina (Chen et al., 2019). Our work supported this notion, and implied that the Qin-ba mountain area could be one geography region of the origin of domestic silkworm. With of NGS technology emergency and decreased sequencing cost, functional genomics research has been promoted greatly as more and more insect genomes have been analyzed (Zhao, 2016). The genomic research of wild silkworm could be conductive to the development of pest control. Otherwise, the comparative genomics study in wild-and domestic-silkworm could result in new discovery in domestic mechanisms and gene resources, and facilitate the new germplasm creation and wild resource utilization. In this work, we survived the genome of Qin-ba wild silkworm by using high-throughput sequencing technology, and analyzed phylogenetic relationships of silkworm based on the mitogenomes, could provide a scientific basis for the future research on the whole genome mapping of wild silkworm.

Experimental materials
The wild silkworm was collected from Raofeng town, Shiquan county, Ankang City, and raising in Shaanxi Key Laboratory of Sericulture more than six years as conventional mulberry culture. One female individual was used to extracted genome, and then conserved under ultralow temperature for use.

Sample extraction and Sequencing data generation
Genomic DNA was extracted from the pupal sample by using DNeasy Blood & Tissue Kit (Qiagen, Germany) following the manufacturer's protocol. UV spectrophotometry was used to measure the concentration and purity the DNA. Agarose gel electrophoresis was used to determine the integrity of the template, and then sheared randomly into small fragments by ultrasonic wave. Two paired-end libraries with an insert size of 300 base pairs (bp) were constructed from fragmented random genomic DNA following the Illumina manufacturer's instructions and sequenced by Illumina NovaSeq sequencing platform. The clean reads were obtained by removing reads containing adapter and low-quality reads from raw data with NGS-QC-Generator Toolkit. All the downstream analyses were based on these clean data.

Prediction of genome size, heterozygosity ratio and Preliminary genome assembly
K-mer analysis was used to predict the genome size and heterozygosity ratio before genome assembly. A K-value of 17 was used for the prediction, analysis, and iterative selection of 17-bp base sequences from the clean reads. SOAPdenovo software was used to carry out preliminary genome assembly after the short tips and low-quality sequences were filtered.

Analysis of GC content and GC-depth statistics
The reads obtained from sequencing all the libraries were aligned to the initially obtained contigs. The filtered reads were aligned to this assembled sequence using SOAP to obtain the base depth. A window size of 10 kb was used for non-repetitive advancement in the sequence and calculation of the mean depth and GC content of every window to generate a GC depth plot to examine whether there was significant GC bias or bacterial contamination in the sequences.

Assembly and annotation of mitochondrial genome
The mitogenome was built using the NOVOPlasty pipeline v. 2.6.3 with default setting, and the sequence with GenBank no. MG604734.1 set as the reference sequence. The newly assembled mitochondrial genome was annotated and visualized in the GeSeq web server using the invertebrate genetic code. The mitogenome of this work were uploaded to GenBank with the accession no. MN400656.

Phylogenetic analysis
The mitogenomes of 25 domestic silkworms and 18 wild silkworms were downloaded from GenBank. The phylogenetic tree based on neighbor-joining algorithm, and genetic distance analysis based on Kimura 2-parameter model were conducted by MEGA-X software. The mitogenomes of Drosophila melanogaster, Antheraea pernyi and Rondotia menciana were set as the out group.