Edited by: Guangchuang Yu, Southern Medical University, China
Reviewed by: Xiaofeng Huang, Cornell University, United States; Max Robinson, Institute for Systems Biology, United States
This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Detection of CNVs (copy number variants) and ROH (runs of homozygosity) from SNP (single nucleotide polymorphism) genotyping data is often required in genomic studies. The post-analysis of CNV and ROH generally involves many steps, potentially across multiple computing platforms, which requires the researchers to be familiar with many different tools. In order to get around this problem and improve research efficiency, we present an R package that integrates the summarization, annotation, map conversion, comparison and visualization functions involved in studies of CNV and ROH. This one-stop post-analysis system is standardized, comprehensive, reproducible, timesaving, and user-friendly for researchers in humans and most diploid livestock species.
香京julia种子在线播放
Genome-wide data have been accumulated for large numbers of individuals of various species as the cost of single nucleotide polymorphism (SNP) genotyping continues to decrease. In addition to using these data for GWAS (genome wide association study) or GS (genomic selection), interesting genomic information about copy number variant (CNV) and runs of homozygosity (ROH) can be inferred from these genotypes, and a range of software products [such as PennCNV (
There are several common “pitfalls” we have observed when conducting CNV analyses using SNP genotyping data. The most frequent is to annotate the candidate genes in a CNVR (copy number variation region) without considering the frequency of the CNVs: this can result in undue weight being given to rare CNVs that affect only one or two samples. A second issue is comparing CNVs between different studies, and making comparisons only at the population level, and not at the individual sample level. Comparison at the population level could reflect the ubiquitous nature of CNVs, but at the individual level it also provides information about the robustness of CNV detection algorithms. A third issue arises when comparing CNVRs that have been detected using different reference genomes, which requires converting the coordinates of the regions between the two genomes. Making these conversions requires careful consideration, as the order of SNPs on chromosomes might differ between two different reference assemblies, such that the lengths or even chromosomal orders of CNVs can change, which might lead to meaningless comparisons between CNVRs. A fourth common problem is get the incorrect number of overlapping CNVRs when presenting comparison results via Venn diagram. Since the number of overlapping regions is relative to the results, and a single long interval generated using one approach might overlap multiple shorter intervals detected using another approach, in which case representing the results via Venn diagram requires special annotation.
There are also some steps that may be easily forgotten performing ROH analysis on SNP genotyping data. For example, the SNP density distributions may not have been carefully examined prior to inference of ROH. The density of SNPs may differ across the chromosome on different SNP chips, but ROH detection methods are highly affected by characteristics such as SNP density, window size, tolerance of occasional heterozygosity in the run, and the presence of missing values in the detection window. Knowing SNP density can therefore help us to select better parameters when performing ROH detection. Moreover, while reporting the candidate genes by functional annotation of genes that located in ROH regions, we may not examine the frequencies of haplotypes within these interesting genes, but this step could provide valuable information about the high frequency genotypes of these genes, which is useful on designing the further validation experiments and can provide the valuable reference to others when they comparing the genes using the same SNP chips on different populations.
There are several common requirements in studying CNV and ROH patterns in a new species or population. These include: the need for preparing summary tables, making summary figures, generating CNVRs and plotting CNVR distribution maps with gene annotations, comparing CNVs and CNVRs between studies, converting genome coordinates and map files from one reference to another, finding high frequency abnormal genomic regions, creating consensus gene lists, producing custom visualization of results, and identifying haplotypes in regions of interest. Therefore, we built this open-source tool to provide a standardized, reproducible, time-saving and widely available one-stop post-analysis system to make research more simple, practical and efficient while avoiding common “pitfalls” that can affect the accuracy and interpretability of these studies.
The functions provided by this package can be categorized into five sections: Conversion; Summary; Annotation; Comparison; and Visualization. The most useful features provided are: integrating summarized results, generating lists of CNVRs, annotating the results with known gene positions, plotting CNVR distribution maps, and producing customized visualizations of CNVs and ROHs with gene and other related information on one plot (
Example plots illustrating the main functions and output from the HandyCNV package.
The conversion section handles the conversions of genomic positions between two reference genomes, and provides two functions.
The summary section contains a group of functions to summarize CNV results, generate CNVRs, and make CNVR distribution maps from CNV results. There is also a collection of functions to summarize ROH results, report frequencies of ROH regions, inbreeding coefficient by different length groups and to generate haplotypes on interesting ROH regions.
The functions used for reporting CNV results include
The annotation section facilitates downloading and formatting reference gene lists, and annotating genes on genomic intervals.
The comparison section consists of functions for comparing sets of CNVs (
Finally, twelve functions in HandyCNV are included in the visualization section; of these, five produce plots as a subset of their output, and have been mentioned previously:
The recommended pipeline contains 14 basic steps depending on the study purposes (
Pipeline of post analysis of CNV results using HandyCNV.
The pipeline for the post analysis of ROHs contains eight basic steps (
Pipeline of post-analysis of ROH in HandyCNV.
We now provide two example runs of the pipeline, using two previously published data sets: the first is a CNV list produced for a human population in Brazil (
The CNV result in this example was cited from a study published in 2020 which comprised 268 microarrays samples in a human population in Brazil (
Analytical steps of example 1.
To replicate this example, we first need to download the dataset “Table S1 – Detailed information about all CNVs analyzed in our sample” (
The main outputs of example 1. Panel
A formatted clean CNV list will return as an object named “clean_cnv” in working environment, and a brief summary table of CNV (see
We then take a quick look at the CNV distribution by reading the “clean_cnv” list as input and customizing parameters in
The CNV summary plot (see
For gene annotation steps, the reference gene list can be downloaded and formatted by assigning the genome version argument in
Finally, we can extract Sample IDs of CNVs that contain genes of interest (see
Since this example only contains one CNV result in one reference genome, the functions in the comparison and conversion sections are not applicable in this example. Users of these functions can browse the vignette of this package from the Github repository (
The genotype data used to detect ROH in this example is from the work of
Analytical steps of example 2.
To run this example, we first need to prepare the genotype data. The genotype files are read using the
The main outputs of example 2. Panel
Then, we invoke Plink 1.9 (
Once we get ROH results, we can run
In this example, we present visualizations of ROH on the whole of chromosome 22 (see
The horse reference gene list (“quaCab2”) was downloaded from the UCSC website (
To get the haplotype of the genes need the phased genotype files. Here, we take chromosome 1 as example to present how to use Plink 1.9 (
Finally, we take
Here we present a freely available and open source R package called HandyCNV, which provides a comprehensive set of functions to summarize and visualize the CNVs and run of homozygosity results detected from SNP genotyping data.
Many good software packages have been developed for the detection of CNV and ROH from SNP chip data [such as PennCNV (
There are some limitations to this package. For example, the
The current release of HandyCNV is version 1.1.6, which can be installed in the R environment using the following code: “remotes::install_github (repo = ‘JH-Zhou/HandyCNV@v.1.1.6’).” The current development version can be found at the GitHub repository (
Publicly available datasets were analyzed in this study. This data can be found here: The human CNV lists used in Example 1 can be found in “Table S1 – Detailed information about all CNVs analyzed” at
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements. Ethical review and approval was not required for the animal study because no animal sampling, experiments or phenotype measurement applied in this study. The genotype data used in this analysis are from previous studies.
JZ conceived the analysis, compiled the package, and wrote the manuscript. LL contributed to code writing and testing, and reviewed the manuscript. TL contributed to package testing, proofreading of the manuscript, and vignette. DG and YS provided instruction for analysis, reviewed the manuscript, manual, and vignette. All authors contributed to the article and approved the submitted version.
TL is employed by Livestock Improvement Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
JZ was funded by the China Scholarship Council. YS was supported by the China Agricultural Research System of MOF and MARA.
We thank the two reviewers for their valuable comments, which have improved the scalability of the functions and structural integrity of this paper. We also thank BioRxiv for accepting an earlier version of this manuscript as a pre-print, and the Github platform for providing a place to store open source code, which helped to promote our study to more users in the early stage. This package depends on several independently developed R packages, such as the Tidyverse family (
The Supplementary Material for this article can be found online at: