ADaCGH is a web-based tool for the analysis of aCGH data. We focus on the problems of calling gains and losses and the estimation of the number of copy changes. We include all the methods that have been shown in recent reviews (Willenbrock and Fridlyand, 2005; Lai et al., 2005) to be the best performers or among the best performers, as well as some methods with which we have had some previous positive experiences.
Most of the methods implemented have their very own ways of displaying results. To make it easier for the user, we have tried to provide a unified graphical output, while at the same time being faithful to the original approaches of the authors (e.g., "plateau plots" are included for the binary segmentation method).
Many of the methods have a variety of paremetrs. For the web-based application, we have tried to simplify as much as possible, setting most parameters to their default values, for several reasons. First, the comparative analysis of performance often use the default parameters, so any assertion such as "method X works well" really means "method X with its default parameters works well". Second, after over a year of operation, we've found most users do not change the default values we provide, but tend to find it confussing to face so many choices. Third, the interpreation of these default parameters is either difficult, or completely abstract, requiring a good understanding of the statistical basis of the method (i.e., precludes playing casually with parameters). Fourth, if you really want to play with parameters (maybe for research), all of our code is freely available. There, you can set parameters and modify settings as you please, in a fully programmable way.
It is common that many genes/clones that are mapped to the same locations appear several times in an array. Often, there are several spots for each clone, and there are several clones for each gene. Thus, the very first part of our analyses is computing the average log2 ratio for each gene/clone. The identifiers used to compute the average are either the identifiers resulting from concatenating the name and position of each gene/clone.
It is up to you to try to work at the clone or the gene level. There are a variety of trade-offs involved here. We would suggest you do some pre-processing prior to sending data to this application. If you have multiple spots per clone and missing data, we probably would suggest you impute and then replicates from the same spot using, for instance, our preprocessor tool.
None of the methods currently implemented in ADaCGH use the distance between clones in the analyses, but some of the ones we will include in the future will. For now, we take as the position of the clone/gene the mid point of that clone/gene. Other approaches are available and reasonable; this one has at least the nice feature of being invariant to where you start (p or q end). We use the mid point to order the clones/genes along a chromosome. In the very unlikely event that two or more clones/genes have the same mid point, a small uniform random variate is added to its mid point to break the ties.
Centering is done on a per-array basis. Some methods do require that data be centered, and for others centering simplifies interpretation. We often prefer centering by subtracting the median, to avoid effects of outliers on the mean. If your data come from a normalization program such as DNMAD, the effect of centering will often be very small.
Note, however, that you might prefer to carry out the centering outside ADaCGH, using certain clones that you know (how?) are not changed. As well, the X and Y chromosomes might require "special" treatment. For instance, if you have some samples that are XXXX, you probably do not want to center the data using the log-ratios from all the clones, since those from the X chromosome will be shifting the mean up artificially. One approach is to exclude from all analyses the sex chromosomes; this is easy if, e.g., you change the X and Y by any other letter (ADaCGH will exclude from all analyses any chromosome with a name that is different from 1 to 24 or X or Y).
This method tries to split each chromosome into contiguous regions of equal copy number. Olshen and Venkatraman reframe the problem as one of finding change-points, those places where the copy number changes. For each chromosome, the (ordered) data are splitted recursively until no further change points can be found. Determining if a particular point corresponds to a change point is done using a test whose reference (null) distribution is found via permutation. This method is described in detail in Olshen et al., 2004 and Lucito et al. 2003.
Following the reslts in Willenbrock and Fridlyand (2005) as well as recent applied publications with Fridlyand and Olshen as coauthors, we use CBS with the mergeLevels algorithm. This can reduce the number of final "states" returned, and provides a way to map from segments to "loss/gain/no-change" states. We use the mergeLevels algorithm is implemented in the bioconductor package aCGH, maintained by J. Fridlyand. We have post-processed the mergeLevels output a little bit, so that the output has only three levels, -1, 0, 1, that you might want to map into "loss", "no change", and "gain". Note that we have made the assumption, in this mapping, that the "no change" class is the one that has the absolute value closest to zero, and any other classes are either gains or losses. When the data are normalized, the "no change" class should be the most common one.
This method uses Hidden Markov Models (HMM). HMMs are a natural model for these type of data. Several homogeneous HMMs are fitted (with different numbers of underlying states) and then model selection, using AIC, is employed to select the number of states. Finally, and following the results in Willenbrock and Fridlyand (2005), mergeLevels is applied to the outcome.
This method, developed by Marioni et al., 2006 is similar to the above, except the models used are non-homogeneous HMMs that allow to incorporate distance between successive probes. Several non-homogeneous HMMs are fitted (with different numbers of underlying states) and then model selection, using AIC, is employed to select the number of states. Finally, mergeLevels is applied to the outcome.
Picard et al. (2005) model aCGH data as a random Gaussian process with abrupt changes (in the mean and, possibly, the variance). We only use the model with changes in the mean, following Picard et al. The key problem then becomes one of selecting the number of segments (and their location). We use the tilingArray implementation with additional code that follows Picard et al.s suggestion for selecting the number of segments. In particular, the model with common variance allows detecting segments with one observation (e.g., singletons).
The original paper includes no details on mapping the segmentation results to the "gain/loss/no-change" classes. We thus use mergeLevels on the output. With this approach, CGHseg is one of the best overall performers (on par with Circular Binary Segmentation) in our comparison of several methods for aCGH analysis (see Supplementary Material to Rueda Diaz-Uriarte for further discussion).
Using CGHseg requires setting one threshold for the adaptive penalization. (Briefly, this threshold is used to determine when the second derivative of the likelihood w.r.t. the number of segments is not large enough to justify further segments; see p. 13 of Picard et al. (2005)). The default value used in the original reference is -0.5. However, our experience with the simulated data in Willenbrock and Fridlyand (2005) indicates that for those data values around -0.005 are more appropriate (see Supplementary Material to Rueda Diaz-Uriarte for further discussion). We recommend that users play around with the threshold until a reasonable value is found. It only makes sense for this value to be negative (even if very close to 0).
Huppe et al. build upon work of Polzehl and Spkoiny on non-parametric likelihood methods for breakpoint detection. Hupe et al. include a final merging algorithm, specific for their method, that allows mapping results to the "gain/loss/no-change" status; we use this approach with GLAD.
This method uses wavelets for denoising the data. First, a Haar wavelet transform is applied to the log-ratios; next, the wavelets are shrunken or thresholded; then, the signal is reconstructed (inverse wavelet transform), leading to the smoothed or denoised data; finally, the smoothed data are clustered (using the "partitioning around medoids" method of Rousseeuw and coll.). Both the wavelet-based smoothing and the clustering are carried out chromosome by chromosome. This method is described in Hsu et al., 2005
Any clusters that are closer together than minDiff are collapsed together to form a single cluster. We use as default value 0.25, the same default as used by Hsu et al., but this is something you might want to experiment with, depending on the variability in your data, ploidy of samples, etc.
The original paper does not map to a set of "gain/loss/no-change" levels. We have followed the same approach as in CBS, and use here the mergeLevels procedure. It must be emphasized that this is an experimental procedure, not described in the original paper. Moreover, the wavelet-smoothing procedure returns smoothed values that rarely fool into a set of categories, so applying mergeLevels here often leads to non-sense results. Thus, we apply mergeLevels after running the original clustering procedure of this method with a very small threshold for merging (currently set to 0.05, or five times smaller than the default of 0.25); some preliminary trials show that the final outcome from mergeLevels is not sensitive to small variations around this threshold.
This is an approach that was developed by Price and collaborators (Price et al., 2005), and that uses the Smith-Waterman algorithm to detect "islands" of positive scores in a set of ordered real values. Before applying the Smith-Waterman algorithm, the data are thresholded and adjusted for sign (depending on whether we want to calculate copy gains or losses).
A permutation test is carried out to compute how likely it is that we find higher-scoring islands when there is no structure in the data. As well, a robustness calculation indicates the sensitivity of the localization of the highest-scoring island to the treshold value (threshold value that was used in the very first step, before applying the Smith-Waterman algorithm).
This method is slightly more sensitive to a -1 copy number change (equivalente to a fold change in signal of 2) than a +1 copy number change (fold change of +1.5), for a diploid sample. (T. Price, pers. communication). Note also that this method does not distinguish between a copy number increase of +1 and a copy number increase of +10 (this is also the case for ACE).
The method implemented here is a small departure from the one described originally: here we use all the data for a subject/array for the thresholding, but subsequent analyses (finding islands and robustness analysis and permutation tests) are carried out chromosome by chromosome. This should allow to detect cases where a complete chromosome is gained or lost, as well as cases where smaller regions are gained or lost. Note that this approach is probably not advisable for the X chromosome, at least for males. You probably want to analyze the X separately, since you only expect one copy of the genomic DNA in males. Finally, the function for calculating the threshold (the default used by T. Price originally, which we also use ---median + 0.2 * MAD---) was validated by Price and collaborators on a 75-probe dataset that spanned the telomeric 2MB of chromosome 16p, using subjects with well-characterized deletions in the region and healthy controls. But it has not been validated with whole-genome chips or other DNA sources. (If you want to play around with the code, the sources are available from http://www.well.ox.ac.uk/~tprice/cgh/). (I thank Tom Price for his discussion on these issues; some sentences here are copied almost verbatim from his emails.)
For the permutation test, Number of permutation is, well, the number of random permutations used. For the robustness calculation, Precision indicates the number of intervals used when dividing the threshold values between the lowest and highest possible. For plotting, Largest p-value shown is the largest p-value for which we want a region to be shown, in red, in the plot. (Note that this last is a parameter in our implementation of the plots, not in the original code). In contrast to the previous version of ADaCGH, we now use a Bonferroni-corrected p-value (we divide the p-value by the number of chromosomes).
The imlementation by T.S.Price allowed the specification of additional parameters. We have used the default ones (thresholding function = median + 0.2 * MAD; robustness calculation with median and median + 0.4 MAD for the lowest and highest threshold values, respectively).
This is the method implemented in CGH-Explorer and described both in the manual and in Lingjærde et al., 2005 . This method does not focus on estimating the number of copy gains per gene, but rather in finding regions with evidence for copy number changes. The emphasis, therefore, is hypothesis testing. The data are first smoothed using a 5-neighbor running mean. Then, genes are segmented in regions of the same type (gain or loss) and a gene is regarded as having an altered copy number if it belongs to a segment with a "significant" segment. Significance is determined using the length and height (mean expression) of each segment and comparing it with the null, and then obtaining and estimate of the positive FDR (sensu Storey and Tibshirani).
This method assumes that the data have been centered appropriately, so that the expect response from normal DNA is zero. You should be careful, does, with how the centering is done when you have samples from different sexes or with altered ratios of the sex chromosomes. (For instance, you might want to center your data using only the mean computed from genes that are not on the X or Y chromosomes).
The only parameter is the (desired) FDR. As part of the calculations, ACE returns a table with the pFDR (positive FDR) and the corresponding number of genes regarded as having altered copy number (over all the samples). You should enter here the pFDR you want, and we will return the results (figures and table of genes and their state) for the closest FDR to the one you entered.
We use the cghMCR function, as provided in the cghMCR bioconductor package, by J. Zhang and B. Feng, which implements an algorithm originally proposed in Aguirre et al. (2004). This algorithm should be applicable to other methods, although the original code is applicable only to the CBS method. We have extended the code, so it can be used with all methods (except for PSW). Note that you can find "common regions" that do not correspond to any region of gain/loss found by the mergeLevels algorithm (see examples).
Please give careful consideration to the parameters. In particular, note that "gapAllowed" should be set to reasonable values for your platform and data (gapAllowed has units!). The default value of 500 refers to distances of Kb. Thus, for MCR to yield sensible results you should use position data in Kb, or else adjust this threshold to the scale of your data.
(We are using version 1.4.0 of cghMCR, released on 2006-10-03. Note there have been important changes in the arguments used by cghMCR, such as from version 1.2.0, and thus there are changes in the ADaCGH interface.)
There are four parameters. We copy, verbatim, the help from the cghMCR function. gapAllowed: an integer specifying low threshold of base pair number to separate two adjacent segments, belower which the two segments will be joined as an altered span. alteredLow: a positive number between 0 and 1 specifying the lower threshold percential value. Only segments with values falling below this threshold are considered as altered span. alteredHigh: a positive number between 0 and 1 specifying the upper threshold percential value. Only segments with values falling over this threshold are considered as altered span. recurrence: an integer between 1 and 100 that specifies the rate of occurrence for a gain or loss that are observed across sample. Only gains/losses with ocurrence rate grater than the threshold values are declared as MCRs.
You can input the data either in a single file or using two files. If you use two files, one of them is the file with the aCGH data and the other the file with the position or coordinates or chromosomal location information. If you enter a single file, the first four columns of that file are identical to the columns of the location information file. All data are to be entered in tab-separated text-files. Any rows starting with a "#" are considered comments.
In the aCGH data file rows are variables (generally genes), and columns represent subjects, or samples, or arrays. If you want to name your arrays place a line that starts with "#", after the "#" put "Name" or "NAME" or "name" and write the array names (separated by tabs), but leave no space between "#" and "name" or whatever; thus, write "#name", NOT "# name". If you use two files, the first column of the aCGH file is assumed to contain the ID information for genes, marker, or whatever. This column need not be the same as the first column of the coordinate or position information file.
More on names: please, do not use spaces in the names of your arrays: they lead to problems when generating the thumbnails for the figures.
We assume all the aCGH data are normalized log ratios or similar. We carry out no checking of this. You can normalize your data using our DNMAD tool, though how best to normalize aCGH data is very much an open question. Note that the base of the log is really immaterial (though base 2 logs are the ones commonly used).
Note that you can analyze data from a variety of platforms. You can even analyze Affymetrix SNP-based arrays. For example, the recent paper by Yu et al., 2007 uses Affymetrix SNP arrays and uses a variety of methods (including some implemented here) to analyze the data. Of course, before being able to apply any of these methods to the SNP data, the original data need to be processed with a suitable tool (see Yu et al., 2007 for examples and further references) that estimates SNP-level copy numbers. Once you have those processed data available, you can enter them into ADaCGH.
(Some extra details on the Affymetrix SNP data: the initial data are the intensities from each each of the 20 probe pairs for each SNP (10 for the A allele and 10 for the B allele; each probe pair includes a perfect match and a mismatch probe). The data from each probe pair are then subject to different possible transformations/averages and compared with a normal references set (transformations go from the relatively simple ones, basically involving averages, in Huang et al., 2004, to complex ones as in the CARAT method Huang et al., 2006 which includes probe selection and regression on GC content and fragment length) before actually applying a segmentation algorithm to detect copy number alterations.)
You must enter information about the position of the genes/clones in your arrays. The file must contain, in this exact order:
Please note that we will not check that the columns you use are in the right order or use reasonable names. This means you can use any column names you like (e.g., instead of start you could write "inicio" or "comienzo"), but this also means that you have to make sure the order conforms to what is explained here.
Where can you get position information? There are several sources available. You might want to check our IDconverter tool.
If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the names in the output figures. This information is returned from IDClight based on that provided by our IDConverter tool.
Some of the methods implemented can deal with missing values in the data. In addition, since the first step is averaging values over identifiers, many missing values "disappear". However, we have done no testing of the procedures implemented with data sets with missing values, and it is likely that you would run into problems if your data sets contain missing values. We plan to lift this restriction in the future. For now, you are adviced to impute the data; you can do this, for instance, with our preprocessor tool.
A common output for all methods are some simple summary statistics per subject/array and per subject by chromosome combination. These statistics are provided before and after centering. If there are any relevant warnings in the initial processing of data (e.g., clones with chromosome names not among the accepted ones), these are also provided here.
Most methods return segmented plots (with genome-wide and chromosome-wide views). In all the plots, the original data are shown using small orange dots. In plots that show a subject/array, the chromosome numbers are indicated in the x-labels, and a vertical dotted grey line separates chromosomes. The main output shows thumbnails; click on the thumbnails to expand. If methods return "gain/loss/no-change" status, spots inferred as gained are shown in red, those inferred as loss as green.
For the Price-Smith-Waterman method, we provide island plots for each subject/array, but we show two plots, one when for regions of gain and one for regions of loss. In each plot (e.g., gain, array 1) we show the original data, as well as horizontal red lines for segments with a significant (i.e., p < Largest p-value shown) island; the horizontal location of the segments is irrelevant (always at the same height).We also provide a line that indicates the robustness. You want to focus on red segments with high robustness. Recall that both the p-values and the robustness are calculated here over each chromosome (even when the thresholding is done using the complete genome). Note that the p-values are not corrected for multiple testing at all. (The original R function from T. Price included an indication of the highest-scoring island; that is not provided here, as we've found it potentially confussing when analyzing 23 chromosomes).
If you have specified that the organism is human, the chromosome-wide plots will include a link to the appropriate chromosome of the Database of Genomic Variants. Note that you do not need to specify, for this to work, the type of identifier, simply the organism. We think this is the most useful type of link, and should allow you to relate your results to the already documented variation in humans.
A file with the output of the analysis. We show several columns: name of gene, chromosome, start, end, and mid-point and, for each subject/array, two or three columns, the original data, the smoothed data, and, if returned by the method, the inferred state; in most cases, the latter are columns with only three possible values : 0 (no change), -1 (loss), 1 (gain).
Warning: for some methods (e.g., CGH segmenatation) "state" is provided, as it is returned by the method. However, the same states do not necessarily mean the same thing over different chromosomes or over different arrays.
For PSW, there are two results files, one for gains and one for losses. Each of these files has the same five initial columns as above (name of gene, chromosome, start, end, and mid-point) and for each subject/array the sign (+1 if we were looking for gains, -1 if looking for losses, the robustness of an island of gain/loss that includes that gene/clone, and the p-value of that segment).
We now provide clickable plots for the segmentation results. Clicking on the thumbnails will open the Genome View plots. If you move the mouse over the figure, you will be told what chromosome it is. If you click, you will be taken to the Chromosome View plot. On the Chromosome View plot, hovering over a point will show you the ID of that point. If you click on the point, you will place a small blue box with the ID in place. To close, click on the "X". If, in addition, you have provided information on the type of id and organism (see above) you can click on the link in the box with the ID, and you will be taken to IDClight for additional information on that clone/gene. Please note that for the "All_arrays" plots, you need to click on the line at the log-ratio of 0, not on a dot at any other height.
It is now possible to send the results to PaLS, for those cases where assignment of genes to "gain", "loss", and "no change" is provided (for now, ACE, CBS with merge, and PSW). PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.
Scroll to the bottom of the main outpu, where you will find the PaLS icon and three lists t. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.
An example with fictitious data is provided in here. You can play with the figures, get the additional information from IDCLight, etc. Additional examples (from a somewhat less interactive version) are also available there.
This program was developped by Ramón Díaz-Uriarte and Oscar Rueda-Palacio, from the Spanish National Cancer Research Centre CNIO. This tool uses Python for the CGI and R for the computations. The R library underlying the Price-Smith-Waterman method was kindly provided by Tom Price. The R code for the wavelet-based smoothing method was kindly provided by Doug Grove. The binary segmentation approach uses the DNAcopy R/BioConductor package by E. S. Venkatraman and Adam Olshen. The ACE algorithm was implemented from scratch in R and C by Oscar Rueda Palacio using, as basis, the Java code and documentation from Lingjærde et al. (available here). The analysis for GLAD, HMM, and BioHMM use the BioC packages from their corresponding authors: GLAD, by P. Hupé, snapCGH, by Mike L. Smith, John C. Marioni, Steven McKinney, Natalie P. Thorne for BioHMM, and aCGH package, by Fridlyand and Dimitorv, for HMM. For CGH segmentation we use the functionality provided by the tilingArray package by Wolfgang Huber. Some of the above functions might have been modified slightly by R. Díaz-Uriarte to parallelize computations. The mergeLevels algorithm is from the aCGH package, by Fridlyand and Dimitorv, and the minimal common regions functionality from the cghMCR by J. Zhang and B. Feng. Our code and/or some of the above packages use the R packages waveslim by Brandom Whitchern, cluster, by Martin Maechler, based on S original by Peter Rousseeuw, Anja.Struyf and Mia.Hubert, and initial R port by Kurt.Hornik, Hmisc, by Frank E Harrell Jr, CGIwithR by David Firth, GDD by Simon Urbanek, Rmpi, by Hao Yu, imagemap by Barry Rowlingson, R2HTML by Eric Lecoutre, and papply, by D. Currie. We have also used and modified the ToolTip script Dynamic Drive as well as taken some ideas from overLIB, by Erik Bosrup.
Our understanding about aCGH and their analysis has beneffited greatly from discussions with Sara Alvarez de Andrés, Tom Price and Doug Grove, as well as with Philippe Hupé, Adam Olshen and Franck Picard.
This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, and Linux Virtual Server for web server load-balancing.
We want to thank the authors and contributors of these great (and open source) tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.
Funding provided by Fundación de Investigación Médica Mutua Madrileña and Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.
Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time (5 days, currently). Anybody can access those directories, but as the name of the directories are not trivial it is not easy for a third person to access your data.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the CNIO the to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.
Aguirre, Andrew J. and Brennan, Cameron and Bailey, Gerald and Sinha, Raktim and Feng, Bin and Leo, Christopher and Zhang, Yunyu and Zhang, Jean and Gans, Joseph D. and Bardeesy, Nabeel and Cauwels, Craig and Cordon-Cardo, Carlos and Redston, Mark S. and Depinho, Ronald A. and Chin, Lynda. (2004). High-resolution characterization of the pancreatic adenocarcinoma genome. PNAS, 24: 9067--9072. link.
Fridlyand J, Snijders AM, Pinkel D, Albertson DG. (2004). Hidden markov models approach to the analysis of array cgh data. Journal of Multivariate Analysis}, 90:132--153, July 2004.
Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6:211-26.
Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA, Wooster R, Jones KW, Shapero MH (2004). Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum Genomics, 1: 287--299.
Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW, Shapero MH. (2006). CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics, 7: 83.
Hupe P, Stransky N, Thiery JP, Radvanyi F, and Barillot E. (2004). Analysis of array cgh data: from signal ratio to gain and loss of dna regions. Bioinformatics, 20:3413--3422.
Lai WRR, Johnson MDD, Kucherlapati R, Park PJJ. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array cgh data. Bioinformatics, 21:3763--3770.
Lingjærde OC, Baumbusch LO, Liestøl K, Glad I, Børresen-Dale AL. (2005). CGH-Explorer: a program for analysis of CGH-data. Bioinformatics, 21: 821--822.
Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, West JA, Rostan S, Nguyen KC, Powers S, Ye KQ, Olshen A, Venkatraman E, Norton L, Wigler M. (2003) Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res. 13:2291-305
Marioni JC, Thorne NP, Tavaré S. (2006). Biohmm: a heterogeneous hidden markov model for segmenting array cgh data. Bioinformatics, 22:1144--1146.
Olshen AB, Venkatraman ES, Lucito R, Wigler M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557-72.
Picard P, Robin S, Lavielle M, Vaisse C, Daudin JJ. (2005). A statistical approach for array cgh data analysis.BMC Bioinformatics}, 6:27, 2005.
Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res. 33:3455-64.
Willenbrock H, Fridlyand J. (2005). A comparison study: applying segmentation to array CGH data for downstream analyses Bioinformatics, 21: 4084-4091.
Yu T, Ye H, Sun W, Li K, Chen Z, Jacobs S, Bailey D, Wong D, Zhou X. (2007). A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) arrays. BMC Bioinformatics, 8: 145.