Log2 Ratio from Affymetrix Genome Wide SNP 6.0

First I downloaded and uncompressed the TESLA.tgz file from 1. The content of the new TESLA folder is the following:

[carleshf@lambdafunction affy6]$ ll TESLA/
total 202316
-rw-rw-r--. 1 carleshf carleshf 69058277 ene 25 12:00 TESLA_..._GenomeWideSNP_6_A01_223682.CEL
-rw-rw-r--. 1 carleshf carleshf 69064401 ene 25 12:00 TESLA_..._GenomeWideSNP_6_A02_223698.CEL
-rw-rw-r--. 1 carleshf carleshf 69043474 ene 25 12:00 TESLA_..._GenomeWideSNP_6_A03_223714.CEL

Then I got the Affymetrix Power Tools for GNU/Linux from 2 and stored at ~/Software.

The following step is to use the Affymetrix Power Tools software to extract allele-specific signal values from the raw CEL files (the ones in TESTLA). Here “allele-specific” refers to the fact that for each SNP, we have a signal measure for the A allele and a separate signal measure for the B allele.

To perform this operation we use the apt-probeset-summarize, from the Affymetrix Power Tools.

To use this tool on raw CEL files from Genome-Wide SNP 6.0 arrays we need the_library files_ corresponding to this technology. I downloaded and uncompressed the library files from 3 into ~/lib_affy6.

With the APT and the library files I run the apt-probeset-summarize as:

/home/chernandez/Software/apt-1.15.1/bin/apt-probeset-summarize 
    --cdf-file /home/chernandez/lib_affy6/GenomeWideSNP_6.cdf 
    --analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true 
    --out-dir apt 
    TESLA/*.CEL

After a minute the process finished and the folder apt was created with:

[carleshf@lambdafunction affy6]$ ll apt/
total 105864
-rw-rw-r--. 1 carleshf carleshf     36206 ene 21 16:58 apt-probeset-summarize.log
-rw-rw-r--. 1 carleshf carleshf      4305 ene 21 16:58 quant-norm.pm-only.med-polish.expr.report.txt
-rw-rw-r--. 1 carleshf carleshf 108356050 ene 21 16:58 quant-norm.pm-only.med-polish.expr.summary.txt

The generated file called quant-norm.pm-only.med-polish.expr.summary.txt contains the allele-specific signal for all the probes into the arrays. With the following command I created a file containing only the allele-specific signal for the SNP probes:

grep SNP apt/quant-norm.pm-only.med-polish.expr.summary.txt | grep -v AFFX > apt/snp.summary.txt

The generated file snp.summary.txt is a tabular file containing four columns, as we can see using the head command:

[carleshf@lambdafunction affy6]$ tail -2 apt/snp.summary.txt 
SNP_A-8574011-A 10.45193    10.40926    10.19732
SNP_A-8574011-B 8.64288      9.00405     9.15780

The first columns is the Affymetrix Identifier of each probe with the allele marker (the -A and the -B at the end of the probe’s name) the other 3 columns correspond to the signal read from each one of the three raw CEL files.

Then I downloaded and uncompressed the annotation file for Genome-Wide SNP 6.0 from 4 and started an R session.

I loaded the allele-specific signals with:

summary.file <- "apt/snp.summary.txt"
int.snp <- read.table(summary.file, header=TRUE, comment.char="#")
rownames(int.snp) <- int.snp$probeset_id

And converted to a matrix skipping the column with the probes’ identifiers with:

int.snp <- as(int.snp[ ,-1 ], "matrix")

The signal of each allele can be gotten using sequences by 2 but starting on at 1 and the other at 2.

alleleA <- int.snp[seq(1, nrow(int.snp), 2), ]
alleleB <- int.snp[seq(2, nrow(int.snp), 2), ]

I calculate and approximation of the LRR using a formula based on our experience in Affymetrix technologies:

library(Biobase) # for 'rowMedians'
R <- as(alleleA + alleleB, "matrix")
medians <- rowMedians(R)
LRR <- log2(sweep(R, 1, medians, FUN="/"))
rownames(LRR) <- gsub("-A", "", as.character(rownames(alleleA)))
colnames(LRR) <- colnames(int.snp)
rm("R", "medians")

Having the LRR I annotated each probe using the annotation file for Genome-Wide SNP 6.0:

annotation.file <- "/home/chernandez/annotations/GenomeWideSNP_6/GenomeWideSNP_6.na33.annot.csv"
ann <- read.csv(annotation.file, comment.char="#", header=TRUE)

I discarded the information I won’t use with:

ann <- ann[ , c("Probe.Set.ID", "Chromosome", "Physical.Position")]

And I did the annotation using the identifier of each probe (the column Probe.Set.ID), creating a data.frame with the annotation and the LRR for the three raw CEL files:

df <- cbind(ann[ann$Probe.Set.ID %in% rownames(LRR),], LRR)
colnames(df) <- c("Probe.Set.ID", "Chromosome", "Physical.Position", 
    "TESLA_A01_223682", "TESLA_A02_223698", "TESLA_A03_223714")

The resulting data.frame is the following:

head(df, 2)
      Probe.Set.ID Chromosome Physical.Position TESLA_A01_223682
3023 SNP_A-1780270          7          78599583       0.00000000
3024 SNP_A-1780271         15          33395779      -0.03363681
     TESLA_A02_223698 TESLA_A03_223714
3023      -0.02645902       0.09666207
3024       0.00000000       0.05463006

Now we can save the data.frame using the write.table and/or plot the LRR. Here you have a sample of the calculated LRR:

LRR of the 26 chromosomes of individual TESLA_..._A01_223682

References

  • 1 Sample Data – HapMap (TESLA batch): link
  • 2 Affymetrix Power Tools: link
  • 3 Library Files: link
  • 4 Annotation File: link
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: