From SnpSet to MAP and PED files for Affymetrix Genome Wide SNP

The following post is restricted to CEL files from Affymetrix GenomeWideSNP 5.0 and GenomeWideSNP 6.0.

The idea is to generate a MAP and a PED file from an amount of CEL files from one of those two technologies (in the post GenomeWideSNP 6.0).

First of all we must know the content of each file. The fields in a MAP file are:

  • Chromosome
  • Marker ID
  • Genetic distance
  • Physical position

While in a PED file they are:

  • Family ID
  • Sample ID
  • Paternal ID
  • Maternal ID
  • Sex (1=male; 2=female; other=unknown)
  • Affection (0=unknown; 1=unaffected; 2=affected)
  • Genotypes (space or tab separated, 2 for each marker. 0=missing)

Now, lets see what we have:

[carles@apollo affy6]$ ls

Ten CEL files of GenomeWideSNP 6.0 which are need to be converted to plink format.

To start with the conversion we open R and we load the two packages we will use:


Next we load the CEL names:

> celfiles <- list.celfiles()
> celfiles
 [1] "AR00.CEL" "AR01.CEL" "AR02.CEL" "AR03.CEL" "AR04.CEL" "AR05.CEL"
 [7] "AR06.CEL" "AR07.CEL" "AR08.CEL" "AR09.CEL"

We create a SnpSet with the crlmm method:

crlmmresult <- crlmm(celfiles, verbose=TRUE)

Now we load the annotation file [1] for the technology:

ann <- read.csv("GenomeWideSNP_6.na32.annot.csv", skip=21, row.names=1,

And we use it to create the content of the MAP file:

mapfile <- annot[ , c("Chromosome", "dbSNP.RS.ID", "ChrX.pseudo.autosomal.region.1", "Physical.Position")]

For PED file we clean the annotations to get only the ones we need. For that we request the genotypes to the SnpSet using the method calls:

calls <- calls(crlmmresult)
ann <-ann [rownames(calls), ]

We are going to need two matrices to store the values of the alleles:

alleleA <- alleleB <- matrix("", nrow=nrow(calls), ncol=ncol(calls), dimnames=dimnames(calls))

We look for the alleles on annotations and we check the strand to get the correct ones:

allA <- ann$Allele.A
allComplA <- c("A","C","G","T")[match(allA, c("T", "G", "C", "A"))]
forwA <- ifelse( ann$Strand=="+", allA, allComplA )

allB <- ann$Allele.B;
allComplB <- c("A","C","G","T")[match(allB, c("T", "G", "C", "A"))]
forwB <- ifelse(ann$Strand=="+", allB, allComplB)

With the correct alleles we fill the matrices:

for(ii in 1:ncol(calls)) {
    alleleA[ , ii] <- ifelse(calls[ , ii] < 3, forwA, forwB)
    alleleB[ , ii] <- ifelse(calls[ , ii] < 2, forwA, forwB)

Now we create a variable called ped to store all the content of the PED file and we put in it the alelles:

ped <- matrix("",ncol=nrow(alleleA) * 2 + 6, nrow=ncol(alleleA))
for(ii in 1:ncol(calls)) {
  ped[ii, (1:nrow(alleleA)) * 2 + 5] <- alleleA[ , ii]
  ped[ii, (1:nrow(alleleB)) * 2 + 6] <- alleleB[ , ii]

To fill the first 6 columns of the PED file I loaded an “info file” attached to my CEL files with the following content:

> info
      family_id      samples_id patern_id mattern_id gender_id affetion
 [1,] "homo sapiens" "AR00"     "0"       "0"        "1"       "0"
 [2,] "homo sapiens" "AR02"     "0"       "0"        "1"       "0"
 [3,] "homo sapiens" "AR04"     "0"       "0"        "1"       "0"
 [4,] "homo sapiens" "AR06"     "0"       "0"        "1"       "0"
 [5,] "homo sapiens" "AR08"     "0"       "0"        "1"       "0"
 [6,] "homo sapiens" "AR01"     "0"       "0"        "2"       "0"
 [7,] "homo sapiens" "AR03"     "0"       "0"        "2"       "0"
 [8,] "homo sapiens" "AR05"     "0"       "0"        "2"       "0"
 [9,] "homo sapiens" "AR07"     "0"       "0"        "2"       "0"
[10,] "homo sapiens" "AR09"     "0"       "0"        "2"       "0"

So to fill the first 6 columns is as easy as:

ped[ , 1] <- info[ , "family_id"]
ped[ , 2] <- info[ , "samples_id"]
ped[ , 3] <- info[ , "patern_id"]
ped[ , 4] <- info[ , "matern_id"]
ped[ , 5] <- info[ , "gender_id"]
ped[ , 6] <- info[ , "affetion"]

Finally we only need to save the variables mapfile and ped to disk with:

write.table(mapfile, file="", quote=FALSE, row.names=FALSE, col.names=FALSE)
write.table(ped, file="Affy6.ped", quote=FALSE, row.names=FALSE, col.names=FALSE )


  • [1] Annotation files for Affymetrix arrays: link
1 comment
  1. Jeongha Lee said:

    thanks for your script!
    but after I made two matched map and ped files(like, example_1.ped,, example_2.ped) by following your script, I faced with the problem.
    When I tried to merge them with plink, it said there were unexpected columns in pedfile (not matched with mapfile columns ). It said the map file contained N rows (SNP data), so there must be the ped file that contained 6+2*N columns(for me 1866250columns should be in ped file) but the observed pedfile had much less columns(1813206 columns was observed in my ped file).
    After several trouble shooting, i solved this problem by just switching several scripts.

    ann <- read.csv(“GenomeWideSNP_6.na35.annot.csv”, skip=18, row.names=1,
    calls <- calls(crlmmresult)
    ann <-ann [rownames(calls), ]
    mapfile <- ann[ , c(“Chromosome”, “dbSNP.RS.ID”, “ChrX.pseudo.autosomal.region.1”, “Physical.Position”)]

    after making mapfile that only contained the ann data that were correspond with rownames(calls), I could get matched columns number between expected ped file and observed ped file.
    So I think it would be little bit better with this switched script for merging map and ped file with plink.
    Anyway thanks again for your script.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: