Archive

R

This last week I used lost of Microsoft Excel files that I need in my R scripts. Hence I discovered the excellent and complete package XLConnect. But, for easy and fast working, its not the best solution.

So I codded a wrapper of XLConnect, calling it loadxls that allows to read and write Excel files in an easy way.

loadxls

The R package loadxls can be installed from its github repository with:

devtools::install_github("carleshf/loadxls")

It implements only 4 functions: read_all, read_sheet, write_all and write_sheet.

Functions

read_all

read_all(filename, environment = parent.frame(), verbose = TRUE)

This functions reads a given Excel file and loads each sheet as a data.frame in the current environment. The created objects will have the name of the sheet.

read_sheet

read_sheet(filename, sheetname, varname, environment = parent.frame(), verbose = TRUE)

This functions loads, instead of the full content of a file, only a given sheet. If the argument varname is supplied, the object loaded from the given sheet will take its name.

write_all

write_all(..., filename, verbose = TRUE)

This function writes all the objects given though ... to a new Excel file, saving each objects as a new sheet.

write_sheet

write_sheet(data, sheetname, filename, replace = FALSE, verbose = TRUE)

This function allows to save a single object to and giving the name of the sheet where it will be write.

Advertisements

I had a file containing ~1M SNPs in their rsid (and their position). I needed to complete the information with their chromosome.

Since the list is really large I use scan combined with a bash command (to know the length of the file).

I found this solution:

library("SNPlocs.Hsapiens.dbSNP.20120608")

## Create connection to big file
inputName <- "input_file.gen"
outputName <- "output_file.gen"
inputCon <- file(description = inputName, open = "r")
outputCon <- file(description = outputName, open = "w")

## We need to know the number of lines of the big file
## This will work for GNU/Linux environments
command <- paste("wc -l ", inputName, " | awk '{ print $1 }'", sep="")
nLines <- as.numeric(system(command = command, intern = TRUE))
rm(command)

## Loop over the file connection until end of lines
pb <- txtProgressBar(min = 0, max = nLines, style = 3)
for(ii in 1:nLines) {
  readLine <- scan(file = inputCon, nlines = 1, what = "list", quiet = TRUE)

  ## Get the chr from the SNP
  x <- tryCatch({
    x <- rsidsToGRanges(readLine[[2]])
    as.character(x@seqnames[1])
  }, error = function(e) {
    "---"
  })

  ## Update on list and save on output
  readLine[[1]] <- x
  writeLines(paste(readLine, collapse = " "), outputCon)

  setTxtProgressBar(pb, ii)
}

close(inputCon)
close(outputCon)

This solution is really slow but it gets the SNP’s chromosome and fills it as "---" when the SNP is not found in dbSNP.

WARNING: The SNP’s rsid must be located into the second column of the file. Take a look at readLine[[2]] in the for.


I am open to know different ways of doing this.

The requested R packages are:

library( Rsamtools )
library( TxDb.Hsapiens.UCSC.hg19.knownGene )
library( ShortRead )

We start from a set of aligned BAM files. The BAM files will be loaded using:

bamfls <- BamFileList( list.files( "...", pattern = "bam$", full.names = TRUE ) )

The reference comes as:

ex   <- unlist( exonsBy( TxDb.Hsapiens.UCSC.hg19.knownGene, "gene" ) )

And the function to extract the GC content is defined as:

getGCcontent <- function( bam_path, ex ) {
    if( missing( ex ) ) {
        stop( "Needed reference as 'ex'. ")
    }
    aln   <- readGappedReads( bam_path )
    olap  <- findOverlaps( aln, ex, ignore.strand = TRUE )
    hits  <- tabulate( queryHits( olap ), length( aln ) )
    keep  <- queryHits( olap ) %in% which( hits == 1 )
    reads <- qseq( aln )[ queryHits( olap )[ keep ] ]
    readGc <- rowSums( alphabetFrequency( reads, as.prob = TRUE )[ , c( "C", "G" ) ] )

    return( mean( readGc ) )
}

I usually create my R scripts inside a .Rnw files that I run with cacheSweave and the I convert to PDF. Since I started the PhD I am creating a lot of this files so I developed a script to run the .Rnw files and to create the corresponding .pdf file in a single command without needing to open R and run the source.

The script is the following:

# Get the arguments from the Rscript
args <- commandArgs( trailingOnly = TRUE )

# Get the sweave (original) and the tex (created with sweave) files
sweave_file <- args[ 1 ]
file_name   <- strsplit( sweave_file, "[/]" )[[ 1 ]]
tex_file    <- paste0( getwd(), "/", strsplit( file_name[ length( file_name ) ], "[.]")[[ 1 ]][ 1 ], ".tex" )

# Load cacheSweave and run the sweave file
library( cacheSweave )
Sweave( sweave_file, driver = cacheSweaveDriver)

# Convert the tex file to pdf
system( paste0( "pdflatex ", tex_file ) )
system("rm *.aux *.log *.out *.toc")

And I run it as:

Rscript ../../scripts/creaPdf.R ../../scripts/mosaicism_analysis.Rnw

The file ../../scripts/creaPdf.R is the one containing the code above. The file ../../scripts/mosaicism_analysis.Rnw is the one that will be executed (and this execution will create a .tex file) and converted to PDF.

To read from console some user input we can use the following:

ask <- function() {
    cat( "You will be asked for some values" )
    return( scan( n = 1 ) )
}

This lines generate the following behaviour:

R> value.1 <- ask()
You will be asked for some values
1: 7
Read 1 item

So when we shoe the content of value.1:

R> print( value.1 )
[1] 7

We can do the same asking for more values:

ask <- function( N = 1 ) {
    cat( "You will be asked for some values" )
    return( scan( n = N ) )
}

Getting a similar behaviour:

R> value.5 <- ask( N = 5 )
You will be asked for some values
1: 5
2: 4
3: 3
4: 2
5: 1
Read 5 items

And, of course, getting a similar result:

R> print( value.5 )
[1] 5 4 3 2 1

But, what happens if we type some other input different than numbers:

R> value.c <- ask()
You will be asked for some values
1: A
1: B
Error en scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() esperaba 'a real', obtuvo 'A'

This can be solve adding a new argument:

R> ask <- function( N = 1, type = double() ) {
    cat( "You will be asked for some values" )
    return( scan( n = N, what = type ) )
}

And be used as:

R> value.c <- ask( type = character() )
You will be asked for some values
1: A
Read 1 item

So we can get the character:

R> print( value.c )
[1] "A"

Introduction

MAD, acronym of Mosaic Alteration Detector, is an R package, depending from the package R-GADA, that allows to detect mosaic events using SNP data. The best feature of this package is that it allows to run the process over a big set of samples.

To run the analysis each sample must have a file, containing the following information:

  • Name (of the SNP)
  • Chromosome (where the SNP is placed)
  • Position (of the SNP in the chromosome)
  • The value for the log2 ratio (of the SNP)
  • The value of the B allele frequency (of the SNP)
  • And the given genotype (of the SNP)

All this information needs to be stored in a tab-splited file:

[carleshf@coruscan analysisstarwarsrawData]$ head padme 
Name    Chr     Position        Log.R.Ratio     GType   B.Allele.Freq
S-4DTYN 3       75333236        0.320067        AB      0.503067484662577
S-4DTYJ 2       137804803       -0.372684       BB      0.93558282208589
S-4DTYD 1       79235573        -0.224208       AA      0.00920245398773001
...     ...     ...             ...             ...     ...

Moreover, the R package MAD requires to place all the samples into a folder called rawData, let’s see:

analysis
  +-- starwats
        +-- rawData
              +-- padme
              +-- anakin
              +-- qui_gon
              +-- palpatine
              +-- watto
  +-- startrek
        +-- rawData
              +-- james_kirk
              +-- spock
              +-- uhura
              +-- scott

Having the correct structure, all the samples into the folder rawData, we place the R session into the upper folder, in the case starwars.

R> path <- getwd()
R> path
[1] "/home/carleshf/.../analysis/starwars"

Analysis

Being there we can start the mosaicism analysis, that is performed in three steps:

  1. The set-up step.
  2. The segmentation procedure step.
  3. The backward elimination step.

Set-up

library( mad )

object <- setupParGADA.B.deviation(
    folder    = path, 
    NumCols   = 6, 
    log2ratio = 4, 
    BAFcol    = 6, 
    GenoCol   = 5
)

Segmentation Procedure

parSBL(
    object, 
    estim.sigma2 = TRUE, 
    aAlpha       = 0.8
)

Backward Elimination

parBE.B.deviation(
    object, 
    T         = 8, 
    MinSegLen = 100
)

Results

At this point the analysis is finished. All the possible events have been detected ans stored into the variable called object. MAD allows us to export all this information as a table:

exportSegments2File(
    object, 
    file   = "mosaic_events.txt"
)

The content of this file follows:

[carleshf@coruscan analysisstarwars]$ head mosaic_events.txt 
IniProbe  EndProbe  LenProbe  qqBdev  chr  LRR    (s.e.)  Bdev   %HetWT  %Hom  State  sample
66690197  71078462  183       0.95    3    -0.64  0.21    0.188  0       8.2   2      palpatine
17309881  21421319  127       0.38    22   -0.24  0.34    0.032  4.7     51.2  2      watto
143559    15049329  495       0.89    18   -0.64  0.24    0.214  0       10.3  2      anakin

We can see that MAD gives us a lot of information. May be, the most import could be the region where the mosaic event was detected (from InitProbe to EndProbe), the chromosome containing the event (chr), the classification of the event (State) and the sample that suffers the event (sample).

The number that codifies the State of the detected abnormalities corresponds:

  1. UPD
  2. Deletion
  3. Duplication
  4. Trisomy
  5. LOH

References

For more information I refer you to the following:

  • The web page of the package – link
  • The vignette of the package – link
  • The paper where the method (package) was used – link

The following is a post/manual based on on the description of the installation process of R from CRAN (link). But I added the results of my experience.

First of all, it is needed to add the CRAN (or alternative) repository to the sources.list file. Edit /etc/apt/sources.list and add to it:

deb http://FAVORITE.CRAN.MIRROR/bin/linux/ubuntu trusty/

Changing FAVORITE.CRAN.MIRROR for one of the list. An example could be:

deb http://cran.rstudio.com/bin/linux/ubuntu trusty/

Now we update the repositories with:

sudo apt-get update

As a result we got a list of completed update but an error when the system tries to get the information for the added repository:

W: GPG error: http://cran.rstudio.com trusty/ Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY XXXXXXXX12345678

To add the key to the system we do:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-key 12345678

Just see that we add the last eight digits from the given key, not all. Now we can redo a:

sudo apt-get update

And we can install R with:

sudo apt-get upgrade
sudo apt-get install r-base