Archive

bioinformatics

This last week I used lost of Microsoft Excel files that I need in my R scripts. Hence I discovered the excellent and complete package XLConnect. But, for easy and fast working, its not the best solution.

So I codded a wrapper of XLConnect, calling it loadxls that allows to read and write Excel files in an easy way.

loadxls

The R package loadxls can be installed from its github repository with:

devtools::install_github("carleshf/loadxls")

It implements only 4 functions: read_all, read_sheet, write_all and write_sheet.

Functions

read_all

read_all(filename, environment = parent.frame(), verbose = TRUE)

This functions reads a given Excel file and loads each sheet as a data.frame in the current environment. The created objects will have the name of the sheet.

read_sheet

read_sheet(filename, sheetname, varname, environment = parent.frame(), verbose = TRUE)

This functions loads, instead of the full content of a file, only a given sheet. If the argument varname is supplied, the object loaded from the given sheet will take its name.

write_all

write_all(..., filename, verbose = TRUE)

This function writes all the objects given though ... to a new Excel file, saving each objects as a new sheet.

write_sheet

write_sheet(data, sheetname, filename, replace = FALSE, verbose = TRUE)

This function allows to save a single object to and giving the name of the sheet where it will be write.

I had a file containing ~1M SNPs in their rsid (and their position). I needed to complete the information with their chromosome.

Since the list is really large I use scan combined with a bash command (to know the length of the file).

I found this solution:

library("SNPlocs.Hsapiens.dbSNP.20120608")

## Create connection to big file
inputName <- "input_file.gen"
outputName <- "output_file.gen"
inputCon <- file(description = inputName, open = "r")
outputCon <- file(description = outputName, open = "w")

## We need to know the number of lines of the big file
## This will work for GNU/Linux environments
command <- paste("wc -l ", inputName, " | awk '{ print $1 }'", sep="")
nLines <- as.numeric(system(command = command, intern = TRUE))
rm(command)

## Loop over the file connection until end of lines
pb <- txtProgressBar(min = 0, max = nLines, style = 3)
for(ii in 1:nLines) {
  readLine <- scan(file = inputCon, nlines = 1, what = "list", quiet = TRUE)

  ## Get the chr from the SNP
  x <- tryCatch({
    x <- rsidsToGRanges(readLine[[2]])
    as.character(x@seqnames[1])
  }, error = function(e) {
    "---"
  })

  ## Update on list and save on output
  readLine[[1]] <- x
  writeLines(paste(readLine, collapse = " "), outputCon)

  setTxtProgressBar(pb, ii)
}

close(inputCon)
close(outputCon)

This solution is really slow but it gets the SNP’s chromosome and fills it as "---" when the SNP is not found in dbSNP.

WARNING: The SNP’s rsid must be located into the second column of the file. Take a look at readLine[[2]] in the for.


I am open to know different ways of doing this.

Using the Dir and File we are able to create a function to list the content of a given path:

    def list_files( path, filter = '', full_names = FALSE )
        if full_names == TRUE
            files = Dir.entries(path).select{ | ff | ff != '.' && ff != '..' }.map{ | ff | File.join(path, ff) }
        else
            files = Dir.entries(path).select{ | ff | ff != '.' && ff != '..' }
        end

        if filter.class != ''
            files = files.select{ | ff | ff[filter] }
        end

        return files
    end

The previous function, moreover allows to get the full name of the files (in base of the given path) and also to filter the result given a regular expression.

For example, having a folder with:

    [carleshf@lambdafunction Data]$ ll aligned/
    -rw-rw----+ 1 carleshf carleshf 2200853068 apl 21 06:06 mRNA10_AGTTCC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4778016 apl 21 06:06 mRNA10_AGTTCC_f2.bam.bai
    -rw-rw----+ 1 carleshf carleshf 1621816053 apl 21 06:06 mRNA10_CAGATC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4774552 apl 21 06:06 mRNA10_CAGATC_f2.bam.bai
    -rw-rw----+ 1 carleshf carleshf 2195712479 apl 21 06:06 mRNA10_CCGTCC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4445664 apl 21 06:06 mRNA10_CCGTCC_f2.bam.bai
    ...

We can list the .bam files with:

    list_files('aligned', full_names = TRUE, filter = /.bam$/)

Let’s imagine we have a file-architecture like:

analysis ( folder )
    |
    +-- sample_name_1 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |    |
    |   |    +-- align_summary.txt
    |   |    ...
    |   +-- qc
    |   ...
    +-- sample_name_2 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |   |
    |   |   +-- align_summary.txt
    |   |   ...
    |   +-- qc
    |   ...
    ...

And we want to get the number and the percentage of duplicated reads from the file align_summary.txt, generated using tophat2. The file align_summary.txt contains a string as this one:

        Reads:
          Input     :  26373797
           Mapped   :  21928071 (83.1% of input)
            of these:   5067439 (23.1%) have multiple alignments (6356 have >20) 
83.1% overall read mapping rate.

So it can be read as a plain-text file and parse the string to get the 4th line, parsing it to get the 5067439 and `23.1%“. That can be done as:

from os import listdir
from os.path import join, basename

def getDuplicateds( file_name ):
    with open( file_name, 'r' ) as fh:
        return( [ x for x in fh.readlines()[ 3 ].replace('(', '').replace(')', '').replace('%', '').split( ' ' ) if x != '' ][ 2:4 ] )

input_folder = "."
all_summary = [ join( x, "tophat_out", "align_summary.txt" ) for x in listdir( input_folder ) ]
dupValue = [ getDuplicateds( ff ) for ff in all_summary ]

Moreover we can add the sample’s name to the list, convert it into a data frame and save is as a .tsv:

import pandas as pd

def getName( facility_name ):
    x = facility_name.replace(".fastq", '' ).split( '_' )
    return( x[ 1 ] + "_" + x[ 3 ] )

names       = [ getName( x ) for x in listdir( input_folder ) ]
dupValue = zip( names, [ x[ 0 ] for x in dupValue ], [ x[ 1 ] for x in dupValue ] )
df = pd.DataFrame( dupValue )
df.columns = [ "Sample Name", "Duplicated (#)", "Duplicated(%)" ]
df.to_csv( "duplicate_reads.tsv", sep="t", index = False )

So, the content of the output file is:

[carleshf@lambdafunction]$ head duplicate_reads.tsv 
Sample Name     Duplicated (#)  Duplicated(%)
mRNA4_I19       1892308 8.3
mRNA4_I5        5104835 22.7
mRNA4_I6        1722464 8.7
mRNA1_I13       3970997 14.4
mRNA1_I15       9024103 32.8
mRNA1_I19       2632723 9.2
mRNA1_I5        1175850 14.9
mRNA1_I6        221270  8.3
mRNA2_I14       7177404 19.5

The requested R packages are:

library( Rsamtools )
library( TxDb.Hsapiens.UCSC.hg19.knownGene )
library( ShortRead )

We start from a set of aligned BAM files. The BAM files will be loaded using:

bamfls <- BamFileList( list.files( "...", pattern = "bam$", full.names = TRUE ) )

The reference comes as:

ex   <- unlist( exonsBy( TxDb.Hsapiens.UCSC.hg19.knownGene, "gene" ) )

And the function to extract the GC content is defined as:

getGCcontent <- function( bam_path, ex ) {
    if( missing( ex ) ) {
        stop( "Needed reference as 'ex'. ")
    }
    aln   <- readGappedReads( bam_path )
    olap  <- findOverlaps( aln, ex, ignore.strand = TRUE )
    hits  <- tabulate( queryHits( olap ), length( aln ) )
    keep  <- queryHits( olap ) %in% which( hits == 1 )
    reads <- qseq( aln )[ queryHits( olap )[ keep ] ]
    readGc <- rowSums( alphabetFrequency( reads, as.prob = TRUE )[ , c( "C", "G" ) ] )

    return( mean( readGc ) )
}

Introduction

MAD, acronym of Mosaic Alteration Detector, is an R package, depending from the package R-GADA, that allows to detect mosaic events using SNP data. The best feature of this package is that it allows to run the process over a big set of samples.

To run the analysis each sample must have a file, containing the following information:

  • Name (of the SNP)
  • Chromosome (where the SNP is placed)
  • Position (of the SNP in the chromosome)
  • The value for the log2 ratio (of the SNP)
  • The value of the B allele frequency (of the SNP)
  • And the given genotype (of the SNP)

All this information needs to be stored in a tab-splited file:

[carleshf@coruscan analysisstarwarsrawData]$ head padme 
Name    Chr     Position        Log.R.Ratio     GType   B.Allele.Freq
S-4DTYN 3       75333236        0.320067        AB      0.503067484662577
S-4DTYJ 2       137804803       -0.372684       BB      0.93558282208589
S-4DTYD 1       79235573        -0.224208       AA      0.00920245398773001
...     ...     ...             ...             ...     ...

Moreover, the R package MAD requires to place all the samples into a folder called rawData, let’s see:

analysis
  +-- starwats
        +-- rawData
              +-- padme
              +-- anakin
              +-- qui_gon
              +-- palpatine
              +-- watto
  +-- startrek
        +-- rawData
              +-- james_kirk
              +-- spock
              +-- uhura
              +-- scott

Having the correct structure, all the samples into the folder rawData, we place the R session into the upper folder, in the case starwars.

R> path <- getwd()
R> path
[1] "/home/carleshf/.../analysis/starwars"

Analysis

Being there we can start the mosaicism analysis, that is performed in three steps:

  1. The set-up step.
  2. The segmentation procedure step.
  3. The backward elimination step.

Set-up

library( mad )

object <- setupParGADA.B.deviation(
    folder    = path, 
    NumCols   = 6, 
    log2ratio = 4, 
    BAFcol    = 6, 
    GenoCol   = 5
)

Segmentation Procedure

parSBL(
    object, 
    estim.sigma2 = TRUE, 
    aAlpha       = 0.8
)

Backward Elimination

parBE.B.deviation(
    object, 
    T         = 8, 
    MinSegLen = 100
)

Results

At this point the analysis is finished. All the possible events have been detected ans stored into the variable called object. MAD allows us to export all this information as a table:

exportSegments2File(
    object, 
    file   = "mosaic_events.txt"
)

The content of this file follows:

[carleshf@coruscan analysisstarwars]$ head mosaic_events.txt 
IniProbe  EndProbe  LenProbe  qqBdev  chr  LRR    (s.e.)  Bdev   %HetWT  %Hom  State  sample
66690197  71078462  183       0.95    3    -0.64  0.21    0.188  0       8.2   2      palpatine
17309881  21421319  127       0.38    22   -0.24  0.34    0.032  4.7     51.2  2      watto
143559    15049329  495       0.89    18   -0.64  0.24    0.214  0       10.3  2      anakin

We can see that MAD gives us a lot of information. May be, the most import could be the region where the mosaic event was detected (from InitProbe to EndProbe), the chromosome containing the event (chr), the classification of the event (State) and the sample that suffers the event (sample).

The number that codifies the State of the detected abnormalities corresponds:

  1. UPD
  2. Deletion
  3. Duplication
  4. Trisomy
  5. LOH

References

For more information I refer you to the following:

  • The web page of the package – link
  • The vignette of the package – link
  • The paper where the method (package) was used – link