Archive

tools

Using bash and ImageMagick we can crop all the pictures in a folder in a single shoot:

fix='crop'
it=1
for file in $2/*; 
do
    if [[ -f $file ]]; 
    then
        output=$2/${fix}${it}.png
        echo "$file --> $output"
        convert -crop $1 "$file" "$output"
        it=$((it+1))
    fi
done

The script, called crop.sh is run as:

./crop.sh 525x240+675+150 ~/Pictures/

The first argument is the argument for the ImageMagick’s crop tool and the second argumen is the folder where the pictures to be cropped are.

Let’s imagine we have a file-architecture like:

analysis ( folder )
    |
    +-- sample_name_1 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |    |
    |   |    +-- align_summary.txt
    |   |    ...
    |   +-- qc
    |   ...
    +-- sample_name_2 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |   |
    |   |   +-- align_summary.txt
    |   |   ...
    |   +-- qc
    |   ...
    ...

And we want to get the number and the percentage of duplicated reads from the file align_summary.txt, generated using tophat2. The file align_summary.txt contains a string as this one:

        Reads:
          Input     :  26373797
           Mapped   :  21928071 (83.1% of input)
            of these:   5067439 (23.1%) have multiple alignments (6356 have >20) 
83.1% overall read mapping rate.

So it can be read as a plain-text file and parse the string to get the 4th line, parsing it to get the 5067439 and `23.1%“. That can be done as:

from os import listdir
from os.path import join, basename

def getDuplicateds( file_name ):
    with open( file_name, 'r' ) as fh:
        return( [ x for x in fh.readlines()[ 3 ].replace('(', '').replace(')', '').replace('%', '').split( ' ' ) if x != '' ][ 2:4 ] )

input_folder = "."
all_summary = [ join( x, "tophat_out", "align_summary.txt" ) for x in listdir( input_folder ) ]
dupValue = [ getDuplicateds( ff ) for ff in all_summary ]

Moreover we can add the sample’s name to the list, convert it into a data frame and save is as a .tsv:

import pandas as pd

def getName( facility_name ):
    x = facility_name.replace(".fastq", '' ).split( '_' )
    return( x[ 1 ] + "_" + x[ 3 ] )

names       = [ getName( x ) for x in listdir( input_folder ) ]
dupValue = zip( names, [ x[ 0 ] for x in dupValue ], [ x[ 1 ] for x in dupValue ] )
df = pd.DataFrame( dupValue )
df.columns = [ "Sample Name", "Duplicated (#)", "Duplicated(%)" ]
df.to_csv( "duplicate_reads.tsv", sep="t", index = False )

So, the content of the output file is:

[carleshf@lambdafunction]$ head duplicate_reads.tsv 
Sample Name     Duplicated (#)  Duplicated(%)
mRNA4_I19       1892308 8.3
mRNA4_I5        5104835 22.7
mRNA4_I6        1722464 8.7
mRNA1_I13       3970997 14.4
mRNA1_I15       9024103 32.8
mRNA1_I19       2632723 9.2
mRNA1_I5        1175850 14.9
mRNA1_I6        221270  8.3
mRNA2_I14       7177404 19.5

I usually create my R scripts inside a .Rnw files that I run with cacheSweave and the I convert to PDF. Since I started the PhD I am creating a lot of this files so I developed a script to run the .Rnw files and to create the corresponding .pdf file in a single command without needing to open R and run the source.

The script is the following:

# Get the arguments from the Rscript
args <- commandArgs( trailingOnly = TRUE )

# Get the sweave (original) and the tex (created with sweave) files
sweave_file <- args[ 1 ]
file_name   <- strsplit( sweave_file, "[/]" )[[ 1 ]]
tex_file    <- paste0( getwd(), "/", strsplit( file_name[ length( file_name ) ], "[.]")[[ 1 ]][ 1 ], ".tex" )

# Load cacheSweave and run the sweave file
library( cacheSweave )
Sweave( sweave_file, driver = cacheSweaveDriver)

# Convert the tex file to pdf
system( paste0( "pdflatex ", tex_file ) )
system("rm *.aux *.log *.out *.toc")

And I run it as:

Rscript ../../scripts/creaPdf.R ../../scripts/mosaicism_analysis.Rnw

The file ../../scripts/creaPdf.R is the one containing the code above. The file ../../scripts/mosaicism_analysis.Rnw is the one that will be executed (and this execution will create a .tex file) and converted to PDF.

For this examples the user is called darth_vader and the remote host death_star.gov.

Single File

From host to server

In this section the user wants to upload the file called rebel_ids.tsv from his localhost to remotehost.

$ scp rebel_ids.tsv darth_vader@death_star.gov:/home/darth_vader/targets 

From server to host

The user wants to download fr the remotehost the file called flight_path.gpx.

$ scp darth_vader@death_star.gov:/shared/patrols/darth_vader/flight_path.gpx /home/darth_vader/autopilot/ 

Multiple Files

From host to server

Now, the user wants to upload two files, alderaan.gpx and despayre.gpx, to his home into remotehost:

$ scp alderaan.gpx despayre.gpx darth_vader@death_star.gov

From server to host

The user wants to download two files, personal_solo.pdf and personal_chubaca.pdf from remotehost

$ scp darth_vader@death_star.gov:/shared/intelligence/targets/{personal_solo.pdf,personal_chubaca.pdf} .

Direcotry

From host to server

$ scp -r foo darth_vader@death_star.gov:/home/darth_vader/personal/remote/directory/bar 

From server to host

$ scp -r darth_vader@death_star.gov:/home/darth_vader/jedi_doc .

Introduction

MAD, acronym of Mosaic Alteration Detector, is an R package, depending from the package R-GADA, that allows to detect mosaic events using SNP data. The best feature of this package is that it allows to run the process over a big set of samples.

To run the analysis each sample must have a file, containing the following information:

  • Name (of the SNP)
  • Chromosome (where the SNP is placed)
  • Position (of the SNP in the chromosome)
  • The value for the log2 ratio (of the SNP)
  • The value of the B allele frequency (of the SNP)
  • And the given genotype (of the SNP)

All this information needs to be stored in a tab-splited file:

[carleshf@coruscan analysisstarwarsrawData]$ head padme 
Name    Chr     Position        Log.R.Ratio     GType   B.Allele.Freq
S-4DTYN 3       75333236        0.320067        AB      0.503067484662577
S-4DTYJ 2       137804803       -0.372684       BB      0.93558282208589
S-4DTYD 1       79235573        -0.224208       AA      0.00920245398773001
...     ...     ...             ...             ...     ...

Moreover, the R package MAD requires to place all the samples into a folder called rawData, let’s see:

analysis
  +-- starwats
        +-- rawData
              +-- padme
              +-- anakin
              +-- qui_gon
              +-- palpatine
              +-- watto
  +-- startrek
        +-- rawData
              +-- james_kirk
              +-- spock
              +-- uhura
              +-- scott

Having the correct structure, all the samples into the folder rawData, we place the R session into the upper folder, in the case starwars.

R> path <- getwd()
R> path
[1] "/home/carleshf/.../analysis/starwars"

Analysis

Being there we can start the mosaicism analysis, that is performed in three steps:

  1. The set-up step.
  2. The segmentation procedure step.
  3. The backward elimination step.

Set-up

library( mad )

object <- setupParGADA.B.deviation(
    folder    = path, 
    NumCols   = 6, 
    log2ratio = 4, 
    BAFcol    = 6, 
    GenoCol   = 5
)

Segmentation Procedure

parSBL(
    object, 
    estim.sigma2 = TRUE, 
    aAlpha       = 0.8
)

Backward Elimination

parBE.B.deviation(
    object, 
    T         = 8, 
    MinSegLen = 100
)

Results

At this point the analysis is finished. All the possible events have been detected ans stored into the variable called object. MAD allows us to export all this information as a table:

exportSegments2File(
    object, 
    file   = "mosaic_events.txt"
)

The content of this file follows:

[carleshf@coruscan analysisstarwars]$ head mosaic_events.txt 
IniProbe  EndProbe  LenProbe  qqBdev  chr  LRR    (s.e.)  Bdev   %HetWT  %Hom  State  sample
66690197  71078462  183       0.95    3    -0.64  0.21    0.188  0       8.2   2      palpatine
17309881  21421319  127       0.38    22   -0.24  0.34    0.032  4.7     51.2  2      watto
143559    15049329  495       0.89    18   -0.64  0.24    0.214  0       10.3  2      anakin

We can see that MAD gives us a lot of information. May be, the most import could be the region where the mosaic event was detected (from InitProbe to EndProbe), the chromosome containing the event (chr), the classification of the event (State) and the sample that suffers the event (sample).

The number that codifies the State of the detected abnormalities corresponds:

  1. UPD
  2. Deletion
  3. Duplication
  4. Trisomy
  5. LOH

References

For more information I refer you to the following:

  • The web page of the package – link
  • The vignette of the package – link
  • The paper where the method (package) was used – link

Two options are available to run .R files from command line:

  • R CMD BATCH
  • Rscript

R CMD BATH

It can be read that the R CMD BATH must be used as [1]:

R CMD BATCH [options] infile [outfile]

So lets see an example, creating an source.R file with the following content:

rnorm(n=5, mean=2)

Then run the script using the R CMD BATHC. The results of the executions is stored into a file called source.Rout:

[carleshf@lambdafunction tmp]$ R CMD BATCH source.R 
[carleshf@lambdafunction tmp]$ ll
total 8
-rw-rw-r--. 1 carleshf carleshf   42 feb  5 09:35 source.R
-rw-rw-r--. 1 carleshf carleshf 1020 feb  5 09:35 source.Rout

The content of the source.Rout is the result of the operation(s) we saved into the R script file (source.R):

[carleshf@lambdafunction tmp]$ cat source.Rout 

R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

[...]

R> rnorm(n=5, mean=2)
+ 
> proc.time()
   user  system elapsed 
  0.166   0.021   0.181

The problem with R CMD BATCH is there is no way to pass arguments to the R script file from the command line in itself.

Rscript

It can be read that the Rscript must be used as [2]:

Rscript [options] [-e expression] file [args]

Using the same example as before, the source.R can be run as:

[carleshf@lambdafunction tmp]$ Rscript source.R

And no output is displayed nor stored. We can change our R script to save the calculated values:

res <- rnorm(n=5, mean=2)
write.table(res, file="source.Rout")

And the results, stored into source.Rout is:

[carleshf@lambdafunction tmp]$ cat source.Rout 
"x"
"1" 2.5619050922511
"2" 1.42802439164873
"3" 1.31925385276224
"4" 2.5502466455518
"5" 0.341230739600524

But with Rscript we can read the arguments from the command line using commandArgs(). So, using this commandArgs(), the source.R must be:

args <- commandArgs(trailingOnly = TRUE)
res <- rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))
write.table(res, file="source.Rout")

And it must be run as:

[carleshf@lambdafunction tmp]$ Rscript source.R 1 2

So, the content, stored into source.Rout, is:

[carleshf@lambdafunction tmp]$ cat source.Rout 
"x"
"1" 1.68212091424515

Resources

  • [1] CMD BATCH documentation: link
  • [2] Rscript documentation: link

The idea of this post it to create a recursive function in R to make a copy of a directory. So the schema will be the following one:

GeneralFunction <- function(from, to) {
    SpecificFunction <- function(from, to) {
        # Create the directory 'to'
        # Get the files in 'from'
        # Get the directories in 'from'
        # Copy the files from 'from' to 'to'
        # foreach directory in from:
        #    SpecificFunction(from + directory, to + directory) 
    }

    # Clean arguments 'from' and 'to'
    SpecificFunction(from, to)
}

The R code to implement that is the following:

CopyDirectory <- function(from, to, overwrite = TRUE, verbose = FALSE) {

    show <- function(...) {
        if (verbose) {
            message("[ CopyDirectory ]: ", ... )
        }
    }

    iCopyDirectory <- function(from, to, overwrite = TRUE, verbose = FALSE) {
        show("Creating new '", to, "'.")
        dir.create(to, showWarnings=FALSE)

        # Get and prepare the content to be copies
        from.dirs <- basename(list.dirs(from))
        from.dirs <- from.dirs[from.dirs != basename(from)]

        from.files <- list.files(from)
        from.files <- from.files[!(from.files %in% from.dirs)]

        # Copy files in this level
        d <- sapply(from.files, function(x){
                complete.from <- file.path(from, x, fsep=.Platform$file.sep)
                complete.to <- file.path(to, x, fsep=.Platform$file.sep)
                show("Coping file '", complete.from, "' to '",
                    complete.to, "'.")
                file.copy(complete.from, complete.to)
            }
        )
        rm("d")

        # Look for deeper levels and apply the same function
        d <- sapply(from.dirs, function(x) {
                complete.from <- file.path(from, x, fsep=.Platform$file.sep)
                complete.to <- file.path(to, x, fsep=.Platform$file.sep)
                iCopyDirectory(complete.from, complete.to, overwrite=FALSE,
                    verbose=verbose)
            }
        )
        rm("d")
    }

    # Clean 'from' and 'to' content
    from <- basename(from)
    to <- basename(to)

    # Remove first level
    if (overwrite) {
        show("Removing '", to, "'.")
        unlink(to, recursive = TRUE, force=TRUE)
    }

    # Start copying
    iCopyDirectory(from, to)
}