Working with CYCHP files from Affymetrix CytoScan HD

The tool apt-copynumber-cyto [1] from the suit Affymetrix Power Tools [2] generates a CYCHP file for each raw CEL file. Setting the argument --text-output to true the ACII text format of these files is also generated. The content and the organization of these files is described in [3]. In sumary, the CYCHP files are pools of tables. Each table represents a data set, containing its own definitions (columns) and specific samples (rows).

In bioinfirmatics, statistics and biomedicine a really used programming language is R. R can read table files like csv, tsv and others but it cannot read the txt version of the CYCHP files.

In order to load the content of the cychp.txt files into a R workspace I wrote a python script to split a cychp.txt file into several pieces. So each pieces is one of the inner table from the CYCHP file and a tsv file outside.

First I stored into a dictionary the possible headers we can find into the files ( the values ), asociated with an indentifier ( the keys ):

headers = {
  'sds': 'Chromosome Display StartIndex MarkerCount MinSignal MaxSignal MedianCnState ' +
    'HomFrequency HetFrequency Mosaicism LOH MedianSignal',
  'cnds': 'ProbeSetName Chromosome Position Log2Ratio WeightedLog2Ratio SmoothSignal',
  'apds': 'ProbeSetName Chromosome Position AllelePeaks0 AllelePeaks1',
  'mabsds': 'Index SCAR',
  'cn_ds': 'SegmentID Chromosome StartPosition StopPosition MarkerCount MeanMarkerDistance ' +
    'State Confidence',
  'lohds': 'SegmentID Chromosome StartPosition StopPosition MarkerCount MeanMarkerDistance ' +
    'LOH Confidence',
  'cnnlohds': 'SegmentID Chromosome StartPosition StopPosition MarkerCount ' +
    'MeanMarkerDistance CNNeutralLOH Confidence',
  'genotype': 'Index Call Confidence ForcedCall ASignal BSignal SignalStrength Contrast',
}

Knowing the headers of each one of the tables into the CYCHP files we can read line by line the file and break it, creating a tsv file for each table.

with open("101_S3A_C5.cyhd.cychp.txt", 'r') as fi:
    out = "101_S3A_C5"
    write = False
    nt, p, = 1, '#'
    for line in fi:
        if p.startswith('#') and not line.startswith('#'):
            write = True
            tn = mach(line.strip().replace('t', ' '), allh)
            if tn is None:
                fo = open(out + '.table' + str(nt) + '.txt', 'w')
                nt += 1
            else:
                fo = open(out + '.' + tn + '.txt', 'w')
        if not p.startswith('#') and line.startswith('#'):
            write = False
            fo.close() if fo is not None else None
        if write and fo is not None:
            fo.write(line)
        p = line

We skip the comment lines (the ones starting with #) and then we create a file called 101_S3A_C5.table_name.txt for each known tables, and 101_S3A_C5.table_num_.txt for each unknown tables.

References

  • [1] apt-copynumber-cyto’s manual: link
  • [2] Affymetrix Power Tools: link
  • [3] CYCHP Format: link
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: