CIPHER

CodIng sequence Prediction using HExameRs

CIPHER Additional Information

Score




Nucleotide hexamer frequencies have been shown to be a powerful way to distinguish between coding and non-coding sequences (Ruiz-Orera et al., 2014). We compute one coding score (CS) per nucleotide hexamer, as follows:

The coding hexamer frequencies are obtained from all annotated coding sequences in protein-coding transcripts encoding experimentally validated proteins (except for zebrafish in which all protein-coding transcripts were considered). The non-coding hexamer frequencies are calculated using the longest ORF in randomly selected intronic regions. The coding score for each hexamer is stored. Next, we used the following statistic to measure the coding score of an open reading frame (ORF):

Where i is each sequence hexamer in the ORF, and n the number of hexamers considered.

The hexamers are obtained in steps of 3 nucleotides in frame (dicodons). We do not consider the initial hexamer, starting with an ATG, or the last hexamer, containing a stop codon. We set up two different minimum ORF length thresholds: 24 and 60 amino acids (corresponding to 72 and 180 nucleotides, or n = 21 and n= 57).


Output example




Two files are produced by the program:


A)Score.txt: Tab-delimited text file with information about ORFs with significant coding potential:


Sequence: Transcript identifier (id) in input FASTA file.

ORF_number: Numerical ORF id, ordered by the length of the ORF in the transcript.

ORF_pos: Relative position of the ORF in the transcript.

ORF_len: ORF length in nucleotides.

transcript_len: Transcript length in nucleotides.

coding_score: Computed coding score (see Score).


B)ORF.fa: Fasta file with every ORF sequence detected with significant coding potential (in nucleotides).


Read more




The hexamer-based coding score has been used in the following studies:

Ruiz-Orera, J., Messeguer, X., Subirana, J. A., & Albà, M. M. (2014). Long non-coding RNAs as a source of new peptides. eLife, 3, 1 ‒ 24.

Ruiz-Orera, J., Hernandez-Rodriguez, J., Chiva, C., Sabidó, E., Kondova, I., Bontrop, R., Marqués-Bonet, T. & Albà, M. M. (2014). Origins of de novo genes in human and chimpanzee.Plos Genetics,11(12), e1005721 .