Protein ALignment Optimizer

Tutorial

1 Introduction

In this tutorial we will explain how to gather and create the files needed to run the algorithm from the database ENSEMBL through biomart. The files will be used either by the online version of the PALO algorithm or the downloadable version at Github.

You can create your own files or download them from any database as long as they match the input format explained in the File Format section.

2 File Format

The accepted input format files are .txt or .csv without headers. The information in them should be in tab separated columns. You will need 2 different files:

2.1 Homologs file

This file contains the gene identifiers of the homologs we want to align. Each row can contain several related homologs gene identifiers as long as they belong to the same alignment. The separator should be a tab.

A demo file with 2 gene homologs groups to align can be found here.

2.2 Species file

This file will contain the species isoforms information. In the first column we will put the gene identifier, the second one will contain the protein identifier and the last one it's protein coding length.

A demo file can be found here.

3 File Creation

Files can be created manually for a few genes, although the easiest way is to download them directly from ENSEMBL. There are very complete tutorials about the basic uses of biomart.

For the homologs file you have to provide a list of protein coding genes in the filters section or select the protein coding option as gene type. You may also be interested in the multi-species comparison filter to narrow your search to certain homologs.

Click image to enlarge

For the species file just select your desired species and choose the gene ID, protein ID and CDS length as attributes.

Click image to enlarge

Do this for each species you're using and concatenate all the files obtained in a single one. The easiest way to do that is using the cat command in linux.

Click image to enlarge

4 Upload files

Once you have your files created the process is very simple. Just click in Run>Upload files and upload first your species file. It can have any name as long as follows the correct format and has a .txt or .csv extension. Click the select button, choose your file and then click the upload button or the play symbol next to the file. Once validated you will be asked to upload the homologs file in the same way.

Click image to enlarge

5 Run PALO

If the files were uploaded correctly you will be redirected to a sectio where you can preview your uploaded files in a html table or directly run the algorithm by pressing the blue button.

Click image to enlarge

After that you should be patient while your files are loaded into the programme. You will start to see the progress in a few seconds-minutes, when it reach 100% the results will load automatically.

Click image to enlarge

The Run>Last result section will show you the results of your last run of PALO, if you didn't try to upload any new files since then.

6 Download results

You will be able to download an output file that contains tab separated in each row one protein ID for each homolog gene provided. Those are the proteins you should align. You will also be able to download a error log file that contains the ID's of the genes (the first one in the homolog file) that the script couldn't process, either because there wasn't information about one of the transcripts (double check if it was protein coding), it had too many combinations (>10.000.000) or another error.

Click image to enlarge

Back to content table