How to use Scipio (version 1.0)
Requirements
- Bioperl (download from bioperl.org )
- YAML Perl module (download from CPAN)
- BLAT binary (see instructions at UCSC)
- Genome of choice in fasta format (you can search diArk for sequenced eukaryotic genomes)
What the Scipio script does
The intention of the script is to locate protein query sequences in DNA targets.
BLAT does most of the work, however, there is some kind of postprocessing that is performed by the script. This includes the following tasks:
- BLAT does not try to align codons that are split by introns. The script searches for missing codons, preferring those that are split at splice sites, and adds the nucleotides to the corresponding exons.
- For each query sequence, most BLAT hits are discarded and only (the) one best hit is displayed. In our case, we are only interested in nearly 100% sequence identity.
- We are prepared for the (may be not so) rare case of genes that are found on different targets. First, all BLAT hits are collected, and sorted by score. Then non-overlapping hits are taken to form a collection of hits of the same query.
- Frameshifts that usually cause BLAT to split an exon into multiple separate matches, are joined back into a single match (this can optionally be disabled).
- Retrieve all the corresponding sequences and group them together with the BLAT results to form the output described below.
Usage
| scipio.pl [<options>] <target> <query> | ||
| <target> is a DNA file | ||
| <query> is a protein file | ||
| both in FASTA format | ||
| Options: | --blat_output = <filename> | name of BLAT output file (.psl); if not specified, defaults to the newest psl file in the working directory, or to Scipio<nnnn>_blat.psl if none exists |
| --force_new | run BLAT even if output file already exists (the default is never to run BLAT if the specified or default psl file would be overwritten) | |
| --verbose | show verbose information (including progress) | |
| --min_score = <value> | minimal score of the best partial hit for a query | |
| --min_identity = <value> | minimal identity in any hit (default is 0.9) | |
| --max_mismatch = <value> | maximal number of mismatches in any hit (0 means any number, the default) | |
| --split_on_fs | do not join matchings separated by seqshifts (the default is to consider them a single exon) | |
| --region_size = <value> | size of shown upstream/downstream sequence parts (defaults to 1000, maximum is 75000) | |
| --force_score = <value> | specify a score (between 0 and 1) that forces a hit to be shown even if it is contradicting a better one (these hits are not assembled together and shown as <queryname>_(1), etc.) | |
| --show = <list> | which keys are to be shown in YAML output (details see below) | |
| --show_intron = <list> | ||
| --show_exon = <list> | ||
| --show_gap = <list> | ||
| --show_undef | show also keys with undefined values | |
| --show_defaults | show keys in brackets in all cases | |
| Options passed to BLAT (ignored when BLAT is not run): | ||
| --blat_bin = <name> | name of BLAT executable, defaults to "blat" | |
| --blat_params = <params> | parameters passed to BLAT | |
| --blat_tilesize = <..> | (see BLAT usage information) | |
| --blat_score = <..> | ||
| --blat_identity = <..> | ||
Output overview
The script produces an output in YAML format. Additional scripts are supplied that transform the YAML output into other formats (e.g. GFF3). The output is organized as follows:
For each query, a list of the matched parts is given:
etc.
Usually, there is only one matched part (= BLAT hit) for each query, except in cases where queries are found on multiple non-overlapping targets (contigs).
At the end of the output, a list of the unmatched sequences is shown.
Description of keys
Keys for BLAT hits
Every BLAT hit is described by the following keys:| ID | The id of the BLAT hit (equals the line number in the psl file). |
| status | One of "complete", "partial", "incomplete" or "manual". "complete" means that Scipio had no problems locating the query, "partial" means that the hit is on one of multiple targets each matching a part of the query, "incomplete" means there might be the need to edit it manually. "manual" can be entered if the output was modified by hand. |
| reason | If status is "incomplete", the reason why. |
| prot_len | The length of the query (in amino acid coordinates). |
| prot_start | Start of the matched part of the query if larger than zero. |
| prot_end | End of the matched part of the query if less than prot_len. |
| prot_seq | The query sequence (in a partial hit: only the matched part). |
| target | The name of the target sequence. |
| target_len | The total length of the target sequence. |
| strand | "+" (= forward) or "-" (= reverse). |
| dna_start | The location of the hit. |
| dna_end | |
| matches | The number of matches if less than prot_len. |
| mismatches | The number of mismatches. |
| undetermined | The number of undetermined residues. |
| unmatched | The number of unmatched query residues. |
| additional | The number of additional residues in the translated target. |
| score | The score of the hit. |
| upstream | DNA Sequence upstream of hit, ending before start codon. |
| upstream_gap | (unaligned) DNA Sequence preceding a partial hit. |
| matchings | The locations of exons and introns, see next section. |
| stopcodon | Stop codon if present. |
| downstream | DNA Sequence downstream of hit, starting with stop codon if present. |
| downstream_gap | (unaligned) DNA Sequence following a partial hit. |
The commandline parameter --show = <comma-separated list> can be used to choose a user-defined collection of keys to be shown.
Keys for matchings
Every matching (intron/exon) is given with the following keys:| type | "intron", "exon" or "gap". |
| nucl_start | Location in the query (in nucleotide coordinates; nucl_end not for introns). |
| nucl_end | |
| dna_start | Location in the target. |
| dna_end | |
| seq | DNA sequence of the feature. |
Keys that appear only in exons:
| seqshifts | Location of seqshifts. |
| mismatchlist | Positions of mismatches (in amino acid coordinates). |
| undeterminedlist | Positions of undetermined residues (in query coordinates). |
| inframe_stopcodons | Positions of in-frame stopcodons (in query coordinates). |
| translation | Translation of the aligned part of the DNA sequence. |
| overlap | Number of nucleotides at the end of the target that are identical with the beginning of the following target, and not considered part of this location. |
The commandline parameters --show_exon = <...>, --show_intron = <...>, --show_gap = <...> can be used to choose user-defined output for matchings. The following additional keys can be activated by this:
| nucl_pos | In introns, a synonym for nucl_start(=nucl_end) |
| prot_start | The location transformed into residue coordinates rather than nucleotides. A remainder of 1 is rounded down, and 2 is rounded up. |
| prot_end | |
| prot_pos | (this one for introns) |
| prot_seq | Part of the query matching an exon, or unmatched part in a gap. |
Sequence coordinate conventions
All nucleotide and residue coordinates specify the number of preceding letters, that is, unlike in GFF, a location specified as starting at 5 and ending at 9 has length four, starting with the sixth and ending with the ninth character. (GFF would show this as start=6 end=9).
Locations on reverse strand are specified by negative numbers that represent the distance from the end of the reverse strand. Hence, a location specified as starting at -9 and ending at -5, refers to the reverse complement of the previous example (GFF would show this also as start=6 end=9).
Amino acid coordinates are rounded the following way:
If a location contains a split codon, our convention is to count it if its second nucleotide belongs to the location. Consequently, in reading frame 2 (meaning the last two nucleotides of the split codon are part of the next location) the start location is rounded down and the codon is considered part of the next location; conversely, a reading frame of 1 means that the codon is considered part of the previous location, and the starting point for the next location is rounded up when given in amino acid / codon coordinates.











