How to use Scipio (version 1.0)

Requirements

  • Bioperl (download from bioperl.org )
  • YAML Perl module (download from CPAN)
  • BLAT binary (see instructions at UCSC)
  • Genome of choice in fasta format (you can search diArk for sequenced eukaryotic genomes)

What the Scipio script does

The intention of the script is to locate protein query sequences in DNA targets.

BLAT does most of the work, however, there is some kind of postprocessing that is performed by the script. This includes the following tasks:

  • BLAT does not try to align codons that are split by introns. The script searches for missing codons, preferring those that are split at splice sites, and adds the nucleotides to the corresponding exons.
  • For each query sequence, most BLAT hits are discarded and only (the) one best hit is displayed. In our case, we are only interested in nearly 100% sequence identity.
  • We are prepared for the (may be not so) rare case of genes that are found on different targets. First, all BLAT hits are collected, and sorted by score. Then non-overlapping hits are taken to form a collection of hits of the same query.
  • Frameshifts that usually cause BLAT to split an exon into multiple separate matches, are joined back into a single match (this can optionally be disabled).
  • Retrieve all the corresponding sequences and group them together with the BLAT results to form the output described below.

Usage

scipio.pl [<options>] <target> <query>
<target>  is a DNA file
<query>  is a protein file
both in FASTA format
 
Options: --blat_output = <filename>name of BLAT output file (.psl); if not specified, defaults to the newest psl file in the working directory, or to Scipio<nnnn>_blat.psl if none exists
--force_newrun BLAT even if output file already exists (the default is never to run BLAT if the specified or default psl file would be overwritten)
--verbose show verbose information (including progress)
--min_score = <value> minimal score of the best partial hit for a query
--min_identity = <value> minimal identity in any hit (default is 0.9)
--max_mismatch = <value> maximal number of mismatches in any hit (0 means any number, the default)
--split_on_fs do not join matchings separated by seqshifts (the default is to consider them a single exon)
--region_size = <value> size of shown upstream/downstream sequence parts (defaults to 1000, maximum is 75000)
--force_score = <value> specify a score (between 0 and 1) that forces a hit to be shown even if it is contradicting a better one (these hits are not assembled together and shown as <queryname>_(1), etc.)
 
--show = <list>which keys are to be shown in YAML output (details see below)
--show_intron = <list>
--show_exon = <list>
--show_gap = <list>
 
--show_undefshow also keys with undefined values
--show_defaultsshow keys in brackets in all cases
 
Options passed to BLAT (ignored when BLAT is not run):
--blat_bin = <name> name of BLAT executable, defaults to "blat"
--blat_params = <params>parameters passed to BLAT
 
--blat_tilesize = <..>(see BLAT usage information)
--blat_score = <..>
--blat_identity = <..>

Output overview

The script produces an output in YAML format. Additional scripts are supplied that transform the YAML output into other formats (e.g. GFF3). The output is organized as follows:

For each query, a list of the matched parts is given:

<query 1>:
-<1st BLAT hit>
-<2nd BLAT hit>
...

<query 2>:
-<1st BLAT hit>

etc.

Usually, there is only one matched part (= BLAT hit) for each query, except in cases where queries are found on multiple non-overlapping targets (contigs).

At the end of the output, a list of the unmatched sequences is shown.


Description of keys

Keys for BLAT hits
Every BLAT hit is described by the following keys:
IDThe id of the BLAT hit (equals the line number in the psl file).
statusOne of "complete", "partial", "incomplete" or "manual". "complete" means that Scipio had no problems locating the query, "partial" means that the hit is on one of multiple targets each matching a part of the query, "incomplete" means there might be the need to edit it manually. "manual" can be entered if the output was modified by hand.
reasonIf status is "incomplete", the reason why.
prot_lenThe length of the query (in amino acid coordinates).
prot_startStart of the matched part of the query if larger than zero.
prot_endEnd of the matched part of the query if less than prot_len.
prot_seqThe query sequence (in a partial hit: only the matched part).
targetThe name of the target sequence.
target_lenThe total length of the target sequence.
strand"+" (= forward) or "-" (= reverse).
 
dna_startThe location of the hit.
dna_end
 
matchesThe number of matches if less than prot_len.
mismatchesThe number of mismatches.
undeterminedThe number of undetermined residues.
unmatchedThe number of unmatched query residues.
additionalThe number of additional residues in the translated target.
scoreThe score of the hit.
upstreamDNA Sequence upstream of hit, ending before start codon.
upstream_gap(unaligned) DNA Sequence preceding a partial hit.
matchingsThe locations of exons and introns, see next section.
stopcodonStop codon if present.
downstreamDNA Sequence downstream of hit, starting with stop codon if present.
downstream_gap (unaligned) DNA Sequence following a partial hit.

The commandline parameter --show = <comma-separated list>  can be used to choose a user-defined collection of keys to be shown.


Keys for matchings
Every matching (intron/exon) is given with the following keys:
type"intron", "exon" or "gap".
 
nucl_startLocation in the query (in nucleotide coordinates; nucl_end not for introns).
nucl_end
 
dna_startLocation in the target.
dna_end
 
seqDNA sequence of the feature.

Keys that appear only in exons:
seqshiftsLocation of seqshifts.
mismatchlistPositions of mismatches (in amino acid coordinates).
undeterminedlistPositions of undetermined residues (in query coordinates).
inframe_stopcodonsPositions of in-frame stopcodons (in query coordinates).
translationTranslation of the aligned part of the DNA sequence.
overlapNumber of nucleotides at the end of the target that are identical with the beginning of the following target, and not considered part of this location.

The commandline parameters --show_exon = <...>, --show_intron = <...>, --show_gap = <...> can be used to choose user-defined output for matchings. The following additional keys can be activated by this:

nucl_posIn introns, a synonym for nucl_start(=nucl_end)
prot_startThe location transformed into residue coordinates rather than nucleotides. A remainder of 1 is rounded down, and 2 is rounded up.
prot_end
prot_pos(this one for introns)
prot_seqPart of the query matching an exon, or unmatched part in a gap.

Sequence coordinate conventions

All nucleotide and residue coordinates specify the number of preceding letters, that is, unlike in GFF, a location specified as starting at 5 and ending at 9 has length four, starting with the sixth and ending with the ninth character. (GFF would show this as start=6 end=9).
Locations on reverse strand are specified by negative numbers that represent the distance from the end of the reverse strand. Hence, a location specified as starting at -9 and ending at -5, refers to the reverse complement of the previous example (GFF would show this also as start=6 end=9).
Amino acid coordinates are rounded the following way: If a location contains a split codon, our convention is to count it if its second nucleotide belongs to the location. Consequently, in reading frame 2 (meaning the last two nucleotides of the split codon are part of the next location) the start location is rounded down and the codon is considered part of the next location; conversely, a reading frame of 1 means that the codon is considered part of the previous location, and the starting point for the next location is rounded up when given in amino acid / codon coordinates.

link to diark
link to cymobase
link to motorprotein.de
MPG
MPI for biophysical chemistry
Uni-Goettingen
Informatik Uni-Goettingen