Glossary
- Accession number
- An identifier associated with a sequence or entity supplied by a particular biological database that uniquely identifies the given sequence or entity within that database.
- Annotation
- To comment on the genomic consensus sequence. Specifically, identifying genes and determining their function; adding pertinent information such as gene coded for or coding sequence, amino acid sequence, or other commentary to the database entry of raw DNA sequence. See structural and functional annotation for more details.
- Assembly
- The process of reconstructing larger genomic sequences from smaller randomly derived subsequences. This process relies on determining sequence similarity between overlapping sequences. Also refers to the collection of assembled sequences that are the output of an assembly program.
- Bioinformatics
- The science of managing and analyzing biological data using advanced computing techniques.
- Clear Range
- The portion of a sequencing read that contains high quality sequence data. By definition this excludes vector sequence and low quality error prone sequence.
- Coding Sequence (CDS)
- That portion of a gene or mRNA which directly specifies the amino acid sequence of a predicted protein.
- Contig
- A contiguous length of assembled consensus genomic sequence in which the order of bases is known to a high confidence level.
- Functional Annotation
- The functional characterization and classification of a sequence (typically protein coding). This typically consists of describing the biological function, enzymatic function, and localization using a variety of pre-defined terms or identifiers.
- Mate pair
- Refers to the set of two reads resulting from pairwise end sequencing
- Open Reading Frame (ORF)
- A series of codons or nucleotide triplets without any termination codons. There are six potential reading frames of an unidentified nucleotide sequence and these are (potentially) translatable into a protein.
- Sequencing Library
- Collection of fragments from a genome that are cloned within plasmid vectors or adaptors.
- Sequencing Read
- A contiguous length of nucleotide bases that is generated using a sequencing machine.
- Scaffold
- A scaffold is a portion of the genome sequence reconstructed from end-sequenced whole-genome shotgun clones. Scaffolds are composed of a linear ordering (order & orientation) of contigs joined by mate pairs, as well as gaps. Celera Assembler uses complex criteria to build scaffolds, but every sequence gap in the output is spanned by at least two mate pairs.
- Singleton
- A singleton is a read that could not assemble. Singletons can represent contamination, unique sequence with no overlap due to the fluctuation of random coverage, or sequence with so many overlaps it could not be assembled efficiently.
- Structural Annotation
- The notable features by position on a DNA, RNA, or protein sequence. Typically this consists of transcribed sequences, splice junctions, binding sites, functional motifs, active sites, and more.
- Transcriptome
- The full complement of activated genes, mRNAs, or transcripts in a particular sample at a particular time
[Following definitions derived from the
NCBI BLAST Guide Glossary]
- Alignment
- The process of lining up two or more gene or protein sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
- BLAST (Basic Local Alignment Search Tool). (Altschul et. al.)
- A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search. For additional details, see one of the BLAST tutorials (Query or BLAST) or the narrative guide to BLAST.
- Conservation
- Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.
- E value (Expectation value)
- The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
- Filtering
- Also known as masking. The process of hiding low-complexity regions of (nucleic acid or amino acid) sequence that frequently leads to spurious high scores.
- Gap
- A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.
- Homology
- Related through descent from a common ancestor.
- Identity
- The extent to which two (nucleotide or amino acid) sequences are invariant.
- Low Complexity Region
- Regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of one or a few residues.
- PSI-BLAST (Position-Specific Iterative BLAST)
- An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.)
- Query
- The input sequence with which all of the entries in a database are to be compared.
- Similarity
- The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score.