Biosequence searching………..Understanding the basics
Biosequence searching is a very niche domain in biotechnology that involves searching for sequences including nucleic acid and amino acid sequences (usually referred to as SEQ ID No.). These are specialized searches that require both
the specific databases and technical expertise to search and interpret the results. Herein we look at some crucial aspects associated with biosequence searching including – (1) Sequence nature & complexity (2) Choice of databases
(3) Search Tools & algorithms (4) Sequence Alignment terminology.
Sequence searching is integral to various aspects of Research & Development including:
- Antibody sequences (Specially variable regions of Heavy & Light chain, CDRs)
- Recombinant proteins/ Enzymes/ Peptides
- Gene/ Genomic sequences (including mutations, substitutions)
- Recombinant vector/ components (Promoter, Terminator, Selection markers)
- Primers & Probes (including multiple primers for multiplex PCRs)
- SiRNA, Epitope, Motif search
- Modified/ Non-natural bases
Searches in this domain become increasingly challenging considering ocean of information being available in public domain. Multiple factors have lead to this spike in sequence data generation globally including completion of Human Genome Project, Sharp
reduction in sequencing costs, Biologic medicine taking the centre stage, Personalized medicine, and Diagnostic research around viral and other pathogenic diseases.
Sample sequence Alignment: % Identity: 97% & Query Cover: 100%
We present below aspects that are crucial for such searches and researches picks these tools on case-to-case basis based on the precise requirements.
nature and complexity: Understanding technical complexity of sequence search is key to conducting accurate sequence search. Sequence searches may range from small sequences (5 or less bases in length) to Medium sequences
(~ 100 bases in length), to Larger sequences (> 1,000 bases including genomic sequences). It is of utmost importance to know if sequence fragments may be of interest as some references may be claim/ disclose specific domains/ segments
of a large sequence.
Choice of Database: Commercial and non-fee search engines are available with their own coverage and search features. Some of the best known databases include databases hosted on STN (Registry, USGENE, DGENE, PCTGEN),
GenomeQuest, The Lens, and NCBI BLAST search. Optimal searching approach should use a commercial database as starting point and ensuring comprehensiveness by supplementing search with non-fee databases including The Lens.
Search Tools: Skilled sequence searcher uses appropriate searching tools/ algorithms based on the end requirement. Some basic tools include Blast search (Similarity search), Exact sequence search (exactly identical
sequence), Sub-sequence search (exactly identical sequence that may be part of larger sequence). A mix of similarity search and Exact sequence search is a commonly used strategy to ensure comprehensiveness of the dataset.
Sequence Alignment Terminology: Sequence search involves understanding of the sequence alignments between input query sequence and subject sequences identified by the database. Both % Identity (match between query
and subject sequence for the aligned portion) and % Query cover (% of query sequence being mapped to subject sequence) are two key aspects that help searches in identification of potentially relevant documents. % Identity of 80% is
generally used as a standard to broadly cover the domain but it may be varied depending on user requirements.
Sequence searches are critical for various searches including Patentability searches, FTO searches, Invalidity searches, Technology Landscape searches. Reach out to our team of experts for specific requirements on Biosequence
searching.