News and Blogs

May 11, 2021

How to conduct sequence based patent searches in biotechnology domain

Sequence based search are key in establishing patentability of novel sequences as well as understanding freedom aspects for commercialization of biotechnology based products and technologies. These searches are key to many aspects of biotechnology and life sciences including Therapeutic drugs (Monoclonal Antibody sequences, Recombinant proteins, Peptides, SiRNA), Diagnostics tests/ kits (Primers, Probes, Genetic sequences, Epitopes), Industrial enzymes, Expression systems (Vectors, Promoters, Terminators, Markers) , Research Tools (CRISP-Cas9, etc.). Modified sequences are key in many new inventions and include aspects like mutations, substitutions, as well as use of synthetic modified bases (e.g. psuedouridine has been used in several mRNA vaccines).

Below are the key steps to be conducted during a sequence search:
1. Identification of Correct sequence information
Sequence search starts with confirming the correct information of nucleotide or amino acid sequence for the query sequence. It is important to cross-check the sequence information with respective research team in-case of novel sequences or reliable sources (e.g. regulatory agencies) while picking sequence information of already available sequences (e.g. during development of biosimilar products).

2. Selection of appropriate databases and search algorithms
Since no single database provides comprehensive coverage of the sequence data, we recommend a combination of commercial databases for example, databases hosted on STN (Registry, DGENE), GenomeQuest and No-fee databases (The Lens, NCBI Blast). Further, only a few databases cover Non-patent Literature including scientific publications which include Registry and NCBI database.

Both, similarity search (e.g. Blast) and exact match/ subsequence search are the most frequently used search algorithms. Many experienced sequence searchers recommend a combination of blast similarity search and exact sequence search to prepare a robust and comprehensive dataset for sequence based analysis. Further, additional algorithm parameters should be adjusted based on individual sequence characteristics (including word size, matrix, etc.).

3. Analysis of sequence alignments
Alignment of query sequence with subject sequences (database derived sequence) should be analyzed in detail to shortlist potentially relevant references with respect to the input query sequence. Two key alignment parameters used for sequence analysis include:

% Identity - Match between query and subject sequence for the aligned portion

% Query cover - % of query sequence being mapped to subject sequence in an alignment

Cut-off values for these search parameter vary in range of 50% to 100% depending on search requirements.

Sample sequence Alignment: % Identity: 97% & Query Cover: 100%

Above representation provides a sample sequence alignment highlighting search alignment parameters.

4. Interpreting claim language during sequence analysis
Both sequence alignment as well as claim language should be considered during the analysis of sequence based dataset. Various restrictions in claims of the patents/ published applications should be understood – e.g. % identity limitations, mutations/ substitutions at specific positions, coverage of parts/ fragments of the claimed sequence. Claims related to primers/ probes frequently have coverage on fragments/ contiguous nucleotide bases.

Patent/Published Application Claimed subject matter
WO2021055395A1
(Assignee: Novozymes)
Isolated purified polypeptide having beta glucanase activity, wherein polypeptide has at least 70% identity to at least 99% identity to polypeptide of SEQ ID No: 8.
WO2014042693A1
(Assignee: LS9)
Variant of ester synthase polypeptide comprising SEQ ID NO: 2, wherein the peptide is genetically engineered to have at least one mutation at an amino acid position selected from positions including 4, 5, 7, 15, 80, 147, 166, 266, 395, 466, etc.
WO2015095392A1
(Assignee: Genentech)
Anti-cluster of differentiation 3 (CD3) antibody, comprising multiple hypervariable regions (HVRs)
(a) an HVR-H1 comprising the amino acid sequence of SEQ ID NO: 1 ;
(b) an HVR-H2 comprising the amino acid sequence of SEQ ID NO: 2;
(c) an HVR-H3 comprising the amino acid sequence of SEQ ID NO: 3;
WO2015051456A1
(Assignee: Univ. of Prince Edward Island)
Method of virus detection wherein primer is selected from an isolated nucleic acid having at least 80% identity to the sequence of SEQ ID NO: 4 or 9, or fragments thereof of at least 15 contiguous nucleotide


5. Additional searches for increasing data recall
Targeted sequence searches are preferred tools in many scenarios, but may not be sufficient to comprehensively cover the technology domain. In-order to increase overall recall of the search, sequence-based searches should be supplemented with focused keyword based searches. We further recommend additional searches including relevant Assignee/ Inventor/ Citation based searches to ensure that all the potentially relevant references have been captured in the search.

Sequence search analysis is a pretty niche domain and key challenges include collecting information scattered among various databases and lack of well-trained searchers. A properly laid-out sequence search methodology with details on appropriate databases and algorithms for various scenarios may be helpful in conducting comprehensive sequence based searches.