News and Blogs

May 21, 2020

Mutation Surveillance using COVID19 Sequencing may be the key to designing Diagnostics & Therapeutics (Drugs and vaccines)

As Coronavirus spreads across the globe through Human-to-Human transmission, it acquires various mutations along the way to form new strains when it comes in contact with local community. Tracking mutation journey of novel coronavirus (COVID-19) will be crucial in designing the optimal diagnostics tools/ kits and therapeutics like drugs and vaccines. Research institutes/ corporate around the world are using many sequencing technologies like Illumina (MiniSeq, MiSeq, NovaSeq, HiSeq), Oxford Nanopore (MinION, GridION), Thermo Fisher (Ion Torrent).

Researchers across the world are generating and sharing sequence data for various coronavirus strains that will help in below aspects:

  • Mapping original source of SARS-Cov-2
  • Tracking the spread of Covid19 disease
  • Evaluating response to treatment (Drugs/ Vaccines)
  • Evaluating diagnostic tests/ kits for SARS-Cov-2
  • Correlating Infection rate/ fatality rates across mutant viral strains

Officially named as SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), novel coronavirus belongs to β-coronavirus family of enveloped, single-stranded, positive-sense RNA viruses. SARS-Cov-2 has complete genome size of 29.9 Kb and measures 0.1 μm in diameter. Nearly two-third of virus nucleic acid relates to first open reading frame (ORF 1a/b), that encodes 16 non-structure proteins (NSPs). Rest of the virus genome encodes four essential structural proteins: Spike (S) glycoprotein, Small envelope (E) protein, Matrix (M) protein, and Nucleocapsid (N) protein, and other accessory proteins. SARS-Cov-2 uses angiotensin-converting enzyme 2 (ACE2) receptor for entry into the human cells, similar to SARS-CoV that emerged back in 2002.

Comparative sequence analysis
Bats and snakes are the best known reservoirs for wide variety of coronaviruses including SARS-COV. It is important to understand that instead of direct transmission from bats to human, there could be passage of virus through an intermediate host (e.g. Pangolin, Civet) that may play critical role for crossing the species barrier into humans.

Virus details

Genomic Size (bp)

Complete Genome alignment

Spike protein alignment

SARS-CoV2 (MN908947.3)

29,903

This sequence has been used as reference sequence for all the comparison

Bat CoV RaTG13 (MN996532.1) (QHR63300.2)

29,885

% Identity: 96.12% % Query cover: 99%

% Identity: 97.41% % Query cover: 100%

Pangolin CoV (Guangdong)

29,825

% Identity: 90.57% % Query cover: 99%

Not Available

Pangolin CoV (MT040335.1) (QIA48632.1)

29,806

% Identity: 85.98% % Query cover: 99%

% Identity: 92.38% % Query cover: 100%

SARS-CoV (AY278487.3) – NA (AAU93320.1) - AA

29,745

% Identity: 80.24% % Query cover: 98%

% Identity: 86.51% % Query cover: 91%

MERS-Cov KF961221.1

30,090

Sequence alignment is low when compared with SARS- CoV-2

Note: Complete genome sequence alignment has been conducted on NCBI Blast using megablast program for raTG13 and pangolin sequences whereas discontinuous megablast was used for SARS-Cov to cover complete sequence. Two different sequences were used in case of SARS-Cov for nucleotide and protein alignment.

Mutation analysis in Spike region for small dataset
Researchers globally have already submitted more than seventeen thousand sequences of SARS-CoV-2. We have randomly picked 10 recently collected SARS-CoV-2 sequences each from China, USA, India, United Kingdom, Italy, and Africa for analyzing mutations in Spike glycoprotein that may have a major impact on development of therapeutics. These sequences were aligned with Reference SRS-Cov-2 sequence submitted as MN908947.3 (Positions 21563 – 25384, CDS for spike protein)

Country/ Territory

Insights on mutations in Spike protein

China (10 Sequences)

Sequence alignment shows that 30% deposited sequences are identical to reference Wuhan genome whereas rest 70% has a single point mutation. Most frequent mutations include A23403G (40%) followed by C25207T (20%) and C21711T (10%). This observation is in line with expectations as virus genome exhibited no mutation or just a single mutation during initial period of spread.

USA (10 Sequences)

Spike protein alignment shows that there was no sequence with 100% identity to reference sequence and all the samples either had one or two mutations in this region. Most frequent mutations include A23403G (~80% samples), that at times was accompanied additionally with following mutations: A23586G, G22349T, C21590A. Few samples that did not have A23403G, showed double mutation at following sites C21575T and C24034T.

India (10 Sequences)

One deposited sequence has spike protein 100% identical to reference genome. Unlike other countries, Sequences deposited from India showed two key mutations A23403G (50%) and C23929T (40%). Another key mutation based on sequence listing is A24389M (No information on details of base M are available)

United Kingdom (10 Sequences)

None of the used spike sequences were identical to reference genome and these samples had one to three point mutations. Around 90% of deposited sequences had mutation at A23403G and some sequences had 1-2 additional mutations like G24488T, G25112T, G21624T. Note that C24981N is another mutation found in many sequences.

Italy (10 Sequences)

There is no sample identical to reference genome. All the samples had mutation at A23403G with couple of samples having additional mutation at G22363T, C21575T.

Africa (10 Sequences)

Five samples each were taken from Republic of Congo and S. Africa for sequence alignment. All the samples had mutation at A23403G with couple of samples having additional mutation at A24012G, C22675T.

Note: Current analysis is limited to small dataset comprising 10 recently deposited sequences from these regions. We would suggest a more rigorous research before coming to any conclusions around mutation analysis in spike protein.

Above analysis shows few striking mutation aspects associated with SARS-Cov-2 virus. Initial reference sequence was present in genomes submitted from China but was hardly observed in any other country showing that the mutated strains mostly propagated to other countries through Human-to-Human transmission. Mutated sequences deposited in china had only a single mutation (A23403G being most prominent). This most prominent mutation was present in almost all the genomes submitted in United Kingdom, Italy, and Africa with few additional mutations in some of the genomes virus acquiring more mutations during this spread over the basic mutation.

Sequences from USA, show that around 20% samples had completely different set of mutations apart from prominent mutation A23403G. These mutations were double mutations at positions C21575T and C24034T. Completely different trend were evident from genomes submitted in India, wherein apart from the prominent mutation A23403G, another prominent mutation at position was evident C23929T that was present is genomes without A23403G mutation. Also another mutation A24389M (24,389) was observed in a variety of samples. Overall, apart from most frequent mutations A23403G which dominates in all regions, only mutation C21575T is present in two different counties (both US and Italy) that highlight that majority of mutations seem to be acquired when virus comes into contact with local population of that region.

Please note that single nucleotide polymorphism at A23403G corresponds to amino acid mutation 'D614G’ in spike protein.

Mutation surveillance of SARS-CoV-2
Viruses may mutate because of evolutionary selection and drift. Mutations may be further inherited depending on the environment encountered during the virus transmission as initial mutations may take place during entry into an intermediate host or later during Human-to-Human transmission. Point mutations are the most common type of mutations that primarily corresponding to following scenarios: Substitution (one of the bases is replaced by another base); Deletion (one of the bases in the sequence is deleted); Addition (addition of specific base into the virus genome). Nucleic acid mutations may be non-sense mutations (with no effect on corresponding amino acid sequence) or mis-sense mutations that have further effect on corresponding protein sequence (e.g. changes in spike protein sequence including receptor binding domain). Mutation analysis will provide some key insights related to virus infection rate as well as clinical aspects like mortality rate as some strains may be asymptomatic/ mild as compared to first virus isolated from Wuhan. Some studies suggest that mutations in RBD domain of spike protein reveal varied infection efficiency in SARS-CoV-2, highlighting its importance as therapeutic target. These mutation patterns are also the key in designing therapeutics through identification and consideration of drug resistant viral phenotypes.
With unprecedented pace of Mutation surveillance across the globe using various sequencing technologies, we may soon have genomic atlas highlighting strains with key mutations affecting infection efficiency and fatality rates based on its evolution in various geographies.

PatInnovate Consulting group specialized in technology landscapes in various technology segments including Life Sciences and Biotechnology. PatInnovate team has proficiency in sequence searching related to diagnostics or therapeutic aspects using Commercial and no-fee databases. We would like to acknowledge efforts of following organizations in isolation of samples and generation of data for the analysis: GISAID, Union Hospital of Tongji Medical College, Huazhong University of Science and Technology, Hubei Provincial Center for Disease Control and Prevention, CR&WISCO GENERAL HOSPITAL, Huazhong University of Science and Technology, Fujian Center for Disease Control and Prevention, Chinese PLA Institute for Disease Control and Prevention, Shanghai Medical College, Fudan University, Shanghai Jiao Tong University School of Medicine & Shanghai Public Health Clinical Center, Guangdong Provincial Institution of Public Health, Scripps Medical Laboratory, Andersen lab at Scripps Research, LSUHS Emerging Viral Threat Laboratory, Microbial Genome Sequencing Center, NYU Langone Health, New York University School of Medicine, Gundersen Molecular Diagnostics

Laboratory, Kabara Cancer Research Institute, Florida Bureau of Public Health Laboratories, NCDC/CSIR-IGIB, District Surveillance Unit, NIMHANS, B.J. Medical College and Civil hospital, Gujarat Biotechnology Research Centre, University of Edinburgh, COVID-19 Genomics UK (COG-UK) Consortium, University of Cambridge, Ospedale Civile S. Liberatore di Atri, Istituto Zooprofilattico Sperimentale dell’Abruzzo e Molise “G. Caporale”, ULSS9 Distretto di Bussolengo, Istituto Zooprofilattico Sperimentale delle Venezie, INMI Lazzaro Spallanzani IRCCS, Laboratory of Virology, INMI Lazzaro Spallanzani IRCCS, University of Milan, ARGO Open Lab Platform for Genome sequencing, Laboratory of Molecular Virology International Center for Genetic Engineering and Biotechnology (ICGEB), National Institute for Biomedical Research (INRB), Molecular Diagnostic Services, KRISP, KZN Research Innovation and Sequencing Platform, South China Agricultural University.