"Research, the internet and mining data"^©

Analysis of DNA sequences

Human and Clinical Genetics - LUMC

Points to bear in mind
Start sites
Analytical tools
Searches
- general remarks
- search software differences
- databases to search
- search types (similarity, structures)
Sequence submission / updating (retrieve and ADD)
Exercises
- DNA analysis
- protein analysis

Points to bear in mind

Speed

work when everybody sleeps
- morning US / evening Japan
while waiting for a result

start a second Browser session
use the option: receive result by E-mail (in .HTML format !!)

Which database ?

DNA or protein search
consider consequences of increasing complexity

answers take more time
higher probability to get random hits (higher background)

Understand the query form

use the "Help" or "Examples" button
which input sequence format is required ?
read the FAQ

Reliability of data

input usually NOT curated ("transitive" vs. experimental proof)
repeat searches regularly
subscribe to update services

Retrieve and ADD

notify database of mistakes encountered
will your data ever be published ?

Search results

do NOT print results - all links are lost
SAVE an important result on disk
E-mail results to your home account

Start sites

General links to the sequences (DNA, protein, gene)
- Entrez Gene - all details on one page (name, chromosomal localization, reference sequences, links, etc.)
- Entrez - sequence retrieval
  - DNA
  - protein - similar sequences try the BLink link (extreme right hand side of results page; example for WHSC1L1 and from there the CDD-search)
  - genome maps
    - Ensembl
    - MapViewer - NCBI
  - taxonomy
  - PubMed (Literature)
Sequence databases
- NCBI - National Center for Biotechnology Information (GenBank, USA)
- EBI - European Bioinformatics Institute (Cambridge)
- DDBJ - DNA Data Bank of Japan
- SwissProt - Swiss Protein Database (curated !)
Sequence analysis tools
- annual database issue Nucleic Acids Research - 2007
- aBi - an extensive list from the Atelier BioInformatique
- BCM Search Launcher - Baylor College of Medicine (Houston)
- EMBL - European Molecular Biology Laboratory (Heidelberg)
List of links
- Genome Web
- Nucleic Acids Research webserver issue 2006
Dutch starting points
- CMBI (Centre for Molecular and Biomolecular Informatics, Nijmegen)
- our Course Bookmarks

Analytical tools

General

aBi On-line analysis tools: subject-ordered links to database searching, nucleic acid sequences, patterns in proteins, predictions on proteins, sequence alignment / phylogeny and analysis tools
Sequence utilities at Baylor: convert sequence formats to FASTA, RepeatMasker, primer selection, six frame translation, restriction enzyme recognition sites (WebCutter), Reverse and Complement.

Sequence alignments

pairwise
- 2-BLAST: compare two sequences
- Baylor: pairwise alignment
multiple sequences
- ...
- at Baylor
- links from aBi On-line analysis tools

Software + descriptions

BioWWW
FTP-site: University of Indiana directory of /molbio
UBiC software links

Manuals

Staden Manual
GCG - WWW demo

Searches

General

literature
- an absolute MUST: "Trends Guide to BioInformatics" (1998), Elsevier Trends Journal Supplement.
- further excellent background information: Smith, R.F. (1996). "Perspectives: sequence data base searching in the era of large-scale genomic sequencing". Genome Res. 6, 653-660.
understand background of software (e.g. BLAST vs. gapped-BLAST vs. FastA)
- influence of sequence errors (99% accurate = terrible)
- significance (best 100)
- faster = less sensitive
repeat search regularly
- the database becomes larger every minute
- subscribe to update service
larger database
- slower search
- increased chance of random hit (still low probability might be real hit)
redundancy in database
- use filters against "simple sequences" and repetitive DNA (e.g. RepeatMask)
- search - strip - search - ... (vector sequences, repetitive sequences (Alu-repeat), (CA)_n-repeats, one sequence/many submissions)
- perform selective search; specific database (e.g. cDNA (dbEST) vs. genomic DNA or species specific)
verify annotations: transitive vs. experimental proof

Search software differences (per search engines)

searching GenBank
BLAST searches (basic or advanced)
- PSI-BLAST - Position Specific Iterated
Similarity searches: EBI
others: search links from aBi

Databases to search

DNA or protein
NOTE: protein databases are curated but incomplete (SwissProt)
HTGS: unfinished genomic sequence (not submitted sequences of large sequencing efforts)
non-redundant
dbEST: transcribed sequences only (cDNA's)
species specific

Search types

similarity searches
- DNA vs. DNA
- protein vs. protein
- DNA vs. protein (most sensitive)
  - protein folding is evolutionary conserved; amino acid sequence comes closest
  - possibility to use relative scoring
- protein vs. DNA
structure searches
- general
  - predictions only
  - first mask (remove) repetitive sequences
- content
  - percentage GC
  - CpG-islands
- repetitive sequences
- gene identification (GRAIL, gene-ID, fgeneh, etc.)
  - extensive list / comparison performance: gene recognition programs
  - literature:
    - bibliography of gene identification by computer
    - recent review: Claverie, J.M. (1997). Computational methods for the identification of genes in vertrebate genomic sequences. Hum.Mol.Genet. 6:1735-1744.
  - gene: promoter, cap site, 5'UTR, ATG, open reading frame, STOP, 3'UTR, polyA-addition signal (AATAAA) and polyA-addition site internally in segmented splice blocks (ag-EXON-gt)
  - to remember: mean vertebrate coding gene has six 150 bp exons spanning ~30 kb, one 300 bp ORF (ATG to stop) randomly occurs once per 36 kb ssDNA (with 25% A, G, C and T)
    - compromise between sensitivity and specificity
    - trained on non-random set of genes
    - recognition: open reading frame, codon usage (compositional bias), six nucleotide words (hexamers), AG/GT exon flanking sequences, exon ranking, composite gene has ORF ("in frame assembly"), similarities in database
    - some assume one complete gene is present in sequence, some look on one strand only
    - bad on first coding exon (with ATG) and last coding exon (with STOP)
    - miss 5'-UTR, 3'-UTR, one-exon gene, RNA-gene, nested gene (gene on other strand / in intron), more than one gene
  - programs: FGENEH/HEXON, GeneID, GENSCAN, GENVIEW, GRAIL, Xpound (links to all programs)
one submission / many databases searched
- Genotator

Sequence submission / updating

(retrieve and ADD)

Software

Sequin (NCBI)
software package for installation on any Operating System, with several viewers and the possibility to connect directly to the Internet to retrieve sequences and literature and to perform database searches

Direct Internet submission
- BankIt (NCBI)

Submit (EBI)

How to annotate sequences
- Genome annotation
- Gene Nomenclature Committee

| Top of page | course bookmarks |

"Research, the internet and mining data"©