"Research, the internet and mining data"©
Analysis of DNA sequences
Human and Clinical Genetics - LUMC
Contents
Points to bear in mind
Speed
- work when everybody sleeps
- morning US / evening Japan
- while waiting for a result
- start a second Browser session
- use the option: receive result by E-mail (in .HTML format !!)
Which database ?
- DNA or protein search
- consider consequences of increasing complexity
- answers take more time
- higher probability to get random hits (higher background)
Understand the query form
- use the "Help" or "Examples" button
- which input sequence format is required ?
- read the FAQ
Reliability of data
- input usually NOT curated ("transitive" vs.
experimental proof)
- repeat searches regularly
- subscribe to update services
Retrieve and ADD
- notify database of mistakes encountered
- will your data ever be published ?
Search results
- do NOT print results - all links are lost
- SAVE an important result on disk
- E-mail results to your home account
Start sites
- General links to the sequences (DNA, protein, gene)
- Entrez Gene - all
details on one page (name, chromosomal localization, reference
sequences, links, etc.)
- Entrez - sequence
retrieval
- Sequence databases
- NCBI - National Center for
Biotechnology Information (GenBank, USA)
- EBI - European
Bioinformatics Institute (Cambridge)
- DDBJ - DNA Data Bank of Japan
- SwissProt
- Swiss Protein Database (curated !)
- Sequence analysis tools
- annual database issue Nucleic Acids Research - 2007
- aBi -
an extensive list from the Atelier
BioInformatique
- BCM Search Launcher
- Baylor College of Medicine (Houston)
- EMBL - European Molecular Biology
Laboratory (Heidelberg)
- List of links
Dutch starting points
Analytical tools
- General
- aBi On-line analysis tools:
subject-ordered links to database searching, nucleic acid sequences, patterns in proteins,
predictions on proteins, sequence alignment / phylogeny and analysis tools
- Sequence utilities at
Baylor: convert sequence formats to FASTA, RepeatMasker, primer selection, six frame
translation, restriction enzyme recognition sites (WebCutter), Reverse and Complement.
- Sequence alignments
- pairwise
- multiple sequences
- Software + descriptions
- Manuals
Searches
General
- literature
- an absolute MUST: "Trends Guide to BioInformatics" (1998),
Elsevier Trends Journal Supplement.
- further excellent background information: Smith, R.F. (1996). "Perspectives:
sequence data base searching in the era of large-scale genomic sequencing".
Genome Res. 6, 653-660.
- understand background of software (e.g. BLAST vs. gapped-BLAST vs.
FastA)
- influence of sequence errors (99% accurate = terrible)
- significance (best 100)
- faster = less sensitive
- repeat search regularly
- the database becomes larger every minute
- subscribe to update service
- larger database
- slower search
- increased chance of random hit (still low probability might be real hit)
- redundancy in database
- use filters against "simple sequences" and repetitive DNA (e.g. RepeatMask)
- search - strip - search - ... (vector sequences, repetitive sequences (Alu-repeat), (CA)n-repeats,
one sequence/many submissions)
- perform selective search; specific database (e.g. cDNA (dbEST) vs. genomic DNA or
species specific)
- verify annotations: transitive vs. experimental proof
Search software differences (per search engines)
Databases to search
- DNA or protein
NOTE: protein databases are curated but incomplete (SwissProt)
- HTGS: unfinished genomic sequence (not submitted sequences of large sequencing efforts)
- non-redundant
- dbEST: transcribed sequences only (cDNA's)
- species specific
Search types
- similarity searches
- DNA vs. DNA
- protein vs. protein
- DNA vs. protein (most sensitive)
- protein folding is evolutionary conserved; amino acid sequence comes closest
- possibility to use relative scoring
- protein vs. DNA
- structure searches
- general
- predictions only
- first mask (remove) repetitive sequences
- content
- percentage GC
- CpG-islands
- repetitive sequences
- gene identification (GRAIL, gene-ID, fgeneh, etc.)
- extensive list / comparison performance: gene recognition programs
- literature:
- bibliography of gene
identification by computer
- recent review: Claverie, J.M. (1997). Computational methods for the identification
of genes in vertrebate genomic sequences. Hum.Mol.Genet. 6:1735-1744.
- gene: promoter, cap site, 5'UTR, ATG, open reading frame, STOP, 3'UTR, polyA-addition
signal (AATAAA) and polyA-addition site internally in segmented splice blocks (ag-EXON-gt)
- to remember: mean vertebrate coding gene has six 150 bp exons spanning ~30 kb,
one 300 bp ORF (ATG to stop) randomly occurs once per 36 kb ssDNA (with 25% A, G, C and T)
- compromise between sensitivity and specificity
- trained on non-random set of genes
- recognition: open reading frame, codon usage (compositional bias), six nucleotide words
(hexamers), AG/GT exon flanking sequences, exon ranking, composite gene has ORF ("in
frame assembly"), similarities in database
- some assume one complete gene is present in sequence, some
look on one strand only
- bad on first coding exon (with ATG) and last coding exon (with STOP)
- miss 5'-UTR, 3'-UTR, one-exon gene, RNA-gene, nested gene (gene on other strand / in
intron), more than one gene
- programs: FGENEH/HEXON, GeneID, GENSCAN, GENVIEW, GRAIL, Xpound (links to all programs)
- one submission / many databases searched
Sequence submission / updating
(retrieve and ADD)
- Software
- Sequin (NCBI)
software package for installation on any Operating System, with several viewers and the
possibility to connect directly to the Internet to retrieve sequences and literature and
to perform database searches
- Direct Internet submission
- How to annotate sequences
| Top of page | course bookmarks |