Dept. Human and Clinical Genetics - LUMC, “RESEARCH and the INTERNET”^©

Tasks: working with DNA sequences

(J.T. den Dunnen)

Content

General remarks
Sequence retrieval
Structural DNA analysis
- general (change format, restriction sites, reverse/complement, translate, primer design)
- homology searches
- from sequence to gene (repetitive sequences, gene/exon prediction)
- from EST to gene (homology search, EST contigs)
The human genome sequence
Sequence submission

General remarks

The tasks are in a direct logical order. Please, browse through the tasks provided and select the one which fits your interest best, i.e. extending your exisitng knowledge. For most tasks it is recommended to start from a link in the Course Bookmarks although an attempt to locate the most appropriate start site through a general search engine (e.g. AltaVista, Excite, Lycos, other) is an instructive and worthwhile excercise. For each task a specific sequence is choosen, please feel free to use your favourite sequence. A start can also be made through the links provided in the "DNA analysis" page.

Sequence retrieval

1. Retrieve a DNA sequence from GenBank
Go to the GenBank - Entrez server, select <Search Nucleotide> sequences, and search for the human dystrophin mRNA (cDNA) sequence (mutations in dystrophin cause Duchenne and Becker muscular dystrophy).

save the sequence (GenBank report) on disk as dystrophin.gb (at the bottom of the Nucleotide QUERY page select <PC> and <Text>)
save the sequence also in FASTA-format (on disk as dystrophin.fas)
suppose the NCBI computer can not be accessed; try to retrieve the sequence from one of the other sequence databases, e.g. EMBL (Europe) or DDJB (Japan)

NOTE: depending on the formulation of your query, large numbers of "hits" may appear. Play with your query by making it more specific, restricting it to specific fields (<Search Field> drop down menu) or by using the <Add Term(s) to Query> field on the results page. Alternatively, go the LocusLink (NCBI) site and try whether dystrophin has been catalogued yet; if so you get a direct link to a curated reference sequence from the RefSeq database (NCBI).

2. Other possibilities for retrieval from GenBank
Go to the GenBank - Entrez server and search for;

was any sequence submitted with your contribution ?
did J.T. den Dunnen submit any sequence ?
what sequence is in AC009249 ?
find the sea urchin dystrophin sequence
- it is mentioned as unpublished, is this true ? (check PubMed)
find the human utrophin sequence (mRNA)
- retrieve and save it for later use (which format do you need ?)
is it possible to use Entrez to find a sequence containing "ctttgggaaaaggtgtaaga" ?

3. Information in GenBank

which independent databases does GenBank keep ?
what does the HTGS database contain ?
how many murine sequences does dbEST currently contain (what are EST's) ?

4. Sequences from other sources
Usually, general search engines can not be used to find and retireve DNA-sequences. However, in exceptional cases, general searches may hit sites of researchers working on specific subjects providing more detailed descriptions and/or even unpublished sequences

try a general search engine (e.g. AltaVista, Excite, Lycos, other) with the sequence (try more than one):
- "cacacacacacaca"
- "tctgtatatcttcagaaataaaggcaggat", what is it ?, is it also available from GenBank ?
NOTE: since sequences are often listed in segments of 10 bp, separated by a space, you might need to modify your search accordingly.

Structural analysis

I. General DNA analysis

NOTE: use the Atelier BioInformatique (aBi) or BCM (section "Sequence Utilities") sites, as a good starting point for the DNA-analysis tools required.

1. Look for restriction sites
Take the dystrophin sequence retrieved and try to find whether EcoRI, NotI and SfiI sites are present.
NOTE: netwerk software for restriction mapping can e.g. be found at the aBi-site, under <Nucleic acids sequences>, <Map (restriction)>.

2. Calculate a primer pair for PCR
Take the dystrophin sequence retrieved and try to design a primer pair for the analysis of RNA samples, i.e. to determine whether the gene is transcribed in specific tissues.

verify that the primers are unique in the human genome (perform a Blast-search search against all available human sequences)
can the primers be used to amplify dystrophin sequences from other organisms ? (perform a Blast-search search against all available non-human sequences)

NOTE: use e.g. the Primer3 package (MIT). Other netwerk software for restriction mapping can e.g. be found at the aBi-site, under <Nucleic acids sequences>, <PCR primer selection>.

3. Look for open reading frames

Try to find the largest open reading frame (ORF) in the dystrophin sequence retrieved
NOTE: go e.g. to the BCM Search Launcher, select <Sequence Utilities>, Copy/Paste your sequence into the Query window and select <6 Frame Translation>
Take part of the correct ORF and check whether it is correct by performing a Blast-search against the SwissProt protein database (select swissprot from the <Database> drop down menu).

4. Turn around a sequence
Turn around, i.e. reverse and complement the dystrophin sequence retrieved

translate the sequence into protein
what is the largest open reading frame (ORF) encoded on the reverse strand ?
is this ORF similar to any known proteins (perform a Blast-search)

II. Homology searches

1. homologies in other organisms
Take the dystrophin sequence retrieved (dystrophin.gb) and select the 3' untranslated region. Perform a Blast-search against the non-redundant database

from how many organisms is the 3'UTR of the dystrophin gene known ?
perform a multiple alignment; are there regions in the 3'UTR conserved ?
repeat, now using a segment of the dystrophin protein, to find the dystrophin gene in other organisms; do sequences from new organisms appear ?
retrieve the 3'UTR sequences from these new sequences and add them to the multiple alignment; are the conserved sequences detected initially also conserved in the more distantly related organisms ?
are there segments in the dystrophin 3'UTR which generate homologies with other genes ?
if yes, take such a sequence and use it to perform a specific Blast search; do additional hits appear ?, what are these homologies ?

2. from EST-homologies to a consensus cDNA-sequence
Take the dystrophin sequence retrieved (dystrophin.gb) and select from the 3' untranslated region about 400 bp immediately upstream from the polyA-addition site. Perform a Blast-search against dbEST using the EST-extractor at TIGEM

from which of the clones hit is also the sequence from the other end available ?
repeat the search, using the possibilities provided, and extend the human EST's into one large consensus sequence. Repeat this effort for the murine EST's
- how many UniGene clusters do the EST-contigs cover ?
- to which human/mouse chromosome(s) do these UniGene clusters map ?
- in which tissues are these ESTs expressed ?
do the human and murine consensus cDNA-sequences include part of the coding region of the gene ?
compare the human and murine consensus sequences with those in the non-redundant database (GenBank); are there potential polymorphic regions or SNP's present ?

III. From sequence to gene

NOTE: difficult task

use task_sequence_1 to perform a BLAST search (against the non-redundant DNA database)
- select the human PAC containing the sequence, retrieve and save it as "task_seq2.gb"
- select and save (FastA format) a 40 kb segment surrounding this sequence as "task_seq2.fas"
verify whether task_seq2.fas contains repetitive DNA sequences
- which repetitive sequences does it contain
- try to obtain a clean copy (i.e. with the repetitive DNA sequences "removed" or "masked"); save it as "seq2_clean.fas"
use gene/exon prediction tools (at least three different packages) to calculate the presence of gene(s) / exon(s) in this DNA segment
- what is the most common prediction ?
does the region contain potential promoters ?

IV. From EST hits to a potential gene

NOTE: difficult task

use the seq2_clean.fas sequence (or task_seq2.fas, see above) to perform BLAST database searches against dbEST, the non-redundant database and the HTGS section
- which homologies do you detect ?
take the best hit from dbEST and use this sequence to repeat the dbEST BLAST search using the EST Extractor (at TIGEM)
- use the significant human hits, perform an overall alignment and use the contig(s) to align with the original seq2_clean.fas or task_seq2.fas sequence
  - are the split segments flanked by consensus splice donor and/or splice acceptor sites ?
  - how do these compare with the gene/exon predictions (see above) ?
  - how many UniGene clusters do the EST-contigs cover ?
  - to which human chromosome(s) do these UniGene clusters map ?
  - in which tissues are these ESTs expressed ?
- use the significant murine hits and perform an overall alignment
  - do the murine sequences extend the predicted gene structure ?

The human genome sequence

The human genome is currently sequenced at an incredible rate. The current strategy is to determine a first draft sequence (finished spring 2000) and than focus on completing it (finished early 2003). The consequence of this is that sequences currently go to the high through-put genomic sequences (HTGS) section in the database and not to the non-redundant (NR) section. This has several consequences;

the database to screen first for human sequences is the HTGS section
the definite map will be available soon (EST’s, markers, genes) although initially in a highly complex format
the HTGS section is a great resource for clones/contigs (DNA, genes, FISH-probes) and contigs, for polymorphic markers (CA, SNP and any simple sequence) in/near your region (gene of interest)

Sequence submission

1. A tool to prepare a sequence for submission

find, download and install the SEQUIN software package (for sequence submission and simple sequence analysis)
find and retrieve, in <.ASN1 format>, the human dystrophin mRNA sequence (see above)
NOTE: SEQUIN loads all details when the .ASN1 format is used)
open the human dystrophin sequence in SEQUIN and try tasks 1 and 3 of the section "I. General DNA analysis"

2. update a sequence directly through the WWW

Go to the sites of either BankIt (NCBI) or Submit (EBI) and look at the possibilities of updating/submitting sequences directly using the Internet.

| Top of page | Course Bookmarks |

Dept. Human and Clinical Genetics - LUMC, “RESEARCH and the INTERNET”©