Dept. Human and Clinical Genetics - LUMC, RESEARCH and the
INTERNET©
Tasks: working with DNA sequences
(J.T. den Dunnen)
Content
General remarks
The tasks are in a direct logical order. Please, browse through the tasks provided and
select the one which fits your interest best, i.e. extending your exisitng knowledge. For
most tasks it is recommended to start from a link in the Course
Bookmarks although an attempt to locate the most appropriate start site through a
general search engine (e.g. AltaVista, Excite, Lycos, other)
is an instructive and worthwhile excercise. For each task a specific sequence is choosen,
please feel free to use your favourite sequence. A start can
also be made through the links provided in the "DNA analysis"
page.
Sequence retrieval
1. Retrieve a DNA sequence from GenBank
Go to the GenBank - Entrez server, select <Search
Nucleotide> sequences, and search for the human dystrophin mRNA
(cDNA) sequence (mutations in dystrophin cause Duchenne and Becker muscular
dystrophy).
- save the sequence (GenBank report) on disk as dystrophin.gb
(at the bottom of the Nucleotide QUERY page select <PC>
and <Text>)
- save the sequence also in FASTA-format (on disk as dystrophin.fas)
- suppose the NCBI computer can not be accessed; try to retrieve the sequence from one of
the other sequence databases, e.g. EMBL (Europe) or DDJB (Japan)
NOTE: depending on the formulation of your query, large numbers of "hits"
may appear. Play with your query by making it more specific, restricting it to specific
fields (<Search Field> drop down menu) or by using the <Add
Term(s) to Query> field on the results page. Alternatively, go the LocusLink (NCBI) site and try whether
dystrophin has been catalogued yet; if so you get a direct link to a curated reference
sequence from the RefSeq database (NCBI).
2. Other possibilities for retrieval from GenBank
Go to the GenBank - Entrez
server and search for;
- was any sequence submitted with your contribution ?
- did J.T. den Dunnen submit any sequence ?
- what sequence is in AC009249 ?
- find the sea urchin dystrophin sequence
- it is mentioned as unpublished, is this true ? (check PubMed)
- find the human utrophin sequence (mRNA)
- retrieve and save it for later use (which format do you need ?)
- is it possible to use Entrez to find a sequence containing
"ctttgggaaaaggtgtaaga" ?
3. Information in GenBank
- which independent databases does GenBank keep ?
- what does the HTGS database contain ?
- how many murine sequences does dbEST currently contain (what are EST's)
?
4. Sequences from other sources
Usually, general search engines can not be used to find and retireve
DNA-sequences. However, in exceptional cases, general searches may hit sites of
researchers working on specific subjects providing more detailed descriptions and/or even
unpublished sequences
Structural analysis
I. General DNA analysis
NOTE: use the Atelier
BioInformatique (aBi) or BCM (section
"Sequence Utilities") sites, as a good starting point for the DNA-analysis tools
required.
1. Look for restriction sites
Take the dystrophin sequence retrieved and try to find whether EcoRI,
NotI and SfiI sites are present.
NOTE: netwerk software for restriction mapping can e.g. be found at the aBi-site, under <Nucleic
acids sequences>, <Map (restriction)>.
2. Calculate a primer pair for PCR
Take the dystrophin sequence retrieved and try to design a
primer pair for the analysis of RNA samples, i.e. to determine
whether the gene is transcribed in specific tissues.
- verify that the primers are unique in the human genome (perform a Blast-search search against
all available human sequences)
- can the primers be used to amplify dystrophin sequences from other organisms ?
(perform a Blast-search
search against all available non-human sequences)
NOTE: use e.g. the Primer3 package
(MIT). Other netwerk software for restriction mapping can e.g. be found at the aBi-site, under <Nucleic
acids sequences>, <PCR primer selection>.
3. Look for open reading frames
- Try to find the largest open reading frame (ORF) in the dystrophin
sequence retrieved
NOTE: go e.g. to the BCM
Search Launcher, select <Sequence Utilities>,
Copy/Paste your sequence into the Query window and select <6 Frame
Translation>
- Take part of the correct ORF and check whether it is correct by performing a Blast-search against the
SwissProt protein database (select swissprot from the <Database>
drop down menu).
4. Turn around a sequence
Turn around, i.e. reverse and complement the dystrophin
sequence retrieved
- translate the sequence into protein
- what is the largest open reading frame (ORF) encoded on the reverse strand ?
- is this ORF similar to any known proteins (perform a Blast-search)
II. Homology searches
1. homologies in other organisms
Take the dystrophin sequence retrieved (dystrophin.gb)
and select the 3' untranslated region. Perform a Blast-search against the
non-redundant database
- from how many organisms is the 3'UTR of the dystrophin gene known ?
- perform a multiple alignment; are there regions in the 3'UTR conserved ?
- repeat, now using a segment of the dystrophin protein, to find the dystrophin gene in
other organisms; do sequences from new organisms appear ?
- retrieve the 3'UTR sequences from these new sequences and add them to the multiple
alignment; are the conserved sequences detected initially also conserved in the more
distantly related organisms ?
- are there segments in the dystrophin 3'UTR which generate homologies with other genes ?
- if yes, take such a sequence and use it to perform a specific Blast search; do
additional hits appear ?, what are these homologies ?
2. from EST-homologies to a consensus cDNA-sequence
Take the dystrophin sequence retrieved (dystrophin.gb)
and select from the 3' untranslated region about 400 bp immediately upstream from the
polyA-addition site. Perform a Blast-search against dbEST using the EST-extractor at TIGEM
- from which of the clones hit is also the sequence from the other end available ?
- repeat the search, using the possibilities provided, and extend the human EST's
into one large consensus sequence. Repeat this effort for the murine
EST's
- how many UniGene clusters do the EST-contigs cover ?
- to which human/mouse chromosome(s) do these UniGene clusters
map ?
- in which tissues are these ESTs expressed ?
- do the human and murine consensus cDNA-sequences include part of the coding region of
the gene ?
- compare the human and murine consensus sequences with those in the non-redundant
database (GenBank); are there potential polymorphic regions or SNP's present ?
III. From sequence to gene
NOTE: difficult task
- use task_sequence_1 to perform a
BLAST search (against the non-redundant DNA database)
- select the human PAC containing the sequence, retrieve and save it as "task_seq2.gb"
- select and save (FastA format) a 40 kb segment surrounding
this sequence as "task_seq2.fas"
- verify whether task_seq2.fas contains repetitive DNA sequences
- which repetitive sequences does it contain
- try to obtain a clean copy (i.e. with the repetitive DNA sequences "removed"
or "masked"); save it as "seq2_clean.fas"
- use gene/exon prediction tools (at least three different
packages) to calculate the presence of gene(s) / exon(s) in this DNA segment
- what is the most common prediction ?
- does the region contain potential promoters ?
IV. From EST hits to a potential gene
NOTE: difficult task
- use the seq2_clean.fas sequence (or task_seq2.fas,
see above) to perform BLAST database searches against dbEST, the non-redundant database
and the HTGS section
- which homologies do you detect ?
- take the best hit from dbEST and use this sequence to repeat the dbEST BLAST search
using the EST Extractor (at
TIGEM)
- use the significant human hits, perform an overall alignment
and use the contig(s) to align with the original seq2_clean.fas or
task_seq2.fas sequence
- are the split segments flanked by consensus splice donor and/or splice acceptor sites ?
- how do these compare with the gene/exon predictions (see
above) ?
- how many UniGene clusters do the EST-contigs cover ?
- to which human chromosome(s) do these UniGene clusters map ?
- in which tissues are these ESTs expressed ?
- use the significant murine hits and perform an overall
alignment
- do the murine sequences extend the predicted gene structure ?
The human genome sequence
The human genome is currently sequenced at an incredible rate. The current strategy is
to determine a first draft sequence (finished spring 2000) and than focus on completing it
(finished early 2003). The consequence of this is that sequences currently go to the high
through-put genomic sequences (HTGS) section in the database and not to the non-redundant
(NR) section. This has several consequences;
- the database to screen first for human sequences is the HTGS section
- the definite map will be available soon (ESTs, markers, genes) although initially
in a highly complex format
- the HTGS section is a great resource for clones/contigs (DNA, genes, FISH-probes) and
contigs, for polymorphic markers (CA, SNP and any simple sequence) in/near your region
(gene of interest)
Sequence submission
1. A tool to prepare a sequence for submission
- find, download and install the SEQUIN software package (for
sequence submission and simple sequence analysis)
- find and retrieve, in <.ASN1 format>, the human
dystrophin mRNA sequence (see above)
NOTE: SEQUIN loads all details when the .ASN1 format is
used)
- open the human dystrophin sequence in SEQUIN and try tasks 1 and 3 of the section "I. General DNA analysis"
2. update a sequence directly through the WWW
Go to the sites of either BankIt (NCBI) or
Submit (EBI) and look at the
possibilities of updating/submitting sequences directly using the Internet.
| Top of page | Course Bookmarks |