|
|
|
Bioinformatics thesis proposals
Objective
Here is a list of bioinformatic projects. Most of them are related to the
Danish-Chineese pig genome project, since it is expected that these data are
applied. However, projects are not limited to these data, and can be applied to
other types of data as well. Projects can also focus solely on development of a
computational approach. Furthermore, projects are not limited to the ones
listed below. We are open for other proposals and ideas.
Facilities
We will attempt to give part- or full-time access to a desk where the project
can be carried out in a growing enviroment of bioinformatics people where
support will be a natural part of such environment. We have a local linux
cluster in which the student will be granted access, and if relevant access to
supercomputers at SDU (http://www.dcsc.sdu.dk) and DTU (CBS: http://www.cbs.dtu.dk/cbs/dcsc-cbs.php)
through Danish Center for Scienctific Computing (http://www.dcsc.dk).
MSc projects
The project proposals are listed in ARBITRARY order, so please take your time
to browse through all of them.
- Sequence annotation using prediction methods
One of the biggest problems with genome annotation is that it essentially
rely on similarity existing known well characterized sequences within the
genome and the genomes from other related organisms.
The most recent research (from expression experiments) have shown that
only half of all expressed genes in the human genome are unknown! In the
piggenome project we observe the same trend and it is of interest to
obtain structural and functional information abdut the unknown sequences.
In this project various prediction methods for protein as well as
non-coding RNAs can be integrated to obtain predictive knowledge. Such
knowledge is the first step of generating gene candidates for further
experimental characterization.
- Integrative computational approach for cleaning up raw sequenced DNA
When raw DNA sequences are converted to bits in the computer they need to
be cleaned prior to usage. They typically contain vector and other
undesired sequences. In addition such sequences typically need to be
assembly with other such sequences. However if the sequences contain
repetitative sequence assembly is made difficult and in worst case
errorneous. There exist numerous programs for masking out repeats, however
it is sometimes diserable to remove the masked part of the sequence
depending on its location.
The project can involve integrating existing methods in combination with
(minor) implementation of additional principles. The focus can be removal
of undesired masked sequence regions. The whole setup can be tested on the
pig genome sequence data.
- Distance constrains in protein structure
Protein structure has been studied for decades and in the last many years
computational approaches to protein structure prediction has constructed.
Though some methods occasionally can produce accaptable prediction,
none of them are really even near in producing satisfactory predictions.
One approach to understand a part of protein structure is to study the
distance between residues in proteins. Like secondary structure prediction
methods (like neural networks or hidden markov models) can be constructed
to make predictions on single sequences, so can method be made to predict
constraints on the physical distances between the residues in the chain.
In the project statistical features of protein distance can be examinated
and used to construct a neural network prediction method that can enhance
prediction ability of existing methods.
- Analysis of genes with non-synonymous substitutions
A large number of simple nucleotide polymorphisms (SNPs) has been
identified within the coding region of pig genes. Some of these SNPs give
rise to different variants of the resulting proteins (non-synonymous
SNPs). It is of great interest to characterize the genes with the
non-synonymous SNPs further in order to evaluate their functional
importance. This characterization can for instance take the outset in:
sequencing of the full-length transcript, comparative analysis and mapping
and lead to: functional analysis based on expression studies and
phenotypic characterization of animals exhibiting the different variants
of the gene in question.
- Analysis of allele-specific gene expression in pigs
New data from the human field provide evidence that differential
expression is relatively common and that allelic differences in expression
are heritable. This expression pattern provides evidence for a model
whereby cis-acting genetic variation results in differential expression
between alleles. In turn, these variations could be of physiological
importance.
To analyse allele-specific gene expression in pigs a number of well-
characterized single nucleotide polymorphisms identified within the coding
region of specific genes will be selected as targets. Allele-specific
quantitative PCR will be performed on cDNA isolated from a range of
different tissues from different animals. Based on these analysis
differential expression will be evaluated and the functional importance
will be considered.
- Prediction and characterization of non-coding RNAs and RNA structure
Until just a few years ago RNA was in three flavours for most biomolecular
scientists: tRNA, rRNA and mRNA. However, textbooks are already in the
process of being completely rewritten: Non-coding RNA (ncRNA) genes has
now been acknowledged to play a major role in gene regulation and
networking. One reason why ncRNA has been ignored is that they are not
well characterized as by a pattern similar to open reading frames for
protein coding genes. In fact, their conservation is as much in structure,
where sequence similarity is erased due to so-called compensatory basepair
changes: The base pair AU in one sequence can for example be replaced by a
CG pair in another, making similarity search hard. However, various
approaches that takes the structure into consideration have been
constructed. In addition mammals contain pseudogenes that by sequence
similarity well resembles ncRNAs. This further complicates the search.
We have whole range of interesting projects some of them are
- Computational scan for micro RNAs (miRNA). Micro RNA is a class of
distinct involved in translational regulation. The mature miRNA is
diced and sliced from a stemloop precursor with a molecular machinery
that much resembles that of siRNA.
- Integrated approach for scanning after ncRNAs. A number of different
methods work on different types of data, such as single or multiple
sequences as well as degree of sequence similarity. Integrating them
into a pipeline will create a general seach tool.
- Integrating multiple methods for folding of RNA sequences. As b) just
combining methods for structure prediction rather than gene finding.
- Advanced project: The Sankoff algorithm for simultanoues folding and
alignment of RNA sequences is O(L^3N) in time and O(L^2N) in memory for
N sequences of length L. Just for two sequences this is a heavy
complexity, hence constraining the algorithm will gain speed. This
project will be in collaboration with an ongoing ph.d.-project.
- Clustering of raw EST sequences for sequence assembly
Expressed sequence tags (ESTs) are expressed genes that have been randomly
fished out of some tissue from some organism. Usually these are used to
search for novel genes, however they are often in complete sequences and
need to be assembled into larger contigs to get a larger portion of the
gene.
Sequence assembly of raw EST sequence, is complicated not only alternative
splice variants, but also by the fact that a some EST sequences are
reversed complement sequenced due to experimental settings. It is the goal
to develop an automated pipeline to detect these such cases.
- Evolution of secondary metabolism in higher plants.
Cytochromes P450 and family 1 glycosyltransferases are key enzymes in
biosynthesis of the wealth of secondary metabolites found in higher
plants. Genomic and cDNA sequencing programs of a number of model plants
have unravelled a wealth of information on genes and genomes. The aim of
the project is obtain a better understanding of the evolution these two
multi gene families in terrestrial plants.
The project includes extracting gene sequences from public databases of
the model organisms: green algae, mosses, gymnosperms, and angiosperms,
and perform comparative analysis. Raw EST sequences will need to be
assembled into contigs prior to gene annotation, and raw genomic sequences
needs to be annotated. The project will include PCR applications to
validate annotations. The deduced amino acid sequences of the annotated
genes will be used in phylogenetic trees analysis to obtain their
evolutionary relationship.
This project is a collaboration between IPB (Søren Bak)
and IBHV (Jan
Gorodkin). For details on this project, contact Søren Bak 3528 3346
(bak@kvl.dk). See also
http://www.plbio.kvl.dk/plbio/cyanogen.htm.
Contact
For more information, contact
Jan Gorodkin
Email: gorodkin@bioinf.kvl.dk
Phone: 3538 3578
Comments, questions, etc., email
webmaster@bioinf.kvl.dk
|