logo

Bioinformatics theses
The Royal Veterinary and Agricultural University (KVL)


  Home


  BioLinux


  Bioinformatics


  Seminars


  Piggenome.dk


  Genome.dk


Bioinformatics thesis proposals

Objective
Here is a list of bioinformatic projects. Most of them are related to the Danish-Chineese pig genome project, since it is expected that these data are applied. However, projects are not limited to these data, and can be applied to other types of data as well. Projects can also focus solely on development of a computational approach. Furthermore, projects are not limited to the ones listed below. We are open for other proposals and ideas.

Facilities
We will attempt to give part- or full-time access to a desk where the project can be carried out in a growing enviroment of bioinformatics people where support will be a natural part of such environment. We have a local linux cluster in which the student will be granted access, and if relevant access to supercomputers at SDU (http://www.dcsc.sdu.dk) and DTU (CBS: http://www.cbs.dtu.dk/cbs/dcsc-cbs.php) through Danish Center for Scienctific Computing (http://www.dcsc.dk).

MSc projects
The project proposals are listed in ARBITRARY order, so please take your time to browse through all of them.

  1. Sequence annotation using prediction methods

    One of the biggest problems with genome annotation is that it essentially rely on similarity existing known well characterized sequences within the genome and the genomes from other related organisms.

    The most recent research (from expression experiments) have shown that only half of all expressed genes in the human genome are unknown! In the piggenome project we observe the same trend and it is of interest to obtain structural and functional information abdut the unknown sequences. In this project various prediction methods for protein as well as non-coding RNAs can be integrated to obtain predictive knowledge. Such knowledge is the first step of generating gene candidates for further experimental characterization.


  2. Integrative computational approach for cleaning up raw sequenced DNA

    When raw DNA sequences are converted to bits in the computer they need to be cleaned prior to usage. They typically contain vector and other undesired sequences. In addition such sequences typically need to be assembly with other such sequences. However if the sequences contain repetitative sequence assembly is made difficult and in worst case errorneous. There exist numerous programs for masking out repeats, however it is sometimes diserable to remove the masked part of the sequence depending on its location.

    The project can involve integrating existing methods in combination with (minor) implementation of additional principles. The focus can be removal of undesired masked sequence regions. The whole setup can be tested on the pig genome sequence data.


  3. Distance constrains in protein structure

    Protein structure has been studied for decades and in the last many years computational approaches to protein structure prediction has constructed. Though some methods occasionally can produce accaptable prediction, none of them are really even near in producing satisfactory predictions.

    One approach to understand a part of protein structure is to study the distance between residues in proteins. Like secondary structure prediction methods (like neural networks or hidden markov models) can be constructed to make predictions on single sequences, so can method be made to predict constraints on the physical distances between the residues in the chain.

    In the project statistical features of protein distance can be examinated and used to construct a neural network prediction method that can enhance prediction ability of existing methods.


  4. Analysis of genes with non-synonymous substitutions

    A large number of simple nucleotide polymorphisms (SNPs) has been identified within the coding region of pig genes. Some of these SNPs give rise to different variants of the resulting proteins (non-synonymous SNPs). It is of great interest to characterize the genes with the non-synonymous SNPs further in order to evaluate their functional importance. This characterization can for instance take the outset in: sequencing of the full-length transcript, comparative analysis and mapping and lead to: functional analysis based on expression studies and phenotypic characterization of animals exhibiting the different variants of the gene in question.


  5. Analysis of allele-specific gene expression in pigs

    New data from the human field provide evidence that differential expression is relatively common and that allelic differences in expression are heritable. This expression pattern provides evidence for a model whereby cis-acting genetic variation results in differential expression between alleles. In turn, these variations could be of physiological importance.

    To analyse allele-specific gene expression in pigs a number of well- characterized single nucleotide polymorphisms identified within the coding region of specific genes will be selected as targets. Allele-specific quantitative PCR will be performed on cDNA isolated from a range of different tissues from different animals. Based on these analysis differential expression will be evaluated and the functional importance will be considered.


  6. Prediction and characterization of non-coding RNAs and RNA structure

    Until just a few years ago RNA was in three flavours for most biomolecular scientists: tRNA, rRNA and mRNA. However, textbooks are already in the process of being completely rewritten: Non-coding RNA (ncRNA) genes has now been acknowledged to play a major role in gene regulation and networking. One reason why ncRNA has been ignored is that they are not well characterized as by a pattern similar to open reading frames for protein coding genes. In fact, their conservation is as much in structure, where sequence similarity is erased due to so-called compensatory basepair changes: The base pair AU in one sequence can for example be replaced by a CG pair in another, making similarity search hard. However, various approaches that takes the structure into consideration have been constructed. In addition mammals contain pseudogenes that by sequence similarity well resembles ncRNAs. This further complicates the search.

    We have whole range of interesting projects some of them are

    1. Computational scan for micro RNAs (miRNA). Micro RNA is a class of distinct involved in translational regulation. The mature miRNA is diced and sliced from a stemloop precursor with a molecular machinery that much resembles that of siRNA.

    2. Integrated approach for scanning after ncRNAs. A number of different methods work on different types of data, such as single or multiple sequences as well as degree of sequence similarity. Integrating them into a pipeline will create a general seach tool.

    3. Integrating multiple methods for folding of RNA sequences. As b) just combining methods for structure prediction rather than gene finding.

    4. Advanced project: The Sankoff algorithm for simultanoues folding and alignment of RNA sequences is O(L^3N) in time and O(L^2N) in memory for N sequences of length L. Just for two sequences this is a heavy complexity, hence constraining the algorithm will gain speed. This project will be in collaboration with an ongoing ph.d.-project.


  7. Clustering of raw EST sequences for sequence assembly

    Expressed sequence tags (ESTs) are expressed genes that have been randomly fished out of some tissue from some organism. Usually these are used to search for novel genes, however they are often in complete sequences and need to be assembled into larger contigs to get a larger portion of the gene.

    Sequence assembly of raw EST sequence, is complicated not only alternative splice variants, but also by the fact that a some EST sequences are reversed complement sequenced due to experimental settings. It is the goal to develop an automated pipeline to detect these such cases.


  8. Evolution of secondary metabolism in higher plants.

    Cytochromes P450 and family 1 glycosyltransferases are key enzymes in biosynthesis of the wealth of secondary metabolites found in higher plants. Genomic and cDNA sequencing programs of a number of model plants have unravelled a wealth of information on genes and genomes. The aim of the project is obtain a better understanding of the evolution these two multi gene families in terrestrial plants.

    The project includes extracting gene sequences from public databases of the model organisms: green algae, mosses, gymnosperms, and angiosperms, and perform comparative analysis. Raw EST sequences will need to be assembled into contigs prior to gene annotation, and raw genomic sequences needs to be annotated. The project will include PCR applications to validate annotations. The deduced amino acid sequences of the annotated genes will be used in phylogenetic trees analysis to obtain their evolutionary relationship.

    This project is a collaboration between IPB (Søren Bak) and IBHV (Jan Gorodkin). For details on this project, contact Søren Bak 3528 3346 (bak@kvl.dk). See also http://www.plbio.kvl.dk/plbio/cyanogen.htm.

Contact
For more information, contact
Jan Gorodkin
Email: gorodkin@bioinf.kvl.dk
Phone: 3538 3578

Comments, questions, etc., email webmaster@bioinf.kvl.dk

Last updated June 9th, 2004 by Jan Gorodkin