Geni CB-250
About Geni CB-250Here you can find all about Geni CB-250 like manual and other informations. For example: review.
Geni CB-250 manual (user guide) is ready to download for free.
On the bottom of page users can write a review. If you own a Geni CB-250 please write about it to help other people. [ Report abuse or wrong photo | Share your Geni CB-250 photo ]
Manual
Download
(English)Geni CB-250, size: 530 KB |
Related manuals Geni CB-250T Geni CB-250B |
Geni CB-250
User reviews and opinions
| EGA |
2:01pm on Sunday, October 3rd, 2010 ![]() |
| I recently purchashed this item and I love it! I surf the web all the time and never want to get off. I bought this computer in August. It was an okay computer but did not meet my needs. | |
| mcorkill |
10:28pm on Wednesday, August 25th, 2010 ![]() |
| great performance despite being small, is quite fast and has served me well for working out, good image of the built-in camera. Great ! Great with the functionality I required: light-weight/mobility, quick to boot up, good battery life. | |
| friedelcraft |
1:19pm on Tuesday, August 17th, 2010 ![]() |
| The Perfect Netbook ! Ordered this from Amazon in Nov of 2009 - took less than 3 days for it to arrive. Its an excellent, very good looking netbook. | |
| kevinxiao |
11:49am on Monday, August 2nd, 2010 ![]() |
| This is the king of netbooks, the one that started it all. I like it so far. It is reliable, easy to st up and connects to internet easily. | |
| JohnGallagher |
10:05pm on Wednesday, May 26th, 2010 ![]() |
| worth for my money I bought this netbook about two weeks ago - it is great. Lightweight, easy to set-up - perfect for airtravel. It now replaces my supersized HP laptop,... | |
Comments posted on www.ps2netdrivers.net are solely the views and opinions of the people posting them and do not necessarily reflect the views or opinions of us.
Documents

Improved Splice Site Detection in Genie
Martin G. Reese, Frank H. Eeckman Human Genome Informatics Group Lawrence Berkeley National Laboratory 1 Cyclotron Road, Berkeley, CA 94720 mgreese@lbl.gov, fheeckman@lbl.gov David Kulp, David Haussler Baskin Center for Computer Engineering and Computer Science University of California, Santa Cruz CA, 95064, USA dkulp@cse.ucsc.edu, haussler@cse.ucsc.edu
We present an improved splice site predictor for the gene nding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in gene nding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows signi cant improvements in the sensitivity and speci city of gene structure identi cation. Experimental results in tests using a standard set of annotated genes showed that Genie identi ed 82% of coding nucleotides correctly with a speci city of 81%, versus 74% and 81% in the older system. In further splice site experiments, we also looked at correlations between splice site scores and intron and exon lengths, as well as at the e ect of distance to the nearest splice site on false positive rates.
1 Introduction
Current estimates for sequence output from the Human Genome Project are two Mbases per day, every day for the next seven years. Such a high throughput makes it important to develop new tools for annotation and analysis of genomic sequence. It is particularly di cult to exactly identify coding regions, from which one can deduce the structure of genes and gene products, in genomic DNA. Gene nding research over the past 15 years has concentrated on the recognition of short signal sequences and codon usage statistics. Signal sequences include promoters, start codons, splice sites, stop codons, etc. Of these, splice sites are especially important since they dene the boundaries between exons and introns, and hence de ne the exact extents of coding regions. Fickett 1 provides an overview and evaluation of the statistical measures for signal and content sensors. More recently, gene nding systems have been developed that employ many of the known recognition techniques in concert (exam1
Figure 1: A simple GHMM for a sequence containing a multiple exon gene. The arcs represent states that emit strings of bases and nodes represent transitions between states. The state labels are J5' : 5' UTR, EI : Initial Exon, E : Exon, I : Intron, E : Internal Exon, EF: Final Exon, ES : Single Exon, and J3' : 3' UTR. The node labels are B : Begin, S : Start Translation, D : Donor, A : Acceptor, T : Stop Translation, F : Final. The arrows imply a generation of bases from 5' to 3'.
ples include FGENEH, 2 GenLang, 3, and GENEMARK 4). Current state-of-the-art gene nding methods combine multiple statistical measures with database homology searching to identify gene features. (See, for example, GRAILII, , GeneID 7 8, and GeneParser3 9.) We have designed a new gene nder of this type that we call Genie 10. Our system is similar in design to GeneParser, but is based on a rigorous probabilistic framework throughout, even where the homology matching methods are employed. While Genie has performed well in comparison to other systems, all the best systems to date are still de cient in accurate prediction of the exon-intron boundaries, so any improvement in this area is likely to have a signi cant impact on the overall prediction process. Here we describe some recent experiments with Genie's splice site prediction method, that resulted in signi cant improvements in overall performance. Genie is a an implementation of a generalized hidden Markov model (GHMM) { a hidden Markov model12 whose states are arbitrary sub-models emitting variable length sequences, rather than single letters, as in a standard HMM. In the next section we give a brief introduction to GHMMs using a GHMM that de nes a simple gene structure syntax as an example. We then discuss how signal sensors are used to identify transitions, such as intron-exon boundaries, and content sensors are used to \score" candidate regions, such as proposed exons, in the parse of a DNA sequence. Next, we describe our splice site recognition methodology, and how it has been enhanced by using neural networks that rely on dinucleotide frequencies. Other experiments using our splice site predictors are described here as well.
2 Methodology
2.1 Basic System Framework
A generalized hidden Markov model is an enhancement of the standard hidden Markov model often used in time series pattern recognition in speech and computational biology. (See, among others, the tutorial from Rabiner and Juang13 and the introduction to HMMs in biosequence analysis by Krogh, et al.12) In a standard hidden Markov model, viewed as a generator, each state emits a single symbol. A 2
GHMM describes a more general model in which each state can emit one or more symbols according to an arbitrary distribution. Each state represents an independent sub-model which may, itself, be a hidden Markov model or any statistical model. Figure 1 shows a simple GHMM that models eukaryotic gene structure. The GHMM is represented as a graph. The states in the model are shown as the arcs of the graph. Nodes in the graph represent transitions between states. (This is di erent from typical graphical representations of regular HMMs.) Each state corresponds to a sub-model of an abstract gene feature such as an \Internal Exon" (E) or an \Intron" (I). For any sequence of bases, x, and state, q , the sub-model associated with the state q de nes a likelihood for the sequence x. This likelihood is denoted P (x q). When the GHMM is viewed as a generative statistical model, this is the probability that the sequence x is emitted when the hidden Markov process is in state q. These likelihood functions, one for each state, are part of the de nition of the GHMM. The graph of a GHMM has a unique source node B (for Begin) and a unique sink node F (for Final). The process of generating a string from a GHMM can be viewed as taking a random walk in the graph for the GHMM from the source to the sink. For any state q , the node that the arc for state q leads to is denoted node(q ). Once in this node, a next state is chosen at random from among the outgoing arcs from this node, independent of any previous choices made. The probability of choosing the next state r is denoted P (r node(q )). For example, in gure 1, the state I (Intron) leads to the node A (Acceptor). After the acceptor can come either the internal exon state (E) or the nal exon (EF). The former is chosen with probability P (E A) and the latter with probability P (EF A) where P (E A) + P (EF A) = 1. These parameters are part of the de nition of the GHMM, and are in practice determined from training data, as are the parameters de ning the likelihood functions P (x q ) de ned above. The full process of generating a string from a GHMM consists of a sequence of random choices: First a state q1 is chosen from among the outgoing arcs of the source node B. Then a substring x1 is generated according to the probability distribution P ( q1). Then a next state q2 is selected from among the outgoing arcs from node(q1). Then a substring x2 is generated according to the probability distribution P ( q2 ), etc., continuing like this until a state q that leads to the sink node is selected. This state emits the last substring x. The full string emitted by the HMM is the concatenation X = x1 : : :x of all the substrings that are emitted. All random choices made in the process of generating the string x are independent, except for the dependencies in the sequence q1 : : : q of states, which form a Markov chain. In applications of GHMMs, this sequence of states is not observed, only the sequence X is observed. Therefore they are called hidden Markov models. We de ne a parse of the sequence X to be a pair consisting of a sequence of states q1 : : : q and a corresponding sequence of substrings x1 : : : x , where 3
j j j j j j j j j
k k k k k k
X = x1 : : :x , q1 is a state arc coming out of the unique source node (B), and q is a state arc leading to the unique sink node (F). The GHMM de nes a joint likelihood of the sequence X = x1 : : : x and the parse = (q1 : : : q x1 : : : x ), according to the generative model described above. It is the joint independent probability of the subsequences given the corresponding states and the probability of the transitions between states. That is, ! Y1 ! ; Y P (X ) = P (q1 B) P (x q ) P (q +1 node(q )) : (1)
k k k k k k k
Given only the observed sequence X , using a variant of the Viterbi algorithm13, we can calculate the parse that maximizes equation 1, i.e. the most likely parse of X. In a GHMM that represents gene structure, such as the one in gure 1, this most likely parse represents the model's prediction of the most likely gene structure within the sequence X. This variant of the Viterbi algorithm used to nd the most likely parse is a dynamic programming algorithm that is essentially the same as the one de ned by Auger and Lawrence 14 to identify segment neighborhoods, by Sanko 15 to optimally decompose a sequence into disjoint regions with particular properties, and by Gelfand and Roytberg 16, Snyder and Stormo 17, Stormo and Haussler 18, and many others to do gene nding. So we do not elaborate on it here. GHMMs place these previous approaches within a convenient and general probabilistic framework. The GHMM in gure 1 represents only the basic ordering of gene features, and fails to fully capture the syntactic restrictions of a \legal gene parse". In an ideal DNA sequence, the parse is \frame consistent", i.e., the total number of coding nucleotides is a multiple of three and the reading frame is consistent from exon to exon. We can add additional states to the model graph such that only frame consistent parses are allowed. Figure 2 shows the model graph representing the resulting frame consistent GHMM. The three levels represent the three frames. Exon lengths can be restricted in the likelihood functions P (x q ) to equal 0, 1 or 2 modulo 3 for the various exon states in this GHMM in such a way to enforce frame consistency (see Kulp, et al 19). This more complex state structure is used by Genie. Further extensions to the GHMM graph can also be added to make the model more realistic. For example, an arc leading back from node T to node S labeled with a state that generates non-coding bases between genes would allow the GHMM to model sequences that have multiple genes within them.
2.2 Sensors
A sensor is an mechanism for recognizing or scoring a subsequence according to a model of an abstract gene feature. There are two types of sensors used in the Genie system: signal sensors and content sensors. 4
ES I E
EI 0 EI 1 EI 2
Figure 2: A GHMM including frame constraints. The additional acceptor and donor transition nodes ensure that only syntactically correct parses are considered.
2.3 Signal Sensor Models
Signal sensors are used to recognize transitions between states in a GHMM. This type of sensor is used in a pre-processing step to identify candidate sites where state transitions can occur. When only a limited number of sites are considered as possible locations of state transitions, the dynamic programming method used to nd the most likely parse runs much quicker. However, care must be taken here, because if an important site, such as a splice site needed in the correct or optimal parse, is not included, then the dynamic programming method will no longer nd the correct solution. In the GHMMs shown in Figures 1 and 2, the nodes correspond to gene features such as acceptor sites, donor sites, and positions of start and stop translation. A typical signal sensor might be a neural network to recognize an acceptor site, as described further below.
2.4 Content Sensor Models
Content sensors are used to estimate the likelihood of a subsequence given a particular state in the GHMM. Some basic content sensors used by Genie were described in 10. Since that paper, a more sophisticated type of content sensor has been developed for Genie. This new type of content sensor integrates evidence contributed from multiple sources and estimates a likelihood of a subsequence from the combined information. In the new Genie content sensor, each source of evidence is called a component a component is trained to recognize a speci c feature. Figure 3 shows an example of a ctitious subsequence whose likelihood is being evaluated by an internal exon content sensor. The internal exon content sensor is composed of several components: a nucleotide component, a codon component, end-region components representing 5
Figure 3: A sample content sensor combines evidence from multiple components to derive a maximum likelihood of the sequence. The arrow shows the combination of component features corresponding to the maximum likelihood.
the regions adjacent to the acceptor and donor sites, and a database homology match component. A component returns a likelihood for each potential feature occurrence, called an \extent". In the gure, the maximum likelihood is determined by the joint probability of the extents shown in the bottom of the gure, i.e. an acceptor extent, followed by two nucleotide extents, a database match extent, and three codon extents. Again, we use dynamic programming to decompose the subsequence into a series of extents in such a way that the joint probability of all extents is maximized. This decomposition is then used to calculate the likelihood. This simple, e cient method encourages a modular approach to developing an e ective gene nding system because components can be easily added to or subtracted from a content sensor.
2.5 Identifying Splice Sites
The biological process of splicing is quite complex and involves various proteins. The mechanism responsible for the exact deletion of introns is probably related to gene conversion using a cDNA copy of the mRNA of a partially spliced intermediate RNA 20. Clearly the intron - exon structure of genes has been very important in the generation of new genes during evolution (for an overview see Sharp 21.) The problem of recognizing splice sites by computer analysis was rst addressed by Brunak et al 22. They trained a backpropagation feedforward neural network with one layer of hidden units to recognize donor and acceptor sites, respectively. The best results were obtained by combining a neural network to recognize the consensus signal at the splice site with another one that predicted coding regions based on the statistical properties of the codon usage and preference. Solovyev and Lawrence 23 6
described a prediction program for splice sites based on oligonucleotide composition around the actual splice site location combined with discriminant analysis. The splice site recognizer used in the gene nding program GRAIL 24 also uses a standard neural network approach. All the above mentioned programs have a very high rate of false positives. In the rst Genie version 19 we implemented a feedforward neural network similar to the one described in Brunak et al 22. The sequence was encoded using 4 input units for each nucleotide, one hidden unit layer was used, and there was one yes/no output unit. For training we used standard backpropagation. We trained one network to recognize the donor sites, using a hidden layer of 10 units, and another one to recognize acceptor sites, using 40 hidden units. In contrast with Brunak et al., we trained our networks only on positive and negative examples that have consensus splice sites, i.e., `GT' for the donor and `AG' for the acceptor site. A positive example for a donor site is a window of 15 residues of DNA from -7 to +8 around the `GT' in an actual human donor splice site, while a negative example is a window of the same size around a `GT' that is in a neighborhood of plus or minus 40 nucleotides around a real splice site, but is not itself a real splice site. The training examples for the net that learns to recognize acceptor sites are similar, except that window size was larger. Experiments showed that 41 residues from -21 to +20 around the consensus `AG' were optimal for this application. We made no attempt to recognize the rare non-consensus splice sites that have been documented 25 26. Recently Henderson et al 27 showed that neighboring nucleotides are very strongly correlated in the splice site consensus pattern. Based on these results, we have changed our input representation from a 4-bit code per base to a 16-bit code per nucleotide pair. Hence a window of 15 nucleotides is encoded as 14 pairs of adjacent nucleotides, and each pairs is represented by 16 inputs that are all set to zero except the one in the position representing the letter pair in question, which is set to 1. This allows the network to easily model pairwise correlations between adjacent nucleotides. We also reduced the number of hidden units to 2 in the donor net and 10 in the acceptor net. This new network shows signi cantly improved prediction performance. We describe our experimental results below.
3 Experiments on splice site detection
Figure 4 and Figure 5 show the performance improvement using the 16-bit input coding for nucleotide pairs. The results shown are obtained by testing the neural network performance on an independent test set of 50 genes that were not included in the training set, and not closely homologous to genes in the training set. In testing, all false splice sites with the consensus dinucleotide that could be found in a neighborhood of plus or minus 40 nucleotides around a real splice site were included as negative examples. 7
Donor: correct positives vs false positives 90 Old New
% true positives
% false positives 100
Figure 4: Performance of the new donor-sensing neural network versus the old donor sensor. Percentage false positive predictions are plotted on the x-axis and the corresponding percentage true positive predictions on the y-axis.
Acceptor: correct positives vs false positives 90 Old New
0.10 % false positives 100
Figure 5: Performance of the new Acceptor neural network versus the old Acceptor sensor.
0.4 0.35 0.3 0.25 Donor NN Acceptor NN
0.2 0.15 0.1 0.-200 -150 -100 -50 position 200
Figure 6: Prediction scores for \false" `GT'/`AG' sites in the neighboring regions of true splice sites. The score for each position in the sequence according to the true splice site position are average over the entire data set.
One sees that in the interesting region where the threshold of the neural network is set so that the algorithm produces 1-10% false positive predictions, the new networks are much more sensitive. In particular, suppose one is able to tolerate a rate of 7% false positives in donor site prediction, i.e. in 7% of the cases where the network is given a window of 15 nucleotides centered around a `GT' that is not a real donor site it nevertheless classi es it as a real donor site. Then the net can achieve a rate of 98.67% true positives, with only 1.33% false negatives. The previous network had a corresponding false negative rate of 3.85%. Reducing the number of false negative splice site predictions is especially critical in a system that tries to nd whole genes, because, as mentioned above, one false negative splice site prevents the system from ever constructing an entirely correct set of exons for the whole gene, whereas even in the presence of many false positives, so long as all the true positives are there, it is still possible for the dynamic programming/Viterbi optimization to select the correct set of exons, if there is enough gene content signal. A number of additional experiments were done comparing the scores of a \false" splice site close to a true splice site versus scores of a false splice site far away from a true splice site. There appears to be a tendency for a false splice site close to the true splice site to have a lower score than for a similar false splice site far away from any true site, reducing the chance that the gene nder will mistakenly pick a wrong splice site very close to the true splice site. This is very helpful. In particular, the scores for false splice sites are atypically low when they occur in a window of about 50 bases on the both sides of a true donor site (Figure 6). While scores for 9
false acceptor sites in a window of about 100 bp on the intron side are atypically low, false acceptor sites on the exon sides show surprisingly high scores. We believe that some of these high scores are due to annotation errors in the genomic DNA database. In our entire data set we nd splice sites that \cut" the codon frame in all three positions. 41,3% of the splice sites are cut in frame 0, 38,7% in frame 1 and only 20,0% in frame 2. We believe that this signi cant lower number of splice sites in frame 2 is due to the low coding value of the nucleotide in this codon position, which has little e ect on the translated amino acid sequence. We also did a series of experiments where we trained three separate neural nets for donor site prediction, and similarly for acceptors, one for each frame. These experiments showed that splice sites in frame 0 and 2 are somewhat easier to predict than those in frame 1 (results not shown). We have not yet tried incorporating these networks into Genie. Finally, we also looked at the correlations between exon length and the score of the anking splice sites. Figure 7 and Figure 8 show the distribution of exon length versus the prediction score for donor and acceptor sites. Careful analysis shows that the exon length correlates with the score and therefore the strength of the consensus pattern. This result on our new gene dataset from Genbank version 95 from June 1996 con rms an earlier hypothesis 25. It is particularly helpful in gene nding that short exons tend to have higher scoring splice sites.
3.1 Cross-validation Experiments on Whole Gene Prediction
We did a set of experiments on whole gene prediction using the new splice site sensors. The data set used during training and testing was a collection of 288 annotated, multiple-exon human DNA sequences from the GenBank sequence database. The data set was randomly partitioned into seven test sets of uniform size to be used in cross-validation experiments. For each test set, the content sensors were trained on the remaining training data and predictions were recorded for the sequences in the test set. Additional tests were performed on a data set of 570 vertebrate genes. This data set was used by Bursett and Guigo as a benchmark for the comparison of many di erent gene nders.8
4 Results
Table 1 shows the results of running Genie using the new splice site detector on all seven test sets and the average results over the entire data set. We also tested Genie against the Burset/Guigo data set results on this set comparing our gene nder with other gene nding systems is shown in Table 2.
The data set is available at ftp://genome.lbl.gov/pub/genesets/.
Exon length bp
-12 -10 -8 -6 -4 -Z-Score 6 8
Figure 7: Exon length versus donor site score from the neural network
-15 -10 -Z-Score 5 10
Figure 8: Exon length versus acceptor site score from the neural network
Table 1: Prediction results on seven test sets using the new splice site predictions and with the old ones. \Per base" statistics refer to the ability to predict whether a nucleotide is coding or noncoding. \Per exon" statistics refer to the ability to predict a complete exon exactly. B/G indicates results for the Bursett/Guigo data set.
Data Set Part 0 Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Average Old Part 0 Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Average New Old New
Per Base
0.92 0.65 0.64 0.73 0.76 0.70 0.78 0.74 0.75 0.84 0.81 0.91 0.78 0.83 0.83 0.82
0.87 0.73 0.82 0.81 0.83 0.83 0.77 0.81 0.82 0.80 0.81 0.89 0.82 0.81 0.75 0.81
0.88 0.63 0.68 0.72 0.75 0.73 0.76 0.74
Sn 0.75 0.46 0.53 0.56 0.63 0.56 0.66 0.59
Exact Exon
Avg ME WE 0.74 0.46 0.55 0.57 0.63 0.55 0.65 0.59 0.54 0.61 0.67 0.80 0.63 0.61 0.66 0.64 0.05 0.26 0.19 0.19 0.13 0.19 0.21 0.17 0.19 0.15 0.11 0.09 0.18 0.12 0.15 0.14 0.17 0.28 0.21 0.22 0.13 0.20 0.24 0.21 0.23 0.25 0.20 0.11 0.16 0.22 0.28 0.21
0.74 0.52 0.79 0.63 0.78 0.68 0.89 0.80 0.75 0.63 0.79 0.63 0.77 0.68 0.79 0.65 B/G data set 0.76 0.77 0.72 0.55 0.78 0.84 0.77 0.61
0.73 0.47 0.57 0.59 0.63 0.55 0.65 0.59 0.56 0.60 0.67 0.80 0.63 0.60 0.63 0.64
0.48 0.51 0.17 0.33 0.64 0.62 0.15 0.16
Table 2: A comparison of Genie with other gene nding systems. Tests were run on a set of 570 annotated sequence from di erent organisms. (Bursett/Guigo data set)
Gene nder
Sn Genie 0.78 FGENEH 0.77 GeneID 0.63 GeneParser2 0.66 GenLang 0.72 GRAILII 0.72 SORFIND 0.71 Xpound 0.61
Sp 0.84 0.85 0.81 0.79 0.75 0.84 0.85 0.82
AC 0.77 0.78 0.67 0.66 0.69 0.75 0.73 0.68
Sn 0.61 0.61 0.44 0.35 0.50 0.36 0.42 0.15
Sp 0.64 0.61 0.45 0.39 0.49 0.41 0.47 0.17
Avg 0.62 0.61 0.45 0.37 0.50 0.38 0.45 0.16
ME 0.15 0.15 0.28 0.29 0.21 0.25 0.24 0.32
WE 0.16 0.11 0.24 0.17 0.21 0.10 0.14 0.13
Table 3: Prediction results on the entire data set and the Bursett/Guigo data set (currently only 302 genes tested) using the old and the new splice site recognizers and homology matches. In this table, we also include, for comparison, predictive results of GeneID+ and GeneParser3 as reported in Bursett and Guigo. Only sequences of length less than 8000 were tested in the latter data set to provide comparable results with the other gene nders.
Data Set Genie Data Set Average Old Average New B/G Data Set Genie (Old) Genie (New) GeneID+ GeneParser3
Avg ME WE
0.80 0.84 0.79 0.66 0.66 0.66 0.15 0.21 0.86 0.85 0.83 0.69 0.69 0.69 0.12 0.18 0.95 0.85 0.91 0.86 0.91 0.91 0.90 0.91 0.91 0.85 0.88 0.86 0.77 0.69 0.73 0.56 0.74 0.72 0.70 0.58 0.76 0.70 0.71 0.57 0.04 0.11 0.07 0.14 0.13 0.09 0.13 0.09
In addition to the Genie version only based on statistical properties trained from existing genes Table 3 shows the results using our new scheme to incooperate homology information from a database as discusses in 2.4. In accordance with the testing scheme established by Bursett and Guigo 8 , we report sensitivity and speci city with respect to per-base prediction of coding/noncoding and with respect to exact prediction of exons. The per-base sensitivity is the fraction of true coding bases predicted as coding, and the speci city is the fraction of all predicted coding bases that were correct. Similarly, the exon sensitivity is the fraction of true exons predicted exactly, and the speci city is the fraction of predicted exons that were correct. In these tests, correct exon prediction requires identi cation of the exact position of splice sites. Fully or partially overlapping predictions are not accepted. The approximate coe cient (AC) is described by Bursett and Guigo as a preferred alternative over the correlation coe cient and de ned by TP TP TN TN AC = 1 ( TP+FN + TP+FP + TN+FP + TN+FN ) 1 2
where TP, FP, TN, and FN are true positives, false positives, true negatives, and false negatives. In addition, we also report the fraction of true exons that were not identi ed either exactly or overlapping (Missing Exons) and the fraction of predicted exons that did not overlap any true exon (Wrong Exons).
5 Discussion
The work presented here extends the work and results reported in Kulp, et al19 by adding two novel neural networks for donor and acceptor splice site predictions. Other experiments exploring the properties of the new splice site detectors are also reported. Our approach was motivated by the work of Henderson et al 27, showing strong correlations in neighboring nucleotides at the splice site. The addition of the new networks increased the overall prediction accuracy by approximately 7%, and caused the number of missed exons to drop signi cantly. Per base sensitivity increased about 10%, and speci city rose 1%. These results show the importance of correct splice site predictions for the overall gene nding process used by Genie. The new input encoding using dinucleotides, which allow the net to easily exploit correlations between neighboring bases, resulted in more sensitive splice site detectors. Another interesting observation was made: non functional GT and AG sites, close to real splice sites, have signi cantly lower scores than GT and AG sites far removed from real splice junctions. This phenomenon, which is not well understood, improves the performance of our gene nding methods. We also studied the length distributions of exons versus the scores of the anking splice sites and found that shorter exons have stronger splice site consensus signals than average length exons. The longest exons also have very strong splice sites. 14
The overall performance of the new Genie compares quite favorably with the other gene nders on the Bursett and Guigo dataset. It should also be noted that this dataset includes sequences from all vertebrates, whereas Genie was trained only on human DNA. We have developed a WWW interface for Genie. Researchers can submit sequences to our server and receive predictions by email. The URL for Genie is http://www-hgc.lbl.gov/projects/genie.html. The splice site predictors are separately accessible at http://www-hgc.lbl.gov/projects/splice.html. In future work we plan to extend Genie so that it can reliably nd multiple genes in a single DNA sequence. We also plan to improve the statistical model used in the intron state of Genie, as well at the model for intergenic DNA. This can be accomplished by incorporating sensors for promoters, the transcription start site, DNA repeat sequences, and the overall structure of 5' and 3' untranslated regions. We are also planning to extend Genie so that it can incorporate homology hits to cDNA databases when these are available.
6 Acknowledgments
We would like to especially thank Kevin Karplus for his contributions to this work, particularly for the splice site pro les he built, and his valuable discussion. We would also extend our gratitude to Gary Stormo, Nomi Harris, and Richard Hughey for their assistance in the development of Genie. This work was supported in part by DOE grant no. DE-FG03-95ER62112 and DE-AC03-76SF00098. M.G. Reese and D. Haussler acknowledge support of the Aspen Center for Physics, Biosequence Analysis Workshop.
References
1. J. W. Fickett and C.-S. Tung. Assessment of protein coding measures. Nucl. Acids Res., 20:6441{6450, 1992. 2. V. Solovyev, Salamov A., and C. Lawrence. Predicting internal exons by oligonucleotide composition and discriminant analysis of splicable open reading frames. Nucl. Acids Res., 22:5156{5163, 1994. 3. S. Dong and D. B. Searls. Gene structure prediction by linguistic methods. Genomics, 162:705{708, 1994. 4. M. Borodovsky and J. McIninch. Genmark: Parallel gene recognition for both DNA strands. Computers and Chemistry, 17(2):123{133, 1993. 5. Y. Xu, J. R. Einstein, M. Shah, and E. C. Uberbacher. An improved system for exon recognition and gene modeling in human dna sequences. In ISMB-94, Menlo Park, CA, 1994. AAAI/MIT Press. 6. Y. Xu and E. Uberbacher. Gene prediction by pattern recognition and homology search. In ISMB-96, St. Louis, June 1996. AAAI Press. 15
7. R. Guigo, S. Knudsen, N. Drake, and T. Smith. Prediction of gene structure. J. Mol. Biol., 226:141{157, 1992. 8. Moises Burset and Roderic Guigo. Evaluation of gene structure prediction programs. Genomics (to appear), 34(3):353{367, 1996. Data set and evaluation results can be found at http://www.imim.es/GeneIdenti cation/Evaluation/Index.html. 9. E. Snyder and G. Stormo. Identi cation of protein coding regions in genomic dna. JMB, 248, 1995. 10. D. Kulp, D. Haussler, M. Reese, and F. Eeckman. Integrating database homology in a probabilistic gene structure model. In submitted to PSB-97, January 1997. 11. E. E. Snyder and G. D. Stormo. Identi cation of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res., 21:607{613, 1993. 12. A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235:1501{1531, February 1994. 13. L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4{16, January 1986. 14. I. E. Auger and C. E. Lawrence. Algorithms for the optimal identi cation of segment neighborhoods. Bull. Math. Biol., 51:39{54, 1989. 15. D. Sanko and J.B. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983. 16. M. S. Gelfand and M. A. Roytberg. Prediction of the exon-intron structure by a dynamic programming approach. BioSystems, 30:173{182, 1993. 17. E.E. Snyder and G.D. Stormo. Identi cation of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucleic Acids Res., 21:607{613, 1992. 18. G. D. Stormo and D. Haussler. Optimally parsing a sequence into di erent classes based on multiple types of information. In ISMB-94, Menlo Park, CA, August 1994. AAAI/MIT Press. 19. D. Kulp, D. Haussler, M. Reese, and F. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. In ISMB-96, St. Louis, June 1996. AAAI Press. 20. G.R. Fink. Cell, 49:355{367, 1987. 21. P.A. Sharp. Split genes and rna splicing. Cell, 77:805{815, 1994. 22. S. Brunak, J. Engelbrecht, and S. Knudsen. Prediction of human mRNA donor and acceptor sites from the dna sequence. JMB, 220:49{65, 1991. 23. V.V. Solovyev and C.B. Lawrence. Identi cation of human gene functional regions based on oligonucleotide composition. In Proceedings, 1st International Conference on Intelligent Systems for Molecular Biology, Menlo Park, 1993. AAAI Press. 16
24. E. Uberbacher and R. Mural. Locating protein coding regions in human DNA sequences by a multiple sensor - neural network approach. Proceedings of the National Academy of Sciences of the United States of America, 88:11261{ 11265, 1991. 25. X. Peng S. M. Mount and E. Meier. Some nasty facts to bear in mind when predicting splice sites. In Gene-Finding and Gene Structure Prediction Workshop, Philadelphia, PA, October 1995. 26. M. B. Shapiro P. Senapathy and N. L. Harris. Splice junctions, branch point sites and exons: sequence statistics, identi cation, and applications to genome project. Meth. Enzymol., 183:252{278, 1990. 27. S. Salzberg J. Henderson and K. Fasman. Finding genes in human dna with a hidden markov model. In Proceedings, 4rd International Conference on Intelligent Systems for Molecular Biology, St. Louis, June 1996. AAAI Press.
Tags
KX-TD816NE S6010 ROC 2404 9600 GT Shake 4 Motorola Q Incopy CS2 Boitier Adsl ZWQ6100 ENL6298X1 4550B AML 129 TD-10 ZX-5 FS IP3000 B2330 Wireless Asus A7VM Gf-200 Siemens SX66 Nroute SHR-2040P250 42PB120S4 Electrisaver E30 AV-R600 Avsf 120 Keytis 2 LX300 Auris MVX25I LN55B650t1F Mf4340D 96740 Controller DCR-HC28 DNX5240BT Cdmix3 LE32C350d1W Kawai K1M GC-154GQW Proheat 7901 Blender FX140-2003 Yamaha DD50 AGP-V3000 F-15-F-5 FAX-B820 Review Sony D303 Th-a9 ASF645-W PS-500 Vitotronic 100 FZ6-S-2005 CQ-MR707N RS21ddms AVR 260 P60-1999 Imageclass D660 Prodikeys DV220AEW XAA 32PW8620 12 CX4900 - D 32LG5000-ZA AEK Plantronics K100 PM665VXI FT920 Yamaha MM10 HT-DS460 RP-1000 6-motif7-motif8 Serie 03 Razr V3X XR-C440RDS Gardena 380 Lugf02-90-S Optoma HD70 DT-585W NC 1000 Dimension 5100 EW1232F Elux NOR CF-5100 PM7001 HD-HG160LAN TL-SG5426 Deere 9030 PK-5A 22PFL3405H Iphone Plcxf46E Soccer RT-26LZ50 Matrix 1000 TXL42S20E WD9280 MC-8088HL I 60 K-701 LC-19A1E
manuel d'instructions, Guide de l'utilisateur | Manual de instrucciones, Instrucciones de uso | Bedienungsanleitung, Bedienungsanleitung | Manual de Instruções, guia do usuário | инструкция | návod na použitie, Užívateľská príručka, návod k použití | bruksanvisningen | instrukcja, podręcznik użytkownika | kullanım kılavuzu, Kullanım | kézikönyv, használati útmutató | manuale di istruzioni, istruzioni d'uso | handleiding, gebruikershandleiding
Sitemap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101









