Splice site detection using principle component analysis and case based reasoning with support vector machine
Srabanti Maji*1 and Haripada Bhunia2
1 Computer Science Department
Sri Guru Harkrishan College of Management and Technology, Raipur, Bahadurgarh;
Dist: Patiala,Punjab, India
2 Department of Chemical Engineering
Thapar University, Patiala-147004, India
*Address Correspondence to this author at
Dr. Srabanti Maji
Computer Science Department,
Sri Guru Harkrishan College of Management and Technology, Raipur, Bahadurgarh;
District: Patiala, Punjab, India
E-mail address: srabantiindia@gmail.com, srabanti9@gmail.com
Tel: +91-9356006454
ABSTRACT
Identification of coding region from genomic DNA sequence is the foremost step
…show more content…
feature selection; and the final stage, in which a support vector machine (SVM) with Polynomial kernel is used for final classification. In comparison with other methods, the proposed SpliceCombo model outperforms other prediction models as the prediction accuracies are 97.25% sensitivity, 97.46% Specificity for donor splice site and 96.51% Sensitivity, 94.48% Specificity for acceptor splice site prediction.
Keywords: Gene Identification, Splicing Site, Principal Component Analysis (PCA); Cased Based Reasoning (CBR); Support Vector Machine(SVM)
*Correspondence to Srabanti Maji,
E-mail address: srabantiindia@gmail.com, srabanti9@gmail.com
Tel: +91-9356006454
Splice site detection using principle component analysis and case based reasoning with support vector machine
1. INTRODUCTION
Research in the genome sequencing technology have been creating an enormous amount of genomic sequencing data as its main objective is gene identification. In the eukaryotes, the prediction of a coding region depends upon the exon-intron structures recognition. Whereas its very challenging to predict exon intron structure in sequence due to its complexity of structure and vast length. Research analysis on the human genome have nearly 20,000–25,000 protein-coding genes [1]. Still, there are nearly 100,000 genes in human genome. Which indicates a huge number of genes are still unidentified [2,3]. Most of the computational techniques
The vital components and techniques of gene cloning are as follows, the DNA sequence that contains the desired gene (EZH2) is amplified by Polymerase chain reaction. PCR was established by Kary Mullis in 1985, popularly known to amplify target sequences of DNA (EZH2) to a billion fold in several hours using thermophilic polymerases (Taq) ,primers and other cofactors (Sambrook and Russell, 2001). Three crucial steps are involved which are Denaturation (at 95°), Annealing of the forward and reverse primers (55-65°) and lastly primer extension (at 72°). After amplification the desired sequence is integrated into the circular vector (pbluescript) forming the recombinant molecule. For the compatibility of the insert and vector, both were digested with (EcoR1) so the same cohesive ends are generated in both, making it easier to ligate. EcoR1 is a restriction enzyme that belongs to the type II endonuclease class which cuts within dsDNA at its recognition site “GAATTC” (Clark 2010; Sambrook and Russell, 2001).
Survey that the commence structure in (1) can be deciphered as section astute low-dimensional representation for each illustration; as needs be it can be speedily acclimated to fit our demand. Specifically, we settle (3) for the entire dataset, and utilize the start system W0 as the novel representation and support it into the data section module, with the subscript "0" demonstrating the entire dataset. It is critical that here we support using W0 over customary name change strategies, for instance, CPLST [32] for the going with reasons: 1) the proposed procedure does not rely on upon the certifiable stamp system Y as in the arrangement of CPLST, and 2) test relationship can be unequivocally introduced, which is proper for data distribute. Our approach makes no particular doubts on the choice of section counts, in this way unique procedures can be considered, including k-suggests gathering, area tricky hashing (LSH), and some flexible systems, for instance, Affinity Propagation batching or ISODATA , if satisfactory prior learning is available. In our execution, we use k-suggests gathering for its straightforwardness and
The goal of the feature extraction and selection is to reduce the dimension of the data. In this experiment the dimension of the AVIRIS and HYDICE images reduced to 20 from 220 and 191 respectively using PCA. From the PCA analysis we can see that image of principal component 1 is brightest and sharpest than other PCA image which is illustrated in figure-2.
(PCR), which isolates small fragments of DNA that have a high degree of variability from
• Use these data to construct a map of these three genes in two steps.
Finally it was found that a total of 62.1 % to about a 74.7% of the human genome was covered by either proceed or by the help of primary transcript.
Two different DNA sequencing techniques were used in this study. Sanger sequencing is a form of DNA sequencing in which the target DNA is copied multiple times in fragments of varying lengths. The ends of the fragments are marked with fluorescent chain terminator nucleotides that indicate the end of the fragments. These fragments would then be aligned according to any overlapping segments that they shared. This allowed larger regions of DNA to be sequenced via capillary gel electrophoresis (Khan Academy, n.d.). This sequencing method was used to sequence the human genome, but has since then become one of the more expensive and less efficient way to sequence.
Specific methodologies are implemented in DNA fingerprinting to get a reliable and valid result. The different steps are – isolation, cutting, sorting, analysing of DNA.
Sophisticated software compared these parts using existing proteins of the human genome to determine the actual proteins in the samples. They found that the Maiden's profile of
Agarose gel was prepared to use for the detection of geneomic DNA by adding 1gm of agarose to 100 ml of 1X TBE buffer and it is dissolved by heating at boiling temperature. Then the agarose was left to cool at 55C°, before pouring in a casting plate to solidify. A required comb was placed near one edge of the gel, and the gel was left to cast. 1XTBE was poured into the gel tank and the gel plate was placed horizontally in an electrophoresis tank. The DNA samples were prepared by adding 1µl of loading buffer and mixed with 5µl DNA samples, and then the samples were added carefully to individual wells. Power was turned on at 45V for 15minute and 85V for 1 hours to run DNA or at 5-8v/cm. Agarose gels were stained with ethidium bromide by immersing
The TARGet website allowed us to determine the Flanking sequence and DNA sequence of the homologous, which we then transferred onto Benchling. The Benchling website helped us determine the introns and exons present in our DNA sequence, as presented in Figure 3. When using benchling we also created primers that cover as various exons and introns as possible. Once we determined the forward and reverse primers that were to be used in our sequence, we ordered them and once received added resuspension buffer according to the instructions on the paper. We then diluted the forward and reverse primers and created yet another PCR table as shown in Figure 4 and once again created another master mix and ran electrophoresis on them. Our results can be seen in Figure 5. We ran the same dilutions a second time to clarify our results for the first PCR reaction. Those results can be seen in Figure
The data are divided into training sets and test sets, and a set of training data is used to build a classification algorithm model to assign test sets to one category or the other. The SVM algorithm has been widely applied in the biological
The Encyclopedia of DNA Elements (ENCODE) is a project designed to compare and contrast the repertoire of RNAs produced by the human cells and cross verify with other methods like NGS. After a five year start-up since the beginning of the ENCODE project just 1% of the human genome has been observed and what was achieved was just the confirmation of the results of previous studies.
Raw data were processed and normalized by the Robust Multichip Averaging (RMA) method (Irizarry et al., 2003) using affy packages of R(v. 3.1.3) (Team, 2004) .The linear regression model package limma (Smyth, 2004) was used to classify chips into each group. The Bayes method (Benjamini and Hochberg, 1995) was used to correct for multiple testing. |logFC| > 2 and P-value < 0.05 were used as cutoff to identify genes which are differentially expressed in
In small scale automobile part manufacturing company producing a large amount of products. A multiproduct manufacturing facility has the wide process which involves a large number of variables such as quality characteristics. Each and every product has been different quality characteristics which are measured on the manufacturing line. But when multiple products are produced or manufacturing trough a single processing line in multi product manufacturing facility, individual process monitoring strategies are used for monitoring the process of individual parts. So the challenge is that to develop a statistical process control charts for detection and implementation of faults using the quality characteristics.