Study of Hbase-A NonSql Database
Abstract:
With the expansion of the internet, there has been an exponential growth in the data being collected from various social media, searching patterns and online transactions. This rapid growth of data has become difficult to be handled by relational database, thus they have been replaced with NoSql databases. The NoSql databases are distributed databases that have the ability to store and process large volumes of data.
This study is focused on HBase database which is a column-oriented NoSql database. HBase is Apache’s open source database that is modeled after Google’s BigTable technology. It uses Java as the API and is developed on top of the Hadoop distributed file system (HDFS) to store and process large quantities of data, maintaining reliability and fault tolerance. This database is being used by many big enterprises including Facebook, Twitter and Yahoo to store and process large quantities of data in efficient and cost effective manner.
Together HBase and HDFS allow low cost hardware to be used to provide a reliable and fault tolerant data management solution that is highly scalable solution to meet the challenges of rapidly growing data efficaciously.
Keywords: BigData, No Sql database, Hadoop, MapReduce, HBase, HFDS
I. Introduction:
The data managed on the internet has been rising exponentially. Everyday companies collect user data in large volumes from various sources like online transactions, search queries, mobile
The Hadoop is a framework of free code in Java to run applications that manipulate large amounts of data in distributed environments. The Hadoop is composed by the file system HDFS (Hadoop Distributed File System) and a parallel execution environment. Within this environment, or better, the Hadoop framework, several subprojects can be found, such as the implementation of MapReduce, the distributed data management system called HBase, data flow language r and structure for parallel execution called Pig.
The problem of relational database is that its performance degrade significantly while handling exponential growth of semi-structured data or unstructured data [4]. The NoSQL databases possess properties called BASE (Basically available, Soft state, and eventually consistent) that makes it much more scalable than relational databases. As the CAP theorem [5] said, a database system cannot have high consistency, high availability,
The relational database technology dominated the web applications for more than 30 years. This technology is able to handle limited load to the database. However, the internet technologies and the advents of the smart phones make the web applications to be accessible by many users and from any location that is covered by the internet connectivity. In addition, currently, the web data in the internet is dominated by the social networking and social media applications which include: Facebook, Twitter, YouTube, Instagram and others. This kind of web applications will likely be prone to the high load of the database layer. As a result, it was not possible for the relational database technology to handle the database load for such applications. Even scaling out the application servers will not solve the database load
Hadoop Distributed File System (HDFS): It is designed to run on hardware which are less expensive and the data is stored in it distributed. It is highly fault tolerance and provides high throughput access to the applications that require big data.
Big data are grooming in reality. SQL does not have capacity to handle a very huge amount of data. All applications are now working in view to a vast volume of data. Data is increasing massively in almost all stream of jobs , let it be employee details or health records. So the applications being used to manage these type of data should be modified too. Not only the applications but the databases and warehouses where we store these data have to be modifies too. SQL can store data in different tables and databases but later it is very difficult task to retrieve the same as that will include loads of join operation and very multifaceted transactions. So in this paper we propose to build an application for hospital management and to handle patient health records . Our application uses a NoSQL database(i.e here we use mongodb) for storing and retrieving the data. We are implementing Mongo lab in our application deployment. Each record and its associated data will be stored in a single document thus simplifies the data access. Here, unlike SQL databases, the documents stored are schema free and similar to each other, this is a big advantage of NoSQL and helps in modelling unstructured data. We also use the tokenization concept to ensure security. We convert the user credentials like name, password, phone no, email id etc. into ASCII values and store it in separate mongo db. The patients’ medical history, lab reports, medicine prescription etc. will be stored in a separate Mongo
“NoSQL practitioners focus on physical data model design rather than the traditional conceptual / logical data model process” (Hsieh, 2014). The mindset of the data modelers have changed in recent years. The flexibility, scalability and the ability to handle variety of structured to unstructured data of the NoSQL data bases have made the data modelers to think more in business –centric notion.
In modern times, the amount of data being stored is terrifically large. Companies must deal with such abundance of data on a daily basis in both storing and analyzing as fast as they can. One such company that not only store data is Google, they also analyze data from each user using their product. The platform used by google for this database management called BigQuery, which runs in the cloud and provides real time information. In this survey, the inner working of BigQuery is glossed over to show how this platform manages to do the job it is supposed to accomplish.
NoSQL databases had made for unraveling the Big Data issue by utilizing a distributed system to bring out excellent performance in data storage and retrieval at very large-scale. At this scale, pieces of the system often fail and NoSQL is created to handle these failures (Chow, 2013) (Ron, Shulman-Peleg, & Bronshtein, 2015). Various companies have espouse different sorts of non-relational databases, ordinarily alluded to as
Author says that conventional data storage systems (databases) work well with structured data, but crash under heavy workloads. He describes various distributed file systems like GFS (Google file system), HDFS (Hadoop distributed file system), and amazon S3(Simple Storage service). All these file systems handle unstructured data and support fault tolerance by data replication. Specially S3 provides good integration with other amazon services and provides big data processing capabilities to consumers at an affordable cost in a pas-as-you-go fashion. For storing non-structured and semi-structured data, the author provides solutions used in various corporates. He gives examples of BigTable used by Google and PNUTS used by Yahoo. One that caught my eye is the one proposed by Facebook, which is a hybrid data management system. It is hybrid in a sense that it combines features of a row-based and column-based database systems. Upon research I found that this new system actually enhances the performance of both query processing and load balancing [2]. The author then moves on to describe various available cloud vendors. All these Infrastructure as a service (IaaS) providers employ virtualization technologies to maximize
The paper provides background and related literature on the Big Data, studies the concept from Relational Database to current NoSQL database which have been fueled by the growth Big Data and importance of managing it. And surveys the Big Data challenges from the perspective of its characteristics Volume, Variety and Velocity and attempts to study how those challenges can be addressed.
NoSQL is able to address the massive traffic loads experienced by database servers at corporations that specialize in data processing like Google, Facebook and Amazon. NoSQL technologies can provide near constant availability, massive user concurrency and lightning fast responses. There are four primary NoSQL database implementation types being used today: document based, wide column (or columnar), key-value and graph. The different properties of SQL and NoSQL databases will be examined and an overview of each NoSQL implementation type along with an example will be given.
There is also a much talked about database called Cassandra which also needs to be discussed. It was originally developed by Facebook as open-sourced in 2008 [6]. Facebook was among the first to try the system for its inbox search system, which controls and stores in its disk space, and with the high performance of the system within its service level agreement requirements more applications like Netflix, Twitter etc. embraced Cassandra as their storage engine as well as backend for their streaming services [9]. What is Cassandra? Based on many definitions, Cassandra is a type of open source distributed database that is highly scalable, high performance designed to handle big amounts of data between many commodity servers that guarantees high availability without failure. Its main duty is high performance, also with its robust clusters among several data centers, as well as providing low latency operation for its various clients which is why businesses love it. It was written in Java language. Cassandra in accordance with research conducted on NoSQL systems concluded that its scalability, ability supersedes rest of the database management system with its largest number of nodes. Designed as a distributing system, which supports replication and multi replication as well as the ability to replace failed nodes without downtime [2]. Cassandra supports other open source like Hadoop, Apache Pig etc. It is similar with relational database since
Recent advancements in internet communication and in parallel computing grabbed the attention of a large number of commercial organizations and industries to adapt the recent changes in storage and retrieval methods. This includes the new data retrieval and mining schemas which enable the firms to provide their clients a wide space for carrying their job processing and storing of the personal data. Although the new storage innovations made the user data to accommodate the petabyte scale in size, the storing schemas are still on the research desk to compete with this adaptation. Some of the new research outcomes which gained a high popularity and become the need of the hour is the Hadoop. Hadoop is developed by Apache based on the papers of
This report basically describes the process of design NoSQL systems for data persisence and implementation of design and the solution of tasks that we are required. The dataset we worked with is a music dataset from lastfm and the designs for MongoDB, HBase and Neo4j are based on the dataset features and given queries. The implementation includes creating databases, setting up the schema and running queries, followed by testing the performance. There are also iteration designs for each system in order to gain higher performance.
NoSQL Databases are being used in the social media applications and big data processing based portals in which huge, heterogeneous and unstructured data formats are handled. NoSQL Databases are used for faster access of records from the big dataset at back-end. The AADHAAR Card implementation in India was done using NoSQL Databases as huge amount of information is associated including Text Data, Images, Thumb Impressions and Iris Detection. Any classical database system cannot handle the dataset of different types (Image, Text, Video, Audio, Video, Thumb Impressions for Pattern Recognition, Iris Sample) simultaneously.