Hadoop is an Apache open source software (java framework). It runs on cluster of commodity machines and provides both distributed storage and distributed processing of huge data sets. It is capable of processing data sizes ranging from Gigabytes to Petabytes.
Architecture :
Similar to master / slave architecture.
The master is the Namenode and the Slaves are the data nodes. The Namenode maintains and manages blocks of datanodes. They are responsible for dealing with clients requests of data. Hadoop splits the files into one or more blocks and these blocks are stored on datanodes. Each datanode is replicated to provide fault tolerance characteristic of Hadoop. Default replicator factor is 3 which can be configured.
Components:
1. Hadoop Distributed File System (HDFS): It is designed to run on hardware which are less expensive and the data is stored in it distributed. It is highly fault tolerance and provides high throughput access to the applications that require big data.
2. Namenode: It manages the system namespace. It stores the metadata of data blocks. The metadata is stored in the form of namespace image and edit logs on a local machine. The namenode knows the location of each datablock on the data node.
3. Secondary Namenode: It is responsible for merging the namespace image and edit log to create check points. This would help namenode to free up memory taken up by edits logs from the start to the latest check point.
4. DataNode: Stores blocks of data and
When a file is written in HDFS, it is divided into fixed size blocks. The client first contacts the NameNode, which get the list of DataNode where actual data can be stored. The data blocks are distributed across the Hadoop cluster. Figure \ref{fig.clusternode} shows the architecture of the Hadoop cluster node used for both computation and storage. The MapReduce engine (running inside a Java virtual machine) executes the user application. When the application reads or writes data, requests are passed through the Hadoop \textit{org.apache.hadoop.fs.FileSystem} class, which provides a standard interface for distributed file systems, including the default HDFS. An HDFS client is then responsible for retrieving data from the distributed file system by contacting a DataNode with the desired block. In the common case, the DataNode is running on the same node, so no external network traffic is necessary. The DataNode, also running inside a Java virtual machine, accesses the data stored on local disk using normal file I/O
HDFS uses NameNode operation to realize data consistency. NameNodes utilizes a transactional log file to record all the changes of
This paper proposes backup task mechanism to improve the straggler tasks which are the final set of MapReduce tasks that take unusually longer time to complete. The simplified programming model proposed in this paper opened up the parallel computation field to general purpose programmers. This paper served as the foundation for the open source distributing computing software – Hadoop as well as tackles various common error scenarios that are encountered in a compute cluster and provides fault tolerance solution on a framework
There are two ways in which the cluster programming can oversee access to the information on the disk.
initialing pointing towards NULL, and uses two temporary pointer nodes *n and *temp1 for performing various functions. The functions performed are adding node
In a distributed database system, all the databases or the storage devices are not connected to a single common processing unit like CPU. The various data may be stored at different physical locations in multiple computers that are located in the same physical location; or they may be inter-connected over a network. In other words, distributed database system consists of various data locations but they do not share any physical components.
MapReduce Parallel programming model if we ever get a chance. In Hadoop, there are two nodes in the cluster when using the algorithm, Master node and Slave node. Master node runs Namenode, Datanode, Jobtracker and Task tracker processes. Slave node runs the Datanode and Task tracker processes. Namenode manages partitioning of input dataset into blocks and on which node it has to store. Lastly, there are two core components of Hadoop: HDFS layer and MapReduce layer. The MapReduce layer read from and write into HDFS storage and processes data in parallel.
The Hadoop employs MapReduce paradigm of computing which targets batch-job processing. It does not directly support the real time query execution i.e OLTP. Hadoop can be integrated with Apache Hive that supports HiveQL query language which supports query firing, but still not provide OLTP tasks (such as updates and deletion at row level) and has late response time (in minutes) due to absence of pipeline
An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity servers. Hadoop clusters at Yahoo! span 40,000 servers, and store 40 petabytes of application data, with the largest cluster
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
The larger blocks offer several advantages over smaller blocks. It reduces the number of interactions with NameNode and also reduces a size of metadata that needs to be stored in the NameNode. It reduces extra network overheads by keeping a persistent TCP connection to the DataNode. In Figure \ref{fig.executionTime}, the execution time for writing and deleting the larger blocks is almost similar to the smaller blocks. However, HDFS blocks are larger in comparison to disk blocks, in order to minimize the cost of seeks. Thus, in our case, all the interaction of deletion is done through a checkerNode. The checkerNode sends the delete command, and rest is handled by overwriting daemon in each nodes. The overwrite daemon read the metadata from inodes, it does not need to make a connection with NameNode or checkerNode. Thus, it reduces the extra network overhead and boost the performance of deletion
In an attempt to manage their data correctly, organizations are realizing the importance of Hadoop for the expansion and growth of business. According to a study done by Gartner, an organization loses approximately 8.2 Million USD annually through poor data quality. This happens when 99 percent of the organizations have their data strategies in place. The reason behind this is simple – the organizations are unable to trace the bad data that exists within their data. This is one problem which can be easily solved by adopting Hadoop testing methods which allows you to validate all of your data at increased testing speeds and boosts your data coverage resulting in better data quality.
Cost reduction: Big data technologies such as Hadoop and cloud based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify and implement more efficient ways of doing business.
Through the attractiveness of Hadoop, a number of companies became bundle Hadoop and some related technologies into their own Hadoop distributions. Cloudera was the first vendor to offer Hadoop as a package and continues to be a leader in the industry. Its Cloudera CDH distribution, which contains all the open source components, is the most popular Hadoop distribution.
Over the years it has become very essential to process large amounts of data with high precision and speed. This large amounts of data that can no more be processed using the Traditional Systems is called Big Data. Hadoop, a Linux based tools framework addresses three main problems faced when processing Big Data which the Traditional Systems cannot. The first problem is the speed of the data flow, the second is the size of the data and the last one is the format of data. Hadoop divides the data and computation into smaller pieces, sends it to different computers, then gathers the results to combine them and sends it to the application. This is done using Map Reduce and HDFS i.e., Hadoop Distributed File System. The data node and the name node part of the architecture fall under HDFS.