GlusterFS is scalable file system which is implemented in C language. Since it is an open source its features can be extended [8]. Architecture of GlusterFS is a powerful network written in user space which uses FUSE to connect itself with virtual file system layer [9].
Features in GlusterFS can be easily added or removed [8]. GlusterFS has following components:
• GlusterFs server storage pool – it is created of storage nodes to make a single global namespace. Members can be dynamically added and removed from the pool.
• GlusterFs storage client – client can connect with any Linux file system with any of NFS, CFS, HTTP and FTP protocols. Fuse – fully functional Fs can be designed using Fuse and it will include features like: simple
…show more content…
That somehow defeats the purpose of a high-availability storage cluster, must synchronize the system time of all bricks, clearly the lack of accessible disk space wasn't GlusterFS's fault, and is probably not a common scenario either, but it should spit out at least an error message.
2.4. HDFS File System
Hadoop distributed file system is written in Java for Hadoop framework, it is scalable and portable FS. HDFS provide shell commands and Java application programming interface (API). [12] Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of larger data sets, and this provides the scalability that is needed for big data processing. [12] A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure calls (RPC) to communicate with each other.
Fig 5. HDFS Architecture [19]
HDFS stores large files across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require redundant array of independent disks (RAID) storage on
The consequences of leaving all their information in one location can be catastrophic to the company’s operation and integrity as valued by their clients. Remotely storing vital information should be the first precaution taken when installing an archive full of fragile data. NAS (network attached storage), typically used in the form of cloud or RAID devices provides a safe approach to storing company information. “Common uses are central file storage, media streaming, print serving and backup for all the local drives on your network. You can even access most NAS drives from the Internet if desired.” (How to Buy Network-Attached Storage Drives, Becky Waring). If the NAS devices aren’t sufficient, file servers can support up to 25 users simultaneously and meet the high demands of the usage that typically come with a large network.
Storage of data plays a major role in improving the performance of a company and this can happen either offline or online and in various formats.
The cluster software can access data on the disk through two ways, one is asymmetric clustering and the other is parallel clustering.
When a file is written in HDFS, it is divided into fixed size blocks. The client first contacts the NameNode, which get the list of DataNode where actual data can be stored. The data blocks are distributed across the Hadoop cluster. Figure \ref{fig.clusternode} shows the architecture of the Hadoop cluster node used for both computation and storage. The MapReduce engine (running inside a Java virtual machine) executes the user application. When the application reads or writes data, requests are passed through the Hadoop \textit{org.apache.hadoop.fs.FileSystem} class, which provides a standard interface for distributed file systems, including the default HDFS. An HDFS client is then responsible for retrieving data from the distributed file system by contacting a DataNode with the desired block. In the common case, the DataNode is running on the same node, so no external network traffic is necessary. The DataNode, also running inside a Java virtual machine, accesses the data stored on local disk using normal file I/O
When write operation are done on existing file, the data is appended to end of the file. This method serializes the data write in GFS.
There are two ways in which the cluster programming can oversee access to the information on the disk.
GFS: Google File System is a distributed file system which is developed by Google in order to provide efficient, reliable access to data. . It is designed and implemented inorder to meet the requirements provided by Google’s data processing. The file system consists of hundreds of storage machines to provide inexpensive parts and it is accessed by different client machines. Here the search engine is providing huge amounts data that should be stored. GFS has 1,000 nodes with 300TB disk storage.
Partitioning strategy: The hierarchical partitioning of data into a set of directories – The placement and replication properties of directories is
In RAID 3 the disks spin in sync where all the read write operations are done with high performance.
HDFS is Hadoop’s distributed file system that provides high throughput access to data, high-availability and fault tolerance. Data are saved as large blocks making it suitable for applications
[8] J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer.Reclaiming space from duplicate files in a serverless distributedfile system. In ICDCS, pages 617–624, 2002.
Ext3fs or third extended file system is a journaled file system that is commonly used by the Linux Kernel. It is the default file system for many Linux distributions.
This server will also act as a centralized location to store files with the 74 TB of storage and has a variety of RAID options that can be configured to ensure that there is no data loss. The server has an 18-core processor with up to 768 GB of RAM insuring high
Abstract - Hadoop Distributed File System, a Java based file system provides reliable and scalable storage for data. It is the key component to understand how a Hadoop cluster can be scaled over hundreds or thousands of nodes. The large amounts of data in Hadoop cluster is broken down to smaller blocks and distributed across small inexpensive servers using HDFS. Now, MapReduce functions are executed on these smaller blocks of data thus providing the scalability needed for big data processing. In this paper I will discuss in detail on Hadoop, the architecture of HDFS, how it functions and the advantages.
Everyone is going to the “cloud” for storage, but what exactly does that mean? Where is the cloud and is it secure? What are the benefits of storing data on the Cloud?