References:
* [Apache Hadoop](https://hadoop.apache.org/docs/stable/index.html)
Abbreviations:
* **HDFS**: Hadoop Distributed File System
* **HDFS HA**: HDFS High Availability
* **QJM**: Quorum Journal Manager
* **ZKFC**: ZKFailoverController
* **ViewFS**: View File System
* **YARN**: Yet Another Resource Negotiator
## About Hadoop Cluster
A Hadoop Cluster consists of two clusters: an HDFS cluster and a YARN cluster, which are logically independent but usually physically together.
![[hadoop_hdsf_cluster_and_yarn_cluster.png]]
One machine in the cluster is designed as the `NameNode` and another machine as the `ResourceManager`, exclusively. These are the masters.
Other services (such as `Web App Proxy` and `MapReduce Job History server`) are usually run either on dedicated hardware or on shared infrastruture, depending upon the load.
The rest of the machines in the cluster act as both `DataNode` and `NodeManager`. These are the workers.
## HDFS
HDFS is highly fault-tolerant and is desigend to be deployed on low-cost hardware, it provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS has a master/slave architecture. An HDFS cluster consists of **a single NameNode**, a master server that manges the file system namespace and regulates access to files by clients. In addition, there are **a number of DataNodes**, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
![[hadoop_hdfs_architecture.png]]
In another way, HDFS has two main layers:
- **Namespace**
- Consists of directories, files and blocks.
- It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
- **Block Storage Service**, which has two parts:
- Block Management (performed in the Namenode)
- Provides Datanode cluster membership by handling registrations, and periodic heart beats.
- Processes block reports and maintains location of blocks.
- Supports block related operations such as create, delete, modify and get block location.
- Manages replica placement, block replication for under replicated blocks, and deletes blocks that are over replicated.
- Storage - is provided by Datanodes by storing blocks on the local file system and allowing read/write access.
![[hadoop_hdfs_two_layers.png]]
### FSImage and EditLog
The entire file system namespace, including the mapping of blocks to files and the file system properties, is stored in a file called the FSImage, which is stored as a file in the NameNode's local file system.
The changes that occurs to file system metadata are stored in a transaction log called EditLog, which is stored as a file in the NameNode's local file system too.
When a NameNode starts up, it reads HDFS state from the fsimage, and then applies edits from the editlog. It then writes new HDFS state to the fsimage and starts normal operation with an empty editlog.
## HDFS HA
The prior HDFS architecture allows only a single NameNode which was a single point of failure (SPOF) in an HDFS cluster.
The HDFS HA feature addresses the above by providding the option of **running redundant NameNodes in the same cluster** in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
### Using the Quorum Journal Manager
In a typical HA cluster, all of the NameNodes communicate with a group of separate daemons called JournalNodes (JNs).
When any namespace modification is performed by the Active NameNode, it durably logs a record of the modification **==to a majority of these JNs==**. The Standby NameNode(s) is capable of reading the edits from the JNs, and is constantly watching them for changes to the editlog. As the standby node sees the edits, it applies them to its own namespace.
In the event of a failover, the standby will ensure that it has read all of the edits from the JNs before promoting itself to the active state. This ensures that the namespace state is fully synchronized before a failover occurs.
### Using the Conventional Shared Storage
In a typical HA cluster, all of the NameNodes have access to a directory on a shared storage device (e.g. an NFS mount from a NAS).
When any namespace modification is performed by the Active NameNode, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby NameNode(s) is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace.
In the event of a failover, the standby will ensure that it has read all of the edits from the shared storage before promoting itself to the active state. This ensures that the namespace state is fully synchronized before a failover occurs.
### Automatic Failover using Zookeeper
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
- **Failure detection** - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode(s) that a failover should be triggered.
- **Active NameNode election** - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.
The ZKFC is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
- **Health monitoring** - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
- **ZooKeeper session management** - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
- **ZooKeeper-based election** - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby NameNode.
## HDFS Observer NameNode
The prior HDFS architecture allows a single Active NameNode and one or more Standby NameNode(s). One drawback of this architecture is that the Active NameNode could be a single bottle-neck and be overloaded with client requests, especially in a busy cluster.
The Consistent Reads from HDFS Observer NameNode feature addresses the above by introducing a new type of NameNode called **Observer NameNode**. Similar to Standby NameNode, Observer NameNode keeps itself up-to-date regarding the namespace and block location information. In addition, it also has the ability to serve consistent reads, like Active NameNode.
Since read requests are the majority in a typical environment, this can help to load balancing the NameNode traffic and improve overall throughput.
In the new architecture, a HA cluster cloud consists of namenodes in 3 different states: active, standby and observer. ==State transition can happen between active and standby, standby and observer, but not directly between active and observer.==
### Maintaining Client Consistency
To ensure read-after-write consistency within a single client, a state ID, which is implemented using transaction ID within NameNode, is introduced in RPC headers.
* When a client performs write through active node, it updates its state ID using the latest transaction ID from the node.
* When performing a subsequent read, the client passes this state ID to observer node, which will then check against its own transaction ID, and will ensure its own transaction ID has caught up with the request's state ID, before serving the read request.
To maintaining consistency between multiple clients in the face of out-of-band communication, a new `msync()`, or `metadata sync`, command is introducedd. When `msync()` is called on a client, it will update its state ID against the Active NameNode - a very lightweight operation - so that subsequent reads are guaranteed to be consistent up to the point of the msync().
Upon startup, a client will automatically call `msync()` before performing any reads against an Observer NameNode, so that any writes performed prior to the initialization of the client will be visible.
In addition, there is a configurable "auto-msync" mode supported by ObserverReadProxyProvider which will automatically perform an `msync()` at some configurable interval, to prevent a client from ever seeing data that is more stale than a time bound. It is disabled by default since there is some RPC overhead associated with this.
## HDFS Federation
The prior HDFS architecture allows only a single namespace for the entire cluster. In that configuration, a single Namenode manages the namespace. HDFS Federation addresses the above and scales the name service horizontally by adding support for multiple NameNodes/namespaces to HDFS.
Federation uses multiple independent NameNodes/namespaces.
* The NameNodes are independent and do not require coordination with each other.
* The DataNodes are used as common storage for blocks by all the NameNodes. Each Datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports. They also handle commands from the Namenodes.
![[hadoop_hdfs_federation_architecture.png]]
### Block Pool
A Block Pool is a set of blocks that belong to a single namespace. Each block pool is managed independently.
A namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management.
## ViewFS
ViewFS is a virtual file system which implements the Hadoop file system interface just like HDFS and the local file system. It doesn't store data physically but offers a unified logical view by mapping virutal paths to actual file - system paths. This is archieved through a mount table.
### Use in HDFS Federation
HDFS Federation uses multiple NameNodes to manage different namespaces, which can complicate client access as clients need to know the location of each namespace. ViewFS comes in handy here by providing **a unified logical view across these multiple namespaces**. By configuring the mount table, ViewFS maps the paths under different NameNodes to a single set of virtual paths. As a result, clients can access data by simply referring to the virtual paths, regardless of which NameNode the data is actually stored under. This greatly simplifies the management and usage of an HDFS Federation cluster.
### View File System Overload Scheme
To use ViewFS, users need to update `fs.defaultFS` with viewfs scheme (`viewfs://`) and copy the mount-table configurations to all the client nodes.
The View File System Overload Scheme is a ViewFS extension to address the above. This will allow users to continue to use their existing `fs.defaultFS` configured scheme or any new scheme name instead of using scheme `viewfs`.
![[hadoop_viewfs_schemes.png]]
## Centralized Cache Management in HDFS
Centralized cache management in HDFS is an explict caching mechanism that allows users to specify paths to be be cached by HDFS. The NameNode will communicate with DateNodes that have the desired blocks on disk, and instruct them to cache the blocks in off-heap caches.
![[hadoop_hdfs_cache_architecture.png]]
## HDFS Router-based Federation
The problem of HDFS Federation is how to maintain the split of the subclusters, which forces users to connect to multiple subclusters and manage the allocation of folders/files to them.
A natural extension to this partitioned federation is to add a layer of software responsible for federating the namespaces. This extra layer allows users to access any subcluster transparently, lets subclusters manage their own block pools independently, and will support rebalancing of data across subclusters later.
To accomplish these goals, the federation layer directs block accesses to the proper subcluster, maintains the state of the namespaces, and provides mechanisms for data rebalancing. This layer must be scalable, highly available, and fault tolerant.
This federation layer comprises multiple components. The `Router` component that has the same interface as a NameNode, and forwards the client requests to the correct subcluster, based on ground-truth information from a `State Store`. The `State Store` combines a remote `Mount Table` (in the flavor of [ViewFS](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html), but shared between clients) and utilization (load/capacity) information about the subclusters. This approach has the same architecture as [YARN federation](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/Federation.html).
![[hadoop_hdfs_rbf_architecture.png]]
## MapReduce
Hadoop MapReduce is a significant and practical data-processing framework. At its core, it offers a well-defined programming model. This model simplifies distributed data processing by dividing tasks into two main phases: the **Map** phase and the **Reduce** phase.
In the Map phase, input data is split into key-value pairs, and specific operations are performed on these pairs. The output of the Map phase serves as the input for the Reduce phase. During the Reduce phase, the framework groups the intermediate key-value pairs by keys and conducts aggregation operations on the grouped values.
Beyond being just a programming concept, MapReduce has its own execution logic. It determines how to partition tasks, monitors task execution, and retries failed tasks.
## YARN
YARN (Yet Another Resource Negotiator) is a general - purpose resource management and job-scheduling system in the Hadoop ecosystem. It plays a central role in managing cluster resources, including CPU, memory, and storage.
YARN consists of two main components: the **ResourceManager** and the **NodeManager**.
The ResourceManager is the global resource manager. It takes charge of allocating resources across the entire cluster. It receives resource requests from different application masters and distributes resources based on pre - defined policies and the current resource availability.
The NodeManager runs on each node in the cluster. It is responsible for managing the resources of its local node, launching and monitoring containers that run application tasks.
![[hadoop_yarn_architecture.png]]
YARN enables multiple computing frameworks, including MapReduce, Spark, and Tez, to share cluster resources efficiently.
### ResourceManager HA
The prior YARN architecture allows only a single ResourceManager which is the single point of failure in a YARN cluster.
The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure.
Similar [[#HDFS HA]] architecture can be referenced.
### YARN Federation
The YARN Federation architecture leverages the notion of federating a number of such smaller YARN clusters, referred to as sub-clusters, into a larger federated YARN cluster comprising of tens of thousands of nodes.
![[hadoop_yarn_federation_architecture.png]]
The applications running in this federated environment see a unified large YARN cluster and will be able to schedule tasks on any nodes in the cluster. Under the hood, the federation system will negotiate with sub-clusters RMs and provide resources to the application.
## HDFS, YARN and MapReduce
HDFS provides the data storage foundation for MapReduce and YARN. MapReduce jobs run on top of YARN, and YARN is responsible for allocating and managing resources for MapReduce jobs.
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




