31. core subproject in Hadoop -
What is it?
A set of components and interfaces for distributed filesystems and
general I/O (serialization, Java RPC, persistent data structures).
32. What are all Hadoop
subprojects?
Pig, Chukwa, Hive, HBase, MapReduce, HDFS, ZooKeeper, Core, Avro
33. What is a split?
Hadoop divides the input to a MapReduce job into fixed-size pieces
called input splits, or just splits. Hadoop creates one map task for each
split, which runs the userdefined map function for each record in the split.
Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.
On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster
Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.
On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster
34. Map tasks write their output to
local disk, not to HDFS. Why is this?
Map output is intermediate output: it’s processed by reduce tasks
to produce the final output, and once the job is complete the map output can be
thrown away. So storing it in HDFS, with replication, would be overkill. If the
node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node
to recreate the map output.
35. MapReduce data flow with a
single reduce task- Explain
The input to a single reduce task is normally the output from all
mappers.
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.
36. MapReduce data flow with
multiple reduce tasks- Explain
When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task. There can be many
keys (and their associated values) in each partition, but the records for every
key are all in a single partition. The partitioning can be controlled by a
user-defined partitioning function, but normally the default partitioner.
37. MapReduce data flow with no
reduce tasks- Explain
It’s also possible to have zero reduce tasks. This can be
appropriate when you don’t need the shuffle since the processing can be carried
out entirely in parallel.
In this case, the only off-node data transfer is used when the map tasks write to HDFS
In this case, the only off-node data transfer is used when the map tasks write to HDFS
38. What is a block in HDFS?
Filesystems deal with data in blocks, which are an integral
multiple of the disk block size. Filesystem blocks are typically a few
kilobytes in size, while disk blocks are normally 512 bytes.
39. Why is a Block in HDFS So
Large?
HDFS blocks are large compared to disk blocks, and the reason is
to minimize the cost of seeks. By making a block large enough, the time to
transfer the data from the disk can be made to be significantly larger than the
time to seek to the start of the block. Thus the time to transfer a large file
made of multiple blocks operates at the disk transfer rate.
40. File permissions in HDFS
HDFS has a permissions model for files and directories.
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.
41. What is Thrift in HDFS?
The Thrift API in the “thriftfs” contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.