Hadoop: 2014

Why is Hadoop useful?

Hadoop is fault tolerant, meaning the system will simply redirect to another location and resume work when a node is lost. Hadoop is also schema-less and can absorb data of all types, sources, and structures, allowing for deeper analysis.

Which directory does Hadoop install to?

Hadoop is installed in cd/usr/lib/hadoop-0.20/

What are the four modules that make up the Apache Hadoop framework?

Hadoop Common, which contains the common utilities and libraries necessary for Hadoop’s other modules.

Hadoop YARN, the framework’s platform for resource-management

Hadoop Distributed File System, or HDFS, which stores information on commodity machines

Hadoop MapReduce, a programming model used to process large-scale sets of data

Which modes can Hadoop be run in? List a few features for each mode.

Standalone, or local mode, which is one of the least commonly used environments. When it is used, it’s usually only for running MapReduce programs. Standalone mode lacks a distributed file system, and uses a local file system instead.

Pseudo-distributed mode, which runs all daemons on a single machine. It is most commonly used in QA and development environments.

Fully distributed mode, which is most commonly used in production environments. Unlike pseudo-distributed mode, fully distributed mode runs all daemons on a cluster of machines rather than a single one.

Where are Hadoop’s configuration files located?

Hadoop’s configuration files can be found inside the conf sub-directory.

List Hadoop’s three configuration files.

hdfs-site.xml

core-site.xml

mapred-site.xml

What are “slaves” and “masters” in Hadoop?

In Hadoop, slaves are a list of hosts for task tracker servers and datanodes. Masters list hosts for secondary namenode servers.

What is /etc/init.d?

/etc/init.d is a Linux directory. In Hadoop, you use this to check the status of daemons or check where they’re located.

What is a Namenode?

Namenode exists at the center of the Hadoop distributed file system cluster. It manages metadata for the file system, and datanodes, but does not store data itself.

How many Namenodes can run on a single Hadoop cluster?

Only one Namenode process can run on a single Hadoop cluster. The file system will go offline if this Namenode goes down.

What is a datanode?

Unlike Namenode, a datanode actually stores data within the Hadoop distributed file system. Datanodes run on their own Java virtual machine process.

How many datanodes can run on a single Hadoop cluster?

Hadoop slave nodes contain only one datanode process each.

What is job tracker in Hadoop?

Job tracker is used to submit and track jobs in MapReduce.

How many job tracker processes can run on a single Hadoop cluster?

Like datanodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.

What sorts of actions does the job tracker process perform?

Client applications send the job tracker jobs.

Job tracker determines the location of data by communicating with Namenode.

Job tracker finds nodes in task tracker that has open slots for the data.

Job tracker submits the job to task tracker nodes.

Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.

Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.

How does job tracker schedule a job for the task tracker?

When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.

What does the mapred.job.tracker command do?

The mapred.job.tracker command will provide a list of nodes that are currently acting as a job tracker process.

What is “PID”?

PID stands for Process ID.

What is “jps”?

jps is a command used to check if your task tracker, job tracker, datanode, and Namenode are working.

Is there another way to check whether Namenode is working?

Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.

How would you restart Namenode?

To restart Namenode, you could either write:

sudo hdfs

su-hdfs

/etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start

and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.

What is “fsck”?

fsck standards for File System Check.

What are the port numbers for job tracker, task tracker, and Namenode?

The port number for job tracker is 30, the port number for task tracker is 60, and the port number for Namenode is 70.

What is a “map” in Hadoop?

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

What is a “reducer” in Hadoop?

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

What are the parameters of mappers and reducers?

The four parameters for mappers are:

LongWritable (input)

text (input)

text (intermediate output)

IntWritable (intermediate output)

The four parameters for reducers are:

Text (intermediate output)

IntWritable (intermediate output)

Text (final output)

IntWritable (final output)

Is it possible to rename the output file, and if so, how?

Yes, it is possible to rename the output file by utilizing a multi-format output class.

List the network requirements for using Hadoop.

Secure Shell (SSH) for launching server processes

Password-less SSH connection

Which port does SSH work on?

SSH works on the default port number, 22.

What is streaming in Hadoop?

As part of the Hadoop framework, streaming is a feature that lets engineers
code with MapReduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc.
If you want to use customization in MapReduce, however, Java must be used.

Hadoop is Java-based, remember, so it’s pretty useful to know Java if you want to work with the framework.

What is the difference between Input Split and an HDFS Block?

InputSplit and HDFS Block both refer to the division of data, but InputSplit handles the logical division while HDFS Block handles the physical division.

What does the file hadoop-metrics.properties do?

The hadoop-metrics.properties file controls reporting in Hadoop.

Latest Hadoop MapReduce Interview Questions and answers:

What is MapReduce?

It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.

What are 'maps' and 'reduces'?

'Maps' and 'Reduces' are two phases of solving a query in HDFS. 'Map' is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.'Reducer' is responsible to process the intermediate output received from the mapper and generate the final output.

What are the four basic parameters of a mapper?

The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.

What are the four basic parameters of a reducer?

The four basic parameters of a reducer are Text, IntWritable, Text, IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.

What do the master class and the output class do?

Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output location.

What is the input type/format in MapReduce by default?

By default the type input type in MapReduce is 'text'.

Is it mandatory to set input and output type/format in MapReduce?

No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as 'text'.

What does the text input format do?

In text input format, each line will create a line object, that is an hexa-decimal number. Key is considered as a line object and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the 'key' as a 'LongWritable' parameter and value as a 'Text' parameter.

What does job conf class do?

MapReduce needs to logically separate different jobs running on the same cluster. 'Job conf class' helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

What does conf.setMapper Class do?

Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

What do sorting and shuffling do?

Sorting and shuffling are responsible for creating a unique key and a list of values.Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

What does a split do?

Before transferring the data from hard disk location to map method, there is a phase or method called the 'Split Method'. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper.Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.

How can we change the split size if our commodity hardware has less storage space?

If our commodity hardware has less storage space, we can change the split size by writing the 'custom splitter'. There is a feature of customization in Hadoop which can be called from the main method.

What does a MapReduce partitioner do?

A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.

How is Hadoop different from other data processing tools?

In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data processing tools available.

Can we rename the output file?

Yes we can rename the output file by implementing multiple format output class.

Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?

We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, inputsplit again gets divided into mapper, thus we do not have a track of the previous row value.

What is Streaming?

Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

What is a Combiner?

A 'Combiner' is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

What is the difference between an HDFS Block and Input Split?

HDFS Block is the physical division of the data and Input Split is the logical division of the data.

What happens in a TextInputFormat?

In TextInputFormat, each line in the text file is a record. Key is the byte offset of the line and value is the content of the line.
For instance,Key: LongWritable, value: Text.

What do you know about KeyValueTextInputFormat?

In KeyValueTextInputFormat, each line in the text file is a 'record'. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value.
For instance,Key: Text, value: Text.

What do you know about SequenceFileInputFormat?

SequenceFileInputFormat is an input format for reading in sequence files. Key and value are user defined. It is a specific compressed binary file format which is optimized for passing the data between the output of one MapReduce job to the input of some other MapReduce job.

What do you know about NLineOutputFormat?

NLineOutputFormat splits 'n' lines of input as one split.

Hadoop

Latest Hadoop Interview Questions and answers: Part 17

Latest Hadoop Interview Questions -Part 16