Hadoop: 2010

Latest Hadoop Interview Questions-Part 1

Q1. What are the default configuration files that are used in Hadoop
As of 0.20 release, Hadoop supported the following read-only default configurations
- src/core/core-default.xml
- src/hdfs/hdfs-default.xml
- src/mapred/mapred-default.xml

Q2. How will you make changes to the default configuration files
Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred-site.xml

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
- core-default.xml : Read-only defaults for hadoop.
- core-site.xml: Site-specific configuration for a given hadoop installation.

Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.

Q3. Consider case scenario where you have set property mapred.output.compress totrue to ensure that all output files are compressed for efficient space usage on the cluster. If a cluster user does not want to compress data for a specific job then what will you recommend him to do ?
Ask him to create his own configuration file and specify configuration mapred.output.compressto false and load this file as a resource in his job.

Q4. In the above case scenario, how can ensure that user cannot override the configuration mapred.output.compress to false in any of his jobs
This can be done by setting the property final to true in the core-site.xml file

Q5. What of the following is the only required variable that needs to be set in file conf/hadoop-env.sh for hadoop to work

- HADOOP_LOG_DIR

- JAVA_HOME

- HADOOP_CLASSPATH
The only required variable to set is JAVA_HOME that needs to point to <java installation> directory

Q6. List all the daemons required to run the Hadoop cluster
- NameNode
- SecondaryNameNode
- DataNode
- JobTracker
- TaskTracker

Q7. Whats the default port that jobtrackers listens to
50030

Q8. Whats the default port where the dfs namenode web ui will listen on
50070

Learn Hadoop: Video Tutorial:

Latest Hadoop Interview Questions-part 2

Q11. Give an example scenario where a cobiner can be used and where it cannot be used

There can be several examples following are the most common ones

- Scenario where you can use combiner

Getting list of distinct words in a file

- Scenario where you cannot use a combiner

Calculating mean of a list of numbers

Q12. What is job tracker

Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

Q13. What are some typical functions of Job Tracker

The following are some typical tasks of Job Tracker

- Accepts jobs from clients

- It talks to the NameNode to determine the location of the data

- It locates TaskTracker nodes with available slots at or near the data

- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker

Q14. What is task tracker

Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

Q15. Whats the relationship between Jobs and Tasks in Hadoop

One job is broken down into one or many tasks in Hadoop.

Q16. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will hadoop do ?

It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this

Speculative Execution

Q18. How does speculative execution works in Hadoop

Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Q19. Using command line in Linux, how will you

- see all jobs running in the hadoop cluster

- kill a job

- hadoop job -list

- hadoop job -kill jobid

Q20. What is Hadoop Streaming

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations

Q21. What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

Latest Hadoop Interview Questions-part 3

Q22. Whats is Distributed Cache in Hadoop
Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Q.24 What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application
This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution

Q25. Have you ever used Counters in Hadoop. Give us an example scenario
Anybody who claims to have worked on a Hadoop project is expected to use counters

Q26. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job
Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

Q27. Is it possible to have Hadoop job output in multiple directories. If yes then how
Yes, by using Multiple Outputs class

Q28. What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.

Q29. How can you set an arbitary number of mappers to be created for a job in Hadoop
This is a trick question. You cannot set it

Q30. How can you set an arbitary number of reducers to be created for a job in Hadoop
You can either do it progamatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting

Latest Hadoop Interview Questions-part 4

Q31. How will you write a custom partitioner for a Hadoop job
To have hadoop use a custom partitioner you will have to do minimum the following three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartitionerClass or
- add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

Q32. How did you debug your Hadoop code
There can be several ways of doing this but most common ways are
- By using counters
- The web interface provided by Hadoop framework

Q33. Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason
Its an open ended question but most candidates, if they have written a production job, should talk about some type of alert mechanisn like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, its very important to have a good alerting system for errors since unexpected data can very easily break the job.

Q34. Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it
This is an open ended question but a candidate who claims to be an intermediate developer and has worked on large data set (10-20GB min) should have run into this problem. There can be many ways to handle this problem but most common way is to alter your algorithm and break down the job into more map reduce phase or use a combiner if possible.

Latest Hadoop Interview Questions -Part 5

1)Can you explanation HADOOP BIG DATA Analytics?

Hadoop BIG DATA Analytics is a such a hugevolume of complex data it becomes very tedious to captureing, storeing, processing, retrieveing,reporing and analyze using management hadoop tools database by hand or processing techniques of transaction data.

What is Hadoop or Hadoop big data?

HadoopApache is a software developement framework that makes the promotion of

distributed data-intensive applications.

• The Hadoop platform consists of Hadoop kernal, component of MapReduce, HDFS (Hadoop distributed file system)

• Hadoopcode is written in Java programming language and is a top-level project Apache built and used by a global community of contributors.

• The best-known technology used for large data is Hadoop

• Two languages are identified as original Hadoop languages: pig and hive.

• System hadoop, data is distributed to thousands of nodes in parallel

• Hadoop discusses the complexities of large volume, velocity & various data

• Batch processing focused on is greatly in Hadoop

• Hadoop can store more petabytes of reliable data

• Accessibility is ensured even if any machine fails or is disposed of the network.

• One can use programs map reduce for accessing and manipulating data. The developer do not have to worry about when the data is stored, it can reference data in a unique view provided from the master node that stores the metadata of all files stored on the cluster.

2)Explain the JobTracker in Hadoop ETL? no'of instances of JobTracker runing on a HadoopETL cluster?

Hadoop JobTracker is the demon submission service and monitor MapReduce jobs in Hadoop. There is only one process Job Tracker runs on any Hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster's race on a separate machine. Each slave node is configured with storage node Job Tracker. The JobTracker is the single point of failure for the Hadoop MapReduce service. If it fails, all work will be stopped. JobTracker in Hadoop performs the following actions.

Client applications submit their work to the Job Tracker.

JobTracker talks to the NameNode to determine the location of the data

The JobTracker locates TaskTracker nodes with slots in or near data

The JobTracker submits the job to the chosen TaskTracker nodes.

TaskTracker nodes are monitored. If they do not have a heart rate signals quite often, they are deemed to have failed and the work is scheduled on another TaskTracker.

A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do next: it may refer the employment elsewhere, he can score this specific record as something to be avoided, and it can even be blacklisted TaskTracker as unreliable.

When work is completed, the JobTracker updates its status.

3)What is the main difference of HDFS and NAS?

HDFS file is a distributed file system designed runing on quality standard hardware. It have so many similarities with existing distributed file systems. However, differences from other distributed file systems are important. some important differences between HDFS and NAS.

1)In HDFS data blocks are allocated on local disks of all machines in a cluster. While in the NAS data is stored on dedicated hardware.

2)HDFS is designed to operate with the MapReduce system, are moved from the calculation data. NAS is not suitable for MapReduce, because the data is stored separately from the calculations.

3)HDFS run on a cluster of machines and provides redundancy using protocal replication. While NAS is provided by a single machine can't not provide data redundancy.What is the difference between motor MapReduce and HDFS cluster?
HDFS cluster is the name given to the entire configuration of the master and slaves, where data is stored. Map reduce engine is the programming module that is used to retrieve and analyze data.

4)Map resembles a pointer?
No, the card is not as a pointer.

6)Claim on two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system because it stores information on the location of all files stored in different datanodes and other hand, datanodes require system low configuration.

8)Why are splits equal number of cards?

The number of cards is equal divisions of entry because we want that key pairs / first value entry is divided.

9) which clints are using HadoopTools ? Give few examples?
. oracle
• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

11)A work is divided between maps?

No, a work is divided into cards. Reversed is created for the file. The file is placed on the datanodes into blocks. For each division, a card is required.

15)Who are the two types of 'records' in HDFS?
There are two types of entries in HDFS: write validated and not being charged. Validated write is when write us and forget it, regardless of the acknowledgement of receipt. It is similar to our Indian traditional post. In an entry not posted, wait us the acknowledgement. It is similar to the current messaging services. Naturally, writing not being charged is more expensive than the posted entry. It is much more expensive, although the two writes are asynchronous.

17)Why 'Reading' is done in parallel and "Written" is not in HDFS?

Reading is done in parallel, because by doing so, we can access the fast data. But we do not write in parallel operation. The reason is that if we make the write in parallel operation, then it can result in data inconsistency. For example, you have a file and try to write data to the file in parallel two nodes, then the first node does not know what wrote the second node, and vice versa. Thus, this makes it confused what data to be stored and accessible.

19)Hadoop is akin to the NOSQL Cassandra database?

If NOSQL is closet technology which can be compared to Hadoop, it has its own advantages and disadvantages. There is no DFS in NOSQL. Hadoop is not a database. This is a filesystem (HDFS) and the framework of distributed programming (MapReduce).

Why should we Hadoop?

Daily a large amount of unstructured data is getting discharged into our machines. The major challenge is to not store large sets of data in our systems but to collect and analyze large data in organizations, that too the data present in different machines at different locations. In this case, a necessity for Hadoop arises. Hadoop has the ability to analyze the data in different machines at different locations very quickly and very cost effectively. It uses the concept of MapReduce which allows it to divide the query into smaller pieces and process them in parallel. This is also known as parallel computing

Latest Hadoop Interview Questions -Part 6

1. What is Hadoop ?

• Apache Hadoop is a software framework (open source) which promotes data-intensive distributed applications.

• The entire Hadoop platform consists of Hadoop kernal, MapReduce component, HDFS (Hadoop distributed file system)

• Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.

• The most well known technology used for Big Data is Hadoop

• Two languages are identified as original Hadoop languages: PIG and Hive.

• In hadoop system, the data is distributed in thousands of nodes parallely

• Hadoop deals with complexities of high volume, velocity & variety of data

• Batch processing centric is greatly achieved in Hadoop

• Hadoop can store petabytes of data reliably

• Accessibility is ensured even if any machine breaks down or is thrown out from network.

• One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

2. What is Big Data?

Big Data is large in quantity, is captured at a rapid rate, and is structured or unstructured, or some combination of the above. It is difficult to capture, mine, and manage data using traditional methods but not in Big data. There is so much hype in this space that there could be an extended debate just about the definition of big data.

Big Data technology is not restricted to large volumes. As of the year2012, clusters that are big are in the 100 Petabyte range.

Traditional relational databases,like Informix and DB2, provide proven solutions for structured data. Via extensibility they also manage unstructured data. The Hadoop technology brings new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.

3. Advantages of Hadoop

• Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.

• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.

• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.

• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

4. Definition of Big data

According to Gartner, Big data can be defined as high volume, velocity and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

5. How Big data differs from database ?

Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.

6. 3 V of Big data - Explain (Important)

Big data can be defined with the help of 3 V (Volume, Velocity and Variety).

Volume: It describes the amount of data that is generated by organizations or individuals. Thus, it denotes the storage area limit.

Velocity: It describes the frequency at which the data is generated, changed, processed and shared. Thus, it denotes any access to the data in a specified time.

Variety: The data can be Structured or Unstructured or Semi-structured data.

The above 3 V were sufficient to define big data. But nowadays one more V (Value) is defined.

Value: It is the outcome or ability of analysing big data which will leverage the business.

7. Who are all using Hadoop? Give some examples.

• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

8. Hadoop Stack - Structure

9. Pig for Hadoop - Give some points

Pig is Data-flow oriented language for analyzing large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.

Features of Pig:

– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!

10. Hive for Hadoop - Give some points

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook

Latest Hadoop Interview Questions -Part 7

11. Hadoop HDFS architecture

12. Map Reduce in Hadoop

Map reduce :
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.

img:Hadoop

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.

13. What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker. If the Name node or job tracker does not receive heart beat then they will decide that there is some problem in data node or task tracker is unable to perform the assigned task.

14. What is a metadata?

Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.

15. What is a Data node?

Data nodes are the slaves which are deployed on each machine and provide the actual storage.
These are responsible for serving read and write requests for the clients.

16. What is a Name node?

Name node is the master node on which job tracker runs and consists of the metadata. $ It maintains and manages the blocks which are present on the datanodes. $It is a high-availability machine and single point of failure in HDFS.

17. Is Namenode also a commodity?

No.
Namenode can never be a commodity hardware because the entire HDFS rely on it.
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

18. Can Hadoop be compared to NOSQL database like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

19. What is Key value pair in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

20. What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

Latest Hadoop Interview Questions -Part 8

21. What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

22. How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

23. History of Hadoop

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.

Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.

24. What is meant by Volunteer Computing?

Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.

25. How Hadoop differs from SETI (Volunteer computing)?

Although SETI (Search for Extra-Terrestrial Intelligence) may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.

26. Compare RDBMS and MapReduce

Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear

27. What is HBase?

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

28. What is ZooKeeper?

A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

29. What is Chukwa?

A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)

30. What is Avro?

A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)