Hadoop: July 2010

Latest Hadoop Interview Questions -Part 5

1)Can you explanation HADOOP BIG DATA Analytics?

Hadoop BIG DATA Analytics is a such a hugevolume of complex data it becomes very tedious to captureing, storeing, processing, retrieveing,reporing and analyze using management hadoop tools database by hand or processing techniques of transaction data.

What is Hadoop or Hadoop big data?

HadoopApache is a software developement framework that makes the promotion of

distributed data-intensive applications.

• The Hadoop platform consists of Hadoop kernal, component of MapReduce, HDFS (Hadoop distributed file system)

• Hadoopcode is written in Java programming language and is a top-level project Apache built and used by a global community of contributors.

• The best-known technology used for large data is Hadoop

• Two languages are identified as original Hadoop languages: pig and hive.

• System hadoop, data is distributed to thousands of nodes in parallel

• Hadoop discusses the complexities of large volume, velocity & various data

• Batch processing focused on is greatly in Hadoop

• Hadoop can store more petabytes of reliable data

• Accessibility is ensured even if any machine fails or is disposed of the network.

• One can use programs map reduce for accessing and manipulating data. The developer do not have to worry about when the data is stored, it can reference data in a unique view provided from the master node that stores the metadata of all files stored on the cluster.

2)Explain the JobTracker in Hadoop ETL? no'of instances of JobTracker runing on a HadoopETL cluster?

Hadoop JobTracker is the demon submission service and monitor MapReduce jobs in Hadoop. There is only one process Job Tracker runs on any Hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster's race on a separate machine. Each slave node is configured with storage node Job Tracker. The JobTracker is the single point of failure for the Hadoop MapReduce service. If it fails, all work will be stopped. JobTracker in Hadoop performs the following actions.

Client applications submit their work to the Job Tracker.

JobTracker talks to the NameNode to determine the location of the data

The JobTracker locates TaskTracker nodes with slots in or near data

The JobTracker submits the job to the chosen TaskTracker nodes.

TaskTracker nodes are monitored. If they do not have a heart rate signals quite often, they are deemed to have failed and the work is scheduled on another TaskTracker.

A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do next: it may refer the employment elsewhere, he can score this specific record as something to be avoided, and it can even be blacklisted TaskTracker as unreliable.

When work is completed, the JobTracker updates its status.

3)What is the main difference of HDFS and NAS?

HDFS file is a distributed file system designed runing on quality standard hardware. It have so many similarities with existing distributed file systems. However, differences from other distributed file systems are important. some important differences between HDFS and NAS.

1)In HDFS data blocks are allocated on local disks of all machines in a cluster. While in the NAS data is stored on dedicated hardware.

2)HDFS is designed to operate with the MapReduce system, are moved from the calculation data. NAS is not suitable for MapReduce, because the data is stored separately from the calculations.

3)HDFS run on a cluster of machines and provides redundancy using protocal replication. While NAS is provided by a single machine can't not provide data redundancy.What is the difference between motor MapReduce and HDFS cluster?
HDFS cluster is the name given to the entire configuration of the master and slaves, where data is stored. Map reduce engine is the programming module that is used to retrieve and analyze data.

4)Map resembles a pointer?
No, the card is not as a pointer.

6)Claim on two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system because it stores information on the location of all files stored in different datanodes and other hand, datanodes require system low configuration.

8)Why are splits equal number of cards?

The number of cards is equal divisions of entry because we want that key pairs / first value entry is divided.

9) which clints are using HadoopTools ? Give few examples?
. oracle
• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

11)A work is divided between maps?

No, a work is divided into cards. Reversed is created for the file. The file is placed on the datanodes into blocks. For each division, a card is required.

15)Who are the two types of 'records' in HDFS?
There are two types of entries in HDFS: write validated and not being charged. Validated write is when write us and forget it, regardless of the acknowledgement of receipt. It is similar to our Indian traditional post. In an entry not posted, wait us the acknowledgement. It is similar to the current messaging services. Naturally, writing not being charged is more expensive than the posted entry. It is much more expensive, although the two writes are asynchronous.

17)Why 'Reading' is done in parallel and "Written" is not in HDFS?

Reading is done in parallel, because by doing so, we can access the fast data. But we do not write in parallel operation. The reason is that if we make the write in parallel operation, then it can result in data inconsistency. For example, you have a file and try to write data to the file in parallel two nodes, then the first node does not know what wrote the second node, and vice versa. Thus, this makes it confused what data to be stored and accessible.

19)Hadoop is akin to the NOSQL Cassandra database?

If NOSQL is closet technology which can be compared to Hadoop, it has its own advantages and disadvantages. There is no DFS in NOSQL. Hadoop is not a database. This is a filesystem (HDFS) and the framework of distributed programming (MapReduce).

Why should we Hadoop?

Daily a large amount of unstructured data is getting discharged into our machines. The major challenge is to not store large sets of data in our systems but to collect and analyze large data in organizations, that too the data present in different machines at different locations. In this case, a necessity for Hadoop arises. Hadoop has the ability to analyze the data in different machines at different locations very quickly and very cost effectively. It uses the concept of MapReduce which allows it to divide the query into smaller pieces and process them in parallel. This is also known as parallel computing

Latest Hadoop Interview Questions -Part 6

1. What is Hadoop ?

• Apache Hadoop is a software framework (open source) which promotes data-intensive distributed applications.

• The entire Hadoop platform consists of Hadoop kernal, MapReduce component, HDFS (Hadoop distributed file system)

• Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.

• The most well known technology used for Big Data is Hadoop

• Two languages are identified as original Hadoop languages: PIG and Hive.

• In hadoop system, the data is distributed in thousands of nodes parallely

• Hadoop deals with complexities of high volume, velocity & variety of data

• Batch processing centric is greatly achieved in Hadoop

• Hadoop can store petabytes of data reliably

• Accessibility is ensured even if any machine breaks down or is thrown out from network.

• One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

2. What is Big Data?

Big Data is large in quantity, is captured at a rapid rate, and is structured or unstructured, or some combination of the above. It is difficult to capture, mine, and manage data using traditional methods but not in Big data. There is so much hype in this space that there could be an extended debate just about the definition of big data.

Big Data technology is not restricted to large volumes. As of the year2012, clusters that are big are in the 100 Petabyte range.

Traditional relational databases,like Informix and DB2, provide proven solutions for structured data. Via extensibility they also manage unstructured data. The Hadoop technology brings new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.

3. Advantages of Hadoop

• Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.

• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.

• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.

• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

4. Definition of Big data

According to Gartner, Big data can be defined as high volume, velocity and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

5. How Big data differs from database ?

Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.

6. 3 V of Big data - Explain (Important)

Big data can be defined with the help of 3 V (Volume, Velocity and Variety).

Volume: It describes the amount of data that is generated by organizations or individuals. Thus, it denotes the storage area limit.

Velocity: It describes the frequency at which the data is generated, changed, processed and shared. Thus, it denotes any access to the data in a specified time.

Variety: The data can be Structured or Unstructured or Semi-structured data.

The above 3 V were sufficient to define big data. But nowadays one more V (Value) is defined.

Value: It is the outcome or ability of analysing big data which will leverage the business.

7. Who are all using Hadoop? Give some examples.

• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

8. Hadoop Stack - Structure

9. Pig for Hadoop - Give some points

Pig is Data-flow oriented language for analyzing large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.

Features of Pig:

– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!

10. Hive for Hadoop - Give some points

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook

Latest Hadoop Interview Questions -Part 7

11. Hadoop HDFS architecture

12. Map Reduce in Hadoop

Map reduce :
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.

img:Hadoop

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.

13. What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker. If the Name node or job tracker does not receive heart beat then they will decide that there is some problem in data node or task tracker is unable to perform the assigned task.

14. What is a metadata?

Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.

15. What is a Data node?

Data nodes are the slaves which are deployed on each machine and provide the actual storage.
These are responsible for serving read and write requests for the clients.

16. What is a Name node?

Name node is the master node on which job tracker runs and consists of the metadata. $ It maintains and manages the blocks which are present on the datanodes. $It is a high-availability machine and single point of failure in HDFS.

17. Is Namenode also a commodity?

No.
Namenode can never be a commodity hardware because the entire HDFS rely on it.
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

18. Can Hadoop be compared to NOSQL database like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

19. What is Key value pair in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

20. What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

Latest Hadoop Interview Questions -Part 8

21. What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

22. How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

23. History of Hadoop

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.

Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.

24. What is meant by Volunteer Computing?

Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.

25. How Hadoop differs from SETI (Volunteer computing)?

Although SETI (Search for Extra-Terrestrial Intelligence) may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.

26. Compare RDBMS and MapReduce

Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear

27. What is HBase?

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

28. What is ZooKeeper?

A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

29. What is Chukwa?

A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)

30. What is Avro?

A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)