21. What is a rack?
Rack is a storage area with all the datanodes put together. These
datanodes can be physically located at different places. Rack is a physical
collection of datanodes which are stored at a single location. There can be
multiple racks in a single location.
22. How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size,
once the data is stored, HDFS will keep on storing the last part of the data
which will say where the next part of the data will be. In fact, this is the
base of HDFS.
23. History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene,
the widely used text search library. Hadoop has its origins in Apache Nutch, an
open source web search engine, itself a part of the Lucene project.
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.
24. What is meant by Volunteer
Computing?
Volunteer computing projects work by breaking the problem they are
trying to solve into chunks called work units, which are sent to computers
around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.
SETI@home is the most well-known of many volunteer computing projects.
25. How Hadoop differs from SETI
(Volunteer computing)?
Although SETI (Search for Extra-Terrestrial Intelligence) may be
superficially similar to MapReduce (breaking a problem into independent pieces
to be worked on in parallel), there are some significant differences. The
SETI@home problem is very CPU-intensive, which makes it suitable for running on
hundreds of thousands of computers across the world. Since the time to transfer
the work unit is dwarfed by the time to run the computation on it. Volunteers
are donating CPU cycles, not bandwidth.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
26. Compare RDBMS and MapReduce
Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear
27. What is HBase?
A distributed, column-oriented database. HBase uses HDFS for its
underlying storage, and supports both batch-style computations using MapReduce
and point queries (random reads).
28. What is ZooKeeper?
A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications.
29. What is Chukwa?
A distributed data collection and analysis system. Chukwa runs
collectors that store data in HDFS, and it uses MapReduce to produce reports.
(At the time of this writing, Chukwa had only recently graduated from a
“contrib” module in Core to its own subproject.)
30. What is Avro?
A data serialization system for efficient, cross-language RPC, and
persistent data storage. (At the time of this writing, Avro had been created
only as a new subproject, and no other Hadoop subprojects were using it yet.)
Very nice, you are updating questions frequently thank you for sharing. Know more about Big Data Hadoop Training
ReplyDeleteThis comment has been removed by the author.
ReplyDelete