Problems with small files
and HDFS
A small file is one which
is significantly smaller than the HDFS block size. If you’re storing small
files, then you probably have lots of them (otherwise you wouldn’t turn to
Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and
block in HDFS is represented as an object in the name node’s memory, each of
which occupies 150 bytes, as a rule of thumb. So, 10 million files, each using a
block, would use about 3 gigabytes of memory. Scaling up much beyond this level
is a problem with current hardware. Certainly, a billion files are not
feasible.
Furthermore, HDFS is not
geared up to efficiently accessing small files: it is primarily designed for
streaming access of large files. Reading through small files normally causes
lots of seeks and lots of hopping from data node to data node to retrieve each
small file, all of which is an inefficient data access pattern
Problems with small files and MapReduce
Map tasks usually process
a block of input at a time (using the default FileInputFormat). If the
file is very small and there are a lot of them, then each map task processes
very little input, and there are a lot more map tasks, each of which imposes
extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and
10,000 or so 100KB files. The 10,000 files use one map each, and the job time
can be tens or hundreds of times slower than the equivalent one with a single
input file.
There are a couple of
features to help alleviate the bookkeeping overhead: task JVM reuse for running
multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see
the mapred.job.reuse.jvm.num. tasks property),
and MultiFileInputSplit which can run more than one split per map.
In general Hadoop handles
big files very well, but when the files are small, it just passes each small
file to a map () function, which is not very efficient because it will create a
large number of mappers. For example, the 1,000’s files of size (2 to 3 MB)
will need 1,000 mappers which very inefficient. Having too many small files can
therefore be problematic in Hadoop. To solve this problem, we should merge many
of these small files into one and then process them. And note that Hadoop is
mainly designed for batch-processing a large volume of data rather than
processing many small files. The main purpose of solving the small files
problem is to speed up the execution of a Hadoop program by combining small
files into bigger files. Solving the small files problem will shrink the number
of map() functions executed and hence will improve the overall performance of a
Hadoop job.
The small size problem is 2 folds.
1) Small File problem in HDFS:
Storing lot of small
files which are extremely smaller than the block size cannot be efficiently
handled by HDFS. Reading through small files involve lots of seeks and lots of
hopping between data node to data node, which is in turn inefficient data
processing.
In name node’s memory,
every file, directory, and the block in HDFS is represented as an object. Each
of these objects is in the size of around 300 bytes. If we consider around 20 million
small files, each of these files will be using a separate block. That will
cause to a use of around 6 gigabytes of memory. With the hardware limitations
have, scaling up beyond this level is a problem. With a lot of files, the
memory required to store the metadata is high and cannot scale beyond a limit
2) Small File problem in MapReduce:
In MapReduce, Map task
process a block of data at a time. Many small files mean lots of blocks – which
means lots of tasks, and lots of book keeping by Application Master. This will
slow the overall cluster performance compared to large files processing.
Why are small files produced?
There are at least two
cases
1.
The
files are pieces of a larger logical file. Since HDFS has only recently
supported appends, a very common pattern for saving unbounded files (e.g. log
files) is to write them in chunks into HDFS.
2.
The
files are inherently small. Imagine a large corpus of images. Each image is a
distinct file, and there is no natural way to combine them into one larger
file.
These two cases require different solutions.
For the first case, where the file is made up of records, the problem may be
avoided by calling HDFS’s sync () method every so often to
continuously write large files. Alternatively, it’s possible to write a program
to concatenate the small files together.
Solutions to Small File
problem:
1. HAR files
2. Sequence File System
3. HBase (If latency is not an issue) and
other options as well.
HAR files:
Hadoop Archives (HAR
files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of
files putting pressure on the name node’s memory. HAR files work by building a
layered filesystem on top of HDFS. A HAR file is created using the Hadoop archive
command, which runs a MapReduce job to pack the files being archived into a
small number of HDFS files. To a client using the HAR filesystem nothing has
changed: all of the original files are visible and accessible (albeit using a
har:// URL). However, the number of files in HDFS has been reduced.
Reading through files in
a HAR is no more efficient than reading through files in HDFS, and in fact may
be slower since each HAR file access requires two index file reads as well as
the data file read (see diagram). And although HAR files can be used as input
to MapReduce, there is no special magic that allows maps to operate over all
the files in the HAR co-resident on a HDFS block. It should be possible to
built an input format that can take advantage of the improved locality of files
in HARs, but it doesn’t exist yet. Note that MultiFileInputSplit, even with the
improvements in HADOOP-4565 to choose files in a split that are node local,
will need a seek per small file. It would be interesting to see the performance
of this compared to a Sequence File, say. At the current time HARs are probably
best used purely for archival purposes.
Sequence Files:
The usual response to
questions about “the small files problem” is: use a Sequence File. The idea
here is that you use the filename as the key and the file contents as the
value. This works very well in practice. Going back to the 10,000 100KB files,
you can write a program to put them into a single Sequence File, and then you
can process them in a streaming fashion (directly or using MapReduce) operating
on the Sequence File. There are a couple of bonuses too. Sequence Files are splitable,
so MapReduce can break them into chunks and operate on each chunk
independently. They support compression as well, unlike HARs. Block compression is the best option in most
cases, since it compresses blocks of several records (rather than per record).
It can be slow to convert
existing data into Sequence Files. However, it is perfectly possible to create
a collection of Sequence Files in parallel. (Stuart Sierra has written a very
useful post about converting a tar file into a Sequence File — tools like this
are very useful, and it would be good to see more of them). Going forward it’s
best to design your data pipeline to write the data at source direct into a Sequence
File, if possible, rather than writing to small files as an intermediate step.
Unlike HAR files there is
no way to list all the keys in a Sequence File, short of reading through the
whole file. (Map Files, which are like Sequence Files with sorted keys,
maintain a partial index, so they can’t list all their keys either — see
diagram.)
Sequence File is rather
Java-centric. TFile is designed to be cross-platform, and be a replacement for Sequence
File, but it’s not available yet
HBase
If you are producing lots
of small files, then, depending on the access pattern, a different type of
storage might be more appropriate. HBase stores data in Map Files (indexed
Sequence Files), and is a good choice if you need to do MapReduce style
streaming analyses with the occasional random look up. If latency is an issue,
then there are lots of other choices — see Richard Jones’ excellent survey of
key-value stores.
Apache HBase™ is the
Hadoop database, a distributed, scalable, big data store.
Use Apache HBase™ when
you need random, real-time read/write access to your Big Data. This project's
goal is the hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware. Apache HBase is an open-source,
distributed, versioned, non-relational database modeled after Google's
Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just
as Bigtable leverages the distributed data storage provided by the Google File
System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and
HDFS.
No comments:
Post a Comment