Describes the considerations for managing the chunk size for map tasks.
Files in the HPE Ezmeral Data Fabric filesystem are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size, but tuning the size correctly is important:
Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether using the file APIs or over NFS for the HPE Ezmeral Data Fabric, use chunk size specified by the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the old chunk size settings. Further writes to the file will use the file's current chunk size.
0), when an application makes
a request for block size, HPE
Ezmeral Data Fabric will return 1073741824 (1GB); however, hadoop mfs commands will continue to
display 0 for chunk size.Chunk size also affects parallel processing and random disk I/O during MapReduce applications. A
higher chunk size means less parallel processing because there are fewer map inputs, and
therefore fewer mappers. A lower chunk size improves parallelism, but results in higher
random disk I/O during shuffle because there are more map outputs. Set the
io.sort.mb parameter to a value between 120% and 150% of the chunk
size.
io.sort.mb parameter to the default 380 MB.io.sort.mb parameter to 190 MB.io.sort.mb
parameter should be at least 380 MB.store.parquet.block-size parameter in Drill so that the Parquet
block size is the same as the chunk size in the HPE Ezmeral Data Fabric filesystem. See Configuring the Parquet
Block Size for more information.You can set the chunk size for a given directory in two ways:
ChunkSize attribute in the .dfs_attributes
file at the top level of the directoryhadoop mfs -setchunksize <size> <directory>
For example, if the volume test is NFS-mounted at
/mapr/my.cluster.com/projects/test you can set the
chunk size to 268,435,456 bytes by editing the file
/mapr/my.cluster.com/projects/test/.dfs_attributes and
setting ChunkSize=268435456. To accomplish the same
thing from the hadoop shell, use the following
command:
hadoop mfs -setchunksize 268435456 /mapr/my.cluster.com/projects/test