I understand that the block system in HDFS is a logical partition on top of underlying file system.
But how is the file retrieved when I issue a cat command.
Let say I have a 1 GB file. My default HDFS block size is 64 MB.
I issue the following the command:
hadoop -fs copyFromLocal my1GBfile.db input/data/
The above command copies the file my1GBfile.db from my local machine to input/data directory in HDFS:
I have 16 blocks to be copied and replicated ( 1 GB / 64 MB ~ 16 ).
If I have 8 datanodes, a single datanode might not have all blocks to reconsitute the file.
when I issue the following command
hadoop -fs cat input/data/my1GBfile.db | head
what happens now?
How is the file reconstituted? Although blocks are just logical partitions, how is the 1 GB file physically stored. It is stored on HDFS. does each datanode get some physical portion of the file.
so by breaking input 1GB file into 64 MB chunks, we might break something at record level (say in between the line). How is this handled?
I checked in my datanode and I do see a blk_1073741825, which when opened in editor actually displays contents of the file.
so is the chunks of files that is made is not logical but real partition of data happens?
kindly help clarify this
So far I understand from your question, my answer goes like this as per my understanding...
First of all, you need to understand the difference b/w HDFS block size and inputSplit size.
Block size - Block size of HDFS (64/128/256 MB) actually contains the data of the original (1 GB) file. And internally/ultimately this data is stored in blocks (4/8 KB) on fileSystem (ext, etc). So, block size of HDFS is a physical partition of the original file.
InputSplit - A file is broken into input split, which is a logical partition of the file. Logical partition means -- it will just have the information of the blocks address/location. Hadoop uses this logical representation of the data (input split) stored in file blocks. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends.
In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.
Hope, above clairfies the difference b/w block size and input split size.
Now coming to your question on working of 'hadoop fs -cat /'----->
All the information about the locations of blocks are stored in NameNode as metadata. If a node is getting split at record level, then DataNode sends the address/location information of the blocks to NameNode.
So, when client issues 'cat' command to Hadoop, then basically client sends a request to NameNode that -- "I want to read fileA.txt, please provide me the locations of all the blocks of this file stored at various locations". It's duty of NameNode to provide the locations of the blocks stored on various DataNodes.
Based on those locations, client directly contacts with DataNodes for those blocks. And finally client reads all those blocks in same order/manner those blocks were stored (here NameNode returns the addresses of all the blocks of a file to the client) in HDFS--resulting in complete file to the client.