HDFS data transfer is faster than SCP based transfer?

504 views

I have a use case that requires transfer of input files from remote storage using SCP protocol (using jSCH jar). To optimize this use case, I have pre-loaded all my input files into HDFS and modified my use case so that it copies required files from HDFS. So, when tasktrackers works, it copies required number of input files to its local directory from HDFS.

All my tasktrackers are also datanodes. I could see my use case has run faster. The only modification in my application is that file copy from HDFS instead of transfer using SCP. Also, my use case involves parallel operations (run in tasktrackers) and they do lot of file transfer. Now all these transfers are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it uses TCP/IP? Can anyone give me reasonable reasons to support the decrease of time?

posted Jan 24, 2014 by Deepti Singh

Share this question

Facebook Share Button

Twitter Share Button

LinkedIn Share Button

Is it a single file? Lots of files? How big are the files? Is the copy on a single node or are you running some kind of a MapReduce program?

commented Jan 24, 2014 by Sonu Jindal

It is not a single file. Lot of small files. Files are stored in HDFS and map operations copies required files from hdfs. One map process running in one node only. Each file will be about 16MB

commented Jan 25, 2014 by anonymous

2 Answers

When u put the data or write into HDFS, 64kb of data is written on client side and then it is pushed through pipeline and this process continue till 64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture

answer Jan 25, 2014 by Abhay Kulkarni

There's a lot of difference here, although both do use TCP underneath, but do note that SCP securely encrypts data but stock HDFS configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument btw - unsure if you already applied that pre-test - it may help show up some difference. Also, the encryption algorithm can be changed to a weaker one if security is not a concern during the transfer, via "-c arcfour".

answer Jan 25, 2014 by anonymous

Similar Questions

+1 vote

Partition file by content based through HDFS

When a user is uploading a file from the local disk to its HDFS, can I make it partition the file into blocks based on its content?

Meaning, if I have a file with one integer column, can i say, I want the hdfs block to have even numbers?

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

+1 vote

Can I open multiple files on hdfs and write data to them in parallel and then close them at the end?

+1 vote

Streaming data access in HDFS: Design Feature

Can anyone please explain what we mean by STREAMING DATA ACCESS IN HDFS.

Data is usually copied to HDFS and in HDFS the data is splitted across DataNodes in blocks.
Say for example, I have an input file of 10240 MB(10 GB) in size and a block size of 64 MB. Then there will be 160 blocks.

These blocks will be distributed across DataNodes in blocks. Now the Mappers will read data from these DataNodes keeping the DATA LOCALITY FEATURE in mind(i.e. blocks local to a DataNode will be read by the map tasks running in that DataNode).

Can you please point me where is the "Streaming data access in HDFS" is coming into picture here?

+1 vote

HDFS Data Integrity in copyToLocal

How can I verify the integrity of files copied to local from HDFS? Does HDFS store MD5s of full files anywhere? From what I can find, FileSystem.getFileChecksum() is relevant to replication and not comparison across filesystems.

The Data Integrity section in HDFS Architecture (http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) does not make it clear if, or how, copyToLocal verifies the integrity of the copied file.

...