Hadoop streaming - Class not found

177 views

I am trying to run an executable using hadoop streaming 2.4

My executable is my mapper which is a groovy script. This script uses a class from a jar file which I am sending via -libjars argument.

The hadoop streaming is made to span maps via an input file, each line feeds to one map. The question is, though the hadoop successfully executes the use case, but, I see that some maps failed and restarted later. The failure was due to failing to locate the class. The script has some imports and they are not found. However, they are all in jar file.

I am tempted to think that when hadoop executes the first few map tasks, the jar file is not "prepared yet" to be made available to maps and hence the initial maps failed to locate the class, and later, when they are restarted, it is able to locate the class and executes smoothly.

Is this correct? If not, can someone tell me why this behavior? How can I get around this issue? Because of this, the use case takes little more time to execute. I fear, when I expand the use case, this will surely cause performance delay.

posted Jul 23, 2014 by Luv Kumar

Looking for an answer? Promote on:

Facebook Share Button

Twitter Share Button

LinkedIn Share Button

Similar Questions

+1 vote

Hadoop YARN 2.2.0 Streaming Memory Limitation?

We are currently facing a frustrating hadoop streaming memory problem. our setup:

our compute nodes have about 7 GB OF RAM
hadoop streaming starts a bash script wich uses about 4 GB OF RAM
therefore it is only possible to start one and only ONE TASK PER NODE

out of the box each hadoop instance starts about 7 hadoop containers with default hadoop settings. each hadoop task forks a bash script that need about 4 GB of RAM, the first fork works, all following fail because THEY RUN OUT OF MEMORY. so what we are looking for is to LIMIT the number of containers TO ONLY ONE. so what we found on the internet:

yarn.scheduler.maximum-allocation-mb and mapreduce.map.memory.mb is set to values such that there is at most one container. this means, mapreduce.map.memory.mb must be MORE THAN HALF of the maximum memory (otherwise there will be multiple containers).

done right, this gives us one container per node. but it produces a new problem: since our java process is now using at least half of the max memory, our child (bash) process we fork will INHERIT THE PARENT MEMORY FOOTPRINT and since the memory used by our parent was more than half of total memory, WE RUN OUT OF MEMORY AGAIN. if we lower the map memory, hadoop will allocate 2 containers per node, which will run out of memory too.

since this problem is a blocker in our current project we are evaluating adapting the source code to solve this issue. as a last resort. any ideas on this are very much welcome.

+1 vote

Streaming data access in HDFS: Design Feature

Can anyone please explain what we mean by STREAMING DATA ACCESS IN HDFS.

Data is usually copied to HDFS and in HDFS the data is splitted across DataNodes in blocks.
Say for example, I have an input file of 10240 MB(10 GB) in size and a block size of 64 MB. Then there will be 160 blocks.

These blocks will be distributed across DataNodes in blocks. Now the Mappers will read data from these DataNodes keeping the DATA LOCALITY FEATURE in mind(i.e. blocks local to a DataNode will be read by the map tasks running in that DataNode).

Can you please point me where is the "Streaming data access in HDFS" is coming into picture here?

+1 vote

Two hadoop nodes on same machine while a second machine not joining the cluster

I have a test cluster of two machines, on both of them hadoop is installed. I have configured the hadoop cluster but on admin UI (as in the below picture) I see that two nodes are running on the same master machine, and that the other machine has no Hadoop node.

On master machine following services are running:

~$ jps 26310 ResourceManager 27593 Jps 26216 DataNode 26135 NameNode 26557 NodeManager 26701 JobHistoryServer

On the slave machine:

~$ jps 2614 DataNode 2920 Jps 2707 NodeManager

I don't why the slave is not joining the cluster (It was before). I tried to shutdown all servers on both machines and format HDFS then restarting everything but that didnot help. Any help to figure whats causing that behavior is appreciated.

+2 votes

Hadoop: JAVA_HOME not set though the workstation has working java

I have a working version of java 7 installed. I can execute java programs in the workstation. When I start hdfs, the statup process aborts with a message JAVA_HOME not set.

OS : Ubuntu 13.04 raring ringtail
Hadoop version : 2.1.1-beta
Java version:java-7-openjdk-amd64

0 votes

Hadoop UI Web Interface

I have a small problem. I need to integrate Hadoop web interface with our web application . I just need an Hadoop interface where we can run some hadoop commands something like

1 cat hadoop dfs -cat prints the file contents
2 chgrp hadoop dfs -chgrp [-R] GROUP URI [URI …]
3 chmod hadoop dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI
4 hadoop dfsadmin -setSpaceQuota ********** /user/esammer
5 hadoop dfsadmin -report
6 copyFromLocal hadoop dfs -copyFromLocal URI

for this need an web interface. I already installed cloudera manager. I am using this Version: Cloudera Enterprise Data Hub Edition Trial 5.1.1 (#82 built by jenkins on 20140725-1608 git: cb9ebb729efc7929e1968b23dc6cf776086e20a7)

...