top button
Flag Notify
    Connect to us
      Facebook Login
      Site Registration Why to Join

Facebook Login
Site Registration
Print Preview

combining files and no dupes

0 votes
25 views

Having looked at "man join" wasn't sure of it's use here.

Unknown number of files, constant is extension .list
(For testing purposes only using two)

cat *.list >> output.joined | sort -u

How can I test if the output.joined,
is indeed the combined two lists with dupes removed.

posted May 14, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button Google+ Share Button LinkedIn Share Button Multiple Social Share Button

2 Answers

0 votes

give

cat *.list >> output.joined | sort -u | uniq --all-repeated
a try. If the output is empty ===> no dupes!

answer May 14, 2013 by anonymous
0 votes

1) You probably want '>' rather than '>>' if you're only running this once. Not that it makes a difference here but it's superfluous.

2) Since you're sending the output of 'cat' to a file, the pipe won't get any input, so you're you sorting nothing. If you actually want to capture the output in a file, you can use 'tee':

sort *.list | tee output | sort -u

or just run the two commands separately:

cat *.list > output
sort -u < output

If not, then "cat *.list|sort -u" is enough.

answer May 14, 2013 by anonymous
Similar Questions
+1 vote

I have a roughly 5 GB file where each row is a key, value pair. I would like to use this as a "hashmap" against another large set of file. From searching around, one way to do it would be to turn it into a dbm like DBD and put it into a distributed cache. Another is by joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just pay this overhead once at the beginning of the job, and then each node gets a copy locally, right? If I were to go with join, would it not increase the workload (more entries) and create the same network congestion issue? And wouldn't going with HBase means making it a bottleneck?

What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from, say, a 40GB file. How would my option change? At which point would
each option make sense?

+1 vote

I am facing some difficulty using join to display the array elements. Here is the code snippet

[code]use strict;use warnings
my @fruits = qw/apple mango orange banana guava/;
#print '[', join '][', @fruits;#print ']';
print '[', join '][', @fruits, ']';best,
[/code]

[output]
      [apple][mango][orange][banana][guava][]
[/output]

How can I make the output to eliminate the last empty square brackets [] using a single print statement. I used two print statements as shown in the code snippet above (#lines are commented out). Any help is greatly appreciated.

+1 vote

Trying to profile an application on powerpc architecture.
while profiling the following error occurred by running opreport command..

#opcontrol --start
#./exec
#opcontrol --stop
#opcontrol --dump
#opreport

opreport error : no sample files found.try using opcontrol --dump


Useful Links with Similar Problem
Contact Us
+91 9880187415
sales@queryhome.net
support@queryhome.net
#280, 3rd floor, 5th Main
6th Sector, HSR Layout
Bangalore-560102
Karnataka INDIA.
QUERY HOME
...