看到一篇不知道是好好玩還是好玩玩童鞋的博客,發現一道好玩的mapreduce題目,地址http://www.cnblogs.com/songhaowan/p/7239578.html 如圖 由於自己太笨,看到一大堆java代碼就頭暈、心慌,所以用python把這個題目研究了一下。 題目:尋找共同好友 ...
看到一篇不知道是好好玩還是好玩玩童鞋的博客,發現一道好玩的mapreduce題目,地址http://www.cnblogs.com/songhaowan/p/7239578.html
如圖
由於自己太笨,看到一大堆java代碼就頭暈、心慌,所以用python把這個題目研究了一下。
題目:尋找共同好友。比如A的好友中有C,B的好友中有C,那麼C就是AB的共同好友。
A:B,C,D,F,E,O
B:A,C,E,K
C:F,A,D,I
D:A,E,F,L
E:B,C,D,M,L
F:A,B,C,D,E,O,M
G:A,C,D,E,F
H:A,C,D,E,O
I:A,O
J:B,O
K:A,C,D
L:D,E,F
M:E,F,G
O:A,H,I,J
m.py
#-*-encoding:utf-8-*- #!/home/hadoop/anaconda2/bin/python import sys result = {} for line in sys.stdin: line = line.strip() if len(line)==0: continue key,vals = line.split(':') val = vals.split(',') result[key] = val if len(result)==1: continue else: for i in result[key]: for j in result: if i in result[j]: if j<key: print j+key,i elif j>key: print key+j,i
r.py
#-*-encoding:utf-8-*- import sys result = {} for line in sys.stdin: line = line.strip() k,v = line.split(' ') if k in result: result[k].append(v) else: result[k] = [v] for key,val in result.items(): print key,val
執行的命令
hadoop jar /home/hadoop/hadoop-2.7.2/hadoop-streaming-2.7.2.jar \ -files /home/hadoop/test/m.py,/home/hadoop/test/r.py \ -input GTHY -output GTHYout \ -mapper 'python m.py' -reducer 'python r.py'
執行情況
packageJobJar: [/tmp/hadoop-unjar2310332345933071298/] [] /tmp/streamjob8006362102585628853.jar tmpDir=null 17/08/31 14:47:59 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.228.200:18040 17/08/31 14:48:00 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.228.200:18040 17/08/31 14:48:00 INFO mapred.FileInputFormat: Total input paths to process : 1 17/08/31 14:48:00 INFO mapreduce.JobSubmitter: number of splits:2 17/08/31 14:48:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1504148710826_0003 17/08/31 14:48:01 INFO impl.YarnClientImpl: Submitted application application_1504148710826_0003 17/08/31 14:48:01 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1504148710826_0003/ 17/08/31 14:48:01 INFO mapreduce.Job: Running job: job_1504148710826_0003 17/08/31 14:48:08 INFO mapreduce.Job: Job job_1504148710826_0003 running in uber mode : false 17/08/31 14:48:08 INFO mapreduce.Job: map 0% reduce 0% 17/08/31 14:48:16 INFO mapreduce.Job: map 100% reduce 0% 17/08/31 14:48:21 INFO mapreduce.Job: map 100% reduce 100% 17/08/31 14:48:21 INFO mapreduce.Job: Job job_1504148710826_0003 completed successfully 17/08/31 14:48:21 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=558 FILE: Number of bytes written=362357 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=462 HDFS: Number of bytes written=510 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=11376 Total time spent by all reduces in occupied slots (ms)=2888 Total time spent by all map tasks (ms)=11376 Total time spent by all reduce tasks (ms)=2888 Total vcore-milliseconds taken by all map tasks=11376 Total vcore-milliseconds taken by all reduce tasks=2888 Total megabyte-milliseconds taken by all map tasks=11649024 Total megabyte-milliseconds taken by all reduce tasks=2957312 Map-Reduce Framework Map input records=27 Map output records=69 Map output bytes=414 Map output materialized bytes=564 Input split bytes=192 Combine input records=0 Combine output records=0 Reduce input groups=69 Reduce shuffle bytes=564 Reduce input records=69 Reduce output records=33 Spilled Records=138 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=421 CPU time spent (ms)=2890 Physical memory (bytes) snapshot=709611520 Virtual memory (bytes) snapshot=5725220864 Total committed heap usage (bytes)=487063552 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=270 File Output Format Counters Bytes Written=510 17/08/31 14:48:21 INFO streaming.StreamJob: Output directory: GTHYout
最終結果
hadoop@master:~/test$ hadoop fs -text GTHYout/part-00000 BD ['A', 'E'] BE ['C'] BF ['A', 'C', 'E'] BG ['A', 'C', 'E'] BC ['A'] DF ['A', 'E'] DG ['A', 'E', 'F'] DE ['L'] HJ ['O'] HK ['A', 'C', 'D'] HI ['A', 'O'] HO ['A'] HL ['D', 'E'] FG ['A', 'C', 'D', 'E'] LM ['E', 'F'] KO ['A'] AC ['D', 'F'] AB ['C', 'E'] AE ['B', 'C', 'D'] AD ['E', 'F'] AG ['C', 'D', 'E', 'F'] AF ['B', 'C', 'D', 'E', 'O'] EG ['C', 'D'] EF ['B', 'C', 'D', 'M'] CG ['A', 'D', 'F'] CF ['A', 'D'] CE ['D'] CD ['A', 'F'] IK ['A'] IJ ['O'] IO ['A'] HM ['E'] KL ['D']
突然發現代碼中居然一句註釋都沒有。果然自己還是太辣雞,還沒養成好習慣。
由於剛接觸大數據不久,對java不熟悉,摸索地很慢。希望python的輕便能助我在大數據的世界探索更多。
有錯的地方還請大佬多多指出~