Python development MapReduce series WordCount Demo-Python Tutorial-php.cn

Python development MapReduce series WordCount Demo

ringa_lee

Release： 2017-09-17 09:28:38

Original

1773 people have browsed it

We know that MapReduce is the core of the elephant hadoop. In Hadoop, the core of data processing is the MapReduce programming model. A Map/Reduce usually splits the input data set into several independent data blocks, which are processed by map tasks (task) in a completely parallel manner. The framework will sort the output of the map first, and then input the results to the reduce task. Typically the input and output of a job are stored in the file system. Therefore, our programming center is mainly the mapper stage and reducer stage.

Let’s develop a MapReduce program from scratch and run it on a hadoop cluster.
mapper code map.py:

 import sys    
    for line in sys.stdin:
        word_list = line.strip().split(&#39; &#39;)    
        for word in word_list:            print &#39;\t&#39;.join([word.strip(), str(1)])

Copy after login

View Code

reducer code reduce.py:

 import sys
    
    cur_word = None
    sum = 0    
    for line in sys.stdin:
        ss = line.strip().split(&#39;\t&#39;)        
        if len(ss) < 2:            continue
    
        word = ss[0].strip()
        count = ss[1].strip()    
        if cur_word == None:
            cur_word = word    
        if cur_word != word:            print &#39;\t&#39;.join([cur_word, str(sum)])
            cur_word = word
            sum = 0
        
        sum += int(count)    
    print &#39;\t&#39;.join([cur_word, str(sum)])
    sum = 0

Copy after login

View Code

Resource file src.txt (for testing, remember to upload to hdfs when running in the cluster):

hello    
    ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao
    Dad would get out his mandolin and play for the family
    Dad loved to play the mandolin for his family he knew we enjoyed singing
    I had to mature into a man and have children of my own before I realized how much he had sacrificed
    I had to,mature into a man and,have children of my own before.I realized how much he had sacrificed

Copy after login

View Code

First debug locally to see if the result is correct. Enter the following command:

cat src.txt | python map.py | sort -k 1 | python reduce.py

Copy after login

The result output in the command line:

a    2
    and    2
    and,have    1
    ao    1
    before    1
    before.I    1
    children    2
    Dad    2
    enjoyed    1
    family    2
    for    2
    get    1
    had    4
    hao    33
    haoao    1
    haoni    3
    have    1
    he    3
    hello    1
    his    2
    how    2
    I    3
    into    2
    knew    1
    loved    1
    man    2
    mandolin    2
    mature    1
    much    2
    my    2
    ni    34
    of    2
    out    1
    own    2
    play    2
    realized    2
    sacrificed    2
    singing    1
    the    2
    to    2
    to,mature    1
    we    1
    would    1

Copy after login

View Code

Found local debugging through debugging, the code is OK. Throw it onto the cluster and run. For convenience, I wrote a special script run.sh to liberate the labor force.

HADOOP_CMD="/home/hadoop/hadoop/bin/hadoop"
    STREAM_JAR_PATH="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"
    
    INPUT_FILE_PATH="/home/input/src.txt"
    OUTPUT_PATH="/home/output"
    
    $HADOOP_CMD fs -rmr  $OUTPUT_PATH 
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \        -input $INPUT_FILE_PATH \        -output $OUTPUT_PATH \        
    -mapper "python map.py" \        -reducer "python reduce.py" \        -file ./map.py \        -file ./reduce.py

Copy after login

Let’s analyze the script below:

　HADOOP_CMD： hadoop的bin的路径
    STREAM_JAR_PATH：streaming jar包的路径
    INPUT_FILE_PATH：hadoop集群上的资源输入路径
    OUTPUT_PATH：hadoop集群上的结果输出路径。（注意：这个目录不应该存在的，因此在脚本加了先删除这个目录。**注意****注意****注意**：若是第一次执行，没有这个目录，会报错的。可以先手动新建一个新的output目录。）
    $HADOOP_CMD fs -rmr  $OUTPUT_PATH
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \        -input $INPUT_FILE_PATH \        -output $OUTPUT_PATH \       
     -mapper "python map.py" \        -reducer "python reduce.py" \       
      -file ./map.py \        -file ./reduce.py                 
      #这里固定格式，指定输入，输出的路径；指定mapper，reducer的文件；
      #并分发mapper，reducer角色的我们用户写的代码文件，因为集群其他的节点还没有mapper、reducer的可执行文件。

Copy after login

Enter the following command to view the records output after the reduce phase:

cat src.txt | python map.py | sort -k 1 | python reduce.py | wc -l
命令行中输出：43

Copy after login

In the browser Enter: master:50030 to view the details of the task.

Kind    % Complete    Num Tasks    Pending    Running    Complete    Killed     Failed/Killed Task Attempts
map       100.00%        2            0        0        2            0            0 / 0
reduce    100.00%        1            0        0        1            0            0 / 0

Copy after login

Saw this in Map-Reduce Framework.

Counter                    　　Map    Reduce    Total
Reduce output records    0    　　0       　　 43

Copy after login

Proof that the entire process was successful. The development of the first hadoop program is completed.

The above is the detailed content of Python development MapReduce series WordCount Demo. For more information, please follow other related articles on the PHP Chinese website!