首页 » 大数据 » Hadoop » Hadoop学习笔记(14)--Pig使用

Hadoop学习笔记(14)--Pig使用

 

以下使用PIG来做一个最简单的统计。
统计网站服务某一个nginx日志,在一天之内,存在哪些频繁访问的IP。

以前曾使用awk来做过类似的统计,具体可参看以前的文章。

首先,nginx日志格式如下:

121.42.0.88 - - [10/May/2016:03:23:04 +0800] "GET /index.html HTTP/1.1" 500 594 "http://img.zuobin.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;Alibaba.Security.Heimdall.1346402)"
#先启动hdfs和yarn
./start-dfs.sh
./start-yarn.sh
#在namenode上执行命令:mr-jobhistory-daemon.sh start historyserver 
#这样在,namenode上会启动JobHistoryServer服务,可以在historyserver的日志中查看运行情况
./mr-jobhistory-daemon.sh start historyserver 

PIG操作:

#首先将日志copy到hdfs
grunt> copyFromLocal /home/hadoop/access.log .
grunt> ls
hdfs://172.16.22.251:9005/user/hadoop/pig/access.log<r 1>   1272021

#
grunt> a = load 'access.log'
>> using PigStorage(' ')
>>  AS (ip,a1,a2,a3,a4,a5,a6,a7,a8);
grunt> b = foreach a generate ip;
grunt> c = group b by ip;
grunt> d = foreach c generate group,COUNT($1);
grunt> dump d
#省略部分日志
2016-06-14 11:52:41,081 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2016-06-14 11:52:41,081 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:08,418 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2016-06-14 11:53:08,418 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:20,492 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:21,499 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:21,531 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:22,864 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:22,870 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:22,945 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:22,953 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:23,031 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-06-14 11:53:23,033 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.7.2   0.16.0  hadoop  2016-06-14 11:52:39     2016-06-14 11:53:23     GROUP_BY

Success!

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTime      AvgMapTime      MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime        Alias   Feature   Outputs
job_1464339368801_0004  1       1       16      16      16      16      9       9       9       9       a,b,c,d GROUP_BY,COMBINER       hdfs://172.16.22.251:9005/tmp/temp-942287448/tmp1415989805,

Input(s):
Successfully read 5959 records (1272400 bytes) from: "hdfs://172.16.22.251:9005/user/hadoop/pig/access.log"

Output(s):
Successfully stored 335 records (7146 bytes) in: "hdfs://172.16.22.251:9005/tmp/temp-942287448/tmp1415989805"

Counters:
Total records written : 335
Total bytes written : 7146
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1464339368801_0004

(1.192.26.82,8)
(119.5.161.3,7)
(121.42.0.34,2)
(121.42.0.35,4)
(121.42.0.54,4)
(121.42.0.63,1)
(121.42.0.71,6)
(121.42.0.82,2)
(121.42.0.86,2)
(121.42.0.88,1186)
(212.14.50.9,7)
(36.33.5.181,7)
(5.178.86.75,3)
(5.178.86.76,1)
(80.82.78.38,1)
(1.214.197.21,8)
(106.41.97.65,23)
(111.13.65.57,123)
(112.86.137.2,21)
(114.113.31.2,5)
(115.60.76.48,10)
(117.78.38.44,8)
(117.78.38.50,11)
(117.78.39.92,8)
(117.78.41.30,6)
(117.78.41.31,9)
(117.78.42.39,12)
(117.78.44.26,9)
(119.29.62.87,7)
(121.43.57.44,8)
(122.96.252.7,8)
(124.133.28.7,73)


原文链接:Hadoop学习笔记(14)--Pig使用,转载请注明来源!

0