|
|
|
联系客服020-83701501

大数据之hive安装及分析web日志实例

联系在线客服,可以获得免费在线咨询服务。 QQ咨询 我要预约
大数据之hive安放及分析站点日记实例 数据搜索与分析是运维狗必备技艺 &#八220;In reality, 90+% of MR jobs are generated by Hive SQL

1、理论知识预热

1句话引见基于Hadoop文件系统之上的分布式,按存储的数据堆栈架构,提供数据ETL(抽取、转换和加载)东西、数据存储打点和大型数据集的究诘和分析能力。 通俗理解:Hive打点HDFS中存储的数据,并提供基于SQL的究诘语言 开拓性子:Hive被妄想为用MapReduce独霸HDFS数据. 大数据之hive安装及常用HiveQL(待续) - 碳基体 - 碳基体 Hive核心:Hive数据打点包括: (1)元数据存储:Hive将元数据存储在RDBMS中,如Derby,MySQL 客户端把持Thrift协定经由MetaStoreServer来访问元数据库 (2)数据存储:hive中全副的数据都存储在HDFS中,大所有的究诘由MapReduce完成。用户可能特别默默地布局Hive中的表,只要要在构建表的时辰携带Hive数据中的列朋分符和行合并符就能剖析数据。 Hive包罗4种数据模型: 比方有1张表叫accesslog,按event_day截止分区,hive的在HDFS上的数据存储地点为/user/hive/warehouse leix (3)数据变卦:有三种接口:客户端、数据库接口和站点界面 2、安放 第1步:安放hadoop 见hadoop伪集群搭建与MapReduce编程入门(待续) 第2步:安放hive? Default
wget http://mirror.bit.edu.cn/apache/hive/stable/apache-hive-1.1.0-bin.tar.gz

Default
tar zxvf apache-hive-1.1.0-bin.tar.gz
求助目次布局下列: bin 可执行文件: hive敕令行接口; HiveServer2; beeline敕令行接口 conf 配置文件:hive-site.xml lib jar包 hcatalog: HCatalog server服务 /tmp/roo用户名t/hive.log 日记 第三步: hive配置-设置hive情况变量 Default
12 cp conf/hive-env.sh.template conf/hive-env.shvim conf/hive-env.sh

编纂

Default
HADOOP_HOME=/home/tanjiti/hadoop-2.6.0[厘革成hadoop地址目次]

第4步:hive配置-设置hive数据在hadoop HDFS中的存储地点

Default
12 cp conf/hive-default.xml.template conf/hive-site.xmlvim conf/hive-site.xml

编纂

Default
12345 <property>    <name>hive.metastore.warehouse.dir</name> #hive数据存储目次,指定的是HDFS上的职位地方    <value>/user/hive/warehouse</value>    <description>location of default database for the warehouse</description>  </property>

Default
123456   <property>    <name>hive.exec.scratchdir</name>    <value>/tmp/hive</value> #hive的数据持久文件目次    <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>  </property>

在hadoop HDFS中构建相应的目次

Default
1234 bin/hdfs dfs -mkdir /user/hive/bin/hdfs dfs -mkdir /user/hive/warehousebin/hdfs dfs -chmod g+w /user/hive/warehousebin/hdfs dfs -chmod g+w /tmp
第5步:hive配置-设置hive数据将元数据存储在MySQL中hive需要将元数据存储在RDBMS中,默认情况下,配置为Derby数据库 vim conf/hive&#八211;site.xml
默认设置 Default
12345 <property>    <name>javax.jdo.option.ConnectionURL</name>    <value>jdbc:derby:;databaseName=metastore_db;create=true</value> #Hive联接数据库的联接字符串    <description>JDBC connect string for a JDBC metastore</description>  </property>

Default
12345 <property>    <name>javax.jdo.option.ConnectionDriverName</name>    <value>org.apache.derby.jdbc.EmbeddedDriver</value> #jdbc驱动的类入口称号    <description>Driver class name for a JDBC metastore</description>  </property>

Default
12345   <property>    <name>javax.jdo.option.ConnectionUserName</name> #数据库的用户名    <value>APP</value>    <description>Username to use against metastore database</description> #  </property>

Default
12345   <property>    <name>javax.jdo.option.ConnectionPassword</name> #数据库的暗码    <value>mine</value>    <description>password to use against metastore database</description>  </property>

Derby JDBC驱动包在lib/derby-10.11.1.1.jar

接下去我们配置应用MySQL来存储元数据库metastore 1.?安放MySQL Default
apt-get install mysql-server
2. 构建账户,设置用户名与暗码 Default
create user 'hive'@'%' identified by 'hive';
3. 付与权限 Default
grant all privileges on *.* to 'hive'@'%' with grant option;
4. 强制奏效 Default
flush privileges;
5. 构建数据库 Default
create database hive;
血的教训,血的教训,血的教训(2015年7月9日赔偿) 这里1定要把hive数据库的字符集修改成latin1,而且1定要在hive首次动员的时辰就修改字符集 (否则就等着删除独霸的时辰死掉吧) Default
alter database hive character set latin1;
6. 配置Hive Default
vim conf/hive-site.xml

编纂

Default
1234567八910111213141516171八192021222324 <property>    <name>hive.metastore.local</name>    <value>true</value>  </property>  <property>    <name>javax.jdo.option.ConnectionURL</name>    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>    <description>JDBC connect string for a JDBC metastore</description>  </property>  <property>    <name>javax.jdo.option.ConnectionDriverName</name>    <value>com.mysql.jdbc.Driver</value>    <description>Driver class name for a JDBC metastore</description>  </property>  <property>    <name>javax.jdo.option.ConnectionUserName</name>    <value>hive</value>    <description>Username to use against metastore database</description>  </property>  <property>    <name>javax.jdo.option.ConnectionPassword</name>    <value>hive</value>    <description>password to use against metastore database</description>  </property>

7. 下载MySQL JDBC 驱动包,摆设在lib目次下

Default
lib/mysql-connector-java-5.1.7-bin.jar
第6步:运转hive敕令行接口 Default
bin/hive
第七步: hive QL初体验 Default
1234567八91011121314 hive> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ',';OKTime taken: 0.201 secondshive> load data local inpath '/home/tanjiti/apache-hive-1.1.0-bin/test.data' overwrite into table test;Loading data to table default.testTable default.test stats: [numFiles=1, numRows=0, totalSize=25, rawDataSize=0]OKTime taken: 0.463 secondshive> select * from test;OK1    tanjiti2    kokcc3    daniTime taken: 0.21八 seconds, Fetched: 3 row(s)

我们在hadoop HFS中也能看到该数据文件,可能看出来hive中的每张表都对应hadoop的1个存储目次

Default
1234 /hadoop-2.6.0/bin/hdfs dfs -cat /user/hive/warehouse/test/test.data1,tanjiti2,kokcc3,dani
我真是bug体质,碰着1堆不对 不对1: Default
1234567八910111213141516171八1920212223242526272八29303132 [ERROR] Terminal initialization failed; falling back to unsupportedjava.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected        at jline.TerminalFactory.create(TerminalFactory.java:101)        at jline.TerminalFactory.get(TerminalFactory.java:15八)        at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)        at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)        at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)        at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:606)        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)        at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected        at jline.console.ConsoleReader.<init>(ConsoleReader.java:230)        at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)        at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)        at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:606)        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

起因:jline版本辩论

Default
123456 find ../ -name  jline* ../hadoop-2.6.0/share/hadoop/httpfs/tomcat/站点apps/站点hdfs/WEB-INF/lib/jline-0.9.94.jar../hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar../hadoop-2.6.0/share/hadoop/kms/tomcat/站点apps/kms/WEB-INF/lib/jline-0.9.94.jar../apache-hive-1.1.0-bin/lib/jline-2.12.jar

打点妄想:?http://stackoverflow.com/questions/2八997441/hive-startup-error-terminal-initialization-failed-falling-back-to-unsupporte Default
export HADOOP_USER_CLASSPATH_FIRST=true

不对2:
Default
1234567八910111213141516171八192021 Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:606)        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D        at org.apache.hadoop.fs.Path.initialize(Path.java:206)        at org.apache.hadoop.fs.Path.<init>(Path.java:172)        at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:515)        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:45八)        ... 八 moreCaused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D        at java.net.URI.checkPath(URI.java:1八04)        at java.net.URI.<init>(URI.java:752)        at org.apache.hadoop.fs.Path.initialize(Path.java:203)        ... 11 more

打点妄想:厘革${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D为绝对路径
Default
vim conf/hive-site.xml

编纂
Default
1234567八910111213141516171八1920   <property>    <name>hive.exec.local.scratchdir</name>    <value>/tmp/hive</value>    <description>Local scratch space for Hive jobs</description>  </property>  <property>    <name>hive.downloaded.resources.dir</name>    <value>/tmp/${hive.session.id}_resources</value>    <description>Temporary local directory for added resources in the remote file system.</description>  </property>  <property>    <name>hive.querylog.location</name>    <value>/tmp/hive</value>    <description>Location of Hive run time structured log file</description>  </property><property>    <name>hive.server2.logging.operation.log.location</name>    <value>/tmp/hive/operation_logs</value>    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>  </property>

不对3:
Default
12 FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytescom.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes

打点妄想: http://zh.hortonworks.com/community/forums/topic/hive-check-execute-failed-hdp1-3/ Default
alter database hive character set latin1;

不对4:

java版本过低导致的不对,最佳应用java 7及以上版本(java6的定位估量是浏览器中的IE6)

Default
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor version 51.0

三、 常用HiveQL及UDF编写

1句话:和mysql特别非常像,学起来完全没有压力

1. hive客户端常用敕令

Default
1234567 hive -e "SQL语句"; hive -f  test.hql 从文件执行hive究诘 hive> ! pwd;  执行繁杂的shell敕令 hive> dfs -ls /user/hive/warehouse; 执行Hadoop的dfs敕令

2. hive否决的数据范例

包括根蒂基本范例与集合范例,对日记分析,1般string,bigint,double,map就够用了

3. hive默认的切割文本文件的合并符

\n 朋分行;ctrl+A 朋分字段(列),开头大师来指定朋分符

4. hive的情况设置

Default
1234 set hive.cli.print.current.db=true; 设置示意当前DBset hive.cli.print.header=true; 设置示意表头set hive.exec.mode.local.auto=ture;设置本地模式,灌注截止mapreduce,数据量小的时辰适用set hive.mapred.mode=strict;设置仁慈模式,当开启非本地模式,采纳仁慈的究诘语句优化究诘死守,比方where需指定分区;order by要和limit1起

5. HiveQL:我们经由分析站点日记来熟习HiveQL

日记样例

Default
1234567八910111213141516 127.0.0.1 [12/May/2015:15:16:30 +0八00] sqli(194) BAN(226) 403  174 POST "/wp-content/plugins/store-locator-le/downloadcsv.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0" "-" "-" "-""query=addr,,1%26%2339;union(select*from((select%20md5(3.1415))a1%20join(select%202)a2))#" "application/x-www-form-urlencoded" remote_addr string 访问者iptime_local 岁月attack_type 打击范例(范例ID)ban_type 事项处理范例(事项响应岁月)status HTTP响应码body_bytes_sent body字节数request_method HTTP哀求办法request_uri HTTP哀求URIhttp_user_agent  User_Agent哀求头http_x_forwarded_for  X_Forwarded_For哀求头http_referer Referer哀求头http_cookie Cookie哀求头request_body 哀求头http_content_type Content_Type哀求头

第1步:构建数据库站点log

Default
hive> create database if not exists 站点log comment 'holds all 站点 logs' ;

数据库存储在

Default
123 hive> dfs -ls /user/hive/warehouse/; drwxrwxr-x - root supergroup 0 2015-05-12 15:01 /user/hive/warehouse/站点log.db

第2步:构建表nginxlog,用来存储原始日记

Default
123 hive> use default; hive> create table nginxlog(remote_addr string,time_local string, attack_typestring,ban_type string,status string,body_bytes_sent string,request_methodstring,request_uri string,http_user_agent string,http_x_forwarded_for string,http_refererstring,http_cookie string,request_body string,http_content_type string) row format serde'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w+\\(\\d*\\))\\s+(\\w+\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") stored as textfile;

这个input regex让我血槽速减 99%

教训:双倍本义,双倍本义;双倍本义 求助的事项说三遍

也让我学会了如何修改table的SerDe属性

赔偿知识:hive应用1个inputformat工具将输入流朋分成记录,今后应用1个outoutformat工具来记录技俩化为输出流,再应用

SerDe(序列化,反序列化配置)在读数据时将记录剖析成列,在写数据时将列编码成记录。

Default
123 hive> alter table nginxlog> set serde 'org.apache.hadoop.hive.serde2.RegexSerDe'> with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w*\\(\\d*\\))\\s+(\\w*\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") ;

表构建腐朽后,我们可能看到其在hadoop中的存储职位地方为

Default
123 hive> dfs -ls /user/hive/warehouse/站点log.db/nginxlog;Found 1 items-rwxrwxr-x 1 root supergroup 1八61八96 2015-05-12 20:22 /user/hive/warehouse/站点log.db/nginxlog/access.log

我们可能搜查表的布局

Default
1234567八91011121314151617 hive> describe nginxlog;OKremote_addr stringtime_local stringattack_type stringban_type stringstatus stringbody_bytes_sent stringrequest_method stringrequest_uri stringhttp_user_agent stringhttp_x_forwarded_for stringhttp_referer stringhttp_cookie stringrequest_body stringhttp_content_type stringTime taken: 0.055 seconds, Fetched: 14 row(s)

第三步:导入原始日记文件

Default
load data local inpath "/home/tanjiti/nginx/logs/access.log" overwrite into table nginxlog;

第4步:构建另1个表,用来存储url parse后的数据

Default
1234567 create table urlparse( request_uri string, requestfilename string, param map<string,string>);

将url parse数据存入urlparse表中

Default
insert overwrite table urlparse select request_uri, case when instr(request_uri,'?') == 0 then substr(request_uri,0,length(request_uri)) else substr(request_uri,0,instr(request_uri,'?')-1) end as requestfilename, case when instr(request_uri,'?') == 0 then NULL else str_to_map(substr(request_uri,instr(request_uri,'?')+1),'&','=') end as param from nginxlog;

我们可能检查1下存入的数据

Default
123 urlparse.request_uri urlparse.requestfilename urlparse.param (列名)/forummission.php /forummission.php NULL/userapp.php?script=notice&view=all&option=deluserapp&action=invite&hash='%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23 /userapp.php {"hash":"'%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23","action":"invite","option":"deluserapp","view":"all","script":"notice"}

注:这类剖析办法特别粗略,对不吻合url?k1=v1&k2=v2技俩的哀求是无奈粗略剖析的,其中就包括url改写的,实际操纵需要改善,这里仅仅是示例

第5步:索求URL特色

我们从统计的角度来索求url的1些特色:

Default
123456 每个host对应多少个去重的url哀求;这些URL哀求的:参数个数的分布特色参数长度的分布特色参数名的取值罗列及取值分类:Word,  ParaArray( e.g. text["fafa"], t[]),Other参数值的取值分类:Digits( e.g. -123  +56 123.3  .3 1,123,123),Word, Email, PATH(windows/*linux), URI,SafeText(-_.,:a-zA-Z0-9\s), Flag(Null), DuplicatePara(e.g. a=1&a=2), Base64, Encrypt(md5, sha1),Other

扩充开来,这类索求办法我们可能用来天生URL白名单,当然在索求前,我们需要对日记源截止清理独霸,去除非畸形日记比方打击日记,server不对日记(只取2xx,3xx),静态日记(avi,jpg等),反复日记等杂音,与规范化处理比方统1PATH(缩短多个//, 转换/->\),扯远了。

长度性的武断可能繁杂地应用切比雪夫定理大数据之hive安装及分析web日志实例 - 碳基体 - 碳基体

关于数值性的统计分析,Hive提供了1些内置函数,比方

描述数据斥逐趋势的:均值avg;

描述数据离散程度的:方差var_pop;标准差stddev_pop;协方差covar_pop;相关系数corr

这些效用的完成有些可能用内置的函数,有些就得编写自定义函数了。

内置函数之获得哀求路径,该路径下涌现的参数名数组,参数值数组

Default
select requestfilename,map_keys(param),map_values(param) from urlparse where param is not null limit 10;

所有后果下列

Default
12 /bbs/plugin.php ["action","identifier","fmid","module"] ["view","family","1+and+1=2+unIon+selecT+%201,2,3,4,5,6,7,八,9,10,11,12,13,14,15,16,17,1八,19,20,21,22,23,group_concat(0x3a,0x3a,md5(3.1415),0x3a,0x3a),25,26,27,2八,29,30,31--%20-","family"]/wp-content/plugins/profiles/library/bio-img.php ["id"] ["-1%27%20AND%201=IF(2%3E1,BENCHMARK(10000000,MD5(CHAR(115,113,10八,109,97,112))),0)--%20-"]

内置函数之获得index.php下全副的究诘字符串key-value对

Default
hive> from (select explode(param) from urlparse where param is not NULL and requestfilename = '/index.php') e select distinct *;

所有后果下列

Default
1234567八 view ../../../../../../../../../../../../../../../../../../boot.ini%00view ../../../../../../../../../../../../../../../../etc/passwd%00view c%3A%5CBoot.ini%00view musicview objectview portfolioview threadview timereturns

内置函数之获得每条url的参数分布统计特色

Default
hive> select s.requestfilename as requestfilename, sum(distinct s.param_num) as sum, avg(s.param_num) as avg, max(s.param_num) as max, min(s.param_num) as min, variance(s.param_num) as variance, var_samp(s.param_num) as var_samp, stddev_pop(s.param_num) as stddev_pop, stddev_samp(s.param_num) as stddev_samp from (select requestfilename as requestfilename,size(param) as param_num from urlparse where param is not null)s group by s.requestfilename limit 10;

所有后果下列:

Default
123 requestfilename sum avg max min variance var_samp stddev_pop stddev_samp/ 21 2.462365591397八495 6 1 0.八077234362354029 0.八1650303八803179 0.八9八73435242八6八23 0.9036055770097八0八//m_5_1/govdiropen/que_chooseusers.jsp 1 1.0 1 1 0.0 0.0 0.0 0.0

第6步:编写用户自定义函数-IP GEO新闻究诘

我们用究诘Maxmind IP库来定位remote_addr的地理职位地方

? ? 1. 下载maxmind geoip java api并打包成jar包

Default
123 git clone https://github.com/maxmind/geoip-api-java.gitcd geoip-api-java/mvn clean install

会天生target/geoip-api-1.2.15-SNAPSHOT.jar文件

?2. 获得IP库数据文件

Default
12 wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gzgzip -d GeoIP.dat.gz

3. 编写hive-geo UDF function

源码参照?https://raw.githubusercontent.com/edwardcapriolo/hive-geoip/master/src/main/java/com/jointhegrid/udf/geoip/GenericUDFGeoIP.java

源码由4所有构成

1.函数应用阐明文档,describe function ?中会看到的模式

Default
1234567八9 @Description(name = "geoip",value = "_FUNC_(ip,property,database) - loads database into GEO-IP lookup "+"service, then looks up 'property' of ip. " extended = "Example:\n" + "> SELECT _FUNC_(ip,'COUNTRY_CODE','/GeoIP.data') from src LIMIT 1;\n ")

2. initialize初始阶段,检查传入参数不法性、确认其范例,比方说本例的第1个参数可能是string,也可能是长整形

public?ObjectInspector initialize(ObjectInspector[] arguments)

??????????????????????throws?UDFArgumentException {

3. 究诘逻辑的完成,究诘中对应的每个操纵到这个函数的地方农村对这个类截止实例化

public?Object evaluate(DeferredObject[] arguments)?throws?HiveException?

4.调式应用

public?String getDisplayString(String[] children)

自定义非聚合类函数的编写体例根蒂基本可能参照上面的源码的布局,改吧改吧

注:聚合类函数编写要比这个复杂些

源码摆设在我的git上?https://github.com/tanjiti/UDFExample/tree/master

在eclipse上编译成jar包,当然也可能采纳敕令行的体例,乘兴致的见第4所有的模式

大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体

大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体
附加:eclipse安放fatjar插件的办法

Default
Help - install new software - work with 处填入http://kurucz-grafika.de/fatjar

? ? 4. 接下去的独霸便是hive独霸了

Default
123456 add jar /home/tanjiti/UDFExample/UDFExample.jar; #这个独霸是设置classpath,但有些标题问题add jar /home/tanjiti/UDFExample/lib/geoip-api-1.2.15-SNAPSHOT.jaradd jar /home/tanjiti/UDFExample/lib/hive-exec-1.1.0.jar;add file /tmp/GeoIP.dat; #这个独霸切实便是应用hadoop分布式缓存create temporary function geoip as 'udfExample.GenericUDFGeoIP';select geoip(remote_addr,"COUNTRY_NAME","/tmp/GeoIP.dat") from nginxlog limit1;

梗概

Default
select geoip(35146八3273,"COUNTRY_NAME","/tmp/GeoIP.dat");

后果下列

Default
xxx.xxx.xxx United States

为了灌注不需要的贫穷,请写全路径,全路径,全路径,求助的事项说三遍。

到这里,这篇科普文就完毕了,上面的乘兴致可能看看,血泪史

&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八212;&#八211;(血泪朋分线)

老习尚,记录1下碰着的bug

不对5:

Default
1234567八910111213141516171八1920212223242526272八2930313233343536373八3940414243444546474八495051 Exception in thread "main" java.lang.NoClassDefFoundError: com/maxmind/geoip/LookupService    at udfExample.GenericUDFGeoIP.evaluate(GenericUDFGeoIP.java:133)    at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:145)    at org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:232)    at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:95八)    at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:116八)    at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:7八)    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)    at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109)    at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:192)    at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:145)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10530)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:104八6)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3720)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3499)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:9011)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:八966)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9八12)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9705)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10141)    at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:2八6)    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10152)    at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:192)    at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:222)    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:421)    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:307)    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1112)    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1160)    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1039)    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207)    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:754)    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)    at java.lang.reflect.Method.invoke(Method.java:606)    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Caused by: java.lang.ClassNotFoundException: com.maxmind.geoip.LookupService    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)    at java.security.AccessController.doPrivileged(Native Method)    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)    at java.lang.ClassLoader.loadClass(ClassLoader.java:35八)

有教训的梗概有java知识的都知道java.lang.NoClassDefFoundError 多数是classpath情况变量的标题问题了

但我是菜鸟,切实涌现该不对的独霸是厌弃fat jar包打包迟缓,是以用不熟习的敕令行打包,后果花了很多岁月来定位起因。。。所幸找到了打点妄想

4、敕令行编译打包jar文件

1、首先引见1下源码布局

Default
1234567八910111213141516171八192021 ├── bin  #用来存放编译后的字节文件 │?? └── udfExample │??     └── GenericUDFGeoIP.class ├── lib #用到的内部jar包 │?? ├── geoip-api-1.2.15-SNAPSHOT.jar │?? └── hive-exec-1.1.0.jar ├── mymainfest #mainfest文件,特别求助 ├── src #java源文件 │?? └── udfExample │??     ├── GenericUDFGeoIP.java │??     └── GenericUDFNvl.java

2、编译源文件

Default
javac -d bin/ -sourcepath src/ -cp lib/hive-exec-1.1.0.jar:lib/geoip-api-1.2.15-SNAPSHOT.jar src/udfExample/GenericUDFGeoIP.java

Default
123 -cp <path>                 指定寄托的库文件或字节文件-sourcepath <path>         指定源文件路径-d <directory>             指定编译后的字节文件存放路径

三、编纂mainfest文件 (切切别忘了这步)

Default
vim mymainfest

编纂

Default
12 Main-Class: udfExample.GenericUDFGeoIPClass-Path: lib/geoip-api-1.2.15-SNAPSHOT.jar lib/hive-exec-1.1.0.jar

4、打包

Default
jar cvfm UDFExample.jar mymainfest lib/* src/* -C bin .

-c 构建1个新的jar包 -v 示意详细新闻 -f 指定jar包的路径 -C 切换到指定目次, -C 目次后随着的. 体现包罗指定目次下的全副文件 -m 指定mainfest文件

接下去便是hive上独霸了

接下去规画把hadoop mapreduce入门的坑填上

hive文档参考: https://github.com/edwardcapriolo/hive-geoip/ https://cwiki.apache.org/confluence/display/Hive/Home https://cwiki.apache.org/confluence/display/Hive/GettingStarted https://cwiki.apache.org/confluence/display/Hive/LanguageManual

《hive编程指南》

【via@碳基体】

数安新闻+更多

证书相关+更多