软件世界网 购物 网址 三丰软件 | 小说 美女秀 图库大全 游戏 笑话 | 下载 开发知识库 新闻 开发 图片素材
多播视频美女直播
↓电视,电影,美女直播,迅雷资源↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
移动开发 架构设计 编程语言 Web前端 互联网
开发杂谈 系统运维 研发管理 数据库 云计算 Android开发资料
  软件世界网 -> 数据库 -> 使用hive查询把访问网络流量会话化 -> 正文阅读

[数据库]使用hive查询把访问网络流量会话化

《Hive编程指南》最后一章的Outbrain案例中,有一个把访问网络流量会话化的简单实现,但按照它的查询出来的结果是错的,于是自己重写了一个。
一、问题提出(摘自书中原文)
        为了分析网络流量,我们常常希望能够基于各种各样的标准来测量热度。一种方法就是将用户行为分解到会话中,一次会话代表单一的一次“使用”所包含的一系列操作。一个用户在一天内或者一个月中的某几天可以多次访问某个网站,但每一次访问肯定是不一样的。
        那么,什么是一个会话呢?一种定义是指相隔不超过30分钟的一连串的页面活动就是一个会话。也就是说,如果你去你的第1个页面,等待5分钟,然后去第2个页面,那么这是相同的会话。又等待25分钟后再到第3页,仍然是相同的会话。再等待1分钟跳转到第4页,这次会话将被打破了,这将不是第4个访问页面了,而是第2个会话中的第一个页面。
        一旦我们获得这些中断信息,我们就可以查看会话的属性信息,来看看发生了什么事而导致中断的。常规的方式就是通过会话长度来对链入的页面进行比较。
        乍一看,这似乎是一个完美的迭代过程。对于每个页面,保持倒计数,直到你找到第1个页面。但Hive是不支持迭代的。不过,还是可以解决这个问题。可以将这个过程分为4个阶段。
1. 识别哪些页面浏览是会话的初始者,或“起源”页面。
2. 对于每个页面,将其划分到正确的来源页面。
3. 将所有的页面浏览聚合到每个来源页面。
4. 对每个来源页面进行标记,然后计算每个会话的热度。
这种方式将产生一个表,其中每一行都表示一个完整的会话,然后用户就可以查询想知道的信息了。
二、实现过程
1. 设置
首先定义表session_test:
CREATE TABLE session_test
(
  st_user_id STRING,
  st_pageview_id STRING,
  st_page_url STRING,
  st_referrer_url STRING,
  st_timestamp DOUBLE
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
然后准备测试数据的CSV文件session_test.txt
user_1,2184914255, http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,http://ssx4501.blog.sohu.com/321517032.html,1459070426.712
user_1,2489633785,http://blog.sohu.com/,http://news.sohu.com/20160327/n442361110.shtml,1459070327.691
user_1,9638816121,http://www.sohu.com/,https://bj.122.gov.cn/views/inquiry.html,1459070227.725
user_1,7730804743,http://www.bjjtgl.gov.cn/wanchengyemian2/hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.563
user_1,6135947520,http://mail.sohu.com/bapp/36/main#mailList_-2,http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,1459070027.165
user_1,2005762692,http://www.56.com/u63/v_MTIxOTAzMTk2.html,http://www.56.com/u63/v_MTIxOTAzMTk2.html,1459069427.065
user_1,9885965958,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri=movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E8%94%A1%E6%9D%8E%E4%BD%9B%E5%B0%8F%E5%AD%90&rsv_pq=da0a1d6e001a2f26&rsv_t=ae7cF77GAFjqvIjZZAoitr6QZQEZHLOS5jaOILlrGrSf82o0ewqe3XyjTLQ&rsv_enter=1&rsv_n=2&rsv_sug3=1,1459060427.743
user_1,7781491477,https://www.baidu.com/,http://www.youku.com/,1459060227.538
user_1,4874646952,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,1459060127.538
user_1,4584871291,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=referrer_url&oq=mysql%20%E5%A4%9A%E8%A1%8Cuuid&rsv_pq=b37dd6bf000f9361&rsv_t=de34O52vUqUHuV5xSBtPC8H8TCY7QosiYwdWioYH04hjx%2FIyZ7PBogibMlA&rsv_enter=1&inputT=5220&rsv_sug3=33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%E5%A4%9A%E8%A1%8Cuuid,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E5%A4%9A%E8%A1%8Cuuid&oq=mysql%20%E9%80%90%E8%A1%8Cuuid&rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5Jt3mUBueH%2B93%2FDJMIOS9KNaP85ERoqiRFfSyIYOh4cSFNOwnRh%2Fcw&rsv_enter=1&inputT=5220&rsv_sug3=29&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&rsv_sug4=5943,1459060027.378
user_1,7240734452,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,1459059427.334
user_1,6310115480,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E9%80%90%E8%A1%8Cuuid&oq=mysql%20uuid&rsv_pq=db45a04e000eb323&rsv_t=a2cexGBrnrvDtSsubhPDhgtu0%2BmpbceYUGEqfhh49O3STPQfBGFTvQmx3LM&rsv_enter=1&inputT=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20uuid&oq=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92&rsv_pq=bd2966c9000d954a&rsv_t=2591i8XuLiiMERZ%2FcaOuYiLPWymHp4R4Z%2BJV5%2F4%2Fk3JbvShjVSiL0kDX09w&rsv_enter=1&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&bs=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92,1459057527.948
user_1,9725586547,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http://blog.csdn.net/runming918/article/details/7231259,1459057427.004
user_1,1995107575,http://blog.csdn.net/superhosts/article/details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html,1459055428.541
user_1,4376767486,http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_XMTUxMTg4MjQ3Ng==_ev_4.html?from=y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.252
user_2,2202616558, http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,http://ssx4501.blog.sohu.com/321517032.html,1459070426.913
user_2,2458809722,http://blog.sohu.com/,http://news.sohu.com/20160327/n442361110.shtml,1459070327.073
user_2,8893838191,http://www.sohu.com/,https://bj.122.gov.cn/views/inquiry.html,1459070227.167
user_2,2853640637,http://www.bjjtgl.gov.cn/wanchengyemian2/hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.527
user_2,1174630884,http://mail.sohu.com/bapp/36/main#mailList_-2,http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,1459070027.515
user_2,5635223546,http://www.56.com/u63/v_MTIxOTAzMTk2.html,http://www.56.com/u63/v_MTIxOTAzMTk2.html,1459069427.031
user_2,6497571917,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri=movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E8%94%A1%E6%9D%8E%E4%BD%9B%E5%B0%8F%E5%AD%90&rsv_pq=da0a1d6e001a2f26&rsv_t=ae7cF77GAFjqvIjZZAoitr6QZQEZHLOS5jaOILlrGrSf82o0ewqe3XyjTLQ&rsv_enter=1&rsv_n=2&rsv_sug3=1,1459060427.961
user_2,9736857005,https://www.baidu.com/,http://www.youku.com/,1459060227.972
user_2,9522138344,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,1459060127.819
user_2,3288917999,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=referrer_url&oq=mysql%20%E5%A4%9A%E8%A1%8Cuuid&rsv_pq=b37dd6bf000f9361&rsv_t=de34O52vUqUHuV5xSBtPC8H8TCY7QosiYwdWioYH04hjx%2FIyZ7PBogibMlA&rsv_enter=1&inputT=5220&rsv_sug3=33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%E5%A4%9A%E8%A1%8Cuuid,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E5%A4%9A%E8%A1%8Cuuid&oq=mysql%20%E9%80%90%E8%A1%8Cuuid&rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5Jt3mUBueH%2B93%2FDJMIOS9KNaP85ERoqiRFfSyIYOh4cSFNOwnRh%2Fcw&rsv_enter=1&inputT=5220&rsv_sug3=29&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&rsv_sug4=5943,1459060027.815
user_2,3790199378,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,1459059427.106
user_2,6413528694,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E9%80%90%E8%A1%8Cuuid&oq=mysql%20uuid&rsv_pq=db45a04e000eb323&rsv_t=a2cexGBrnrvDtSsubhPDhgtu0%2BmpbceYUGEqfhh49O3STPQfBGFTvQmx3LM&rsv_enter=1&inputT=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20uuid&oq=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92&rsv_pq=bd2966c9000d954a&rsv_t=2591i8XuLiiMERZ%2FcaOuYiLPWymHp4R4Z%2BJV5%2F4%2Fk3JbvShjVSiL0kDX09w&rsv_enter=1&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&bs=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92,1459057527.689
user_2,6745253283,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http://blog.csdn.net/runming918/article/details/7231259,1459057427.126
user_2,7438056632,http://blog.csdn.net/superhosts/article/details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html,1459055428.198
user_2,8615798962,http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_XMTUxMTg4MjQ3Ng==_ev_4.html?from=y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.636
最后向表中导入测试数据
LOAD DATA LOCAL INPATH '/home/grid/session_test.txt' INTO TABLE session_test;

2. 找到每个会话的起始页
CREATE TABLE sessionization_step_one_origins
AS
   SELECT t1.st_user_id AS ssoo_user_id,
          t1.st_pageview_id ssoo_pageview_id,
          t1.st_timestamp ssoo_timestamp
     FROM session_test t1
          LEFT JOIN
          (SELECT DISTINCT st_pageview_id
             FROM (SELECT b.st_pageview_id st_pageview_id,
                          IF (
                                 b.st_timestamp > a.st_timestamp
                             AND b.st_timestamp - a.st_timestamp < 1800,
                             1,
                             0)
                             c_nonorigin_flags
                     FROM session_test a
                          INNER JOIN session_test b
                             ON a.st_user_id = b.st_user_id) t1
            WHERE c_nonorigin_flags = 1) t2
             ON t1.st_pageview_id = t2.st_pageview_id
    WHERE t2.st_pageview_id IS NULL;
        在最内层子查询中用自关联查询出非起始页面的浏览ID,因为Hive目前还没有支持不等于的JOIN,所以用了一个IF判断给非起始页面做标记。最外层的查询使用外连接得到会话起始页。
        查询sessionization_step_one_origins表的记录如图1所示
  SELECT ssoo_user_id,
         ssoo_pageview_id,
         CAST (ssoo_timestamp AS DECIMAL (13, 3))
    FROM sessionization_step_one_origins
ORDER BY ssoo_user_id, ssoo_timestamp;

[img]http://img.blog.csdn.net/20160328113523015?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图1
3. 将PV分配到起始页面中
CREATE TABLE sessionization_step_two_origin_identification
AS
   SELECT st_user_id sstoi_user_id,
          st_pageview_id sstoi_pageview_id,
          ssoo_pageview_id sstoi_origin_pageview_id
     FROM (SELECT t2.st_user_id,
                  t2.st_pageview_id,
                  t1.ssoo_pageview_id,
                  IF (
                         t2.st_timestamp >= t1.current_ssoo_timestamp
                     AND t2.st_timestamp < t1.lead_ssoo_timestamp,
                     1,
                     0)
                     c_nonorigin_flags
             FROM (SELECT ssoo_user_id,
                          ssoo_pageview_id,
                          NVL (lead_ssoo_timestamp, 9999999999)
                             lead_ssoo_timestamp,
                          current_ssoo_timestamp
                     FROM (SELECT ssoo_user_id,
                                  ssoo_pageview_id,
                                  LEAD (
                                     ssoo_timestamp,
                                     1)
                                  OVER (PARTITION BY ssoo_user_id
                                        ORDER BY ssoo_timestamp)
                                     lead_ssoo_timestamp,
                                  ssoo_timestamp current_ssoo_timestamp
                             FROM sessionization_step_one_origins) t0) t1
                  INNER JOIN
                  (SELECT st_user_id, st_pageview_id, st_timestamp
                     FROM session_test) t2
                     ON t1.ssoo_user_id = t2.st_user_id) t
    WHERE t.c_nonorigin_flags = 1;
        最内层子查询使用LEAD窗口分析函数得到下一个时间戳,在当前时间戳和下一个时间戳之间的记录为当前会话起始页对应的PV。因为起始页对应的PV包含起始页自己,所以IF区间判断条件是左闭右开。
        查询sessionization_step_two_origin_identification表的记录如图2所示
SELECT sstoi_user_id, sstoi_pageview_id, sstoi_origin_pageview_id
  FROM sessionization_step_two_origin_identification;

[img]http://img.blog.csdn.net/20160328113638423?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center  
图2
4. 按起始页聚合
CREATE TABLE sessionization_step_three_origin_aggregation AS
SELECT 
    sstoi_user_id sstoa_user_id,
    sstoi_origin_pageview_id sstoa_origin_pageview_id,
    count(1) sstoa_pageview_count
FROM
    sessionization_step_two_origin_identification
GROUP BY sstoi_user_id , sstoi_origin_pageview_id;
        查询sessionization_step_three_origin_aggregation表的记录如图3所示
  SELECT sstoa_user_id, sstoa_origin_pageview_id, sstoa_pageview_count
    FROM sessionization_step_three_origin_aggregation
ORDER BY sstoa_user_id, sstoa_pageview_count;

[img]http://img.blog.csdn.net/20160328113736472?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图3
5. 按起始页属性聚合
CREATE TABLE sessionization_step_four_qualitative_labeling
AS
   SELECT a.a_user_id AS ssfql_user_id,
          a.a_origin_pageview_id AS ssfql_origin_pageview_id,
          b.b_timestamp AS ssfql_timestamp,
          b.b_page_url AS ssfql_page_url,
          b.b_referrer_url AS ssfql_referrer_url,
          a.a_pageview_count AS ssqfl_pageview_count
     FROM (SELECT sstoa_user_id AS a_user_id,
                  sstoa_origin_pageview_id AS a_origin_pageview_id,
                  sstoa_pageview_count AS a_pageview_count
             FROM sessionization_step_three_origin_aggregation) a
          JOIN
          (SELECT st_user_id AS b_user_id,
                  st_pageview_id AS b_pageview_id,
                  st_page_url AS b_page_url,
                  st_referrer_url AS b_referrer_url,
                  st_timestamp AS b_timestamp
             FROM session_test) b
             ON     a.a_user_id = b.b_user_id
                AND a.a_origin_pageview_id = b.b_pageview_id;
        查询sessionization_step_four_qualitative_labeling表的记录如图4所示
  SELECT ssfql_user_id,
         ssfql_origin_pageview_id,
         ssfql_timestamp,
         ssfql_page_url,
         ssfql_referrer_url,
         ssqfl_pageview_count
    FROM sessionization_step_four_qualitative_labeling
ORDER BY ssfql_user_id, ssfql_timestamp;

[img]http://img.blog.csdn.net/20160328113842035?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图4
6. 衡量热度
  SELECT PARSE_URL (ssfql_referrer_url, 'HOST') AS referrer_host,
         COUNT (1) AS session_count,
         AVG (ssqfl_pageview_count) AS avg_pvs_per_session,
         SUM (ssqfl_pageview_count) / COUNT (1) AS weighted_avg_pvs_per_session,
         MAX (ssqfl_pageview_count) AS max_pvs_per_session,
         MIN (ssqfl_pageview_count) AS min_pvs_per_session,
         COUNT (DISTINCT ssfql_user_id) AS unique_users
    FROM sessionization_step_four_qualitative_labeling
GROUP BY PARSE_URL (ssfql_referrer_url, 'HOST');
        查询结果如图5所示
[img]http://img.blog.csdn.net/20160328113936489?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图5
......显示全文...
    点击查看全文


上一篇文章      下一篇文章      查看所有文章
2016-03-28 21:48:11  
数据库 最新文章
Python&MySQL&PyQt
最新用python来操作mysql完全解析
mongodb的安装详解
1.PDO简介
《MySQL必知必会学习笔记》:高级联结
【翻译自mos文章】怎么对Microsoft(Office)
MyCAT全局表描述及示例
ocp
关于SQL数据表存储过程表名前缀换成dbo代码
数据库调优教程(二)慢查询数据准备
360图书馆 软件开发资料 文字转语音 购物精选 软件下载 美食菜谱 新闻资讯 电影视频 小游戏 Chinese Culture 股票 租车
生肖星座 三丰软件 视频 开发 短信 中国文化 网文精选 搜图网 美图 阅读网 多播 租车 短信 看图 日历 万年历 2018年7日历
2018-7-19 13:54:43
多播视频美女直播
↓电视,电影,美女直播,迅雷资源↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  软件世界网 --