软件世界网 购物 网址 三丰软件 | 小说 美女秀 图库大全 游戏 笑话 | 下载 开发知识库 新闻 开发 图片素材
多播视频美女直播
↓电视,电影,美女直播,迅雷资源↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
移动开发 架构设计 编程语言 Web前端 互联网
开发杂谈 系统运维 研发管理 数据库 云计算 Android开发资料
  软件世界网 -> 数据库 -> 使用hive查询把访问网络流量会话化 -> 正文阅读

[数据库]使用hive查询把访问网络流量会话化

《Hive编程指南》最后一章的Outbrain案例中,有一个把访问网络流量会话化的简单实现,但按照它的查询出来的结果是错的,于是自己重写了一个。
一、问题提出(摘自书中原文)
        为了分析网络流量,我们常常希望能够基于各种各样的标准来测量热度。一种方法就是将用户行为分解到会话中,一次会话代表单一的一次“使用”所包含的一系列操作。一个用户在一天内或者一个月中的某几天可以多次访问某个网站,但每一次访问肯定是不一样的。
        那么,什么是一个会话呢?一种定义是指相隔不超过30分钟的一连串的页面活动就是一个会话。也就是说,如果你去你的第1个页面,等待5分钟,然后去第2个页面,那么这是相同的会话。又等待25分钟后再到第3页,仍然是相同的会话。再等待1分钟跳转到第4页,这次会话将被打破了,这将不是第4个访问页面了,而是第2个会话中的第一个页面。
        一旦我们获得这些中断信息,我们就可以查看会话的属性信息,来看看发生了什么事而导致中断的。常规的方式就是通过会话长度来对链入的页面进行比较。
        乍一看,这似乎是一个完美的迭代过程。对于每个页面,保持倒计数,直到你找到第1个页面。但Hive是不支持迭代的。不过,还是可以解决这个问题。可以将这个过程分为4个阶段。
1. 识别哪些页面浏览是会话的初始者,或“起源”页面。
2. 对于每个页面,将其划分到正确的来源页面。
3. 将所有的页面浏览聚合到每个来源页面。
4. 对每个来源页面进行标记,然后计算每个会话的热度。
这种方式将产生一个表,其中每一行都表示一个完整的会话,然后用户就可以查询想知道的信息了。
二、实现过程
1. 设置
首先定义表session_test:
CREATE TABLE session_test
(
  st_user_id STRING,
  st_pageview_id STRING,
  st_page_url STRING,
  st_referrer_url STRING,
  st_timestamp DOUBLE
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
然后准备测试数据的CSV文件session_test.txt
user_1,2184914255, http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,http://ssx4501.blog.sohu.com/321517032.html,1459070426.712
user_1,2489633785,http://blog.sohu.com/,http://news.sohu.com/20160327/n442361110.shtml,1459070327.691
user_1,9638816121,http://www.sohu.com/,https://bj.122.gov.cn/views/inquiry.html,1459070227.725
user_1,7730804743,http://www.bjjtgl.gov.cn/wanchengyemian2/hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.563
user_1,6135947520,http://mail.sohu.com/bapp/36/main#mailList_-2,http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,1459070027.165
user_1,2005762692,http://www.56.com/u63/v_MTIxOTAzMTk2.html,http://www.56.com/u63/v_MTIxOTAzMTk2.html,1459069427.065
user_1,9885965958,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri=movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E8%94%A1%E6%9D%8E%E4%BD%9B%E5%B0%8F%E5%AD%90&rsv_pq=da0a1d6e001a2f26&rsv_t=ae7cF77GAFjqvIjZZAoitr6QZQEZHLOS5jaOILlrGrSf82o0ewqe3XyjTLQ&rsv_enter=1&rsv_n=2&rsv_sug3=1,1459060427.743
user_1,7781491477,https://www.baidu.com/,http://www.youku.com/,1459060227.538
user_1,4874646952,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,1459060127.538
user_1,4584871291,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=referrer_url&oq=mysql%20%E5%A4%9A%E8%A1%8Cuuid&rsv_pq=b37dd6bf000f9361&rsv_t=de34O52vUqUHuV5xSBtPC8H8TCY7QosiYwdWioYH04hjx%2FIyZ7PBogibMlA&rsv_enter=1&inputT=5220&rsv_sug3=33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%E5%A4%9A%E8%A1%8Cuuid,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E5%A4%9A%E8%A1%8Cuuid&oq=mysql%20%E9%80%90%E8%A1%8Cuuid&rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5Jt3mUBueH%2B93%2FDJMIOS9KNaP85ERoqiRFfSyIYOh4cSFNOwnRh%2Fcw&rsv_enter=1&inputT=5220&rsv_sug3=29&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&rsv_sug4=5943,1459060027.378
user_1,7240734452,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,1459059427.334
user_1,6310115480,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E9%80%90%E8%A1%8Cuuid&oq=mysql%20uuid&rsv_pq=db45a04e000eb323&rsv_t=a2cexGBrnrvDtSsubhPDhgtu0%2BmpbceYUGEqfhh49O3STPQfBGFTvQmx3LM&rsv_enter=1&inputT=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20uuid&oq=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92&rsv_pq=bd2966c9000d954a&rsv_t=2591i8XuLiiMERZ%2FcaOuYiLPWymHp4R4Z%2BJV5%2F4%2Fk3JbvShjVSiL0kDX09w&rsv_enter=1&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&bs=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92,1459057527.948
user_1,9725586547,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http://blog.csdn.net/runming918/article/details/7231259,1459057427.004
user_1,1995107575,http://blog.csdn.net/superhosts/article/details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html,1459055428.541
user_1,4376767486,http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_XMTUxMTg4MjQ3Ng==_ev_4.html?from=y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.252
user_2,2202616558, http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,http://ssx4501.blog.sohu.com/321517032.html,1459070426.913
user_2,2458809722,http://blog.sohu.com/,http://news.sohu.com/20160327/n442361110.shtml,1459070327.073
user_2,8893838191,http://www.sohu.com/,https://bj.122.gov.cn/views/inquiry.html,1459070227.167
user_2,2853640637,http://www.bjjtgl.gov.cn/wanchengyemian2/hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.527
user_2,1174630884,http://mail.sohu.com/bapp/36/main#mailList_-2,http://baike.baidu.com/link?url=stnTAgubmqJuhKT4Kf-P1wXR0rkNyuN1THCwLblz4eWqWxWly5kgTFtbBYVt8BCspvPj1BQNS9MoDyTcegvgjq#3,1459070027.515
user_2,5635223546,http://www.56.com/u63/v_MTIxOTAzMTk2.html,http://www.56.com/u63/v_MTIxOTAzMTk2.html,1459069427.031
user_2,6497571917,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri=movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E8%94%A1%E6%9D%8E%E4%BD%9B%E5%B0%8F%E5%AD%90&rsv_pq=da0a1d6e001a2f26&rsv_t=ae7cF77GAFjqvIjZZAoitr6QZQEZHLOS5jaOILlrGrSf82o0ewqe3XyjTLQ&rsv_enter=1&rsv_n=2&rsv_sug3=1,1459060427.961
user_2,9736857005,https://www.baidu.com/,http://www.youku.com/,1459060227.972
user_2,9522138344,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,1459060127.819
user_2,3288917999,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=referrer_url&oq=mysql%20%E5%A4%9A%E8%A1%8Cuuid&rsv_pq=b37dd6bf000f9361&rsv_t=de34O52vUqUHuV5xSBtPC8H8TCY7QosiYwdWioYH04hjx%2FIyZ7PBogibMlA&rsv_enter=1&inputT=5220&rsv_sug3=33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%E5%A4%9A%E8%A1%8Cuuid,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E5%A4%9A%E8%A1%8Cuuid&oq=mysql%20%E9%80%90%E8%A1%8Cuuid&rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5Jt3mUBueH%2B93%2FDJMIOS9KNaP85ERoqiRFfSyIYOh4cSFNOwnRh%2Fcw&rsv_enter=1&inputT=5220&rsv_sug3=29&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&rsv_sug4=5943,1459060027.815
user_2,3790199378,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,http://zhidao.baidu.com/link?url=_Cw4-_vo4XVI1lHfPuSDo7rRPeyDThHzietAOWHR52H_G1FlV8piTWbzNCDHQTGaxjSRHIox_DttBWLtmzS-8HFydpxG_IjUnbjryjASWay,1459059427.106
user_2,6413528694,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E9%80%90%E8%A1%8Cuuid&oq=mysql%20uuid&rsv_pq=db45a04e000eb323&rsv_t=a2cexGBrnrvDtSsubhPDhgtu0%2BmpbceYUGEqfhh49O3STPQfBGFTvQmx3LM&rsv_enter=1&inputT=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20uuid&oq=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92&rsv_pq=bd2966c9000d954a&rsv_t=2591i8XuLiiMERZ%2FcaOuYiLPWymHp4R4Z%2BJV5%2F4%2Fk3JbvShjVSiL0kDX09w&rsv_enter=1&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&bs=%E6%97%B6%E9%97%B4%E6%88%B3%20%E6%AF%AB%E7%A7%92,1459057527.689
user_2,6745253283,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http://blog.csdn.net/runming918/article/details/7231259,1459057427.126
user_2,7438056632,http://blog.csdn.net/superhosts/article/details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html,1459055428.198
user_2,8615798962,http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_XMTUxMTg4MjQ3Ng==_ev_4.html?from=y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.636
最后向表中导入测试数据
LOAD DATA LOCAL INPATH '/home/grid/session_test.txt' INTO TABLE session_test;

2. 找到每个会话的起始页
CREATE TABLE sessionization_step_one_origins
AS
   SELECT t1.st_user_id AS ssoo_user_id,
          t1.st_pageview_id ssoo_pageview_id,
          t1.st_timestamp ssoo_timestamp
     FROM session_test t1
          LEFT JOIN
          (SELECT DISTINCT st_pageview_id
             FROM (SELECT b.st_pageview_id st_pageview_id,
                          IF (
                                 b.st_timestamp > a.st_timestamp
                             AND b.st_timestamp - a.st_timestamp < 1800,
                             1,
                             0)
                             c_nonorigin_flags
                     FROM session_test a
                          INNER JOIN session_test b
                             ON a.st_user_id = b.st_user_id) t1
            WHERE c_nonorigin_flags = 1) t2
             ON t1.st_pageview_id = t2.st_pageview_id
    WHERE t2.st_pageview_id IS NULL;
        在最内层子查询中用自关联查询出非起始页面的浏览ID,因为Hive目前还没有支持不等于的JOIN,所以用了一个IF判断给非起始页面做标记。最外层的查询使用外连接得到会话起始页。
        查询sessionization_step_one_origins表的记录如图1所示
  SELECT ssoo_user_id,
         ssoo_pageview_id,
         CAST (ssoo_timestamp AS DECIMAL (13, 3))
    FROM sessionization_step_one_origins
ORDER BY ssoo_user_id, ssoo_timestamp;

[img]http://img.blog.csdn.net/20160328113523015?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图1
3. 将PV分配到起始页面中
CREATE TABLE sessionization_step_two_origin_identification
AS
   SELECT st_user_id sstoi_user_id,
          st_pageview_id sstoi_pageview_id,
          ssoo_pageview_id sstoi_origin_pageview_id
     FROM (SELECT t2.st_user_id,
                  t2.st_pageview_id,
                  t1.ssoo_pageview_id,
                  IF (
                         t2.st_timestamp >= t1.current_ssoo_timestamp
                     AND t2.st_timestamp < t1.lead_ssoo_timestamp,
                     1,
                     0)
                     c_nonorigin_flags
             FROM (SELECT ssoo_user_id,
                          ssoo_pageview_id,
                          NVL (lead_ssoo_timestamp, 9999999999)
                             lead_ssoo_timestamp,
                          current_ssoo_timestamp
                     FROM (SELECT ssoo_user_id,
                                  ssoo_pageview_id,
                                  LEAD (
                                     ssoo_timestamp,
                                     1)
                                  OVER (PARTITION BY ssoo_user_id
                                        ORDER BY ssoo_timestamp)
                                     lead_ssoo_timestamp,
                                  ssoo_timestamp current_ssoo_timestamp
                             FROM sessionization_step_one_origins) t0) t1
                  INNER JOIN
                  (SELECT st_user_id, st_pageview_id, st_timestamp
                     FROM session_test) t2
                     ON t1.ssoo_user_id = t2.st_user_id) t
    WHERE t.c_nonorigin_flags = 1;
        最内层子查询使用LEAD窗口分析函数得到下一个时间戳,在当前时间戳和下一个时间戳之间的记录为当前会话起始页对应的PV。因为起始页对应的PV包含起始页自己,所以IF区间判断条件是左闭右开。
        查询sessionization_step_two_origin_identification表的记录如图2所示
SELECT sstoi_user_id, sstoi_pageview_id, sstoi_origin_pageview_id
  FROM sessionization_step_two_origin_identification;

[img]http://img.blog.csdn.net/20160328113638423?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center  
图2
4. 按起始页聚合
CREATE TABLE sessionization_step_three_origin_aggregation AS
SELECT 
    sstoi_user_id sstoa_user_id,
    sstoi_origin_pageview_id sstoa_origin_pageview_id,
    count(1) sstoa_pageview_count
FROM
    sessionization_step_two_origin_identification
GROUP BY sstoi_user_id , sstoi_origin_pageview_id;
        查询sessionization_step_three_origin_aggregation表的记录如图3所示
  SELECT sstoa_user_id, sstoa_origin_pageview_id, sstoa_pageview_count
    FROM sessionization_step_three_origin_aggregation
ORDER BY sstoa_user_id, sstoa_pageview_count;

[img]http://img.blog.csdn.net/20160328113736472?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图3
5. 按起始页属性聚合
CREATE TABLE sessionization_step_four_qualitative_labeling
AS
   SELECT a.a_user_id AS ssfql_user_id,
          a.a_origin_pageview_id AS ssfql_origin_pageview_id,
          b.b_timestamp AS ssfql_timestamp,
          b.b_page_url AS ssfql_page_url,
          b.b_referrer_url AS ssfql_referrer_url,
          a.a_pageview_count AS ssqfl_pageview_count
     FROM (SELECT sstoa_user_id AS a_user_id,
                  sstoa_origin_pageview_id AS a_origin_pageview_id,
                  sstoa_pageview_count AS a_pageview_count
             FROM sessionization_step_three_origin_aggregation) a
          JOIN
          (SELECT st_user_id AS b_user_id,
                  st_pageview_id AS b_pageview_id,
                  st_page_url AS b_page_url,
                  st_referrer_url AS b_referrer_url,
                  st_timestamp AS b_timestamp
             FROM session_test) b
             ON     a.a_user_id = b.b_user_id
                AND a.a_origin_pageview_id = b.b_pageview_id;
        查询sessionization_step_four_qualitative_labeling表的记录如图4所示
  SELECT ssfql_user_id,
         ssfql_origin_pageview_id,
         ssfql_timestamp,
         ssfql_page_url,
         ssfql_referrer_url,
         ssqfl_pageview_count
    FROM sessionization_step_four_qualitative_labeling
ORDER BY ssfql_user_id, ssfql_timestamp;

[img]http://img.blog.csdn.net/20160328113842035?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图4
6. 衡量热度
  SELECT PARSE_URL (ssfql_referrer_url, 'HOST') AS referrer_host,
         COUNT (1) AS session_count,
         AVG (ssqfl_pageview_count) AS avg_pvs_per_session,
         SUM (ssqfl_pageview_count) / COUNT (1) AS weighted_avg_pvs_per_session,
         MAX (ssqfl_pageview_count) AS max_pvs_per_session,
         MIN (ssqfl_pageview_count) AS min_pvs_per_session,
         COUNT (DISTINCT ssfql_user_id) AS unique_users
    FROM sessionization_step_four_qualitative_labeling
GROUP BY PARSE_URL (ssfql_referrer_url, 'HOST');
        查询结果如图5所示
[img]http://img.blog.csdn.net/20160328113936489?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center
图5
......显示全文...
    点击查看全文


上一篇文章      下一篇文章      查看所有文章
2016-03-28 21:48:11  
数据库 最新文章
Python&MySQL&PyQt
最新用python来操作mysql完全解析
mongodb的安装详解
1.PDO简介
《MySQL必知必会学习笔记》:高级联结
【翻译自mos文章】怎么对Microsoft(Office)
MyCAT全局表描述及示例
ocp
关于SQL数据表存储过程表名前缀换成dbo代码
数据库调优教程(二)慢查询数据准备
360图书馆 论文大全 母婴/育儿 软件开发资料 网页快照 文字转语音 购物精选 软件 美食菜谱 新闻中心 电影下载 小游戏 Chinese Culture
生肖星座解梦 人民的名义 人民的名义在线看 三沣玩客 拍拍 视频 开发 Android开发 站长 古典小说 网文精选 搜图网 天下美图
中国文化英文 多播视频 装修知识库
2017-7-20 20:33:52
多播视频美女直播
↓电视,电影,美女直播,迅雷资源↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  软件世界网 --