根据网友的一个鹅厂面试题由来,原始内容摘录如下:求每个用户的最长连续登录天数,两个日期的间隔小于或等于 3 均视为连续登录。比如 01-01 号登录,最近的下一次登录是 01-04 号,两个日期的间隔等于 3 天,因此这两个日期之间的天数都算作活跃天数,一共 4 天。
目录导航
分析
首先明确,登录是否只考虑天,不考虑时间。也就是只要跨天就算1天,哪怕中间只间隔了1秒。比如2020-01-01 23:59:59 - 2020-01-02 00:00:00。
gbase> select datediff('2020-01-01 23:59:59','2020-01-02 00:00:00');
+-------------------------------------------------------+
| datediff('2020-01-01 23:59:59','2020-01-02 00:00:00') |
+-------------------------------------------------------+
| -1 |
+-------------------------------------------------------+
1 row in set (Elapsed: 00:00:00.00)
这涉及到了计算方法,如果是天,那么用datediff就可以了。 如果是24小时,那么就得用秒来计算,也就是用timestampdiff。
gbase> select timestampdiff(hour,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
| 0 |
+------+
1 row in set (Elapsed: 00:00:00.00)
gbase> select timestampdiff(minute,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
| 0 |
+------+
1 row in set (Elapsed: 00:00:00.00)
gbase> select timestampdiff(second,'2020-01-01 23:59:59','2020-01-02 00:00:00')diff;
+------+
| diff |
+------+
| 1 |
+------+
1 row in set (Elapsed: 00:00:00.00)
根据粒度不同,如果是天,datediff即可。 如果是24小时,那么用timestampdiff的hour即可。总之这个必须明确。
因为题目没有说,后面我们以默认的【天】为基本粒度。
原始数据登录信息
本信息包括了一天内多次登录,不同时间登录等情况。注意,如下特特意构造了1个月的中断登录时间。
create table test_login(user_id int,login_date timestamp);
insert into test_login values (1,'2020-01-01 00:01:00');
insert into test_login values (1,'2020-01-01 00:02:00');
insert into test_login values (1,'2020-01-01 00:03:00');
insert into test_login values (1,'2020-01-02 00:03:00');
insert into test_login values (1,'2020-01-05 00:03:00');
insert into test_login values (1,'2020-01-07 00:03:00');
insert into test_login values (1,'2020-01-11 00:03:00');
insert into test_login values (1,'2020-01-12 00:03:00');
insert into test_login values (1,'2020-01-13 00:03:00');
insert into test_login values (1,'2020-01-14 00:03:00');
insert into test_login values (1,'2020-01-17 00:03:00');
insert into test_login values (1,'2020-01-18 00:03:00');
insert into test_login values (1,'2020-01-19 00:03:00');
insert into test_login values (1,'2020-01-22 00:03:00');
insert into test_login values (1,'2020-02-22 00:03:00');
insert into test_login values (1,'2020-03-22 00:03:00');
insert into test_login values (1,'2020-03-23 00:03:00');
insert into test_login values (2,'2020-01-01 00:04:00');
insert into test_login values (2,'2020-01-01 00:05:00');
insert into test_login values (2,'2020-01-01 00:06:00');
insert into test_login values (2,'2020-01-02 00:06:00');
insert into test_login values (2,'2020-01-07 00:06:00');
insert into test_login values (2,'2020-01-11 00:06:00');
insert into test_login values (2,'2020-01-12 00:06:00');
insert into test_login values (2,'2020-01-14 00:06:00');
insert into test_login values (2,'2020-01-18 00:06:00');
insert into test_login values (2,'2020-01-19 00:06:00');
insert into test_login values (2,'2020-01-22 00:06:00');
查看原始数据
gbase> select * from test_login;
+---------+---------------------+
| user_id | login_date |
+---------+---------------------+
| 1 | 2020-01-01 00:01:00 |
| 1 | 2020-01-01 00:02:00 |
| 1 | 2020-01-01 00:03:00 |
| 1 | 2020-01-02 00:03:00 |
| 1 | 2020-01-05 00:03:00 |
| 1 | 2020-01-07 00:03:00 |
| 1 | 2020-01-11 00:03:00 |
| 1 | 2020-01-12 00:03:00 |
| 1 | 2020-01-13 00:03:00 |
| 1 | 2020-01-14 00:03:00 |
| 1 | 2020-01-17 00:03:00 |
| 1 | 2020-01-18 00:03:00 |
| 1 | 2020-01-19 00:03:00 |
| 1 | 2020-01-22 00:03:00 |
| 1 | 2020-02-22 00:03:00 |
| 1 | 2020-03-22 00:03:00 |
| 1 | 2020-03-23 00:03:00 |
| 2 | 2020-01-01 00:04:00 |
| 2 | 2020-01-01 00:05:00 |
| 2 | 2020-01-01 00:06:00 |
| 2 | 2020-01-02 00:06:00 |
| 2 | 2020-01-07 00:06:00 |
| 2 | 2020-01-11 00:06:00 |
| 2 | 2020-01-12 00:06:00 |
| 2 | 2020-01-14 00:06:00 |
| 2 | 2020-01-18 00:06:00 |
| 2 | 2020-01-19 00:06:00 |
| 2 | 2020-01-22 00:06:00 |
+---------+---------------------+
28 rows in set (Elapsed: 00:00:00.00)
如下我们先按照一步步的分析,解析整个步骤。
拿到本地登录的下次登录信息
通过lead函数,可以拿到下一行的数据信息。其中必须用user_id做partition,因为是多人。另外要通过登录信息正向排序,以便拿到上一次的。最终用
lag(login_date)over(partition by user_id order by login_date) login_date_next
拿到上一次的登录时间 login_date_last。 最后一次登录的下一次为NULL。
gbase> create table test_login_last as select user_id,login_date,lag(login_date)over(partition by user_id order by login_date) login_date_last from test_login order by user_id,login_date;
Query OK, 25 rows affected (Elapsed: 00:00:01.07)
gbase> select * from test_login_last;
+---------+---------------------+---------------------+
| user_id | login_date | login_date_last |
+---------+---------------------+---------------------+
| 1 | 2020-01-01 00:01:00 | NULL |
| 1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 |
| 1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 |
| 1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 |
| 1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 |
| 1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 |
| 1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 |
| 1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 |
| 1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 |
| 1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 |
| 1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 |
| 1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 |
| 1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 |
| 1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 |
| 1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 |
| 1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 |
| 1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 |
| 2 | 2020-01-01 00:04:00 | NULL |
| 2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 |
| 2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 |
| 2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 |
| 2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 |
| 2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 |
| 2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 |
| 2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 |
| 2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 |
| 2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 |
| 2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 |
+---------+---------------------+---------------------+
28 rows in set (Elapsed: 00:00:00.01)
计算登录时间差距
计算本次登录和上次登录的时间差,这里用datediff进行,如果是其它粒度,请参考第一节的介绍。
获得了diff字段,内容格式2个时间相差的【天数】。注意本例中,一天内的多次登录的天数差距是0.
gbase> create table test_login_diff as select user_id,login_date,login_date_last,datediff(login_date,login_date_last)diff from test_login_last;
Query OK, 25 rows affected (Elapsed: 00:00:00.86)
gbase> select * from test_login_diff;
+---------+---------------------+---------------------+------+
| user_id | login_date | login_date_last | diff |
+---------+---------------------+---------------------+------+
| 1 | 2020-01-01 00:01:00 | NULL | NULL |
| 1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 | 0 |
| 1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 | 0 |
| 1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 | 1 |
| 1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 | 3 |
| 1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 | 2 |
| 1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 | 4 |
| 1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 | 1 |
| 1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 | 1 |
| 1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 | 1 |
| 1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 | 3 |
| 1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 | 1 |
| 1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 | 1 |
| 1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 | 3 |
| 1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 | 31 |
| 1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 | 29 |
| 1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 | 1 |
| 2 | 2020-01-01 00:04:00 | NULL | NULL |
| 2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 | 0 |
| 2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 | 0 |
| 2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 | 1 |
| 2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 | 5 |
| 2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 | 4 |
| 2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 | 1 |
| 2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 | 2 |
| 2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 | 4 |
| 2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 | 1 |
| 2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 | 3 |
+---------+---------------------+---------------------+------+
28 rows in set (Elapsed: 00:00:00.01)
获得连续登录天数
连续登录,按题目要求,是间隔时间小于等于3。因为一个人可以出现多次登录中断,而每次中断就都需要重新计算【连续】天数,所以要先计算其中断的次数。
这里将登录天数>3的设置为1。
gbase> create table test_login_flag1 as select *,if(diff<=3,0,1) flag1 from test_login_diff order by user_id,login_date;
Query OK, 28 rows affected (Elapsed: 00:00:00.98)
gbase> select * from test_login_flag1;
+---------+---------------------+---------------------+------+-------+
| user_id | login_date | login_date_last | diff | flag1 |
+---------+---------------------+---------------------+------+-------+
| 1 | 2020-01-01 00:01:00 | NULL | NULL | 1 |
| 1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 | 0 | 0 |
| 1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 | 0 | 0 |
| 1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 | 1 | 0 |
| 1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 | 3 | 0 |
| 1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 | 2 | 0 |
| 1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 | 4 | 1 |
| 1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 | 1 | 0 |
| 1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 | 1 | 0 |
| 1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 | 1 | 0 |
| 1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 | 3 | 0 |
| 1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 | 1 | 0 |
| 1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 | 1 | 0 |
| 1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 | 3 | 0 |
| 1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 | 31 | 1 |
| 1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 | 29 | 1 |
| 1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 | 1 | 0 |
| 2 | 2020-01-01 00:04:00 | NULL | NULL | 1 |
| 2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 | 0 | 0 |
| 2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 | 0 | 0 |
| 2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 | 1 | 0 |
| 2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 | 5 | 1 |
| 2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 | 4 | 1 |
| 2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 | 1 | 0 |
| 2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 | 2 | 0 |
| 2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 | 4 | 1 |
| 2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 | 1 | 0 |
| 2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 | 3 | 0 |
+---------+---------------------+---------------------+------+-------+
28 rows in set (Elapsed: 00:00:00.02)
计算连续登录中断次数
也就是当前记录以前,出现过flag1=1的sum值。
sum(flag1)over(partition by user_id order by login_date) flag2
gbase> create table test_login_flag2 as select *,sum(flag1)over(partition by user_id order by login_date) flag2 from test_login_flag1;
Query OK, 28 rows affected (Elapsed: 00:00:01.26)
gbase> select * from test_login_flag2;
+---------+---------------------+---------------------+------+-------+-------+
| user_id | login_date | login_date_last | diff | flag1 | flag2 |
+---------+---------------------+---------------------+------+-------+-------+
| 1 | 2020-01-01 00:01:00 | NULL | NULL | 1 | 1 |
| 1 | 2020-01-01 00:02:00 | 2020-01-01 00:01:00 | 0 | 0 | 1 |
| 1 | 2020-01-01 00:03:00 | 2020-01-01 00:02:00 | 0 | 0 | 1 |
| 1 | 2020-01-02 00:03:00 | 2020-01-01 00:03:00 | 1 | 0 | 1 |
| 1 | 2020-01-05 00:03:00 | 2020-01-02 00:03:00 | 3 | 0 | 1 |
| 1 | 2020-01-07 00:03:00 | 2020-01-05 00:03:00 | 2 | 0 | 1 |
| 1 | 2020-01-11 00:03:00 | 2020-01-07 00:03:00 | 4 | 1 | 2 |
| 1 | 2020-01-12 00:03:00 | 2020-01-11 00:03:00 | 1 | 0 | 2 |
| 1 | 2020-01-13 00:03:00 | 2020-01-12 00:03:00 | 1 | 0 | 2 |
| 1 | 2020-01-14 00:03:00 | 2020-01-13 00:03:00 | 1 | 0 | 2 |
| 1 | 2020-01-17 00:03:00 | 2020-01-14 00:03:00 | 3 | 0 | 2 |
| 1 | 2020-01-18 00:03:00 | 2020-01-17 00:03:00 | 1 | 0 | 2 |
| 1 | 2020-01-19 00:03:00 | 2020-01-18 00:03:00 | 1 | 0 | 2 |
| 1 | 2020-01-22 00:03:00 | 2020-01-19 00:03:00 | 3 | 0 | 2 |
| 1 | 2020-02-22 00:03:00 | 2020-01-22 00:03:00 | 31 | 1 | 3 |
| 1 | 2020-03-22 00:03:00 | 2020-02-22 00:03:00 | 29 | 1 | 4 |
| 1 | 2020-03-23 00:03:00 | 2020-03-22 00:03:00 | 1 | 0 | 4 |
| 2 | 2020-01-01 00:04:00 | NULL | NULL | 1 | 1 |
| 2 | 2020-01-01 00:05:00 | 2020-01-01 00:04:00 | 0 | 0 | 1 |
| 2 | 2020-01-01 00:06:00 | 2020-01-01 00:05:00 | 0 | 0 | 1 |
| 2 | 2020-01-02 00:06:00 | 2020-01-01 00:06:00 | 1 | 0 | 1 |
| 2 | 2020-01-07 00:06:00 | 2020-01-02 00:06:00 | 5 | 1 | 2 |
| 2 | 2020-01-11 00:06:00 | 2020-01-07 00:06:00 | 4 | 1 | 3 |
| 2 | 2020-01-12 00:06:00 | 2020-01-11 00:06:00 | 1 | 0 | 3 |
| 2 | 2020-01-14 00:06:00 | 2020-01-12 00:06:00 | 2 | 0 | 3 |
| 2 | 2020-01-18 00:06:00 | 2020-01-14 00:06:00 | 4 | 1 | 4 |
| 2 | 2020-01-19 00:06:00 | 2020-01-18 00:06:00 | 1 | 0 | 4 |
| 2 | 2020-01-22 00:06:00 | 2020-01-19 00:06:00 | 3 | 0 | 4 |
+---------+---------------------+---------------------+------+-------+-------+
28 rows in set (Elapsed: 00:00:00.00)
计算每次中断内的连续登录间隔
根据user_id和flag2, 计算最大和最小登录时间的差距。
gbase> create table test_login_continue as select user_id,flag2,datediff(max(login_date),min(login_date)) cont from test_login_flag2 group by user_id,flag2 order by user_id,flag2;
Query OK, 8 rows affected (Elapsed: 00:00:01.11)
gbase> select * from test_login_continue;
+---------+-------+------+
| user_id | flag2 | cont |
+---------+-------+------+
| 1 | 1 | 6 |
| 1 | 2 | 11 |
| 1 | 3 | 0 |
| 1 | 4 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 0 |
| 2 | 3 | 3 |
| 2 | 4 | 4 |
+---------+-------+------+
8 rows in set (Elapsed: 00:00:00.01)
得到最长的连续登录间隔
这里根据题目要求,需要+1,1号到4号间隔3天,计算为4天连续登录。
gbase> select user_id,max(cont)+1 cont from test_login_continue group by user_id order by user_id;
+---------+------+
| user_id | cont |
+---------+------+
| 1 | 12 |
| 2 | 5 |
+---------+------+
2 rows in set (Elapsed: 00:00:00.15)