暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片

pyspark案例系列3-dataframe实现mysql的group_concat功能

原创 只是甲 2021-05-24
1083

一.问题描述

今天写pyspark遇到一个问题,要实现同mysql的GROUP_CONCAT函数的功能

数据1:

col1   col2
1        a
1        b
1        c
2        d
2        f

想要的结果1:

col1   new_col2
1        a,b,c
2        d,f

如果存在多列是否也可行
数据2:

col1   col2   col3
1        a        100
1        b        200
1        c        300
2        d        400
2        f        500

想要的结果2:

col1   new_col2    new_col3
1        a,b,c   100,200,300
2        d,f       400,500

二.解决方案

pyspark的collect_list可以实现如下需求

代码:

#!/usr/bin/env python from pyspark import SparkContext, SparkConf from pyspark.sql.session import SparkSession from pyspark.sql.functions import collect_list from pyspark.sql.functions import collect_set spark_conf = SparkConf().setMaster("local").setAppName("test1") sc = SparkContext(conf = spark_conf) spark = SparkSession(sc) df1 = spark.createDataFrame([('1','a'),('1','b'),('1','c'),('2','d'),('2','f')], ['col1', 'col2']) df1.groupBy("col1").agg(collect_list('col2').alias('new_col2')).show() df2 = spark.createDataFrame([('1','a','100'),('1','b','200'),('1','c','300'),('2','d','400'),('2','f','500')], ['col1', 'col2', 'col3']) df2.groupBy("col1").agg(*[collect_set(col) for col in ['col2','col3']]).show() sc.stop()

测试记录:

--snip--
21/04/15 17:26:41 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
21/04/15 17:26:41 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/04/15 17:26:41 INFO hive.metastore: Connected to metastore.
+----+---------+
|col1| new_col2|
+----+---------+
|   1|[a, b, c]|
|   2|   [d, f]|
+----+---------+
--snip--
21/04/15 17:30:43 INFO scheduler.TaskSetManager: Finished task 64.0 in stage 19.0 (TID 401) in 8 ms on localhost (executor driver) (75/75)
21/04/15 17:30:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have all completed, from pool 
21/04/15 17:30:43 INFO scheduler.DAGScheduler: ResultStage 19 (showString at NativeMethodAccessorImpl.java:0) finished in 0.272 s
21/04/15 17:30:43 INFO scheduler.DAGScheduler: Job 9 finished: showString at NativeMethodAccessorImpl.java:0, took 0.278859 s
+----+-----------------+-----------------+
|col1|collect_set(col2)|collect_set(col3)|
+----+-----------------+-----------------+
|   1|        [c, b, a]|  [100, 300, 200]|
|   2|           [f, d]|       [500, 400]|
+----+-----------------+-----------------+
--snip--

参考:
1.https://www.cnblogs.com/TTyb/p/10196544.html

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论