👩🏼‍🏭 ☯️ 🧔🏻 Spark 3.0: nouvelles fonctionnalités et exemples d'utilisation

Pour notre nouveau programme "Apache Spark for Data Engineers" et le webinaire sur le cours du 2 décembre, nous avons préparé une traduction d'un article de présentation sur Spark 3.0.

Spark 3.0 est sorti avec tout un tas d'améliorations importantes, notamment: amélioration des performances avec ADQ, lecture des binaires, prise en charge améliorée de SQL et Python, Python 3.0, intégration Hadoop 3, prise en charge ACID.

Dans cet article, l'auteur a tenté de donner des exemples d'utilisation de ces nouvelles fonctions. Il s'agit du premier article sur les fonctionnalités de Spark 3.0 et cette série d'articles devrait continuer.

Cet article met en évidence les fonctionnalités suivantes dans Spark 3.0:

Cadre d'exécution adaptative des requêtes (AQE)
Prise en charge de nouvelles langues
Nouvelle interface pour le streaming structuré
Lecture de fichiers binaires
Navigation récursive dans les dossiers
Prise en charge de plusieurs délimiteurs de données (||)
Nouvelles fonctionnalités Spark intégrées
Passer au calendrier grégorien prolptique
Queue du cadre de données
Fonction de répartition dans les requêtes SQL
Compatibilité ANSI SQL améliorée

(AQE) – , , Spark 3.0. , , .

3.0 Spark , , Spark , . AQE , , , .

, (AQE) . spark.sql.adaptive.enabled true. AQE, Spark TPC-DS Spark 2.4

AQE Spark 3.0 3 :

,
join sort-merge broadcast

Spark 3.0 , :

Python3 (Python 2.x)
Scala 2.12
JDK 11

Hadoop 3 , Kafka 2.4.1 .

Spark Structured Streaming

web- Spark . , , , -, . , .

2 :

: Databricks

«Active Streaming Queries» , «Completed Streaming Queries» – .

Run ID : , , , , , . , Databricks.

Spark 3.0 “binaryFile”, .

binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .

val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")

df.printSchema()

df.show()

root

|-- path: string (nullable = true)

|-- modificationTime: timestamp (nullable = true)

|-- length: long (nullable = true)

|-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+

+--------------------+--------------------+------+--------------------+

|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|

+--------------------+--------------------+------+--------------------+

Spark 3.0 recursiveFileLookup, . true , DataFrameReader , .

spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")

Spark 3.0 (||) CSV . , CSV :

col1||col2||col3||col4

val1||val2||val3||val4

val df = spark.read

.option("delimiter","||")

.option("header","true")

.csv("/tmp/data/douplepipedata.csv")

Spark 2.x , . :

throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||

Spark

Spark SQL, Spark .

sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,

booland,boolor,countif,datepart,extract,forall,fromcsv,

makedate,makeinterval,maketimestamp,mapentries

mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv

transformkeys,transform_values,typeof,version

xxhash64

Spark : 1582 , – .

JDK 7 java.sql.Date API. JDK 8 java.time.LocalDate API .

Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .

Spark 3.0 Date & Timestamp :

makedate(), maketimestamp(), makeinterval().

makedate(year, month, day) – <>, <> <>.

makedate(2014, 8, 13)

//returns 2014-08-13.

maketimestamp(year, month, day, hour, min, sec[, timezone]) – Timestamp <>, <>, <>, <>, <>, < >.

maketimestamp(2014, 8, 13, 1,10,40.147)

//returns Timestamp 2014-08-13 1:10:40.147

maketimestamp(2014, 8, 13, 1,10,40.147,CET)

makeinterval(years, months, weeks, days, hours, mins, secs) –

makedate() make_timestam() 0.

DataFrame.tail()

Spark head(), , tail(), Pandas Python. Spark 3.0 tail() . tail() scala.Array[T] Scala.

val data=spark.range(1,100).toDF("num").tail(5)

data.foreach(print)

//Returns

//[95][96][97][98][99]

repartition SQL

SQL Spark actions, Dataset/DataFrame, , Spark SQL repartition() . SQL-. .

val df=spark.range(1,10000).toDF("num")

println("Before re-partition :"+df.rdd.getNumPartitions)

df.createOrReplaceTempView("RANGE¨C17CTABLE")

println("After re-partition :"+df2.rdd.getNumPartitions)

//Returns

//Before re-partition :1

//After re-partition :20

ANSI SQL

Spark data-, ANSI SQL, Spark 3.0 . , true spark.sql.parser.ansi.enabled Spark .

Newprolab Apache Spark:

Apache Spark - (Scala). 11 , 5 .

Apache Spark (Python). " ". 6 , 5 .

Spark 3.0: nouvelles fonctionnalités et exemples d'utilisation - partie 1

Spark Structured Streaming

Spark

DataFrame.tail()

repartition SQL

ANSI SQL

More articles: