pyspark.sql.SparkSession.createDataFrame¶
-
SparkSession.
createDataFrame
(data: Union[pyspark.rdd.RDD[Any], Iterable[Any], PandasDataFrameLike, ArrayLike], schema: Union[pyspark.sql.types.AtomicType, pyspark.sql.types.StructType, str, None] = None, samplingRatio: Optional[float] = None, verifySchema: bool = True) → pyspark.sql.dataframe.DataFrame[source]¶ Creates a
DataFrame
from anRDD
, a list, apandas.DataFrame
or anumpy.ndarray
.New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- data
RDD
or iterable an RDD of any kind of SQL data representation (
Row
,tuple
,int
,boolean
, etc.), orlist
,pandas.DataFrame
ornumpy.ndarray
.- schema
pyspark.sql.types.DataType
, str or list, optional a
pyspark.sql.types.DataType
or a datatype string or a list of column names, default is None. The data type string format equals topyspark.sql.types.DataType.simpleString
, except that top level struct type can omit thestruct<>
.When
schema
is a list of column names, the type of each column will be inferred fromdata
.When
schema
isNone
, it will try to infer the schema (column names and types) fromdata
, which should be an RDD of eitherRow
,namedtuple
, ordict
.When
schema
ispyspark.sql.types.DataType
or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType
, it will be wrapped into apyspark.sql.types.StructType
as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.- samplingRatiofloat, optional
the sample ratio of rows used for inferring. The first few rows will be used if
samplingRatio
isNone
.- verifySchemabool, optional
verify data types of every row against schema. Enabled by default.
New in version 2.1.0.
- data
- Returns
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
Create a DataFrame from a list of tuples.
>>> spark.createDataFrame([('Alice', 1)]).collect() [Row(_1='Alice', _2=1)] >>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).collect() [Row(name='Alice', age=1)]
Create a DataFrame from a list of dictionaries
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).collect() [Row(age=1, name='Alice')]
Create a DataFrame from an RDD.
>>> rdd = spark.sparkContext.parallelize([('Alice', 1)]) >>> spark.createDataFrame(rdd).collect() [Row(_1='Alice', _2=1)] >>> df = spark.createDataFrame(rdd, ['name', 'age']) >>> df.collect() [Row(name='Alice', age=1)]
Create a DataFrame from Row instances.
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> person = rdd.map(lambda r: Person(*r)) >>> df2 = spark.createDataFrame(person) >>> df2.collect() [Row(name='Alice', age=1)]
Create a DataFrame with the explicit schema specified.
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> df3 = spark.createDataFrame(rdd, schema) >>> df3.collect() [Row(name='Alice', age=1)]
Create a DataFrame from a pandas DataFrame.
>>> spark.createDataFrame(df.toPandas()).collect() [Row(name='Alice', age=1)] >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() [Row(0=1, 1=2)]
Create a DataFrame from an RDD with the schema in DDL formatted string.
>>> spark.createDataFrame(rdd, "a: string, b: int").collect() [Row(a='Alice', b=1)] >>> rdd = rdd.map(lambda row: row[1]) >>> spark.createDataFrame(rdd, "int").collect() [Row(value=1)]
When the type is unmatched, it throws an exception.
>>> spark.createDataFrame(rdd, "boolean").collect() Traceback (most recent call last): ... Py4JJavaError: ...