pyspark.pandas.DataFrame.describe

DataFrame. describe ( percentiles:Optional[List[float]]=None )→ pyspark.pandas.frame.DataFrame

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excludingNaNvalues.

Analyzes both numeric and object series, as well asDataFramecolumn sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters
percentiles list offloatin range [0.0, 1.0], default [0.25, 0.5, 0.75]

百分位數的列表計算。

Returns
DataFrame

Summary statistics of the Dataframe provided.

See also

DataFrame.count

Count number of non-NA/null observations.

DataFrame.max

Maximum of the values in the object.

DataFrame.min

Minimum of the values in the object.

DataFrame.mean

Mean of the values.

DataFrame.std

Standard deviation of the observations.

Notes

For numeric data, the result’s index will includecount,mean,std,min,25%,50%,75%,max.

對象數據(如字符串或時間戳)result’s index will includecount,unique,top, andfreq. Thetopis the most common value. Thefreqis the most common value’s frequency. Timestamps also include thefirstandlastitems.

Examples

Describing a numericSeries.

>>>s=ps.Series([1,2,3])>>>s.describe()count 3.0mean 2.0std 1.01.0分鍾25% 1.050% 2.075% 3.0max 3.0dtype: float64

Describing aDataFrame. Only numeric fields are returned.

>>>df=ps.DataFrame({'numeric1':[1,2,3],...'numeric2':[4.0,5.0,6.0],...'object':['a','b','c']...},...columns=['numeric1','numeric2','object'])>>>df.describe()numeric1 numeric2count 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.025% 1.0 4.050% 2.0 5.075% 3.0 6.0max 3.0 6.0

For multi-index columns:

>>>df.columns=[('num','a'),('num','b'),('obj','c')]>>>df.describe()numa bcount 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.025% 1.0 4.050% 2.0 5.075% 3.0 6.0max 3.0 6.0
>>>df[('num','b')].describe()count 3.0mean 5.0std 1.0min 4.025% 4.050% 5.075% 6.0max 6.0Name: (num, b), dtype: float64

Describing aDataFrameand selecting custom percentiles.

>>>df=ps.DataFrame({'numeric1':[1,2,3],...'numeric2':[4.0,5.0,6.0]...},...columns=['numeric1','numeric2'])>>>df.describe(percentiles=[0.85,0.15])numeric1 numeric2count 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.015% 1.0 4.050% 2.0 5.085% 3.0 6.0max 3.0 6.0

Describing a column from aDataFrameby accessing it as an attribute.

>>>df.numeric1.describe()count 3.0mean 2.0std 1.01.0分鍾25% 1.050% 2.075% 3.0max 3.0Name: numeric1, dtype: float64

Describing a column from aDataFrameby accessing it as an attribute and selecting custom percentiles.

>>>df.numeric1.describe(percentiles=[0.85,0.15])count 3.0mean 2.0std 1.01.0分鍾15% 1.050% 2.085% 3.0max 3.0Name: numeric1, dtype: float64