pyspark.pandas.DataFrame.describe¶
-
DataFrame.
describe
( percentiles:Optional[List[float]]=None )→ pyspark.pandas.frame.DataFrame¶ -
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.- Parameters
-
-
percentiles
list of
float
in range [0.0, 1.0], default [0.25, 0.5, 0.75] -
百分位數的列表計算。
-
percentiles
list of
- Returns
-
- DataFrame
-
Summary statistics of the Dataframe provided.
See also
-
DataFrame.count
-
Count number of non-NA/null observations.
-
DataFrame.max
-
Maximum of the values in the object.
-
DataFrame.min
-
Minimum of the values in the object.
-
DataFrame.mean
-
Mean of the values.
-
DataFrame.std
-
Standard deviation of the observations.
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,25%
,50%
,75%
,max
.對象數據(如字符串或時間戳)result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.Examples
Describing a numeric
Series
.>>>s=ps.Series([1,2,3])>>>s.describe()count 3.0mean 2.0std 1.01.0分鍾25% 1.050% 2.075% 3.0max 3.0dtype: float64
Describing a
DataFrame
. Only numeric fields are returned.>>>df=ps.DataFrame({'numeric1':[1,2,3],...'numeric2':[4.0,5.0,6.0],...'object':['a','b','c']...},...columns=['numeric1','numeric2','object'])>>>df.describe()numeric1 numeric2count 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.025% 1.0 4.050% 2.0 5.075% 3.0 6.0max 3.0 6.0
For multi-index columns:
>>>df.columns=[('num','a'),('num','b'),('obj','c')]>>>df.describe()numa bcount 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.025% 1.0 4.050% 2.0 5.075% 3.0 6.0max 3.0 6.0
>>>df[('num','b')].describe()count 3.0mean 5.0std 1.0min 4.025% 4.050% 5.075% 6.0max 6.0Name: (num, b), dtype: float64
Describing a
DataFrame
and selecting custom percentiles.>>>df=ps.DataFrame({'numeric1':[1,2,3],...'numeric2':[4.0,5.0,6.0]...},...columns=['numeric1','numeric2'])>>>df.describe(percentiles=[0.85,0.15])numeric1 numeric2count 3.0 3.0mean 2.0 5.0std 1.0 1.01.0分鍾4.015% 1.0 4.050% 2.0 5.085% 3.0 6.0max 3.0 6.0
Describing a column from a
DataFrame
by accessing it as an attribute.>>>df.numeric1.describe()count 3.0mean 2.0std 1.01.0分鍾25% 1.050% 2.075% 3.0max 3.0Name: numeric1, dtype: float64
Describing a column from a
DataFrame
by accessing it as an attribute and selecting custom percentiles.>>>df.numeric1.describe(percentiles=[0.85,0.15])count 3.0mean 2.0std 1.01.0分鍾15% 1.050% 2.085% 3.0max 3.0Name: numeric1, dtype: float64