如何在Python中处理时间序列中的缺失值

在这篇文章中，我们将讨论如何使用Python编程语言来处理时间序列中的缺失值。

时间序列是以固定的时间间隔记录的观察序列。时间序列分析对于观察特定资产、证券或经济变量随时间的变化是非常有用的。这里的另一个大问题是，为什么我们需要处理数据集中的缺失值，为什么数据中会出现缺失值？

在数据集的预处理过程中，对缺失数据的处理非常重要，因为许多机器学习算法不支持缺失值。
由于阅读或记录数据的问题，时间序列可能会有缺失点。

为什么我们不能用全局平均值来改变缺失值，因为时间序列数据可能有一些像季节性或趋势性的东西？传统的方法，如平均数和模式的归纳，删除，和其他方法都不足以处理缺失值，因为这些方法会导致数据的偏差。用一些程序或算法产生的数值来估计或归纳缺失的数据，可以说是将传统方法对数据的偏差影响降到最低的最佳解决方案。因此，最后，数据将被完成，并准备用于另一步的分析或数据挖掘。

方法1：使用ffill()和bfill()方法

该方法根据顺序和条件来填补缺失的值。这意味着该方法用最后观察到的非楠木值或下一个观察到的非楠木值来替换’楠木’的值。

回填 – bfill : 根据最后的观察值
forwardfill – ffill : 根据下一个观察值

# import the libraries
import pandas as pd
import numpy as np
  
# dataframe with index as timeseries
time_sdata = pd.date_range("09/10/2021", periods=9, freq="W")
  
df = pd.DataFrame(index=time_sdata)
print(df)
  
# there are four missing values
df["example"] = [10001.0, 10002.0, 10003.0, np.nan,
                 10004.0, np.nan, np.nan, 10005.0, np.nan]
  
gfg1 = df.ffill()
print("Using ffill() function:-")
print(gfg1)
  
# here we are doing Backfill Missing Values
# in the output the last value has NaN because 
# there is no backward value for that
gfg2 = df.bfill()
print("Using bfill() function:-")
print(gfg2)

输出:

如何在Python中处理时间序列中的缺失值？

方法2：使用Interpolate()方法

该方法比上述fillna()方法更复杂。它由不同的方法组成，包括 “线性”、”二次”、”最近”。插值是填补时间序列数据中缺失值的一种强大方法。通过下面提供的链接，可以看到更多的例子。

# import the libraries
import pandas as pd
import numpy as np
  
# dataframe with index as timeseries
time_sdata = pd.date_range("09/10/2021", periods=9, freq="W")
  
df = pd.DataFrame(index=time_sdata)
print(df)
  
# there are four missing values
df["example"] = [10001.0, 10002.0, 10003.0, np.nan,
                 10004.0, np.nan, np.nan, 10005.0, np.nan]
  
# using interpolate() to fill the missing 
# values in a specific order
# dealing with missing values
dataframe1 = df.interpolate()
print(dataframe1)

输出:

如何在Python中处理时间序列中的缺失值？

方法3：使用带有极限参数的Interpolate()方法

这是向前/向后填充连续NaN值的最大数量。换句话说，如果有一个缺口的连续NaN值超过这个数量，它将只被部分填补。

语法:

DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, kwargs)

注意：只有method=’linear’支持带有MultiIndex的DataFrame/Series。

# import the libraries
import pandas as pd
import numpy as np
  
# dataframe with index as timeseries
time_sdata = pd.date_range("09/10/2021", periods=9, freq="W")
  
df = pd.DataFrame(index=time_sdata)
print(df)
  
# there are four missing values
df["example"] = [10001.0, 10002.0, 10003.0, np.nan,
                 10004.0, np.nan, np.nan, 10005.0, np.nan]
  
# Interpolating Missing Values to two values
dataframe = df.interpolate(limit=2, limit_direction="forward")
print(dataframe)

输出:

如何在Python中处理时间序列中的缺失值？