根据最接近的DateTime合并两个Pandas DataFrames

根据最接近的DateTime合并两个Pandas DataFrames

在这篇文章中,我们将讨论如何根据最接近的日期时间来合并Pandas DataFrame。要学习如何合并数据框架,首先你必须学习如何创建一个数据框架,为此你必须参考《创建Pandas数据框架》一文。在创建完数据框架后,需要对其进行合并,为了合并数据框架,有一个名为merge_asof()的函数,当需要写这个函数时,可以写成。

pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')

注意:

  • 要了解这个函数的更多信息,请参考文章pandas.merge_asof() function in Python
  • 数据帧必须按键进行排序。

Step-by-step 步骤

第1步:导入pandas库

为了完成这项任务,我们必须导入名为Pandas的库。

import pandas as pd

步骤2:创建数据框架

在这一步,我们必须使用 “pd.DataFrame() “函数来创建数据帧。在这里,我们创建了两个数据框架,一个被命名为左,另一个被命名为右,因为我们的最后目标是根据最接近的DateTime来合并两个数据框架。它可以被写成。

left = pd.DataFrame( {

“time”: [pd.Timestamp(“2020-03-25 13:30:00.023”),

pd.Timestamp(“2020-03-25 13:30:00.023”),

pd.Timestamp(“2020-03-25 13:30:00.030”),

pd.Timestamp(“2020-03-25 13:30:00.041”),

pd.Timestamp(“2020-03-25 13:30:00.048”),

pd.Timestamp(“2020-03-25 13:30:00.049”),

pd.Timestamp(“2020-03-25 13:30:00.072”),

pd.Timestamp(“2020-03-25 13:30:00.075”)

],

“ticker”: [“GOOG”,”MSFT”,”MSFT”,”MSFT”,”GOOG”,”AAPL”,”GOOG”,”MSFT”],

“bid”: [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],

“ask”: [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]

})

right = pd.DataFrame( {

“time”: [

pd.Timestamp(“2020-03-25 13:30:00.023”),

pd.Timestamp(“2020-03-25 13:30:00.038”),

pd.Timestamp(“2020-03-25 13:30:00.048”),

pd.Timestamp(“2020-03-25 13:30:00.048”),

pd.Timestamp(“2020-03-25 13:30:00.048”)

],

“ticker”: [“MSFT”, “MSFT”, “GOOG”, “GOOG”, “AAPL”],

“price”: [51.95, 51.95, 720.77, 720.92, 98.0],

“quantity”: [75, 155, 100, 100, 100]

})

第3步:合并数据帧并打印它们

在这一步中,将使用函数 “pd.merge_asof() “来合并数据框架。merge_asof()函数的结果被存储在一个变量中,然后使用 “print() “来打印该变量。

# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
   
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
   
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(right, left, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)

输出 :

根据最接近的DateTime合并两个Pandas DataFrames

例子1:现在我们在merge_asof函数中改变左右Dataframe的位置。

# Importing the required package
import pandas as pd
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)

输出:

根据最接近的DateTime合并两个Pandas DataFrames

注意:因此,从我们的2个输出中可以看出,当我们把右边的DataFrame放在第一位时,输出中的行数是5,等于右边DataFrame中的行数;当左边的DataFrame放在第一位时,输出中的行数等于左边DataFrame中的行数。如果我们看一下这两个输出并进行比较,那么我们可以很容易的说merge_asof()与left-join相似,只是我们在最近的键上进行匹配,而不是等键。

例子2:我们只在报价时间和交易时间之间的2ms内进行。

# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time", by="ticker",
                                 tolerance=pd.Timedelta("2ms"))
 
# print the variable
print(merged_dataframe)

输出 :

根据最接近的DateTime合并两个Pandas DataFrames

例子3:我们只在报价时间和交易时间之间的10毫秒内进行统计,我们排除了时间上的完全匹配。然而,之前的数据会向前传播。

# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
   
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time", by="ticker",
                                 tolerance=pd.Timedelta("2ms"),
                                 allow_exact_matches=False)
 
# print the variable
print(merged_dataframe)

输出 :

根据最接近的DateTime合并两个Pandas DataFrames

例子4:当两个地方都使用同一个DataFrame时。在这个左边的Dataframe是在两边使用的。

# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
   
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
   
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, left, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)

输出 :

根据最接近的DateTime合并两个Pandas DataFrames

它将同一个数据框创建为两个框架,一个表示为x,另一个创建为y,即bid_x, bid_y, ask_x, ask_y。

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程