如何查找和删除Pandas数据框架中的重复列

让我们讨论一下如何在Pandas数据框架中查找和删除重复的列。首先，让我们创建一个简单的数据框架，列名为 “姓名”、”年龄”、”住所 “和 “标记”。

# Import pandas library 
import pandas as pd
  
# List of Tuples
students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
         ]
  
# Create a DataFrame object
df = pd.DataFrame(students, columns =['Name', 'Age', 'Domicile', 'Marks'])
  
# Print a original dataframe
df

输出:
如何查找和删除Pandas数据框架中的重复列？

代码1：在一个数据框架中查找重复的列。
为了找到重复的列，我们需要遍历一个DataFrame的所有列，对于每一个列，它将搜索DataFrame中是否有任何其他列已经存在相同的内容。如果有，那么该列名将被存储在重复列集中。最后，该函数将返回重复列的列名列表。

# import pandas library 
import pandas as pd
  
# This function take a dataframe
# as a parameter and returning list
# of column names whose contents 
# are duplicates.
def getDuplicateColumns(df):
  
    # Create an empty set
    duplicateColumnNames = set()
      
    # Iterate through all the columns 
    # of dataframe
    for x in range(df.shape[1]):
          
        # Take column at xth index.
        col = df.iloc[:, x]
          
        # Iterate through all the columns in
        # DataFrame from (x + 1)th index to
        # last index
        for y in range(x + 1, df.shape[1]):
              
            # Take column at yth index.
            otherCol = df.iloc[:, y]
              
            # Check if two columns at x & y
            # index are equal or not,
            # if equal then adding 
            # to the set
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
                  
    # Return list of unique column names 
    # whose contents are duplicates.
    return list(duplicateColumnNames)
  
# Driver code
if __name__ == "__main__" :
  
    # List of Tuples
    students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
          ]
  
    # Create a DataFrame object
    df = pd.DataFrame(students, 
                         columns =['Name', 'Age', 'Domicile', 'Marks'])
  
  
    # Get list of duplicate columns
    duplicateColNames = getDuplicateColumns(df)
  
    print('Duplicate Columns are :')
        
    # Iterate through duplicate
    # column names
    for column in duplicateColNames :
       print('Column Name : ', column)

输出:
如何查找和删除Pandas数据框架中的重复列？

代码2：在一个DataFrame中删除重复的列。
为了删除重复的列，我们可以将用户定义的函数getDuplicateColumns()返回的重复列列表传递给Dataframe.drop()方法。

# import pandas library 
import pandas as pd
  
  
# This function take a dataframe
# as a parameter and returning list
# of column names whose contents 
# are duplicates.
def getDuplicateColumns(df):
  
    # Create an empty set
    duplicateColumnNames = set()
      
    # Iterate through all the columns 
    # of dataframe
    for x in range(df.shape[1]):
          
        # Take column at xth index.
        col = df.iloc[:, x]
          
        # Iterate through all the columns in
        # DataFrame from (x + 1)th index to
        # last index
        for y in range(x + 1, df.shape[1]):
              
            # Take column at yth index.
            otherCol = df.iloc[:, y]
              
            # Check if two columns at x & y
            # index are equal or not,
            # if equal then adding 
            # to the set
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
                  
    # Return list of unique column names 
    # whose contents are duplicates.
    return list(duplicateColumnNames)
  
# Driver code
if __name__ == "__main__" :
  
    # List of Tuples
    students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
          ]
  
    # Create a DataFrame object
    df = pd.DataFrame(students, 
                        columns =['Name', 'Age', 'Domicile', 'Marks'])
  
    # Dropping duplicate columns
    rslt_df = df.drop(columns = getDuplicateColumns(df))
  
    print("Resultant Dataframe :")
  
    # Show the dataframe
    rslt_df

输出:

如何查找和删除Pandas数据框架中的重复列？