使用BeautifulSoup将XML结构转换为DataFrame

在这里，我们将使用Python的BeautifulSoup包将XML结构转换为一个DataFrame。它是一个用于搜刮网页的Python库。要安装这个库，命令是

pip install beautifulsoup4

我们将使用这个库从XML文件中提取数据，然后我们将把提取的数据转换为数据框架。为了转换为Dataframes，我们需要安装panda的库。

Pandas库：它是一个用于数据处理和分析的python库。要安装这个库，命令是

pip install pandas

注意：如果它要求你安装一个分析器库，使用命令

pip install et_xmlfile

分步实现:

第1步：导入库。

from bs4 import BeautifulSoup  
import pandas as pd

首先，我们需要导入将在我们的程序中使用的库。在这里，我们从bs4模块中导入了BeautifulSoup库，还导入了pandas库并创建了它的别名 “pd”。

第2步：读取xml文件。

file = open("gfg.xml",'r')
contents = file.read()

在这里，我们使用open(“filename”, “mode”)函数以读模式 “r “打开名为 “gfg.xml “的xml文件，并将其存储在变量 “file “中。然后，我们使用read()函数读取存储在该文件中的实际内容。

第3步:

soup = BeautifulSoup(contents,'xml')

在这里，我们将存储在’contents’变量中的文件数据交给BeautifulSoup函数，同时传递文件的类型，即XML。

第4步：搜索数据。

在这里，我们正在提取数据。我们使用find_all()函数，该函数返回在该函数中传递的标签中存在的提取的数据。

authors = soup.find_all('author')
titles = soup.find_all('title')
prices = soup.find_all('price')
pubdate = soup.find_all('publish_date')
genres = soup.find_all('genre')
des = soup.find_all('description')

示例:

authors = soup.find_all('author')

我们要将提取的数据存储到author变量中。这个find_all(‘author’)函数将提取xml文件中author标签内的所有数据。这些数据将被存储为一个列表，即author是一个从该xml文件中所有author标签中提取的数据列表。其他语句也是如此。

第5步：从xml中获取文本数据。

data = []
for i in range(0,len(authors)):
   rows = [authors[i].get_text(),titles[i].get_text(),
           genres[i].get_text(),prices[i].get_text(),
           pubdate[i].get_text(),des[i].get_text()]
   data.append(rows)

现在，我们已经有了从xml文件中提取的所有数据，并按照标签放在不同的列表中。现在我们需要从不同的列表中合并与一本书有关的所有数据。因此，我们运行一个for循环，将不同列表中的某本书的所有数据存储在一个名为 “rows “的列表中，然后将每一行附加到另一个名为 “data “的列表中。

第6步：打印数据框架。

最后，我们有一个分离的每本书的组合数据。现在我们需要将这个列表数据转换成一个DataFrame。

df = pd.DataFrame(data,columns = ['Author','Book Title',
                                  'Genre','Price','Publish Date',
                                  'Description'], dtype = float)
display(df)

输出:

使用BeautifulSoup将XML结构转换为DataFrame - Python

DataFrame

在这里，我们使用pd.DataFrame()命令将该数据列表转换为数据框架。在这个命令中，我们传递了列表 “data”，也传递了我们想拥有的列的名称。我们还提到了数据类型(type)为float，这将使所有的数字值都是浮动的。

现在我们已经使用BeautifulSoup从XML文件中提取了数据到DataFrame中，它被存储为 “df”。为了查看DataFrame，我们使用print语句来打印它。

使用的XML文件 – GFG.xml

以下是完整的实施方案：。

# Python program to convert xml
# structure into dataframes using beautifulsoup
  
# Import libraries
from bs4 import BeautifulSoup
import pandas as pd
  
# Open XML file
file = open("gfg.xml", 'r')
  
# Read the contents of that file
contents = file.read()
  
soup = BeautifulSoup(contents, 'xml')
  
# Extracting the data
authors = soup.find_all('author')
titles = soup.find_all('title')
prices = soup.find_all('price')
pubdate = soup.find_all('publish_date')
genres = soup.find_all('genre')
des = soup.find_all('description')
  
data = []
  
# Loop to store the data in a list named 'data'
for i in range(0, len(authors)):
    rows = [authors[i].get_text(), titles[i].get_text(), genres[i].get_text(
    ), prices[i].get_text(), pubdate[i].get_text(), des[i].get_text()]
    data.append(rows)
  
# Converting the list into dataframe
df = pd.DataFrame(data, columns=['Author',
                                 'Book Title', 'Genre', 
                                 'Price', 'Publish Date',
                                 'Description'], dtype = float)
display(df)

输出:

使用BeautifulSoup将XML结构转换为DataFrame - Python