如何使用Python Pandas在词典切片中选择数据的子集?

如何使用Python Pandas在词典切片中选择数据的子集?

更多Pandas相关文章,请阅读:Pandas 教程

简介

Pandas具有双重选择能力,使用索引位置或使用索引标签选择数据子集。在本文中,我将向您展示如何“使用词典切片选择数据子集”。

谷歌上充满了数据集。在kaggle.com中搜索电影数据集。本文使用来自kaggle的电影数据集。

如何实现

  • 导入仅用于此示例的列的电影数据集。
import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)
budget vote_average vote_count
titile
Little Voice 0 6.6 61
Grown Ups 2 80000000 5.8 1155
The Best Years of Our Lives 2100000 7.6 143
Tusk 2800000 5.1 366
Operation Chromite 0 5.8 29
  • 我始终建议对索引进行排序,特别是如果索引由字符串组成。如果您的索引已排序,则在处理巨大数据集时会注意到差异。

如果我不对索引进行排序怎么办?

没问题,您的代码将永远运行。开个玩笑,如果索引标签未排序,则Pandas必须逐个遍历所有标签以匹配查询。想象一下没有索引页的牛津字典,你会怎么做?索引排序后,您可以快速跳转到要提取的标签,这也是Pandas的情况。

让我们首先检查索引是否已排序。

#检查索引是否已排序?
movies.index.is_monotonic

  • 明显,索引未排序。我们将尝试选择以A%开头的电影。这就像写作

select * from movies where title like’A%’

movies.loc["Aa":"Bb"]
选择所有标题以'A%'开头的电影

---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(self, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: 索引必须单调递增或递减

在处理上述异常时,又发生了另一个异常:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, end, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, label, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

KeyError: 'Aa'

  • 将索引按升序排序,然后尝试使用词典排序切片的相同命令利用排序优势。
  • 现在我们的数据设置好了,可以进行词典排序的切片。现在让我们选择所有从字母’A’到字母’B’开头的电影标题。
budget vote_average vote_count
title
Abandon 25000000 4.6 45
Abandoned 0 5.8 27
Abduction 35000000 5.6 961
Aberdeen 0 7.0 6
About Last Night 12500000 6.0 210
Battle for the Planet of the Apes 1700000 5.5 215
Battle of the Year 20000000 5.9 88
Battle: Los Angeles 70000000 5.5 1448
Battlefield Earth 44000000 3.0 255
Battleship 209000000 5.5 2114

292 行 × 3列

True

title budget vote_average vote_count
Æon Flux 62000000 5.4 703
xXx: State of the Union 60000000 4.7 549
xXx 70000000 5.8 1424
eXistenZ 15000000 6.7 475
[REC]² 5600000 6.4 489

预算 平均投票率 投票次数 标题

这对于我们来说是一个没有头绪的空数据框。让我们反转字母并再次运行它。

title budget vote_average vote_count
B-Girl 0 5.5 7
Ayurveda: Art of Being 300000 5.5 3
Away We Go 17000000 6.7 189
Awake 86000000 6.3 395
Avengers: Age of Ultron 280000000 7.3 6767
About Last Night 12500000 6.0 210
Aberdeen 0 7.0 6
Abduction 35000000 5.6 961
Abandoned 0 5.8 27
Abandon 25000000 4.6 45

228 行 × 3 列

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程