如何在Python Pandas中使用字典序切片选择子集数据?
介绍
Pandas具有使用索引位置或索引标签选择数据子集的双重选择功能。在本文中,我将向您展示如何“使用字典序切片选择子集数据”。
Google充满了数据集。在kaggle.com中搜索电影数据集。本文使用来自kaggle的电影数据集。
如何操作
1. 仅使用此示例所需的列导入电影数据集。
title |
budget |
vote_average |
vote_count |
Little Voice |
0 |
6.6 |
61 |
Grown Ups 2 |
80000000 |
5.8 |
1155 |
The Best Years of Our Lives |
2100000 |
7.6 |
143 |
Tusk |
2800000 |
5.1 |
366 |
Operation Chromite |
0 |
5.8 |
29 |
2. 我总是建议对索引进行排序,特别是如果索引由字符串组成。如果索引已排序,您将注意到在处理巨大数据集时的差异。
如果不对索引进行排序怎么办?
没关系,您的代码将永远运行。开玩笑,如果索引标签未排序,则Pandas必须逐个遍历所有标签以匹配查询。就像没有索引页的牛津字典,你该怎么办?如果索引排序,则您可以快速跳转到要提取的标签,Pandas也是如此。
首先,我们检查索引是否已排序。
3. 显然,索引未排序。我们将尝试选择以A%开头的电影。这就像写
---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last):
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807
ValueError: index must be monotonic increasing or decreasing
During handling of the above exception, another exception occurred:
KeyErrorTraceback (most recent call last):
in
----> 1 movies.loc["Aa": "Bb"]
~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,
tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
KeyError: 'Aa'
4. 将索引按升序排序,并尝试相同的命令以利用字典排序进行切片。