Numbers and strings¶
This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.
An example¶
import numpy as np
import pandas as pd
pd.set_option('mode.chained_assignment','raise')
We return to the example data file that you may have seen in the text encoding page.
You can download the data file from {download}imdblet_latin.csv <../data/imdblet_latin.csv>
.
films = pd.read_csv('imdblet_latin.csv', encoding='latin1')
films.head()
Votes | Rating | Title | Year | Decade | |
---|---|---|---|---|---|
0 | 635139 | 8.6 | Léon | 1994 | 1990 |
1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
2 | 43090 | N/K | Paris, Texas (1984) | 1984 | 1980 |
3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |
Now imagine we are interested in the average rating across these films:
ratings = films['Rating'].copy()
ratings.mean()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
1601 try:
-> 1602 x = float(x)
1603 except (TypeError, ValueError):
ValueError: could not convert string to float: '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
1605 try:
-> 1606 x = complex(x)
1607 except ValueError as err:
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/var/folders/_7/h5j1wn6n2b5c2qwh5dtr3jlm0000gq/T/ipykernel_21389/633626916.py in <module>
1 ratings = films['Rating'].copy()
----> 2 ratings.mean()
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
10749 )
10750 def mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10751 return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
10752
10753 setattr(cls, "mean", mean)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
10367
10368 def mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10369 return self._stat_function(
10370 "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs
10371 )
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
10352 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
10353 )
> 10354 return self._reduce(
10355 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
10356 )
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4390 )
4391 with np.errstate(all="ignore"):
-> 4392 return op(delegate, skipna=skipna, **kwds)
4393
4394 def _reindex_indexer(
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _f(*args, **kwargs)
91 try:
92 with np.errstate(invalid="ignore"):
---> 93 return f(*args, **kwargs)
94 except ValueError as e:
95 # we want to transform an object array
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
153 result = alt(values, axis=axis, skipna=skipna, **kwds)
154 else:
--> 155 result = alt(values, axis=axis, skipna=skipna, **kwds)
156
157 return result
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in new_func(values, axis, skipna, mask, **kwargs)
408 mask = isna(values)
409
--> 410 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
411
412 if datetimelike:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in nanmean(values, axis, skipna, mask)
663
664 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 665 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
666
667 if axis is not None and getattr(the_sum, "ndim", False):
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
1607 except ValueError as err:
1608 # e.g. "foo"
-> 1609 raise TypeError(f"Could not convert {x} to numeric") from err
1610 return x
1611
TypeError: Could not convert 8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1 to numeric
The problem¶
The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.
We can see what type of thing Pandas has stored by looking at the dtype
attribute of a Series, or the dtypes
attribute of a data frame.
films.dtypes
Votes object
Rating object
Title object
Year int64
Decade int64
dtype: object
ratings.dtype
dtype('O')
In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.
Why does Pandas use text for the ‘Rating’ column?
A quick look at the first rows gives the answer:
ratings.head()
0 8.6
1 8.1
2 N/K
3 8.3
4 8.0
Name: Rating, dtype: object
The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.
If that wasn’t obvious, another way of checking where the problem value is, to apply
the function float
to the column values.
When we apply
a function to a Series, it does this:
For each value in the Series it:
Calls the function, with the value as the single argument.
Collects the new value returned from the function, and appends it to a new Series.
Returns the new Series.
The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.
Recall that the float
function converts the thing you pass into a floating
point value:
v = float('3.14')
v
3.14
type(v)
float
Now we try applying float
to the problematic column:
ratings.apply(float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/_7/h5j1wn6n2b5c2qwh5dtr3jlm0000gq/T/ipykernel_21389/1295716276.py in <module>
----> 1 ratings.apply(float)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
4355 dtype: float64
4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4358
4359 def _reduce(
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self)
1096 # List[Union[Callable[..., Any], str]]]]]"; expected
1097 # "Callable[[Any], Any]"
-> 1098 mapped = lib.map_infer(
1099 values,
1100 f, # type: ignore[arg-type]
~/opt/anaconda3/lib/python3.8/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
ValueError: could not convert string to float: 'N/K'
One way of dealing with this problem is to make a recoding function.
A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.
def recode_ratings(v):
if v == 'N/K': # Return missing value for 'N/K'
return np.nan
# Otherwise make text value into a float
return float(v)
We test our function:
recode_ratings('8.3')
8.3
recode_ratings('N/K')
nan
We make a new Series by calling the recode function:
new_ratings = ratings.apply(recode_ratings)
new_ratings.head()
0 8.6
1 8.1
2 NaN
3 8.3
4 8.0
Name: Rating, dtype: float64
We can insert this back into a copy of the original data frame:
films_fixed = films.copy()
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()
Votes | Rating | Title | Year | Decade | |
---|---|---|---|---|---|
0 | 635139 | 8.6 | Léon | 1994 | 1990 |
1 | 264285 | 8.1 | The Princess Bride | 1987 | 1980 |
2 | 43090 | NaN | Paris, Texas (1984) | 1984 | 1980 |
3 | 90434 | 8.3 | Rashômon | 1950 | 1950 |
4 | 427099 | 8.0 | X-Men: Days of Future Past | 2014 | 2010 |