Numbers and strings

This page covers a common problem when loading data into Pandas — when Pandas gets confused about whether values in a column are text or numbers.

An example

import numpy as np
import pandas as pd
pd.set_option('mode.chained_assignment','raise')

We return to the example data file that you may have seen in the text encoding page.

You can download the data file from {download}imdblet_latin.csv <../data/imdblet_latin.csv>.

films = pd.read_csv('imdblet_latin.csv', encoding='latin1')
films.head()
Votes Rating Title Year Decade
0 635139 8.6 Léon 1994 1990
1 264285 8.1 The Princess Bride 1987 1980
2 43090 N/K Paris, Texas (1984) 1984 1980
3 90434 8.3 Rashômon 1950 1950
4 427099 8.0 X-Men: Days of Future Past 2014 2010

Now imagine we are interested in the average rating across these films:

ratings = films['Rating'].copy()
ratings.mean()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
   1601         try:
-> 1602             x = float(x)
   1603         except (TypeError, ValueError):

ValueError: could not convert string to float: '8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
   1605             try:
-> 1606                 x = complex(x)
   1607             except ValueError as err:

ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/_7/h5j1wn6n2b5c2qwh5dtr3jlm0000gq/T/ipykernel_21389/633626916.py in <module>
      1 ratings = films['Rating'].copy()
----> 2 ratings.mean()

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
  10749         )
  10750         def mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10751             return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
  10752 
  10753         setattr(cls, "mean", mean)

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in mean(self, axis, skipna, level, numeric_only, **kwargs)
  10367 
  10368     def mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10369         return self._stat_function(
  10370             "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs
  10371         )

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
  10352                 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
  10353             )
> 10354         return self._reduce(
  10355             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  10356         )

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4390                 )
   4391             with np.errstate(all="ignore"):
-> 4392                 return op(delegate, skipna=skipna, **kwds)
   4393 
   4394     def _reindex_indexer(

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _f(*args, **kwargs)
     91             try:
     92                 with np.errstate(invalid="ignore"):
---> 93                     return f(*args, **kwargs)
     94             except ValueError as e:
     95                 # we want to transform an object array

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    153                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    154             else:
--> 155                 result = alt(values, axis=axis, skipna=skipna, **kwds)
    156 
    157             return result

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in new_func(values, axis, skipna, mask, **kwargs)
    408             mask = isna(values)
    409 
--> 410         result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    411 
    412         if datetimelike:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in nanmean(values, axis, skipna, mask)
    663 
    664     count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 665     the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
    666 
    667     if axis is not None and getattr(the_sum, "ndim", False):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/nanops.py in _ensure_numeric(x)
   1607             except ValueError as err:
   1608                 # e.g. "foo"
-> 1609                 raise TypeError(f"Could not convert {x} to numeric") from err
   1610     return x
   1611 

TypeError: Could not convert 8.68.1N/K8.38.08.18.08.48.28.08.08.08.68.68.08.28.48.48.18.38.48.28.58.08.28.18.48.18.68.48.18.78.1 to numeric

The problem

The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.

We can see what type of thing Pandas has stored by looking at the dtype attribute of a Series, or the dtypes attribute of a data frame.

films.dtypes
Votes     object
Rating    object
Title     object
Year       int64
Decade     int64
dtype: object
ratings.dtype
dtype('O')

In fact both these bits of information say the same thing – that the ‘Rating’ column stores things in the “object” or “O” type. This is a general type that can store any Python value. It is the standard type that Pandas uses when storing text.

Why does Pandas use text for the ‘Rating’ column?

A quick look at the first rows gives the answer:

ratings.head()
0    8.6
1    8.1
2    N/K
3    8.3
4    8.0
Name: Rating, dtype: object

The film “Paris, Texas (1984)” has a value “N/K” for the rating. This can’t be a number, so Pandas stored this column in a format that allows it to store “N/K” as text.

If that wasn’t obvious, another way of checking where the problem value is, to apply the function float to the column values.

When we apply a function to a Series, it does this:

  • For each value in the Series it:

    • Calls the function, with the value as the single argument.

    • Collects the new value returned from the function, and appends it to a new Series.

  • Returns the new Series.

The result is a Series that is the same length as the original series, but where each value in the new series is the result of calling the function on the original value.

Recall that the float function converts the thing you pass into a floating point value:

v = float('3.14')
v
3.14
type(v)
float

Now we try applying float to the problematic column:

ratings.apply(float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/_7/h5j1wn6n2b5c2qwh5dtr3jlm0000gq/T/ipykernel_21389/1295716276.py in <module>
----> 1 ratings.apply(float)

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
   4355         dtype: float64
   4356         """
-> 4357         return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
   4358 
   4359     def _reduce(

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply(self)
   1041             return self.apply_str()
   1042 
-> 1043         return self.apply_standard()
   1044 
   1045     def agg(self):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self)
   1096                 # List[Union[Callable[..., Any], str]]]]]"; expected
   1097                 # "Callable[[Any], Any]"
-> 1098                 mapped = lib.map_infer(
   1099                     values,
   1100                     f,  # type: ignore[arg-type]

~/opt/anaconda3/lib/python3.8/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

ValueError: could not convert string to float: 'N/K'

One way of dealing with this problem is to make a recoding function.

A recoding function is a function that we will apply to a Series. That means that we call the function for every value in the Series. The function argument is the value from the series. The function returns the new value, for a new Series.

def recode_ratings(v):
    if v == 'N/K':  # Return missing value for 'N/K'
        return np.nan
    # Otherwise make text value into a float
    return float(v)

We test our function:

recode_ratings('8.3')
8.3
recode_ratings('N/K')
nan

We make a new Series by calling the recode function:

new_ratings = ratings.apply(recode_ratings)
new_ratings.head()
0    8.6
1    8.1
2    NaN
3    8.3
4    8.0
Name: Rating, dtype: float64

We can insert this back into a copy of the original data frame:

films_fixed = films.copy()
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()
Votes Rating Title Year Decade
0 635139 8.6 Léon 1994 1990
1 264285 8.1 The Princess Bride 1987 1980
2 43090 NaN Paris, Texas (1984) 1984 1980
3 90434 8.3 Rashômon 1950 1950
4 427099 8.0 X-Men: Days of Future Past 2014 2010