Advertisement
Advertisement


Creating an empty Pandas DataFrame, then filling it?


Question

I'm starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I'd like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN.

I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. Note: I'm using Python 2.7.

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict
2019/02/25
1
493
2/25/2019 5:29:07 PM

Accepted Answer

Here's a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9
2017/01/31
346
1/31/2017 2:26:04 PM

The Right Way™ to Create a DataFrame

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Wait until you are sure you have all the data you need to work with. Use a list to collect your data, then initialise a DataFrame when you are ready.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

The other advantage of this method is dtypes are automatically inferred (rather than assigning object to all of them).

The last advantage is that a RangeIndex is automatically created for your data, so it is one less thing to worry about (take a look at the poor append and loc methods below, you will see elements in both that require handling the index appropriately).


Things you should NOT do

append or concat inside a loop

Here is the biggest mistake I've seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation. From the df.append doc page:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.

2020/07/23

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()

2020/01/08

Initialize empty frame with column names

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

Performance considerations

If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.

If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.

2020/01/08

Assume a dataframe with 19 rows

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

test['A']=10

Keeping column b as a variable given by a loop

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value

2019/03/16