Advertisement
Advertisement


Using Pandas to pd.read_excel() for multiple worksheets of the same workbook


Question

I have a large spreadsheet file (.xlsx) that I'm processing using python pandas. It happens that I need data from two tabs in that large file. One of the tabs has a ton of data and the other is just a few square cells.

When I use pd.read_excel() on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I'm interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we're only using the specified sheet).

Am I using it wrong or is it just limited in this way?

Thank you!

2014/10/23
1
175
10/23/2014 4:21:45 AM

Accepted Answer

Try pd.ExcelFile:

xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile() call (there doesn't appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.

Note that the sheet_name argument to pd.read_excel() can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0).

If None is specified, all sheets are returned, as a {sheet_name:dataframe} dictionary.

2020/01/04
269
1/4/2020 12:43:04 PM


You can also use the index for the sheet:

xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)

will give the first worksheet. for the second worksheet:

sheet2 = xls.parse(1)
2015/02/25

You could also specify the sheet name as a parameter:

data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")

will upload only the sheet "sheet_name".

2020/03/29

pd.read_excel('filename.xlsx') 

by default read the first sheet of workbook.

pd.read_excel('filename.xlsx', sheet_name = 'sheetname') 

read the specific sheet of workbook and

pd.read_excel('filename.xlsx', sheet_name = None) 

read all the worksheets from excel to pandas dataframe as a type of OrderedDict means nested dataframes, all the worksheets as dataframes collected inside dataframe and it's type is OrderedDict.

2019/08/01

If you are interested in reading all sheets and merging them together. The best and fastest way to do it

sheet_to_df_map = pd.read_excel('path_to_file.xls', sheet_name=None)
mdf = pd.concat(sheet_to_df_map, axis=0, ignore_index=True)

This will convert all the sheet into a single data frame m_df

2020/08/11

Yes unfortunately it will always load the full file. If you're doing this repeatedly probably best to extract the sheets to separate CSVs and then load separately. You can automate that process with d6tstack which also adds additional features like checking if all the columns are equal across all sheets or multiple Excel files.

import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']

See d6tstack Excel examples

2018/12/17