Processing Excel XLSX files with Python
Using Python’s ElementTree XML Parser, its possible to quickly parse data within a Microsoft Excel .XLSX file.
An .XLSX file consists of a large number of XML files that are compressed into
a single .ZIP file. Some Spreadsheets make large use of sharedStrings, where
are strings that are stored in a separate XML file (sharedStrings.xml
) which
then acts as the source of a lookup for the worksheets. Cells that have a s
value in the t
attribute for a cell contain a sharedString, whose contents
actually exist in another xml fie in the .XLSX. The following script should
form the basis of an .XLSX to .CSV convertor.
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import codecs
import sys
import datetime
def excel_date_to_str(d):
# excel stores a date as days since 01/01/1900
# by adding 6933594 to the excel value, we can use python's date module to format it correctly
ret = None
try:
ret = datetime.date.fromordinal(int(d) + 693594).strftime("%A, %d %B %Y")
finally:
return ret
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
x = ET.parse(open("xl/sharedStrings.xml"))
sharedStrings = x.getroot()
last_v = None
last_type = None
for event, elem in ET.iterparse("xl/worksheets/sheet1.xml", events=('start', 'end')):
uri, tag = elem.tag.split("}")
if event == "start" and tag == "c": # start c tag
last_v = None
if "t" in elem.attrib:
last_type = elem.attrib["t"]
else:
last_type = None
elif event == "end" and tag == "c": # end c tag
if "r" in elem.attrib:
rc = elem.attrib["r"]
if last_v != None:
print "RC is ", rc, " = ", last_v
elif event == "start" and tag == "v": # start v tag
value = "".join(elem.itertext())
if last_type == "s":
last_v = "".join(sharedStrings[int(value)].itertext())
else:
last_v = value, "type is ", last_type, excel_date_to_str(value)