Processing Excel XLSX files with Python
Using Python’s ElementTree XML Parser, its possible to quickly parse data within a Microsoft Excel .XLSX file.
An .XLSX file consists of a large number of XML files that are compressed into
a single .ZIP file. Some Spreadsheets make large use of sharedStrings, where
are strings that are stored in a separate XML file (
then acts as the source of a lookup for the worksheets. Cells that have a
value in the
t attribute for a cell contain a sharedString, whose contents
actually exist in another xml fie in the .XLSX. The following script should
form the basis of an .XLSX to .CSV convertor.
#!/usr/bin/env python import xml.etree.ElementTree as ET import codecs import sys import datetime def excel_date_to_str(d): # excel stores a date as days since 01/01/1900 # by adding 6933594 to the excel value, we can use python's date module to format it correctly ret = None try: ret = datetime.date.fromordinal(int(d) + 693594).strftime("%A, %d %B %Y") finally: return ret UTF8Writer = codecs.getwriter('utf8') sys.stdout = UTF8Writer(sys.stdout) x = ET.parse(open("xl/sharedStrings.xml")) sharedStrings = x.getroot() last_v = None last_type = None for event, elem in ET.iterparse("xl/worksheets/sheet1.xml", events=('start', 'end')): uri, tag = elem.tag.split("}") if event == "start" and tag == "c": # start c tag last_v = None if "t" in elem.attrib: last_type = elem.attrib["t"] else: last_type = None elif event == "end" and tag == "c": # end c tag if "r" in elem.attrib: rc = elem.attrib["r"] if last_v != None: print "RC is ", rc, " = ", last_v elif event == "start" and tag == "v": # start v tag value = "".join(elem.itertext()) if last_type == "s": last_v = "".join(sharedStrings[int(value)].itertext()) else: last_v = value, "type is ", last_type, excel_date_to_str(value)