ExcelInterop vs OpenXML in CPython3

Does anyone have any example of using ExcelInterop in Cpython3 for opening, reading, and writing Excel files?

I have a whole bunch of scripts i wrote several years ago (IronPython - yay!!) that all used ExcelInterop and were blazing fast (i am often dealing with excel files with 20+columns and 60,000+ rows).

Using the DSOffice.OpenXml methods in Cpython3 worked fine for me in testing when opening small-ish test excel files (5 cols, 100 rows), but when I scaled up to mega-sized files it locked up my computer for 20 minutes.

In researching here i have seen a lot of commentary about how ExcelInterop and COM is difficult/near impossible in Cpython3, and i have seen mention of using OpenXml in Cpython 3 with some examples, but i have not seen any complete examples of how to Open Excel Spreadsheet, given sheet name and file path read all inforrmation on Sheet, create new Excel File and Sheet and write data to sheet using OpenXml, etc.

I am very familiar with using Worksheet.UsedRange, UsedRange.Columns, UsedRange.Rows, Row.Value2 for reading information from Excel and creating and Array and writing an array to a new sheet using cellval=Worksheet.Range[A1,AA10000] and cellval.Value2=array but havent had any luck trying to either translate these methods to something that works in Cpython3 or finding the OpenXml equivalents.

I am frankly not wanting to go the route of using any non-built in 3rd party libraries that i then have to manage getting installed and maintained on many users across many offices (that was the beauty of excel interop - everyone had it and it worked like a charm).

Thanks

Sadly this hasnā€™t been the case for most. The original Dynamo excel nodes struggled mightily with stability as MS updates roll out and break the registry, causing the entirety of the interop to fail, causing a machine by machine review of the issueā€¦ And because of how IT updates are rolled out some very large companies had it broken, resolved, and broken again in a weekā€¦ not fun.

The issue with using CPython for this interop package is that your classes will be operating a few levels under Excel, Dynamo, and Revit. CPython being a C implementation of Python is running closer to the hardware then the C# python weā€™re used to with IronPython.

That said there are often ways around the limitations, such as assigning namespaces to classes and the like. But even then there are cases where things just wonā€™t run.

In any case it will be difficult to build a solution without a common starting point - can you post some boilerplate of what you are using now?

Note: I double cheecked. I hadn;t run it in Cpython yet. Just IronPython. Just tested and it failed in CPython.

Just did some interop stuff last weekā€¦ Canā€™t say Iā€™m impressed with the speed. It works ok in both IronPython and CPython3. But Iā€™m not using Dynamo to run it. Just pyRevit and running it directly.

Cell-by-cell is slow. Arrays, not so bad.

I may redo in C .Net and see if that is faster.

And Iā€™m going the interop route for the same reason as you. No extra libraries.

#! python3
import clr
import sys

import System
from System import Array
from System.Collections.Generic import *

clr.AddReference("RevitAPI")
clr.AddReference("RevitAPIUI")

from Autodesk.Revit.DB import *
from Autodesk.Revit.UI import *
import os

from System import Array
import time

excelTypeLibGuid = System.Guid("00020813-0000-0000-C000-000000000046")
clr.AddReferenceToTypeLibrary(excelTypeLibGuid)
from Excel import Application

doc = __revit__.ActiveUIDocument.Document
app = __revit__.Application
uidoc = __revit__.ActiveUIDocument

title = doc.Title

excel = Application()

myexcelfile = os.path.expanduser('~\Desktop\\test.xlsx')
fileexists = os.path.isfile(myexcelfile)
if fileexists:
    workbook = excel.Workbooks.Open(myexcelfile)
    worksheets = workbook.WorkSheets
    ws = worksheets.Add()
    try:
        ws.Name = title
    except:
        pass
    #excel.Visible = True
else:
    workbook = excel.Workbooks.Add()
    ws = workbook.ActiveSheet
    ws.Name = title
    cell = ws.Range["A1"]
    #workbook.saveAs(myexcelfile)
    #excel.visible = True
    #excel.Quit()

Hi,
you can use Excel Interrop with CPython3/PythonNet but need to cast each object to the correct Interface

a small example here

for OpenXML, you can find many examples in GitHub in C# (but the portage to Python is not simple)

https://github.com/search?l=C%23&q=OpenXml+CellReference&type=Code

I understand the argument, but if you can, take a look at the Python libraries pandas and openpyxl

you can also install the IronPython package

Below is some example code i use for reading in large excel files - in this case an excel file with 10 columns and 50437 rows took about 2 seconds to read into IronPython, but i only read in columns A,B,C, and H (0,1,2,8). Being able to read each column of the excel into a list in one line is very useful and very fast.

import clr
from System.Collections.Generic import *


clr.AddReferenceByName('Microsoft.Office.Interop.Excel, Version=11.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c')
from Microsoft.Office.Interop import Excel
from System.Runtime.InteropServices import Marshal

clr.AddReference("System")
from System.Collections.Generic import List as cList

from System.Collections.Generic import *

from System import Array


#The inputs to this node will be stored as a list in the IN variables.
dataEnteringNode = IN

myfilepath = IN[0]
oldfilepath=IN[1]
sheetname=IN[2]

datalist=[]
colist=[]
framelist=[]
changelist=[]

def ExcelColNumToExcelColString(_num):
    if _num < 1:
        raise ValueError("Index is too small")
    result = ""
    while True:
        if _num > 26:
            _num, r = divmod(int(_num) - 1, 26)
            result = chr(r + ord('A')) + result
        else:
            return chr(_num + ord('A') - 1) + result

exold = Excel.ApplicationClass()
exold.Visible = True
exold.DisplayAlerts = False 

try:
	workbookold = exold.Workbooks.Open(oldfilepath)
	wsold = workbookold.Worksheets(sheetname)
	oldexcelfileval=1

	valold=[]

	xlRangeold=wsold.UsedRange
	rowsold=wsold.UsedRange.Rows.Count
	colsold=wsold.UsedRange.Columns.Count

	for c in wsold.UsedRange.Columns:
		valold.append(c.Value2)
	
	guidlistold=list(valold[0])
	assemlistold=list(valold[1])
	commlistold=list(valold[2])
	seqlistold=list(valold[8])


	exold.ActiveWorkbook.Close(False)

	Marshal.ReleaseComObject(wsold)
	Marshal.ReleaseComObject(workbookold)
	exold.Quit()
	Marshal.ReleaseComObject(exold)

except:
	oldexcelfileval=0
	Marshal.ReleaseComObject(exold)

OUT = seqlistold[4]

iā€™ll put together another boilerplate example of writing to excel this evening

1 Like

Here is the example code for reading the same huge excel file then writing a new huge excel file:
FYI - runtime was about 5 seconds.

import clr
from System.Collections.Generic import *


clr.AddReferenceByName('Microsoft.Office.Interop.Excel, Version=11.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c')
from Microsoft.Office.Interop import Excel
from System.Runtime.InteropServices import Marshal

clr.AddReference("System")
from System.Collections.Generic import List as cList

from System.Collections.Generic import *

from System import Array


#The inputs to this node will be stored as a list in the IN variables.
dataEnteringNode = IN

myfilepath = IN[0]
oldfilepath=IN[1]
sheetname=IN[2]

datalist=[]
colist=[]
framelist=[]
changelist=[]

def ExcelColNumToExcelColString(_num):
    if _num < 1:
        raise ValueError("Index is too small")
    result = ""
    while True:
        if _num > 26:
            _num, r = divmod(int(_num) - 1, 26)
            result = chr(r + ord('A')) + result
        else:
            return chr(_num + ord('A') - 1) + result

exold = Excel.ApplicationClass()
exold.Visible = True
exold.DisplayAlerts = False 

try:
	workbookold = exold.Workbooks.Open(oldfilepath)
	wsold = workbookold.Worksheets(sheetname)
	oldexcelfileval=1

	valold=[]

	xlRangeold=wsold.UsedRange
	rowsold=wsold.UsedRange.Rows.Count
	colsold=wsold.UsedRange.Columns.Count

	for c in wsold.UsedRange.Columns:
		valold.append(c.Value2)
	
	exceldata0=list(valold[0])
	exceldata1=list(valold[1])
	exceldata2=list(valold[2])
	exceldata3=list(valold[3])
	exceldata4=list(valold[4])
	exceldata5=list(valold[5])
	exceldata6=list(valold[6])
	exceldata7=list(valold[7])
	exceldata8=list(valold[8])
	exceldata9=list(valold[9])
	


	exold.ActiveWorkbook.Close(False)

	Marshal.ReleaseComObject(wsold)
	Marshal.ReleaseComObject(workbookold)
	exold.Quit()
	Marshal.ReleaseComObject(exold)

except:
	oldexcelfileval=0
	Marshal.ReleaseComObject(exold)
	

changearray=Array.CreateInstance(object,len(exceldata0),10)
for j in range(0,len(exceldata0)):
	changearray[j,0]=exceldata0[j]
	changearray[j,1]=exceldata1[j]
	changearray[j,2]=exceldata2[j]
	changearray[j,3]=exceldata3[j]
	changearray[j,4]=exceldata4[j]
	changearray[j,5]=exceldata5[j]
	changearray[j,6]=exceldata6[j]
	changearray[j,7]=exceldata7[j]
	changearray[j,8]=exceldata8[j]
	changearray[j,9]=exceldata9[j]
	
ex = Excel.ApplicationClass()
ex.Visible = True
ex.DisplayAlerts = False

try: 
	workbook = ex.Workbooks.Open(myfilepath)
except:
	workbook = ex.Workbooks.Add()
#Create a new Worksheet within our new Workbook
worksheet = workbook.Worksheets.Add()
worksheet.Name=sheetname

arraylen=len(changearray)
exceldatalen=len(exceldata1)
excolval=ExcelColNumToExcelColString(arraylen/exceldatalen)

cell2=worksheet.Range["A1",excolval+str(len(exceldata1))]
cell2.Value2=changearray


worksheet.SaveAs(unicode(myfilepath))


Marshal.ReleaseComObject(worksheet)
Marshal.ReleaseComObject(workbook)
ex.Quit()
Marshal.ReleaseComObject(ex)


OUT = exceldata1[4]

Nice little script, thanks for sharing!

Iā€™ve adjusted it to suit my needs for a task so thought Iā€™d share the outcome back. Have added progressive error checking at each stage as Iā€™ve found excel interop can run into lots of issues along the way unless the user runs it in a set way (file closed initially, donā€™t interact with Excel, donā€™t close before finished reading etc.). Tested in Revit 2022 latest build.

Simple excel reader.dyn (12.2 KB)

It seems pretty reliable, although chugs on larger files (but gets there in the end).

import clr
from System.Collections.Generic import *

clr.AddReferenceByName('Microsoft.Office.Interop.Excel, Version=11.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c')
from Microsoft.Office.Interop import Excel
from System.Runtime.InteropServices import Marshal

# Get filepath and sheetname
filepath    = IN[0]
sheetname   = IN[1]
replaceNull = IN[2]

# Clean data
def cleanNull(d):
	if d:
		return d
	else:
		return replaceNull

# Base outputs
errorCode = "Data read successfully."
success   = True
matrix    = []

# Make Excel application object
try:
	exApp = Excel.ApplicationClass()
	exApp.Visible = True
	exApp.DisplayAlerts = False
except:
	errorCode = "Error accessing Excel application."
	success   = False

# Try to get the Excel workbook
if success:
	try:
		workbook = exApp.Workbooks.Open(filepath)
		ws = workbook.Worksheets(sheetname)
	except:
		errorCode = "Could not find worksheet."
		success   = False

# Try to read worksheet
if success:
	# Get column data
	try:
		# Get used range
		xlRange=ws.UsedRange
		rows=ws.UsedRange.Rows.Count
		cols=ws.UsedRange.Columns.Count
		# Get column data
		for c in ws.UsedRange.Columns:
			vals = c.Value2
			cleanVals = [cleanNull(v) for v in vals]
			matrix.append(cleanVals)
	# Otherwise hit an error
	except:
		errorCode = "Worksheet found, reading error."
		success   = False

# Close Excel application
try:
	exApp.ActiveWorkbook.Close(False)
	Marshal.ReleaseComObject(ws)
	Marshal.ReleaseComObject(workbook)
	exApp.Quit()
	Marshal.ReleaseComObject(exApp)
except:
	errorCode = "Error closing Excel application."
	Marshal.ReleaseComObject(exApp)

# Return matrix and error report
OUT = [matrix, success, errorCode]

Thanks for checking it out Gavin. I noticed you used IronPython in your example. Per my original post, Iā€™m trying to work out a way to do the same in the new (supported and maintained) Cpython3. I know we currently can add IronPython back into Fynamo via PackageManager (and I have done so) but looking long term Iā€™m not sure how viable that is when building scripts and workflows that I want to work for years to come (much like these have for me for quite some time). Hence the desire to update these to the latest greatest.

<Note that what I posted is only a small snippet of about a 1500 line script>

1 Like

Personally my workaround is to limit dynamo in 2023 jobs onwards and revert to pyrevit/addins. I guide users into installing the IP package if needed, but until IP3 is in dynamo itā€™s in the naughty corner at company scale - CP3 just has too many limits, some without viable .net suitable solutions unless packages are batch deployed and maintained.

2 Likes

So because of Cpython - power user directs entire company away from using Dynamoā€¦ā€¦

Hope Autodesk is still monitoring this one.

I canā€™t say I disagree - though my solution thus far is to begrudgingly fix scripts for 2023 on an ā€œas neededā€ basis - which frankly stinks but I donā€™t want to completely get away from some of the ease of using Dynamo to switch things to a different platform. Iā€˜ve definitely come across some real head scratchers (like why is getting the Family Type name so buggyā€¦and some weird results getting View Types) but this Excel thing is the first near brick wall Iā€™ve run into.

Itā€™s probably a mixture of this combined with the way people run tools where I work. Most users I work with just want to smash a button and have a script run, so Dynamo is not always ideal in that scenario if errors would occur. Scripted coding (Python, C etc.) generally gives us more error catching and process terminating techniques which also lead me away from Dynamo a bit.

Weā€™ve had quite a few threads between Dynamo devs, package managers and power users/script managers at firms and the general concensus is that ā€˜it is what it isā€™. We either see Dynamo development slow if they go back to align all old versions, or we have to make exceptions or find other options. I understand why these choices have been made, and they understand some people will choose not to follow them (e.g. look at all the CP3 packages out there, or the popular packages that actually rewrote for CP3). The Dynamo team are aware of the gripes, and have mentioned IP3 may eventually come into the picture (but then IP2 would need to leave I think, could be even more challenging). I think Iā€™d choose integrated Dynamo builds over one Dynamo to rule all versions anyway, as the former means we can at least predict builds people will run.

Personally I still use Dynamo a fair bit to prototype and develop quick project specific solutions and we still get a good amount of script use prior to 2023 (generally 100-200 successful script runs per week on live projects according to my logging system), itā€™s just 2023 where there is some choices to be made depending on average user understanding of Dynamo beyond the Player window. Dynamo is still the gateway into programming I recommend to any users who come to me keen to learn.

In my eyes the true source of difficulty here goes back to the multiple versions of Revit = multiple versions of Dynamo, as each year we will effectively get a new version of exceptions and catches to add to our scripts and packages if they try to span not just IP2/CP3 changes but also API changes (some of which are significantly different and can only be solved by checking the Revit app version in if statements).

Ultimately itā€™s a matter of being fully aware of your technical debt, and creating only as much as you can afford in future at a company. As I get deeper in programming this becomes ever more apparent to me.

Iā€™ll make a video on the channel about it at some point as it comes up a fair bit.

Have you considered using GitHub - dotnetcore/NPOI: A .NET library for reading and writing Microsoft Office binary and OOXML file formats.

It doesnā€™t use COM nor does it require an install of Excel as it works on the open source XML office document, which solves all the headaches caused by NuGets that do.

2 Likes

I think what interests me most is trying out the casting approach @c.poupin suggested. Iā€™m hopeful that I can get that to work for both reading and writing Excel files.

@GavinCrump - totally understand users wanting to just mash a button. I have never used pyrevit and just jumped straight to coding C# adding when I have something that deserves it. I like dynamo because even though these scripts get used all the time - on each job thereā€™s a minor amount of customization and itā€™s pretty easy to find where a user ran into trouble (we are running tear on models from 3rd parties most of the time so thereā€™s no consistency of ā€œthese models are built using our company standardā€ - so lots of one-off solutions).

1 Like

I spoke a bit about why the CPython interaction is so different from the IronPython one today in the office hour. Basically youā€™re at a different level of the machine - running closer to the hardware is it were, and as a result things which were directly translated now need to be explicitly stated. One example of this is CPython interface classes needing a namespace assigned to be called from Revit. This will be odd for a bit. To quote a very skilled Python user I work with: ā€˜I canā€™t say I have ever seen a technical reason you would want to do that before.ā€™

And so the specifics of how you interact with Revit, Civil 3D, Excel and all the rest will need to be different then it was in IronPython. And for awhile some things may not be feasible as a result. Hence the IronPython2 package.

The team is aware that this is a lot of work; and so an IronPython 3 implementation is likely something theyā€™ll try and put together. This could likely be distributed and implemented like the IronPython 2 package was, meaning it could be run side by side with your other Python builds.

This is really the trigger for the situation. The team could have stuck with 1 or 2 updates a year and have been spending time issuing hot foxes to get old builds working in new hosts, or they could be giving us the features we were asking for and the features they had to provide from a security standpoint. By breaking into itā€™s own cycle we got a span of just over 13 months which saw the release of versions 2.13 - 2.17, which has a LOT of good stuff in it. Obviously not all of these are picked up by any one host application, and we may see some host applications lag behind others, but the overall ecosystem is much more advanced than it was before.

But now as a result the integration specifics will vary from host build to host build. Not ideal from a ā€˜one graph runs in all parts of the ecosystemā€™ aspect.

Too many things are changing on the host application side, the Dynamo side, and even the Microsoft side to assume you can have one version of a graph on all years without excessive work and multiple rounds of ā€˜try/exceptā€™ to deal with version changes (or better yet some if statement triggers based on the Dynamo/Host version). And so for now my advice with dealing with the changes continues to be this:

Do a save as at each update, and test for the breaking changes. If nothing fails, back to the library for office wide use. If something goes wrong fix it in ~5 minutes if you can, or save it to a new location to fix later.

The entirety of the testing process less the decision to save to directory A or B can be automated without too much work, as can most of the ā€˜that will impact all of my graphsā€™ issues which you may come across (ie: bulk changing all Python in a package to CPython 3 to see which automatically pass the upgrade).

Now rather than focusing on all the other stuff going on in the world of Dynamo (Iā€™m happy to set up a time to chat on such topics separately - just DM me) try to pivot back to the issue Ben hasā€¦

I do think the open xml is likely the most stable in the long term (managing the deployment of that dependency shouldnā€™t be that hard, right? - cue the video of the PM crying while every block goes in the square hole).

Seeing the size of the dataset and knowing speed is the issue, I can think of a few possible other workarounds for long term use which would simplify the need for being a part time software developer focused on excel interopā€¦

Perhaps defining the read and write range as an added overload method on the current nodes would suffice? You are basically increasing the speed by slashing the scope of the data being read by 50% or more. Assuming this is passed into Dynamo directly as lists, that could reduce scope a fair bit. Altering the structure from a list to a dictionary could also help in that respect, as dictionaries are significantly faster to serialize between nodes, but that would remove the capacity to use any form of iteration in Dynamo without reintroducing the list.

For now the methods shown by @c.poupin are promising - I will try to have a look at those next week (tied up today and this weekend Iā€™ll be away from my CPU).

2 Likes

No need to defend the multiple version thing too much - I get it.

Personally my issue is that i have something that works and i want to port it over to Cpython. And the whole Excel Interop thing vs OpenXML is akin to me saying -yeah - my Southern US Redneck English works here - but i need to translate it to the Queens English (porting interop over to Cpython and using casting but not reinventing the wheel) vs saying - yeah - English no worky anymore, must learn Mandarin Chinese (Switching from Interop to OpenXml - completely new methods, etc).

i cant really alter the range of data read in or written out as i need all of itā€¦ Essentially iā€™m taking models of steel framing (50,000-100,000 elements in Revit) and getting location information and certain parameters/properties (like material grade as well as some job specific parameters) and writing them out to Excel. Then when i get a new model from the engineer or detailer in a week or two, i repeat the process to write out a new set of data as well as compare the new set of data to the old set of data (recorded in Excel). Where changes of note have occured - I write a simple text value to each element back in the Revit model so that i can then apply a view template to tell what has changed and in what way it has changed (new element, new location, material grade change). I dont ever take any of the data as lists back into Dynamo because i found taking multiple lists of 50,000 - 100,000 back into Dynamo tends to be a speed bottleneck. Job specific alterations take place outside in the Dynamo Environment ,but all the heavy lifting occurs inside the Python Script.

1 Like

Great analogy! :slight_smile:
Iā€™ll expand on it a bit. :smiling_imp:

At the moment the IronPython > CPython is more akin to moving from using words to communicate by tin cans on a string to telepathic communication on an intergalactic scale. Weā€™re still communicating the same thought, and so to a recipient the message feels the same, but whatā€™s actually happening under the hood has to be VERY different.

And so for most of what users do in Dynamo this pans out - we can use the same code to do the same thing without fail. However when we get to specifics of communicating (with x as y; calling a class for an interface interaction; etc.) things start to fall apart as that particular set of instructions doesnā€™t work in the new communication interface.

@c.poupin in trying the casting method i seem to have very quickly hit a wall. I am getting error messages telling me that Sheets objects dont have a property called ā€œItemā€ so i am unable to call up the specific sheet I need to read. Any ideas?

I am able to cycle through the Sheets item and get a list of every sheetname (same as you had on your example) but when i try to just call up an indv sheet using Item it doesnt work.

Python.NET does not handle virtual indexers well.
You can use get_Item instead:

ws = Excel.Worksheet(Excel.Sheets(workbook.Worksheets).get_Item(1))

Thank @c.poupin

Hereā€™s the first portion (reading from Excel) re-written for Cpython. Still have to work out the writing portion.

import sys
import clr
from System import Environment 

clr.AddReference('Microsoft.Office.Interop.Excel, Version=11.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c')
from Microsoft.Office.Interop import Excel
from System.Runtime.InteropServices import Marshal

clr.AddReference("System")
from System.Collections.Generic import List as cList

from System.Collections.Generic import *

from System import Array


#The inputs to this node will be stored as a list in the IN variables.
dataEnteringNode = IN

myfilepath = IN[0]
oldfilepath=IN[1]
sheetname=IN[2]

datalist=[]
colist=[]
framelist=[]
changelist=[]

def ExcelColNumToExcelColString(_num):
    if _num < 1:
        raise ValueError("Index is too small")
    result = ""
    while True:
        if _num > 26:
            _num, r = divmod(int(_num) - 1, 26)
            result = chr(r + ord('A')) + result
        else:
            return chr(_num + ord('A') - 1) + result

exold = Excel.ApplicationClass()
exold.Visible = True
exold.DisplayAlerts = False 

try:
        workbookold = Excel.Workbooks(exold.Workbooks).Open(oldfilepath)
        wsoldall = Excel.Sheets(workbookold.Worksheets)
        wsold=Excel.Worksheet(wsoldall.get_Item(sheetname))
        oldexcelfileval=1
        wname=wsold.get_Name()
        valold=[]
        xlRangeold=Excel.Range(wsold.get_UsedRange())
        rowsold=Excel.Range(xlRangeold.get_Rows()).Count
        colsold=Excel.Range(xlRangeold.get_Columns()).Count
        
        for c in xlRangeold.Columns:
            valold.append(Excel.Range(c).get_Value2())
        
        exceldata0=list(valold[0])
        exceldata1=list(valold[1])
        exceldata2=list(valold[2])
        exceldata3=list(valold[3])
        exceldata4=list(valold[4])
        exceldata5=list(valold[5])
        exceldata6=list(valold[6])
        exceldata7=list(valold[7])
        exceldata8=list(valold[8])
        exceldata9=list(valold[9])
        


        exold.ActiveWorkbook.Close(False)

        Marshal.ReleaseComObject(wsold)
        Marshal.ReleaseComObject(workbookold)
        exold.Quit()
        Marshal.ReleaseComObject(exold)

except:
        oldexcelfileval=0
        Marshal.ReleaseComObject(exold)



OUT=exceldata0,exceldata1

3 Likes

Upcoming versions of Dynamo will include the openpyxl module (and a few others) as standard.

See here Python Modules #13728

1 Like