Thanks Andrew, appreciate the kind words and looking forward to hearing your thoughts after you try it.
You’re right that SQL-like queries can handle a lot. But the challenge is that they require a seasoned user who already knows what they’re looking for and where to find it.
IFC is complicated. The class hierarchy, property sets, quantity sets, relationships. It’s a lot to navigate even for experienced users. Then the authoring software adds another layer. Revit exports properties differently than ArchiCAD or Tekla. Each tool adds its own custom property sets and naming conventions. So the person validating the file needs to understand both the IFC schema itself and the quirks of whatever software created it.
A few real world examples where this becomes a problem:
Someone asks “show me all the external walls.” Simple enough. But what if some walls are classified as IfcBuildingElementProxy because the authoring software didn’t map them correctly? A property filter won’t catch those. The LLM can look at the properties, materials, and context and flag them anyway. It can even capture images and identify them visually even if the element metadata doesn’t include anything to identify them as walls.
Or imagine asking “which doors don’t meet fire rating requirements?” The fire rating might be stored under FireRating in one file, FireResistance in another, or buried in a custom property set like Pset_DoorCommon or a vendor specific one. A SQL query needs to know exactly where to look. The LLM can reason across all of them.
Or something like “find all rooms with inadequate ventilation.” That might require combining room volumes, window areas, mechanical equipment, and external standards. Not a single query. A chain of lookups plus external context.
So the way I see it, IFCNodes works as a two layer system.
Layer one is the deterministic nodes. Filtering, spatial queries, clash detection, geometry extraction. For the experienced user who knows the schema and knows what they want. Fast, predictable, no tokens involved.
Layer two is the LLM assistant. For the less experienced user, or when information is scattered, or when elements are misclassified, or when you need to combine multiple sources including external standards. It figures out which tools to use and in what order.
Both layers have their place depending on the user and the problem.
On the vector database idea, yes it’s definitely feasible and something I’m planning to explore. But I don’t see it as a replacement for the LLM. More like a complement. The vector database would handle retrieval. Finding relevant elements or properties based on similarity. But you still need something to reason over the results, decide what to do next, chain multiple operations together, or explain findings in plain language. That’s where the LLM stays valuable.
So the ideal setup is probably both working together. Vector database for efficient retrieval, LLM for reasoning and orchestration.