Principal Research Engineer, Aec Data, Generative AI

Autodesk Autodesk · Enterprise · Toronto, ON +1

This Principal Research Engineer role focuses on building foundation models and generative AI tools for the AEC industry. The primary responsibilities involve developing scalable data pipelines for diverse AEC and infrastructure data sources, working with large-scale multi-modal datasets (text, geometric, terrain, reality capture) to design preprocessing and content understanding methods, and transforming unstructured data into representations suitable for machine learning. The role also involves collaborating with ML scientists to align data formats for downstream LLM training and fine-tuning, and applying data quality techniques. While the core is data engineering for ML (L0), there's a clear connection to downstream training and fine-tuning (L2).

What you'd actually do

  1. Collaborate with other engineers to develop scalable data pipelines for diverse AEC and infrastructure data sources used in production ML systems, including BIM models, CAD drawings, infrastructure and transportation design data
  2. Work with large-scale infrastructure datasets—such as transportation networks, terrain models, and reality capture data—to enable machine learning workflows for infrastructure planning and engineering
  3. Work with large-scale, multi-modal datasets including text and geometric data, to design novel preprocessing, augmentation, analysis and content understanding
  4. Transform unstructured AEC and infrastructure data into representations suitable for machine learning
  5. Lead cross-functional collaboration with ML Research Scientists and Engineers to align data formats with downstream training and fine-tuning of LLMs

Skills

Required

  • MSc or PhD in Computer Science, Engineering, or a related field
  • 7+ years of experience in Machine Learning, Engineering, or related fields
  • Proven technical leadership, including leading complex projects and influencing technical direction in cross-functional teams
  • Strong experience in geometric data modeling and processing, including complex 2D/3D representations, computational geometry, and data architectures
  • Familiarity with machine learning concepts and frameworks and how data is represented for training
  • Proficiency in Python and strong software engineering practices
  • Ability to translate research ideas into production-grade systems
  • Excellent communication skills with ability to influence and guide technical decisions
  • Background in Architecture, Engineering, or Construction (AEC)

Nice to have

  • Experience with AEC data formats and workflows (e.g., BIM, IFC, CAD, and infrastructure or transportation design models)
  • Experience working with infrastructure or transportation design tools such as Autodesk Civil 3D, InfraWorks, or similar systems
  • Experience working with reality capture data, including point clouds or LiDAR datasets (e.g., Autodesk ReCap)
  • Experience delivering production ML or data systems
  • Strong foundations in core computer science (algorithms, systems, scalability)
  • Understanding of deep learning architectures (CNNs, Transformers) and familiarity with frameworks such as PyTorch
  • Experience building scalable data or ML pipelines in cloud environments (e.g., AWS, SageMaker)
  • Experience mentoring senior engineers or leading small technical teams
  • Track record of driving technical innovation and best practices

What the JD emphasized

  • Proven technical leadership, including leading complex projects and influencing technical direction in cross-functional teams
  • Strong experience in geometric data modeling and processing, including complex 2D/3D representations, computational geometry, and data architectures
  • Background in Architecture, Engineering, or Construction (AEC)

Other signals

  • develop scalable data pipelines for ML systems
  • design novel preprocessing, augmentation, analysis and content understanding for multi-modal datasets
  • transform unstructured data into representations suitable for machine learning
  • align data formats with downstream training and fine-tuning of LLMs