Please login first
Tackling the Data Sourcing Problem in Construction Procurement with File Scraping Algorithms
* 1 , 1, 2 , 2
1  CONSTRUCT/GEQUALTEC, FEUP DEC
2  BUILT CoLAB – Collaborative Laboratory for the Future Built Environment
Academic Editor: Jun WANG

Abstract:

The Architecture, Engineering, and Construction (AEC) sector is observed to have a lower adoption rate of machine learning (ML) tools compared to other industries that share similar characteristics. A significant contributing factor to this lower adoption rate is the limited availability of data, as ML techniques rely on large datasets to train algorithms effectively. However, the construction process generates substantial data that provide a detailed characterisation of the project. This inclination towards generating abundant data in the Construction sector contradicts ML developers' prevailing challenge in sourcing sufficient data within the AEC industry.

In the specific case of Portuguese Construction Procurement, public construction projects are mandatorily submitted to online, open-source repositories. However, the consultation and extraction of procurement files is decentralised and not automated, making data agglomeration difficult and time-consuming.

In this sense, this paper presents a data-scraping algorithm to scrape construction procurement repositories to develop an ML-ready dataset of training data for ML and Natural Language Processing (NLP) algorithms focused on the Construction sector's procurement phase. This tool automatically scrapes procurement repositories, developing a procurement file dataset comprising bills of quantities (BoQ) and project specifications.

In future studies, the dataset will be processed into a standardised format suitable for NLP BOQ task-matching algorithms. These matching algorithms will aim to automate construction budgeting for tender proposal purposes.

Keywords: Construction; Public Procurement; Contract Awarding; Scraping Algorithm; Database; Artificial Intelligence; Machine Learning; Natural Language Processing
Top