Digital repositories acting as resource aggregators typically face different challenges, roughly classified in three main categories: extraction, improvement and storage. The first category comprises issues related to dealing with different resource collection protocols: OAI-PMH, web-crawling, webservices, etc and their representation: XML, HTML, database tuples, unstructured documents, etc. The second category comprises information improvements based on controlled vocabularies, specific date formats, correction of malformed data, etc. Finally, the third category deals with the destination of downloaded resources: unification into a common database, sorting by certain criteria, etc.
This paper proposes an ETL architecture for designing a software application that provides a comprehensive solution to challenges posed by a digital repository as resource aggregator.
Design and implementation aspects considered during the development of this tool are described, focusing especially on architecture highlights.