Preliminary test results for Fedora Ingestion Service
Notice regarding GSearch Service
During the tests, we noticed that if the Ingest Service stops for a reason (RPC errors, Internet connection problems, operator side cancellation of task), GSearch will delete the index file, and indexes only the last successfully ingested objects, after the incident.
- Proposed solution: to manage Fedora Commons messaging service in this cases, in order to send a notification to the Repository Admin, or, better, for the massive ingest, to stop the GSearch indexing before the ingest, and fire it after that again, in an background separate thread.
- 1,462,351 milliseconds manually indexing time for 212,713 objects, because the index file was deleted therefore as a console cancellation of the work flow.
- 1,920,544 milliseconds manually indexing time for 258,493 objects after a massive ingest. The messaging service has been disabled.
- 3,433,129 milliseconds manually indexing time for 335,010 objects.
Memory used by Ingestion Service application
- The Ingestion Service uses maximum amount of 320MB of memory and 50% from a processor of 1,8 GB, because of using of Xalan-java 2.7.1 XSLT processor.
- The preparation for Ingest uses the maximum memory, effective ingest uses only 60MB of memory.
After testing of the Ingest Service on the biggest collection Vascular plants (UNITS), reports as follows:
- Preparing for ingest takes: 52 minutes. Files validation and splitting multiple valued items takes 35 minutes from this total time for 49851 objects
- Effective ingest took from 12:50:00 PM to 17:45:00 PM, about 10,000/hour.
- The final amount of objects in K2N Fedora Commons repository at 2009-09-27 5:45 PM is 335,010 digital objects.
Metadata quality. Notices for data providers
- It is strongly recommended to have a complete set of metadata: providers are advised to provide as many metadata as possible for a digital resource;
- It is strongly recommended to provide a unique Resource ID metadata, to improve metadata management;
- Please, do not copy/paste from HTML formatted sources (html pages, TaxonPages, other Internet resources). This metadata are html pre-formatted and our tools for parsing, analyzing and search, work hard to avoid HTML tags in different phases;
- Please, try to keep the metadata coherence, specially when you submit large collection with attachments: if the collection is changing in time and the attachment is changing also, please remove the attachment reference from wiki collection pages in which the attachment doesn't belong anymore;
- Keeping the coherence between Metadata template, used for collection and Infobox Organization template, used for providers description. See NatureGate wiki page where the application initial cannot find the provider automatic until a manually correction have been performed instead.
(Return to MediaWiki_based_ingest_tool)