|
|
|
Data import
=========== Target : -------- The target is to provide an easy and pluggable way to import content into the Nuxeo5. The entry point used will be the core API via the Platform service: - connect via POJO on stand alone core - via EJB remoting on JBoss embeded core Pipeline : ---------- In order to provide a pluggable solution, we use a NxRuntime service with 3 Extension points: - Data Reader : read data from a source - read all data from an external source - read some data from the core (NXQL query) => return simple DocumentModel like artifacts (HashMaps or XML tree) - Data transformer : convert read data to the storage structure - field mapping - meta-data extraction - Data Writer : write data to the target destination - write to an external source - write to the core => The writter may implement batch writting for transactions optimisations NXDocumentPipe also support an configuration Extension Point that defines the configuration of a pipe : - one reader with one config - one converter with one config - one writer with one config Starter kit : ------------- Beside the DocumentPipe service, we must provide : - a core reader : - NXQL based extraction - supports pagination - a simple XSLT converter - DM => XML =XLST=> XML => DM - a simple Field mapper - a core writer - supports batch transations - configuration for base path - configuration for events disconnection - starts reIndexall at the end - a XML reader : reads XML representation of DocumentModels - a XML writer : serialize DocumentModel as a simple <doc><schema><field> tree Thierry tells me to explain my use case that it may use this fonctionnality :
I have a Nuxeo 5 with a Core filled with datas. I would like to migrate these datas in a new docType (schemas a bit different).. Transfert will be doing from my old Core to the new one wherre datas will be saved in the new docType... I hope I'm clear... Please check out Talend (http://www.talend.com. It's an ETL software (Extract / Transform / Load) that should provide the right infrastructure for data import from external sources. We would, in this case, only write an NXCore loader/writer to Talend. To goal is to benefit from data sources provided by Talend to offer more import sources to our customer with writing specific connector for each product.
Some more ideas about the future NXIO (other names welcomed).
Simple Readers : ================ Core Reader ----------- Reads documents from the core and returns a pagined list of documentModels. The data extraction will be at first configured by giving a NXQL Query. XML Reader ---------- Reads XML Files and generate a list of documentModels. The DM may be very basic but need to have at least : - a path - DataModels filled and named About the XML format -------------------- The input source will be configured via an URL : at start the file:// protocol will be implemented. The source may be : - a folder containg XML or zip files - a zip file (zipped folder containing XML files) Each XML represents one Document : simple xml tree document/schemas/schemaX/fieldY The idea is that the document/schemas node should be valid against schemas XSD. The root node (document) will also have some special children - "transfert" - Source (label) - Date and hour of generation - BlobRepresentation - externalized - externalized with digest - base64 inline - RelativePath - signature/digest of the document - "type" - Typename - "facets" ro store the list of facet of the document - lifecycle - "security" ACP XML representation As a first implementation I guess we can start without security and lifecycle. All data in XML will be UTF-8. The blobs can be stored inline in base64. The blobs can be externalized. If documentA.xml contains the XML export, the externalized files will be at the same level. The blobs file name can be arbitrary : it just as to be referenced in the main XML file. => for example <externalizedBlob> file://documentA_fieldY.xxx </externalizedBlob> In the mid terme, it will be usefull to also store inside the xml file a signature/digest of the externalised blob. => at first a simple SHA1 would be good Simple Writter ============== Core Writer ----------- Reads a pagined list of DM and create them into the repo. Configuration includes : - target core/domain - base path - base ACL - default folder type The default folder type will be usefull because in some systems, the folders are not document : the xml export format only exports documents with path, there may be missing folderish nodes. (Needed for processing Lotus Notes XML exports) XML Writter ----------- Write XML representation to the configured output URL (for now file only). Configuration : - target path - externalize Blob true/false Simple converter ================ The converter may be used for : - type mapping Define the target type when reading from a file system. => at start we will just define a default type ==> all incomming nodes with type will have the default one => and also a simple type bijection (Benjamin need something like that) - RelativePath mapping May be used to define the path of the document to import when reading from a file system. => at start just enable/disable the addition of the source relative path as prefix the target path. - security Configurable ACL based on the meta-data and some rules => at start nothing - filtring Hide some data when reading from the core => at start a list of schemas that are not exported ETL === The purpous is not to replace an ETL. But some very "content oriented" mapping will be difficult to do with ETL : - type mapping - right mapping - ... In the mid terme, an ETL like Talend could be plugged to the nuxeo import/export pipe : - via the XML file input/output - via a set of connectors to Talend that provide readers and writters for NXIO. I think Bogdan S can write the component infrastructure : Runtime Service + facade + EP
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As a service in the nxp application or a separate tool (external).
Some aspects :
1. Import/export to be performed in the background and/or as long running process without affecting the application performance
2. Export: should consider a snapshot of the database (or jcr rep) - (i.e. while exporting some data might change and we might end with an inconsistent exported rep if just retrieving the last data).
3. Import: no matter the level at which data is written back to live storage there should be/or not events sent like when creating/copying documents? At least the cache will need to know when new documents are added. Maybe some document post-creation processors won't need (or shouldn't) be called - so a selectively notification might be needed...