History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: NXP-701
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Bogdan Stefanescu
Reporter: Eric Barroca
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Google issue summary
Nuxeo Enterprise Platform

XML-based Import / Export for documents and repositories

Created: 02/03/07 14:54   Updated: 10/06/08 02:19
Component/s: Core
Affects Version/s: None
Fix Version/s: 5.1.RC

Time Tracking:
Issue & Sub-Tasks
Issue Only
Original Estimate: 1 week
Original Estimate - 1 week
Remaining Estimate: 1 week
Remaining Estimate - 1 week
Time Spent: Not Specified
Remaining Estimate - 1 week

Issue Links:
Duplicate
 

Resolution Date: 20/06/07 16:22
Participants: Benjamin Jalon, Bogdan Stefanescu, Dragos Mihalache, Eric Barroca and Thierry Delprat
Date of First Response: 02/03/07 15:46
Tags:

Sub-Tasks  All   Open   

 Description  « Hide
Implement an import / export feature that allows to export a list of documents (any type) and repository-related information (ex: security, versioning, etc.).

The export format should be XML based and reuse the XMLSchema definition of content type.

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Dragos Mihalache - 02/03/07 15:46
Implementation:
As a service in the nxp application or a separate tool (external).

Some aspects :

1. Import/export to be performed in the background and/or as long running process without affecting the application performance

2. Export: should consider a snapshot of the database (or jcr rep) - (i.e. while exporting some data might change and we might end with an inconsistent exported rep if just retrieving the last data).

3. Import: no matter the level at which data is written back to live storage there should be/or not events sent like when creating/copying documents? At least the cache will need to know when new documents are added. Maybe some document post-creation processors won't need (or shouldn't) be called - so a selectively notification might be needed...


Thierry Delprat - 03/05/07 01:35
Data import
===========

Target :
--------
The target is to provide an easy and pluggable way to import content into the Nuxeo5.
The entry point used will be the core API via the Platform service:
 - connect via POJO on stand alone core
 - via EJB remoting on JBoss embeded core

Pipeline :
----------
In order to provide a pluggable solution, we use a NxRuntime service with 3 Extension points:
 - Data Reader : read data from a source
- read all data from an external source
- read some data from the core (NXQL query)

=> return simple DocumentModel like artifacts (HashMaps or XML tree)

 - Data transformer : convert read data to the storage structure
- field mapping
- meta-data extraction

 - Data Writer : write data to the target destination
- write to an external source
- write to the core
  
=> The writter may implement batch writting for transactions optimisations

NXDocumentPipe also support an configuration Extension Point that defines the configuration of a pipe :
- one reader with one config
- one converter with one config
- one writer with one config

Starter kit :
-------------

Beside the DocumentPipe service, we must provide :
 - a core reader :
- NXQL based extraction
- supports pagination

 - a simple XSLT converter
- DM => XML =XLST=> XML => DM

 - a simple Field mapper

 - a core writer
- supports batch transations
- configuration for base path
- configuration for events disconnection
- starts reIndexall at the end

 - a XML reader : reads XML representation of DocumentModels
 - a XML writer : serialize DocumentModel as a simple <doc><schema><field> tree

Benjamin Jalon - 03/05/07 16:06
Thierry tells me to explain my use case that it may use this fonctionnality :
I have a Nuxeo 5 with a Core filled with datas.
I would like to migrate these datas in a new docType (schemas a bit different)..
Transfert will be doing from my old Core to the new one wherre datas will be saved in the new docType...

I hope I'm clear...

Eric Barroca - 08/05/07 15:13
Please check out Talend (http://www.talend.com. It's an ETL software (Extract / Transform / Load) that should provide the right infrastructure for data import from external sources. We would, in this case, only write an NXCore loader/writer to Talend. To goal is to benefit from data sources provided by Talend to offer more import sources to our customer with writing specific connector for each product.

Thierry Delprat - 11/05/07 04:19
Some more ideas about the future NXIO (other names welcomed).

Simple Readers :
================

Core Reader
-----------
Reads documents from the core and returns a pagined list of documentModels.
The data extraction will be at first configured by giving a NXQL Query.

XML Reader
----------
Reads XML Files and generate a list of documentModels.
The DM may be very basic but need to have at least :
 - a path
 - DataModels filled and named

About the XML format
--------------------
The input source will be configured via an URL : at start the file:// protocol will be implemented.
The source may be :
 - a folder containg XML or zip files
 - a zip file (zipped folder containing XML files)
Each XML represents one Document : simple xml tree document/schemas/schemaX/fieldY
The idea is that the document/schemas node should be valid against schemas XSD.
The root node (document) will also have some special children
 - "transfert"
   - Source (label)
   - Date and hour of generation
   - BlobRepresentation
- externalized
- externalized with digest
- base64 inline
   - RelativePath
   - signature/digest of the document
 - "type"
   - Typename
   - "facets" ro store the list of facet of the document
   - lifecycle
 - "security" ACP XML representation
As a first implementation I guess we can start without security and lifecycle.

All data in XML will be UTF-8.
The blobs can be stored inline in base64.
The blobs can be externalized.
If documentA.xml contains the XML export, the externalized files will be at the same level.
The blobs file name can be arbitrary : it just as to be referenced in the main XML file.
=> for example <externalizedBlob> file://documentA_fieldY.xxx </externalizedBlob>
 
In the mid terme, it will be usefull to also store inside the xml file a signature/digest of the externalised blob.
=> at first a simple SHA1 would be good

Simple Writter
==============

Core Writer
-----------
Reads a pagined list of DM and create them into the repo.
Configuration includes :
 - target core/domain
 - base path
 - base ACL
 - default folder type
The default folder type will be usefull because in some systems, the folders are not document : the xml export format only exports documents with path, there may be missing folderish nodes.
(Needed for processing Lotus Notes XML exports)

XML Writter
-----------
Write XML representation to the configured output URL (for now file only).
Configuration :
 - target path
 - externalize Blob true/false
 

Simple converter
================

The converter may be used for :

 - type mapping
 Define the target type when reading from a file system.
 => at start we will just define a default type
 ==> all incomming nodes with type will have the default one
 => and also a simple type bijection (Benjamin need something like that)

 - RelativePath mapping
 May be used to define the path of the document to import when reading from a file system.
 => at start just enable/disable the addition of the source relative path as prefix the target path.

 - security
 Configurable ACL based on the meta-data and some rules
 => at start nothing

 - filtring
 Hide some data when reading from the core
 => at start a list of schemas that are not exported

ETL
===
The purpous is not to replace an ETL.
But some very "content oriented" mapping will be difficult to do with ETL :
 - type mapping
 - right mapping
 - ...

In the mid terme, an ETL like Talend could be plugged to the nuxeo import/export pipe :
 - via the XML file input/output
 - via a set of connectors to Talend that provide readers and writters for NXIO.




Thierry Delprat - 11/05/07 04:22
I think Bogdan S can write the component infrastructure : Runtime Service + facade + EP