Ask Your Question
0

How large a dataset can be uploaded to a Generic Member Node (GMN) via the DataONE API?

asked 2014-08-21 15:54:36 -0500

lmoyers1 gravatar image

How large a dataset can be uploaded to a Generic Member Node (GMN) via the DataONE API? Have we tested this out to see if there are boundaries?

edit retag flag offensive close merge delete

Comments

Just for clarity, DataONE makes a distinction between Science Data and Science Metadata (SciMeta) objects. DataONE supports SciMeta standards that allow for Datasets to be embedded within. The Coordinating Node will only accept a 1 Gigabyte SciMeta object, thus bounding any Membernode's SciMeta.

waltz gravatar imagewaltz ( 2014-08-22 00:35:08 -0500 )edit

1 answer

Sort by ยป oldest newest most voted
1

answered 2014-08-22 12:29:45 -0500

rnahf gravatar image

updated 2014-08-22 12:31:45 -0500

(This answer is summarized from an internal email thread on the DataONE developers list and the comment to the question by waltz)

If the dataset (data file) is represented as one of the 'DATA' file format types (see the complete list), you should be fine, as there is no prescribed file size limit in GMN for any format type. However, if the dataset's file format is a 'METADATA' file format (NetCDF, for example), after making it into the specific GMN node, it will be uploaded to DataONE Coordinating Node in order to build the search record in the central search index, and the Coordinating Node doesn't accept METADATA files larger than 1Gb.

So, any dataset represented as a METADATA filetype has an effective size limit of 1Gb, since it will not be able to be registered, even though it could be successfully uploaded to GMN.

Following are a couple salient points from the discussion on testing, in response to the idea of trying to upload a 1Tb size file into GMN:

Doing this through the API will be limited to the restrictions of transferring any large file over HTTP, and any fragilities that may exist in the specific target MN implementation (one hopes that there are none). That is to say, that on a stable local network it should be possible, but across the public internet it may be flakey due to dropping connections. That said, we regularly transfer 0.5TB across the internet globally with GBIF and we only have to drop into retry mechanisms occasionally.

reply by the GMN developer:

It should work as long as the network connection is stable during the transfer...

One important concern when dealing with such large objects is to make sure that both client and MN just stream the data through to/from disk and doesn't attempt to buffer up the object. Both the client library for Python and GMN were designed to avoid any buffering and I performed tests with GB sized files to make sure none was occurring.

Most systems will run into problems trying to buffer a 1Tb file in memory!

Hope that helps.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

[hide preview]

Question Tools

Stats

Asked: 2014-08-21 15:54:36 -0500

Seen: 784 times

Last updated: Aug 22 '14