Ask Your Question
1

How can we download a specific set of metadata fields for a large number of metadata records?

asked 2013-03-20 15:21:23 -0500

pouchard gravatar image

updated 2013-03-21 01:33:32 -0500

vieglais gravatar image

We are interested in getting metadata for a large number of records from DataONE. For example, we'd like to be able to retrieve the identifier, title, abstract, and keywords for all publicly accessible metadata documents in DataONE. Rather than retrieving each metadata document through the DataONE API individually and then parsing them to extract the needed metadata, it would be a lot more efficient for us to get a data dump of the metadata documents in DataONE, which we can then parse ourselves to get only these fields. How can we go about retrieving this large number of records?

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
3

answered 2013-03-20 15:29:45 -0500

updated 2013-05-09 15:27:07 -0500

We've already parsed out all of those fields from all of the metadata documents and indexed them, so by far the easiest way to get the information you want is to use our CN query() service. In order to not overload the server, please query for a maximum of 1000 records at a time. Each of these should only take a second or so to run, so it should only take a couple of minutes to run through all of the metadata documents that we have and get those fields. To do that, you would write a one or two line script that calls our query service, and each time through the loop you would use curl to execute a query like this:

https://cn.dataone.org/cn/v1/query/solr/fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0

Note the last two parameters (rows, and start). Rows indicates how many records to retrieve, and start indicates where in the record set to start retrieving records. So in your first call, set start=0 to get the first block of 1000, and in your second call set start=1000 to get the second block of 1000 records, etc. When you get fewer than 1000 records back from the call, then you've gotten all of the records back (see the numFound field in the query response to see how many records match your query).

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

At present (April 2013), there are about 127,000 overall metadata documents, and about 46,000 if you filter out obsolete revisions. So, filtering out the obsoleted metadata records will be even faster.

In either case, the return document is a SOLR result set record with all of the fields encoded in an easily parseable XML format. Here's an example of what you will get back for one of the metadata documents:

<doc>
    <str name="abstract">
    To establish a long term data base on the nutrient dynamics of a salt marsh estuarine system.   
    This data can be used in correlation with a number of other estuarine data sets to obtain a 
    broader definition of the over all estuarine ecosystem.
    </str>
    <str name="id">doi:10.6073/AA/knb-lter-nin.2578.3</str>
    <arr name="keywords">
        <str>nutrient dynamics</str>
        <str>North Inlet Estuary</str>
        <str>Baruch Institute</str>
        <str>Georgetown, South Carolina</str>
    </arr>
    <str name="title">
    Daily Water Sample Chlorophyll a, and Phaeophytin a data for North Inlet Estuary system, Georgetown, SC.
    </str>
</doc>

I think this should do what you want.

edit flag offensive delete link more

Comments

This answer is great. I would just add that by adding "&wt=csv" or "&wt=json" to the example queries above, you can have your response in comma separated value format or javascript object notation respectively.

skye gravatar imageskye ( 2013-03-20 15:59:56 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

[hide preview]

Question Tools

Stats

Asked: 2013-03-20 15:21:23 -0500

Seen: 315 times

Last updated: May 09 '13