Ask Your Question

Revision history [back]

We've already parsed out all of those fields from all of the metadata documents and indexed them, so by far the easiest way to get the information you want is to use our CN query() service (http://mule1.dataone.org/ArchitectureDocs-current/apis/CN_APIs.html#CNRead.query). In order to not overload the server, we ask that you query for a maximum of 1000 records at a time, but each of these should only take a second or so to run, so it should only take a couple of minutes to run through all 127,000 metadata documents that we have and get those fields. To do that, you would write a one or two line script that calls our query service in a loop 128 times, and each time through the loop you would use curl to execute a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0

Note the last two parameters (rows, and start). Rows indicates how many records to retrieve, and start indicates where in the record set to start retrieving records. So in your first call, set start=0 to get the first block of 1000, and in your second call set start=1000 to get the second block of 1000 records, etc until you've gotten all 127,000+ records.

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

In this case there are about 46,000 metadata documents, so it will be even faster for you to get them all (see the numFound field in the query response to see how many records match your query).

In either case, the return document is a SOLR result set record with all of the fields encoded in an easily parseable XML format. Here's an example of what you will get back for one of the metadata documents:

<doc>
    <str name="abstract">
    To establish a long term data base on the nutrient dynamics of a salt marsh estuarine system.   
    This data can be used in correlation with a number of other estuarine data sets to obtain a 
    broader definition of the over all estuarine ecosystem.
    </str>
    <str name="id">doi:10.6073/AA/knb-lter-nin.2578.3</str>
    <arr name="keywords">
        <str>nutrient dynamics</str>
        <str>North Inlet Estuary</str>
        <str>Baruch Institute</str>
        <str>Georgetown, South Carolina</str>
    </arr>
    <str name="title">
    Daily Water Sample Chlorophyll a, and Phaeophytin a data for North Inlet Estuary system, Georgetown, SC.
    </str>

</doc>

I think this should do what you want.

click to hide/show revision 2
fixed formatting of XML response

We've already parsed out all of those fields from all of the metadata documents and indexed them, so by far the easiest way to get the information you want is to use our CN query() service (http://mule1.dataone.org/ArchitectureDocs-current/apis/CN_APIs.html#CNRead.query). In order to not overload the server, we ask that you query for a maximum of 1000 records at a time, but each of these should only take a second or so to run, so it should only take a couple of minutes to run through all 127,000 metadata documents that we have and get those fields. To do that, you would write a one or two line script that calls our query service in a loop 128 times, and each time through the loop you would use curl to execute a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0

Note the last two parameters (rows, and start). Rows indicates how many records to retrieve, and start indicates where in the record set to start retrieving records. So in your first call, set start=0 to get the first block of 1000, and in your second call set start=1000 to get the second block of 1000 records, etc until you've gotten all 127,000+ records.

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

In this case there are about 46,000 metadata documents, so it will be even faster for you to get them all (see the numFound field in the query response to see how many records match your query).

In either case, the return document is a SOLR result set record with all of the fields encoded in an easily parseable XML format. Here's an example of what you will get back for one of the metadata documents:

<doc>
    <str name="abstract">
    To establish a long term data base on the nutrient dynamics of a salt marsh estuarine system.   
    This data can be used in correlation with a number of other estuarine data sets to obtain a 
    broader definition of the over all estuarine ecosystem.
    </str>
    <str name="id">doi:10.6073/AA/knb-lter-nin.2578.3</str>
    <arr name="keywords">
        <str>nutrient dynamics</str>
        <str>North Inlet Estuary</str>
        <str>Baruch Institute</str>
        <str>Georgetown, South Carolina</str>
    </arr>
    <str name="title">
    Daily Water Sample Chlorophyll a, and Phaeophytin a data for North Inlet Estuary system, Georgetown, SC.
    </str>
</doc>

</doc>

I think this should do what you want.

click to hide/show revision 3
Made more general to not list specific #'s of records, minor editorial changes.

We've already parsed out all of those fields from all of the metadata documents and indexed them, so by far the easiest way to get the information you want is to use our CN query() service (http://mule1.dataone.org/ArchitectureDocs-current/apis/CN_APIs.html#CNRead.query). service. In order to not overload the server, we ask that you please query for a maximum of 1000 records at a time, but each time. Each of these should only take a second or so to run, so it should only take a couple of minutes to run through all 127,000 of the metadata documents that we have and get those fields. To do that, you would write a one or two line script that calls our query service in a loop 128 times, service, and each time through the loop you would use curl to execute a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0

Note the last two parameters (rows, and start). Rows indicates how many records to retrieve, and start indicates where in the record set to start retrieving records. So in your first call, set start=0 to get the first block of 1000, and in your second call set start=1000 to get the second block of 1000 records, etc until etc. When you get fewer than 1000 records back from the call, then you've gotten all 127,000+ records.

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

In this case there are about 46,000 metadata documents, so it will be even faster for you to get them all of the records back (see the numFound field in the query response to see how many records match your query).

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

At present (April 2013), there are about 127,000 overall metadata documents, and about 46,000 if you filter out obsolete revisions. So, filtering out the obsoleted metadata records will be even faster.

In either case, the return document is a SOLR result set record with all of the fields encoded in an easily parseable XML format. Here's an example of what you will get back for one of the metadata documents:

<doc>
    <str name="abstract">
    To establish a long term data base on the nutrient dynamics of a salt marsh estuarine system.   
    This data can be used in correlation with a number of other estuarine data sets to obtain a 
    broader definition of the over all estuarine ecosystem.
    </str>
    <str name="id">doi:10.6073/AA/knb-lter-nin.2578.3</str>
    <arr name="keywords">
        <str>nutrient dynamics</str>
        <str>North Inlet Estuary</str>
        <str>Baruch Institute</str>
        <str>Georgetown, South Carolina</str>
    </arr>
    <str name="title">
    Daily Water Sample Chlorophyll a, and Phaeophytin a data for North Inlet Estuary system, Georgetown, SC.
    </str>
</doc>

I think this should do what you want.

We've already parsed out all of those fields from all of the metadata documents and indexed them, so by far the easiest way to get the information you want is to use our CN query() service. In order to not overload the server, please query for a maximum of 1000 records at a time. Each of these should only take a second or so to run, so it should only take a couple of minutes to run through all of the metadata documents that we have and get those fields. To do that, you would write a one or two line script that calls our query service, and each time through the loop you would use curl to execute a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0
https://cn.dataone.org/cn/v1/query/solr/fl=id,title,abstract,keywords&q=formatType:METADATA&rows=1000&start=0

Note the last two parameters (rows, and start). Rows indicates how many records to retrieve, and start indicates where in the record set to start retrieving records. So in your first call, set start=0 to get the first block of 1000, and in your second call set start=1000 to get the second block of 1000 records, etc. When you get fewer than 1000 records back from the call, then you've gotten all of the records back (see the numFound field in the query response to see how many records match your query).

Note that many of these metadata documents are newer versions of the same metadata, so if you only want the newest revision of any given metadata document, you would want to use a query that filters out all of the obsoleted revisions. To do that, you could use a query like this:

https://cn.dataone.org/cn/v1/query/solr/?fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0
https://cn.dataone.org/cn/v1/query/solr/fl=id,title,abstract,keywords&q=formatType:METADATA+-obsoletedBy:*&rows=1000&start=0

At present (April 2013), there are about 127,000 overall metadata documents, and about 46,000 if you filter out obsolete revisions. So, filtering out the obsoleted metadata records will be even faster.

In either case, the return document is a SOLR result set record with all of the fields encoded in an easily parseable XML format. Here's an example of what you will get back for one of the metadata documents:

<doc>
    <str name="abstract">
    To establish a long term data base on the nutrient dynamics of a salt marsh estuarine system.   
    This data can be used in correlation with a number of other estuarine data sets to obtain a 
    broader definition of the over all estuarine ecosystem.
    </str>
    <str name="id">doi:10.6073/AA/knb-lter-nin.2578.3</str>
    <arr name="keywords">
        <str>nutrient dynamics</str>
        <str>North Inlet Estuary</str>
        <str>Baruch Institute</str>
        <str>Georgetown, South Carolina</str>
    </arr>
    <str name="title">
    Daily Water Sample Chlorophyll a, and Phaeophytin a data for North Inlet Estuary system, Georgetown, SC.
    </str>
</doc>

I think this should do what you want.