ArXiv OAI-PMH arXivRaw publication metadata (Q12237)

From MaRDI portal
Revision as of 16:06, 20 February 2025 by Importer (talk | contribs) (‎Created a new Item)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Dataset published at Zenodo repository.
Language Label Description Also known as
English
ArXiv OAI-PMH arXivRaw publication metadata
Dataset published at Zenodo repository.

    Statements

    0 references
    This dataset containsOAI-PMH metadata for all ArXiv publications up until 2024-04-23 in the arXivRaw XML format. The metadata has been harvested using the metha Go package v0.3.3 [1] on go1.18.Specifically, harvesting was run on a small HPC cluster using the following SLURM script. The script had to be scheduled twice due to the connection being reset by the peer (see combined-slurm.out). metha caters for these situations and is able to pick up where it left off with cumulative harvesting. #!/bin/bash #SBATCH --job-name=metha #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks=1 #SBATCH --time=10-20:00:00 module purge echo "Installing Go module." module add go/go-1.18/go-1.18-gcc-9.4.0-okbjyoy echo "Installed Go module: $(go version)." echo "Installing metha." go install -v github.com/miku/metha/cmd/...@latest echo "Installed metha: $(retracted/go/bin/metha-sync -v)" echo "Harvesting ArXiv OAI-PMH metadata in format 'arXivRaw' from http://export.arxiv.org/oai2." retracted/go/bin/metha-sync -T 5m -base-dir /scratch/retracted/arxiv -format "arXivRaw" http://export.arxiv.org/oai2 # For the second run, '-from' was specified to pick up the harvest where it was left off. # retracted/go/bin/metha-sync -from 2020-09-29 -T 5m -base-dir /scratch/retracted/arxiv -format "arXivRaw" http://export.arxiv.org/oai2 echo "Done." exit 0 Dataset contents This deposit of the dataset contains the following files: metha-output-OAI-PMH-arXivRaw-until-2024-03-24.tar.gz: an archive file containing the archive files (gzipped, *.xml.gz) produced bymetha, which in turn contain the XML metadata files. The gzipped files contained in the archive are named following the pattern YYYY-MM-DD-8-digit zero-padded 0-index file count.xml.gz, e.g., 2024-03-24-00000001.xml.gz. README.md: This file, containing basic information about the dataset and deposit. combined-slurm.out: The combined SLURM log for the two consecutive SLURM runs that have produced the dataset. Run-specific information has been retracted. Reproducibility As the OAI-PMH metadata is not static but may change at any time, this dataset isn't fully reproducible. However, running the same metha version on the samego version with the same commands should yield very similar results, but will contain newer metadata. Licenses All ArXiv OAI-PMH metadata is licensed under CC0-1.0. combined-slurm.out is licensed under CC0-1.0. README.md is licensed underCC0-1.0. Licenses are documentedin a machine-readble manner following the REUSE 3.0 Specification. License deeds are included in this deposit as.txt files named using the respective SPDX license identifiers. [1] Martin Czygan, Thomas Gersch, ACz-UniBi, Justin Kelly, Gunnar Þr Magnsson, dvglc, Natanael Arndt. (2024). miku/metha: v0.3.3 (v0.3.3). Zenodo. doi:10.5281/zenodo.10940212.
    0 references
    25 April 2024
    0 references
    0 references

    Identifiers

    0 references