Tapeworm

From HPCwiki
Revision as of 16:25, 3 March 2026 by Prins0891 (talk | contribs)
Jump to navigation Jump to search

Tapeworm: Automated tape-archival of old datasets

Warning:

  • This documentation page is under construction and may contain errors.
  • The Tapeworm application is in beta and may contain errors.

https://tapeworm.anunna.wur.nl/

Tapeworm helps you manage data on /archive by identifying datasets that are no longer actively used and preparing them for tape archival. The goal is simple: keep our warm storage available for active work, while safely preserving older data on tape.

With Tapeworm, you can:

  • See which of your datasets are being considered for tape archival.
  • Review planned moves before they happen.
  • Approve, snooze, or block moves when needed.
  • Add metadata to help describe archived datasets. The metadata is included on tape and can be used to view/retrieve from tape, should you need to do so in the future.

If you do nothing, Tapeworm will continue with the planned move after the review period. That is why we recommend checking your pending actions regularly. You will also receive notification emails about pending actions.

How Tapeworm works

  1. Tapeworm scans /archive and builds an index of datasets, size, owner, and last active use.
  2. A policy engine checks which datasets look stale (for example: 30+ days old and larger than 1GB).
  3. Matching datasets are marked as planned and shown in your overview.
  4. You will be notified by email that Tapeworm plans to move data you own
  5. You can review and change what should happen, or block the move(s) entirely
  6. If no action is taken, after a wait period of 4 weeks, planned moves can become scheduled and then executed.
  7. Data is moved to tape, and removed from /archive/

Who sees what?

  • Regular users see only their own datasets and actions.
  • Group admins/contacts see data for their configured group(s), in addition to their own data.

User pages

1) Overview

This is your action page. It shows items that currently need your decision.

For each candidate, you can:

  • Approve: proceed with the tape move. It will schedule for the next day.
  • Deny: stop this move, configure an override for this path. Tapeworm will not try to move this dataset/path again, until you choose to remove the override.
  • Snooze: postpone the decision to a future date.
  • Edit metadata: add key/value notes for archived data. These values are included on tape and can be used to view/retrieve datasets on tape.

You can also select multiple rows and apply actions in bulk.

2) Datasets

This page shows your discovered datasets, their sizes, and last activity times. The application has no concept of what data belongs together and should be considered a 'dataset'. If the selections on this page are wrong, you can change how Tapeworm should handle these datasets instead.

Important:

  • If a dataset already has an active move candidate, scheduling controls are disabled.
The dataset list is informational; move decisions are handled through the Schedule page.

3) Schedule

This page shows move candidates and their status over time.

Common statuses:

  • Planned (or planned + notified): under review.
  • Scheduled: move is planned for a specific date.
  • Executing / Tape staged / On tape: move is in progress or completed.
  • Error: move needs admin attention. You may be contacted, maybe we resolve it ourselves :).

Once a move is already executing or completed, schedule-changing actions are locked.

4) Overrides

Overrides tell Tapeworm to ignore specific paths in future planning.

Use overrides when:

  • a project is still active and needs to remain on /archive
  • policy suggestions are not appropriate for that location

If you agree that the dataset can in principle be moved to tape, but you don't (yet) know when, you can choose to postpone/snooze the archival instead of overriding it.

Overrides apply to the selected path and everything below it.

5) History

This page shows completed archival moves. When a dataset has been successfully archived and finalized, it is removed from active scheduling pages and moved into history.

History helps you answer:

  • what was moved,
  • when it moved,
  • where it went on tape.

Group pages

Group admins have a separate set of pages for their group scope:

  • Group overview
  • Group datasets
  • Group schedule
  • Group overrides
  • Group history

If you manage more than one group, you can switch group scope in the selector at the top of the group pages.

Notifications (email)

Tapeworm sends email updates when actions are pending, dates are approaching, or move state changes happen.

Emails typically include:

  • dataset path,
  • size and last activity,
  • current status,
  • review/scheduled date.

Notification types you may receive:

  • Action required: please approve, snooze, or deny.
  • Reminder: review date is approaching.
  • Informational: move status changed (for example scheduled, staging, or completed).
  • Escalation: sent to group contacts when no user response is received.

Please read these emails carefully — they are your chance to adjust decisions before execution.

Best practices for users

  • Check your Overview page regularly.
  • Use Snooze if you need time to validate impact.
  • Add metadata when approving important datasets.
  • Use Overrides for known exceptions.
  • If unsure, contact HPC support before a scheduled move date.

FAQ

What happens if I do nothing?

Planned items can move forward automatically after the review window (typically 4 weeks).

Can I undo after tape staging?

Not directly in Tapeworm. Retrieval is done via the tape/iRODS workflow. See: https://irods.wur.nl/userguide/tape_retrieval/

What does “completed” mean?

Completed means Tapeworm saw the tape workflow finish and finalized the move. Before finalization, the system verifies the archive in iRODS and only then removes the staged source copy.

Why is an action button disabled?

Usually because the move has already progressed (executing/staged/on tape/error), so schedule edits are no longer valid.

Why do I see “planned + notified”?

That means the dataset move is planned and a notification has already been sent.

Need help?

If anything is unclear, or you think a move is incorrect but you cannot alter it in the provided GUI, please open an HPC support ticket.