Tapeworm: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
Prins0891 (talk | contribs)
images updated to new GUI
Prins0891 (talk | contribs)
No edit summary
 
Line 1: Line 1:
= Tapeworm: Automated tape-archival of old datasets =
= Tapeworm: Automated tape-archival of old datasets =
<div style="border:2px solid #d32f2f; background:#ffebee; padding:12px; margin:12px 0;">
<b>Under construction:</b>
<ul style="margin:8px 0 0 18px;">
  <li>This documentation page is under construction and may still contain some gaps.</li>
  <li>The Tapeworm application will be "In Production" from March 19.</li>
</ul>
</div>


=== https://tapeworm.anunna.wur.nl/ ===
=== https://tapeworm.anunna.wur.nl/ ===

Latest revision as of 10:26, 19 March 2026

Tapeworm: Automated tape-archival of old datasets

https://tapeworm.anunna.wur.nl/

Tapeworm helps you manage data on /archive by identifying datasets that are no longer actively used and preparing them for tape archival. The goal is simple: keep our warm storage available for active work, while safely preserving older data on tape.

With Tapeworm, you can:

  • See which of your datasets are being considered for tape archival
  • Review planned moves before they happen
  • Approve, snooze, or block moves when needed
  • Add metadata to help describe archived datasets. The metadata is included on tape and can be used to view/retrieve from tape, should you need to do so in the future

If you do nothing, Tapeworm will continue with the planned move after the review period. That is why we recommend checking your pending actions regularly. You will also receive notification emails about pending actions.

From time to time, the user pages may temporarily be unavailable during maintenance. In that case, Tapeworm will show a short maintenance page instead of the normal interface.

How Tapeworm works

  1. Tapeworm scans /archive and builds an index of paths and filesystem metadata.
  2. A dataset discovery step groups those paths into datasets, determines size, owner, and last active use, and makes them available in the GUI. Very small datasets are filtered out and will not be shown.
  3. A policy engine checks which discovered datasets look stale (for example: 60+ days old and larger than 1GB).
  4. Matching datasets are marked as planned and shown in your overview.
  5. You will be notified by email that Tapeworm plans to move data you own
  6. You can review and change what should happen, or block the move(s) entirely
  7. If no action is taken, after a wait period of 4 weeks, planned moves can become scheduled and then executed.
  8. Data is moved to tape, and removed from /archive/

Who sees what?

  • Regular users see only their own datasets and actions.
  • Group admins/contacts see data for their configured group(s), in addition to their own data.

User pages

1) Overview

This is your action page. It shows items that currently need your decision.

For each candidate, you can:

  • Approve: proceed with the tape move. It will schedule for the next day
  • Deny: stop this move, configure an override for this path. Tapeworm will not try to move this dataset/path again, until you choose to remove the override
  • Snooze: postpone the decision to a future date
  • Edit metadata: add key/value notes for archived data. These values are included on tape and can be used to view/retrieve datasets on tape

You can also select multiple rows and apply actions in bulk.

2) Datasets

This page shows your discovered datasets, their sizes, and last activity times. The application has no concept of what data belongs together and should be considered a 'dataset'. If the selections on this page are wrong, you can change how Tapeworm should handle these datasets instead.

Important:

  • If a dataset already has an active move candidate, scheduling controls are disabled.
The dataset list is informational; move decisions are handled through the Schedule page.

3) Schedule

This page shows move candidates and their status over time.

Common statuses:

  • Planned: under review
  • Scheduled: move is planned for a specific date
  • Executing / Tape staged / On tape: move is in progress or completed
  • Error: move needs admin attention. You may be contacted, maybe we resolve it ourselves :)

Once a move is already executing or completed, schedule-changing actions are locked.

4) Overrides

Overrides tell Tapeworm to ignore specific paths in future planning.

Use overrides when:

  • a project is still active and needs to remain on /archive
  • policy suggestions are not appropriate for that location

If you agree that the dataset can in principle be moved to tape, but you don't (yet) know when, you can choose to postpone/snooze the archival instead of overriding it.

Overrides apply to the selected path and everything below it.

5) History

This page shows completed archival moves. When a dataset has been successfully archived and finalized, it is removed from active scheduling pages and moved into history.

Group pages

Group admins have a separate set of pages for their group scope:

  • Group overview
  • Group datasets
  • Group schedule
  • Group overrides
  • Group history

If you manage more than one group, you can switch group scope in the selector at the top of the group pages.

Notifications (email)

Tapeworm sends email updates when actions are pending, dates are approaching, or move state changes happen.

Emails typically include:

  • Dataset path
  • Size
  • Last activity
  • Current status
  • Review/scheduled date

Notification types you may receive:

  • Action required: please approve, snooze, or deny
  • Reminder: review date is approaching
  • Informational: move status changed (for example scheduled, staging, or completed)
  • Escalation: sent to group contacts when no user response is received

Please read these emails carefully — they are your chance to adjust decisions before execution.

Best practices for users

  • Check your Overview page regularly
  • Use Snooze if you need time to validate impact
  • Add metadata when approving important datasets
  • Use Overrides for known and persisting exceptions
  • If unsure, contact HPC support before a scheduled move date

FAQ

What happens if I do nothing?

Planned items can move forward automatically after the review window (4 weeks)

Can I undo after tape staging?

Not directly in Tapeworm. Retrieval is done via the tape/iRODS workflow. See: https://irods.wur.nl/userguide/tape_retrieval/

What does “completed” mean?

Completed means Tapeworm saw the tape workflow finish and finalized the move. Before finalization, the system verifies the archive in iRODS and only then removes the staged source copy.

Why is an action button disabled?

Usually because the move has already progressed (executing/staged/on tape/error), so schedule edits are no longer valid.

Why do some paths on /archive/ not appear as datasets in Tapeworm?

Tapeworm only shows paths that are discovered as datasets and pass a minimum-size threshold. Very small paths, single text files, and other tiny items are intentionally filtered out.

Need help?

If anything is unclear, or you think a move is incorrect but you cannot alter it in the provided GUI, please open an HPC support ticket.