Archival Storage: Difference between revisions
m Haars0011 moved page Tapeworm to Archival Storage: IA migration §6: rename Tapeworm → Archival Storage (leaving redirect) (via move-page on MediaWiki MCP Server) |
IA migration §6: expand into Archival Storage — add /archive + iRODS sections, convert <b>→bold, fix backup schedule, frame Tapeworm as a section (via update-page on MediaWiki MCP Server) |
||
| Line 1: | Line 1: | ||
<code>/archive</code> is Anunna's warm storage tier for data you want to keep but are not actively computing on. Storing data there costs less than on [[Compute Storage|Lustre]], and datasets that have not been used for a long time are moved on to long-term tape archive — automatically, through the [[#Tapeworm|Tapeworm]] service — to keep the warm tier free for active work. | |||
== | == The /archive filesystem == | ||
Tapeworm helps you manage data on <code>/archive</code> by identifying datasets that are no longer actively used and preparing them for tape archival. | <code>/archive</code> is a mount that is only accessible from the login nodes. It is cheaper than [[Compute Storage|Lustre]], but it cannot be used for compute work, and it is only available to WUR users. | ||
The goal is simple: keep our warm storage available for active work, while safely preserving older data on tape. | |||
<code>/archive</code> is backed up to tape on the same schedule as your [[Home Directory|home directory]] — restorable from roughly a week of history (see [[Backup Policy]]). That backup is separate from the long-term tape archival described below: a backup is a short-term safety copy, while archival moves data off <code>/archive</code> for long-term keeping. | |||
== Tapeworm == | |||
Tapeworm (https://tapeworm.anunna.wur.nl/) helps you manage data on <code>/archive</code> by identifying datasets that are no longer actively used and preparing them for tape archival. The goal is simple: keep our warm storage available for active work, while safely preserving older data on tape. | |||
With Tapeworm, you can: | With Tapeworm, you can: | ||
| Line 12: | Line 17: | ||
* Add metadata to help describe archived datasets. The metadata is included on tape and can be used to view/retrieve from tape, should you need to do so in the future | * Add metadata to help describe archived datasets. The metadata is included on tape and can be used to view/retrieve from tape, should you need to do so in the future | ||
If you do nothing, Tapeworm will continue with the planned move after the review period. | If you do nothing, Tapeworm will continue with the planned move after the review period. That is why we recommend checking your pending actions regularly. You will also receive notification emails about pending actions. | ||
That is why we recommend checking your pending actions regularly. You will also receive notification emails about pending actions. | |||
From time to time, the user pages may temporarily be unavailable during maintenance. In that case, Tapeworm will show a short maintenance page instead of the normal interface. | From time to time, the user pages may temporarily be unavailable during maintenance. In that case, Tapeworm will show a short maintenance page instead of the normal interface. | ||
== How Tapeworm works == | === How Tapeworm works === | ||
# Tapeworm scans <code>/archive</code> and builds an index of paths and filesystem metadata. | # Tapeworm scans <code>/archive</code> and builds an index of paths and filesystem metadata. | ||
# A dataset discovery step groups those paths into datasets, determines size, owner, and last active use, and makes them available in the GUI. Very small datasets are filtered out and will not be shown. | # A dataset discovery step groups those paths into datasets, determines size, owner, and last active use, and makes them available in the GUI. Very small datasets are filtered out and will not be shown. | ||
# A policy engine checks which discovered datasets look stale (for example: 60+ days old and larger than 1GB). | # A policy engine checks which discovered datasets look stale (for example: 60+ days old and larger than 1GB). | ||
# Matching datasets are marked as | # Matching datasets are marked as '''planned''' and shown in your overview. | ||
# You will be notified by email that Tapeworm plans to move data you own | # You will be notified by email that Tapeworm plans to move data you own. | ||
# You can review and change what should happen, or block the move(s) entirely | # You can review and change what should happen, or block the move(s) entirely. | ||
# If no action is taken, after a wait period of 4 weeks, planned moves can become scheduled and then executed. | # If no action is taken, after a wait period of 4 weeks, planned moves can become scheduled and then executed. | ||
# Data is moved to tape, and removed from /archive/ | # Data is moved to tape, and removed from <code>/archive</code>. | ||
== Who sees what? == | === Who sees what? === | ||
* | * '''Regular users''' see only their own datasets and actions. | ||
* | * '''Group admins/contacts''' see data for their configured group(s), in addition to their own data. | ||
== User pages == | === User pages === | ||
=== 1) Overview === | ==== 1) Overview ==== | ||
This is your action page. It shows items that currently need your decision. | This is your action page. It shows items that currently need your decision. | ||
[[File:Screenshot from 2026-03-17 15-19-46.png|none|thumb|1400x1400px]] | [[File:Screenshot from 2026-03-17 15-19-46.png|none|thumb|1400x1400px]] | ||
For each candidate, you can: | For each candidate, you can: | ||
* | * '''Approve''': proceed with the tape move. It will schedule for the next day | ||
* | * '''Deny''': stop this move, configure an override for this path. Tapeworm will not try to move this dataset/path again, until you choose to remove the override | ||
* | * '''Snooze''': postpone the decision to a future date | ||
* | * '''Edit metadata''': add key/value notes for archived data. These values are included on tape and can be used to view/retrieve datasets on tape | ||
You can also select multiple rows and apply actions in bulk. | You can also select multiple rows and apply actions in bulk. | ||
=== 2) Datasets === | ==== 2) Datasets ==== | ||
This page shows your discovered datasets, their sizes, and last activity times. The application has no concept of what data belongs together and should be considered a 'dataset'. If the selections on this page are wrong, you can change how Tapeworm should handle these datasets instead. | This page shows your discovered datasets, their sizes, and last activity times. The application has no concept of what data belongs together and should be considered a 'dataset'. If the selections on this page are wrong, you can change how Tapeworm should handle these datasets instead. | ||
[[File:Screenshot from 2026-03-17 15-20-23.png|none|thumb|1400x1400px]] | [[File:Screenshot from 2026-03-17 15-20-23.png|none|thumb|1400x1400px]] | ||
| Line 53: | Line 57: | ||
The dataset list is informational; move decisions are handled through the Schedule page. | The dataset list is informational; move decisions are handled through the Schedule page. | ||
=== 3) Schedule === | ==== 3) Schedule ==== | ||
This page shows move candidates and their status over time. | This page shows move candidates and their status over time. | ||
[[File:Screenshot from 2026-03-17 15-20-58.png|none|thumb|1400x1400px]] | [[File:Screenshot from 2026-03-17 15-20-58.png|none|thumb|1400x1400px]] | ||
Common statuses: | Common statuses: | ||
* | * '''Planned''': under review | ||
* | * '''Scheduled''': move is planned for a specific date | ||
* | * '''Executing / Tape staged / On tape''': move is in progress or completed | ||
* | * '''Error''': move needs admin attention. You may be contacted, maybe we resolve it ourselves :) | ||
Once a move is already executing or completed, schedule-changing actions are locked. | Once a move is already executing or completed, schedule-changing actions are locked. | ||
=== 4) Overrides === | ==== 4) Overrides ==== | ||
Overrides tell Tapeworm to ignore specific paths in future planning. | Overrides tell Tapeworm to ignore specific paths in future planning. | ||
[[File:Screenshot from 2026-03-17 15-21-49.png|none|thumb|1400x1400px]] | [[File:Screenshot from 2026-03-17 15-21-49.png|none|thumb|1400x1400px]] | ||
| Line 75: | Line 79: | ||
Overrides apply to the selected path and everything below it. | Overrides apply to the selected path and everything below it. | ||
=== 5) History === | ==== 5) History ==== | ||
This page shows completed archival moves. | This page shows completed archival moves. When a dataset has been successfully archived and finalized, it is removed from active scheduling pages and moved into history. | ||
When a dataset has been successfully archived and finalized, it is removed from active scheduling pages and moved into history. | |||
[[File:Screenshot from 2026-03-17 15-24-40.png|none|thumb|1400x1400px]] | [[File:Screenshot from 2026-03-17 15-24-40.png|none|thumb|1400x1400px]] | ||
== Group pages == | === Group pages === | ||
Group admins have a separate set of pages for their group scope: | Group admins have a separate set of pages for their group scope: | ||
* | * '''Group overview''' | ||
* | * '''Group datasets''' | ||
* | * '''Group schedule''' | ||
* | * '''Group overrides''' | ||
* | * '''Group history''' | ||
If you manage more than one group, you can switch group scope in the selector at the top of the group pages. | If you manage more than one group, you can switch group scope in the selector at the top of the group pages. | ||
== Notifications (email) == | === Notifications (email) === | ||
Tapeworm sends email updates when actions are pending, dates are approaching, or move state changes happen. | Tapeworm sends email updates when actions are pending, dates are approaching, or move state changes happen. | ||
| Line 103: | Line 106: | ||
Notification types you may receive: | Notification types you may receive: | ||
* | * '''Action required''': please approve, snooze, or deny | ||
* | * '''Reminder''': review date is approaching | ||
* | * '''Informational''': move status changed (for example scheduled, staging, or completed) | ||
* | * '''Escalation''': sent to group contacts when no user response is received | ||
Please read these emails carefully — they are your chance to adjust decisions before execution. | Please read these emails carefully — they are your chance to adjust decisions before execution. | ||
== Best practices for users == | === Best practices for users === | ||
* Check your | * Check your '''Overview''' page regularly | ||
* Use | * Use '''Snooze''' if you need time to validate impact | ||
* Add | * Add '''metadata''' when approving important datasets | ||
* Use | * Use '''Overrides''' for known and persisting exceptions | ||
* If unsure, contact HPC support before a scheduled move date | * If unsure, contact HPC support before a scheduled move date | ||
== FAQ == | === Tapeworm FAQ === | ||
=== What happens if I do nothing? === | ==== What happens if I do nothing? ==== | ||
Planned items can move forward automatically after the review window (4 weeks) | Planned items can move forward automatically after the review window (4 weeks). | ||
=== Can I undo after tape staging? === | ==== Can I undo after tape staging? ==== | ||
Not directly in Tapeworm. Retrieval is done via the tape/iRODS workflow. | Not directly in Tapeworm. Retrieval is done via the tape/iRODS workflow. See: https://irods.wur.nl/userguide/tape_retrieval/ | ||
See: https://irods.wur.nl/userguide/tape_retrieval/ | |||
=== What does | ==== What does "completed" mean? ==== | ||
Completed means Tapeworm saw the tape workflow finish and finalized the move. | Completed means Tapeworm saw the tape workflow finish and finalized the move. Before finalization, the system verifies the archive in iRODS and only then removes the staged source copy. | ||
Before finalization, the system verifies the archive in iRODS and only then removes the staged source copy. | |||
=== Why is an action button disabled? === | ==== Why is an action button disabled? ==== | ||
Usually because the move has already progressed (executing/staged/on tape/error), so schedule edits are no longer valid. | Usually because the move has already progressed (executing/staged/on tape/error), so schedule edits are no longer valid. | ||
=== Why do some paths on /archive/ not appear as datasets in Tapeworm? === | ==== Why do some paths on /archive/ not appear as datasets in Tapeworm? ==== | ||
Tapeworm only shows paths that are discovered as datasets and pass a minimum-size threshold. Very small paths, single text files, and other tiny items are intentionally filtered out. | Tapeworm only shows paths that are discovered as datasets and pass a minimum-size threshold. Very small paths, single text files, and other tiny items are intentionally filtered out. | ||
== Need help? == | === Need help? === | ||
If anything is unclear, or you think a move is incorrect but you cannot alter it in the provided GUI, please open an HPC support ticket. | If anything is unclear, or you think a move is incorrect but you cannot alter it in the provided GUI, please open an HPC support ticket. | ||
== Manual tape access with iRODS == | |||
<!-- TODO: clarify how this manual iRODS/itape workflow relates to Tapeworm now that Tapeworm manages /archive → tape automatically. Is itape still the recommended way to push data to tape, or is iRODS now used mainly for retrieving archived datasets? Confirm with FB-IT. --> | |||
Anunna hosts its own iRODS instance, with which you can push data to the WUR tape storage for archiving at very low cost. For general usage, see https://irods.wur.nl/. The best course of action is to loosely follow the course using your own data, and use your personal space for data upload and transfer to tape. | |||
'''Be sure to check whether the data is correctly stored on tape before you remove your data.''' | |||
On Anunna there are some differences and additions to the linked site: | |||
* The zone is <code>HPC</code>. | |||
* With <code>iinit</code> you can initialise the iRODS environment. Use your account password. | |||
* With <code>ils</code> you can see your available iRODS collections. You need that as a destination location for <code>itape</code>. | |||
* We have a function to ease uploads (use <code>-h</code> for help): <code>itape</code>. | |||
* We have aliases to ease checking the status of your archive process (it takes a while): <code>itapestat</code> and <code>itapestatnp</code>. The first is for human use — it shows a paginated status of all your files. The latter dumps all the info, so you can e.g. use grep to filter. | |||
* If you remove data with <code>irm</code> within iRODS, the data isn't actually removed but moved to a trash bin. The advantage is that you can retrieve it if the removal was in error; the disadvantage is that the data will keep costing money. To empty it, see <code>irmtrash -h</code>. | |||
Because of hardware limitations on the backend tape storage, the size limit per file for our tape archive is 5 TB. | |||
== See also == | |||
* [[Storage Systems Overview]] | |||
* [[Backup Policy]] | |||
* [[Compute Storage]] | |||
* [[Tariffs|Costs associated with resource usage]] | |||
== External links == | |||
* [http://wiki.lustre.org/index.php/Main_Page Lustre website] | |||
Latest revision as of 11:50, 18 June 2026
/archive is Anunna's warm storage tier for data you want to keep but are not actively computing on. Storing data there costs less than on Lustre, and datasets that have not been used for a long time are moved on to long-term tape archive — automatically, through the Tapeworm service — to keep the warm tier free for active work.
The /archive filesystem
/archive is a mount that is only accessible from the login nodes. It is cheaper than Lustre, but it cannot be used for compute work, and it is only available to WUR users.
/archive is backed up to tape on the same schedule as your home directory — restorable from roughly a week of history (see Backup Policy). That backup is separate from the long-term tape archival described below: a backup is a short-term safety copy, while archival moves data off /archive for long-term keeping.
Tapeworm
Tapeworm (https://tapeworm.anunna.wur.nl/) helps you manage data on /archive by identifying datasets that are no longer actively used and preparing them for tape archival. The goal is simple: keep our warm storage available for active work, while safely preserving older data on tape.
With Tapeworm, you can:
- See which of your datasets are being considered for tape archival
- Review planned moves before they happen
- Approve, snooze, or block moves when needed
- Add metadata to help describe archived datasets. The metadata is included on tape and can be used to view/retrieve from tape, should you need to do so in the future
If you do nothing, Tapeworm will continue with the planned move after the review period. That is why we recommend checking your pending actions regularly. You will also receive notification emails about pending actions.
From time to time, the user pages may temporarily be unavailable during maintenance. In that case, Tapeworm will show a short maintenance page instead of the normal interface.
How Tapeworm works
- Tapeworm scans
/archiveand builds an index of paths and filesystem metadata. - A dataset discovery step groups those paths into datasets, determines size, owner, and last active use, and makes them available in the GUI. Very small datasets are filtered out and will not be shown.
- A policy engine checks which discovered datasets look stale (for example: 60+ days old and larger than 1GB).
- Matching datasets are marked as planned and shown in your overview.
- You will be notified by email that Tapeworm plans to move data you own.
- You can review and change what should happen, or block the move(s) entirely.
- If no action is taken, after a wait period of 4 weeks, planned moves can become scheduled and then executed.
- Data is moved to tape, and removed from
/archive.
Who sees what?
- Regular users see only their own datasets and actions.
- Group admins/contacts see data for their configured group(s), in addition to their own data.
User pages
1) Overview
This is your action page. It shows items that currently need your decision.

For each candidate, you can:
- Approve: proceed with the tape move. It will schedule for the next day
- Deny: stop this move, configure an override for this path. Tapeworm will not try to move this dataset/path again, until you choose to remove the override
- Snooze: postpone the decision to a future date
- Edit metadata: add key/value notes for archived data. These values are included on tape and can be used to view/retrieve datasets on tape
You can also select multiple rows and apply actions in bulk.
2) Datasets
This page shows your discovered datasets, their sizes, and last activity times. The application has no concept of what data belongs together and should be considered a 'dataset'. If the selections on this page are wrong, you can change how Tapeworm should handle these datasets instead.

Important:
- If a dataset already has an active move candidate, scheduling controls are disabled.
The dataset list is informational; move decisions are handled through the Schedule page.
3) Schedule
This page shows move candidates and their status over time.

Common statuses:
- Planned: under review
- Scheduled: move is planned for a specific date
- Executing / Tape staged / On tape: move is in progress or completed
- Error: move needs admin attention. You may be contacted, maybe we resolve it ourselves :)
Once a move is already executing or completed, schedule-changing actions are locked.
4) Overrides
Overrides tell Tapeworm to ignore specific paths in future planning.

Use overrides when:
- a project is still active and needs to remain on /archive
- policy suggestions are not appropriate for that location
If you agree that the dataset can in principle be moved to tape, but you don't (yet) know when, you can choose to postpone/snooze the archival instead of overriding it.
Overrides apply to the selected path and everything below it.
5) History
This page shows completed archival moves. When a dataset has been successfully archived and finalized, it is removed from active scheduling pages and moved into history.

Group pages
Group admins have a separate set of pages for their group scope:
- Group overview
- Group datasets
- Group schedule
- Group overrides
- Group history
If you manage more than one group, you can switch group scope in the selector at the top of the group pages.
Notifications (email)
Tapeworm sends email updates when actions are pending, dates are approaching, or move state changes happen.
Emails typically include:
- Dataset path
- Size
- Last activity
- Current status
- Review/scheduled date
Notification types you may receive:
- Action required: please approve, snooze, or deny
- Reminder: review date is approaching
- Informational: move status changed (for example scheduled, staging, or completed)
- Escalation: sent to group contacts when no user response is received
Please read these emails carefully — they are your chance to adjust decisions before execution.
Best practices for users
- Check your Overview page regularly
- Use Snooze if you need time to validate impact
- Add metadata when approving important datasets
- Use Overrides for known and persisting exceptions
- If unsure, contact HPC support before a scheduled move date
Tapeworm FAQ
What happens if I do nothing?
Planned items can move forward automatically after the review window (4 weeks).
Can I undo after tape staging?
Not directly in Tapeworm. Retrieval is done via the tape/iRODS workflow. See: https://irods.wur.nl/userguide/tape_retrieval/
What does "completed" mean?
Completed means Tapeworm saw the tape workflow finish and finalized the move. Before finalization, the system verifies the archive in iRODS and only then removes the staged source copy.
Why is an action button disabled?
Usually because the move has already progressed (executing/staged/on tape/error), so schedule edits are no longer valid.
Why do some paths on /archive/ not appear as datasets in Tapeworm?
Tapeworm only shows paths that are discovered as datasets and pass a minimum-size threshold. Very small paths, single text files, and other tiny items are intentionally filtered out.
Need help?
If anything is unclear, or you think a move is incorrect but you cannot alter it in the provided GUI, please open an HPC support ticket.
Manual tape access with iRODS
Anunna hosts its own iRODS instance, with which you can push data to the WUR tape storage for archiving at very low cost. For general usage, see https://irods.wur.nl/. The best course of action is to loosely follow the course using your own data, and use your personal space for data upload and transfer to tape.
Be sure to check whether the data is correctly stored on tape before you remove your data.
On Anunna there are some differences and additions to the linked site:
- The zone is
HPC. - With
iinityou can initialise the iRODS environment. Use your account password. - With
ilsyou can see your available iRODS collections. You need that as a destination location foritape. - We have a function to ease uploads (use
-hfor help):itape. - We have aliases to ease checking the status of your archive process (it takes a while):
itapestatanditapestatnp. The first is for human use — it shows a paginated status of all your files. The latter dumps all the info, so you can e.g. use grep to filter. - If you remove data with
irmwithin iRODS, the data isn't actually removed but moved to a trash bin. The advantage is that you can retrieve it if the removal was in error; the disadvantage is that the data will keep costing money. To empty it, seeirmtrash -h.
Because of hardware limitations on the backend tape storage, the size limit per file for our tape archive is 5 TB.