Skip to content
Snippets Groups Projects
Verified Commit 643ddc52 authored by anarcat's avatar anarcat
Browse files

trac archive documentation

parent eeb20129
No related branches found
No related tags found
No related merge requests found
......@@ -582,6 +582,49 @@ have to adjust our workflows to work around this. In some cases, we can use
gitlab milestone pages or projects that do not need a wiki page as a work
around.
### Trac Archival
A copy of all Trac web pages were stored in the [Internet
Archive](http://archive.org/)'s [Wayback machine](http://web.archive.org/), thanks to [ArchiveBot](https://www.archiveteam.org/index.php?title=ArchiveBot), a tool
developed by [ArchiveTeam](https://www.archiveteam.org/), of which anarcat is somewhat a part of.
First, a list of tickets was created:
seq 1 40000 | sed 's#^#https://trac.torproject.org/projects/tor/ticket/#'
This was uploaded to anarcat's pastebin (using [pubpaste](https://gitlab.com/anarcat/pubpaste)) and fed
into archivebot with:
!ao < https://paste.anarc.at/publish/2020-06-17/trac.torproject.org-tickets-1-40000-final.txt
!ao https://paste.anarc.at/publish/2020-06-17/trac.torproject.org-tickets-1-40000-final.txt
This tells ArchiveBot to crawl each ticket individually, and then
archive the list itself as well.
Simultaneously, a full crawl of the entire site (and first level
outgoing links) was started, with:
!a --explain "Trac migrated to GitLab, readonly" https://trac.torproject.org/
A list of excludes was added to ignore traps and infinite loops. The
crawl was slowed down with a 500-1000ms delay to avoid hammering the server.
(TODO: add the actual exclude lists and commands.)
The results will be accessible in the wayback machine a few days after
the crawl. Another crawl was performed back in 2019, so the known full
archives of Trac are as follows:
* [june 2019 ticket crawl](https://archive.fart.website/archivebot/viewer/job/5vytc): 6h30, 29892 files, 1.9 GiB
* [june 2019 full crawl](https://archive.fart.website/archivebot/viewer/job/bpu6j): 5 days, 7h30, 732488 files, 105.4 GiB
* [june 2020 ticket crawl](https://archive.fart.website/archivebot/viewer/job/c4xu3): 4h30, 33582 files, 1.9GiB
* [june 2020 full crawl]() (TBD, still processing, should appear [in
the viewer shortly](https://archive.fart.website/archivebot/viewer/?q=trac.torproject.org))
This information can be extracted back again from the `*-meta.warc.gz`
(text) files in the above URLs. This was done as part of [ticket
40003](https://gitlab.torproject.org/tpo/tpa/services/-/issues/40003).
### History
* lost in the mists of time: migration from Bugzilla to Flyspray (40
......@@ -607,6 +650,10 @@ around.
* 2020-06-14 21:22UTC: Trac wiki migrated
* 2020-06-15 18:30UTC: bugs.torproject.org redirects to gitlab
* 2020-06-16 02:15UTC: GitLab launch announced to tor-internal
* 2020-06-17 12:33UTC: Archivebot starts crawling all tickets of, and
the entire Trac website
* 2020-06-23: Archivebot completes the full Trac crawl, Trac is fully
archived on the Internet Archive
## Design
<!-- how this is built -->
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment