IS&T, as far as I know, does not have an across the board standard,
with the possible exception of letting TSM backup everything on a server.
ASST (Server Ops) manages the TSM backup service. The automatic
system they set up on all of their servers backs up all files on a
server that have changed nightly, including marking when files are
deleted. They have a staged system, where things are stored in a
'ready reserve' RAID array, a slower file store system, and then a
tape archive. Things get transfered to tape after about (I believe) a
year, and the tapes are kept for 7 years, since that is the period
required for business records. If we are going to use their service
for our archives, then the archives will be kept for 7 years by
default. I am not certain if they will hold on to the tapes for less
time, since there is work to separate the files on a tape for the
files from other systems that are on the same tape, and the process is
argely automatic. The Enterprise solution, which will store up to
100 Tbytes, costs $65 a month.
As for what Stellar does, I have asked Craig, and this is his response:Quick answer: layers of sometimes complex stuff. You can check with ops on the details of what they handle.Database: live replication to standby databases, GQL can explain
Files: Stelar stores them twice on separate partitions just in case,
and a script copies the new files to the backup server.That's the main approach, trying to be sure nothing gets lost and
recovery from disaster doesn't require reloading gigabytes.But, ops also runs TSM on the machines backing up the files, and the
oracle database logs in a way which in principle allows recovery.
Check with them on the details of scheduling etc.
As for what we are currently doing, we are building an archive file of
all of the files in the Alfresco repository weekly, with changed files
put into an incremental archive file daily, and those plus a
transactional dump of the database gets stored onto our archive server
daily. The next day, all of those files get stored into TSM for long
term storage. They are currently held on the local machines for 2
weeks, and on our archive server for 6 months, but I will soon be
deploying a new backup script that, instead of building an archive
file, builds a copy of the directory structure on the archive server,
and eliminating the local stores of archive, and decreasing the
storage time on our storage server to 4 months. Everything would
still be stored into TSM the next day. Also planning on switching
over to keeping a 'living' backup on the recovery servers to make the
mean time from the emergency outage phone call to being back online
faster. Unless we decide that something different should be done,
that is for the next iteration of updates to the Thalia service readiness configuration.Technically, all files on the Thalia repository servers are being
stored into TSM on the day they are created/changed, and their
deletions are marked. However, due to the need of Alfresco to
synchronize the store of the state of the database with the state of
the file store, these file stores are not reliable, and may not work
if used to restore a Thalia repository. If we could synchronize the
database backup with the TSM backup, then we might be able to avoid
most of this mess all together, but we, at this point in time, do not have that ability.We'd like to keep them for as long as is prudent, from a systems
administration standpoint, and to cover any business needs.Well, as it stands now, we are able to keep the archives on our
servers for 4 to 6 months, and in TSM for up to 7 years, without
changing the technology we are using. Storing archives for less time
merely saves disk/tape space, which is already purchased or provided for.
The only downside, which I forgot to include previously, is that the archives are being stored on one of our systems, not one provided to us by server ops. Wilson had previously stated that this was not acceptable as a long term strategy, because we do not have access to colo 24/7, but we can get a server ops technician over a weekend for a leased system. Also, the system in question, al-dente.mit.edu, only has a 2 Terabyte RAID array in it. While this means that 2 drives would need to fail before we loose data, it also means that this setup will not see us through then end of 2009 if we continue to perform complete backups on a weekly basis, and only archive changes during the week (and our current projections of storage usage is accurate). Given that the amount of storage space provided to us by Server Ops on the Alfresco repositories is also below the 664 Gbytes projections, some plans will need to be made in the not distant future.