The John F. Kennedy presidential library digital archive that went live online today is the result of a four-year,...
$10 million project to digitize hundreds of thousands of documents, audio tapes, film, photos and other artifacts collected in the 50 years since JFK's inauguration. And with millions of documents yet to be digitized, the archiving process is likely to continue for decades.
The Digital Archive Project team from the JFK Foundation uses storage from EMC Corp. and an Iron Mountain disaster recovery (DR) center to archive and protect the data. AT&T provides the web hosting, and Raytheon designed and implemented the system.
The foundation's digital archivist Erica Boudreau said the goal was to make documents that had only been available to visitors of the JFK Library in Boston open to anybody with an Internet connection.
. "Without the digital archive, you have to be onsite and go into a research room to see the video content and documents on display," she said.
However, archiving and protecting the digital data is a massive undertaking.
The JFK Library is divided into 13 collections. Because of the large amounts of data that needed to be digitized, the library's archivist picked six of the collections to go online for today's launch, including interactive exhibits dealing with the space program, the Cuban Missile Crisis and Civil Rights.
Since 2006, the library has digitized 200,000 documents; 300 reels of audio tape containing more than 1,245 telephone calls, speeches and meetings; 300 museum artifacts; 72 reels of film; and 1,500 photos. All the material is stored at high resolution to preserve fidelity.
Boudreau said the biggest data storage challenge was getting files such as oversized maps and old videos into the system. "These files are very large, and we're digitizing at the highest possible resolution," she said. "We have to capture our meta data, keep it with the file and deliver that content to a website. Oversized documents like maps were difficult to work with because of scaling. One video replication job is working on a 4 TB file share, and could take weeks to replicate."
The JFK Foundation's IT team uses EMC Documentum and Captiva software to ingest and organize the data, and mirrors the data between EMC Celerra NS-120 network-attached storage (NAS) and Centera archiving systems at the library's primary site in Boston and Iron Mountain's DR site in Boyers, Pa.
The size of the archive is expected to increase to about 117 TB by 2016, although the project team hopes to reduce capacity by applying Captiva's LZW lossless compression to document files.
Mirrors replace tape for backup
The library's IT specialist Tim Fitzpatrick said he started mirroring between sites to protect the files because tape backups were taking more than a week to complete. "Tape backups were becoming unmanageable," he said. "We established two sets of mirrors between the production environment and the DR location, and we replicate between the two. That protects us against a disaster where data is lost on a production system."
Boudreau said the digital archive project is funded mostly through private donations, and the technology partners have donated equipment and services. But Fitzpatrick said there are still some 48 million pages left to digitize before all of the JFK collections are online, and the library is still collecting material.
"At our current rate," Fitzpatrick said, "it will take over 100 years to get everything digitized."