Core Actions¶
Attention
All examples on this page assume that the HC_CACHE_STORAGE
variable has been set. If you do not know what this means, read the Caching and being nice to web archives section first.
For sampling from a collection or converting it into different forms, Hypercane offers the core actions of:
sample
- for creating a sample of a collectionreport
- for generating a report on collection metadata, named entities, curation behavior, and moresynthesize
- for generating output for other tools, like Archives Unleashed Toolkit or Raintale
sample
¶
Hypercane’s sample
action allows a user to provide input
true-random
- samples k mementos from the input, randomlyfiltered-random
- removes off-topic mementos, near-duplicates, and then randomly samples k mementos from the remainderdsa1
- executes an updated version of AlNoamany’s original sampling algorithm, may also be specified usingalnoamany
systematic
- chooses every jth memento from the input
For example, to randomly sample 5 mementos from Trove collection 8125, type the following:
hc sample true-random -i trove -a 8125 -k 5 -o randomly-sampled.tsv
or to intelligently sample a set of approximately 28 mementos from Archive-It collection 694:
hc sample dsa1 -i archiveit -a 694 -o sampled-with-dsa1.tsv
or to systematically sample every 4th memento from the mementos in mementos.tsv:
hc sample systematic -i mementos -a mementos.tsv -o sampled-systematically.tsv -j 4
Hypercane’s sample
action can also execute the following algorithms on output provided by its cluster action:
stratified-random
- chooses j random mementos from each clusterstratified-systematic
- chooses every jth memento from each clusterrandom-cluster
- randomly chooses j clusters from the input and returns their mementosrandom-oversample
- randomly chooses mementos from clusters until those clusters are the same size as the largest clusterrandom-undersample
- randomly chooses mementos from clusters until those clusters are the same size as the smallest cluster
Type hc sample --help
for more information on all available options. The --help
argument can also be supplied to a single option for more information, e.g., hc sample dsa1 --help
.
report
¶
Hypercane can produce reports for use in storytelling and rudimentary collection analysis. The following report styles are available for use with the report
action:
metadata
- the metadata scraped from an Archive-It collection; output is a JSON fileimage-data
- provides information about all embedded images discovered in the input and ranks them so Raintale has a striking image for the story; output is a JSON fileseed-statistics
- calculates metrics on the original resources discovered in the input, as mentioned in Jones et al. in 2018; output is a JSON filemetadata-statistics
- calculates metrics on the metadata discovered in the input, as used across collections by Jones et al. in 2019; output is a JSON filehtml-metadata
- a report on the metadata available in each memento’s HTML texttt{META} tag, as applied by Jones et al. in 2021 ; output is a JSON filegrowth
- calculates metrics on the collection growth, as described in Jones et al. in 2018; output is a JSON fileterms
- provides all terms discovered in the input, including their frequency, document frequency, probability, and corpus-wide TF-IDF; output is a tab-delimited fileentities
- provides a list of all entities discovered in the input, including frequency, probability, and corpus-wide TF-IDF; output is a tab-delimited file
For example, to generate a report on the metadata for Archive-It collection 8788 and save it in a file named 8788-metadata.json:
hc report metadata -i archiveit -a 8788 -o 8788-metadata.json
or to do the same for Trove collection 13742:
hc report metadata -i trove -a 13742 -o 13742-metadata.json
or generate the entities from a list of mementos:
hc report entities -i mementos -a memento-file.tsv -o entity-report.json
Type hc report --help
for more information on all available options. The --help
argument can also be supplied to a single option for more information, e.g., hc report growth --help
.
synthesize
¶
Hypercane’s synthesize
action allows users to generate output for other tools with output in other formats, like WARC, JSON, or a set of files in a directory. The synthesize
action has the following supported output formats:
warcs
- (experimental) for generating a directory of WARCsfiles
- for generating a directory of mementosbpfree-files
- for generating a directory of boilerplate-free mementosraintale-story
- for generating a JSON file suitable as input for Raintalecombine
- combine the output from several Hypercane runs together
To synthesize Archive-It collection 694 into a set of WARCs stored in output-directory:
hc synthesize warcs -i archiveit -a 694 -o output-directory
To synthesize a list of mementos into a set of WARCs stored in output-directory without any embedded images, JavaScript, or stylesheets:
hc synthesize warcs -i archiveit -a 694 -o output-directory --no-download-embedded
To synthesize a Raintale story built from the output of other Hypercane commands:
hc synthesize raintale-story -i mementos -a story-mementos.tsv \
--imagedata imagedata.json --termdata sumgrams.json \
--entitydata entities.json --collection_metadata metadata.json \
--title "Archive-It Collection" -o raintale-story.json
Type hc synthesize --help
for more information on all available options. The --help
argument can also be supplied to a single option for more information, e.g., hc synthesize warcs --help
.
For more examples and a discussion of using synthesize
please read this blog post.