Core Actions¶
Attention
All examples on this page assume that the HC_CACHE_STORAGE variable has been set. If you do not know what this means, read the Caching and being nice to web archives section first.
For sampling from a collection or converting it into different forms, Hypercane offers the core actions of:
sample- for creating a sample of a collectionreport- for generating a report on collection metadata, named entities, curation behavior, and moresynthesize- for generating output for other tools, like Archives Unleashed Toolkit or Raintale
sample¶
Hypercane’s sample action allows a user to provide input
true-random- samples k mementos from the input, randomlyfiltered-random- removes off-topic mementos, near-duplicates, and then randomly samples k mementos from the remainderdsa1- executes an updated version of AlNoamany’s original sampling algorithm, may also be specified usingalnoamanysystematic- chooses every jth memento from the input
For example, to randomly sample 5 mementos from Trove collection 8125, type the following:
hc sample true-random -i trove -a 8125 -k 5 -o randomly-sampled.tsv
or to intelligently sample a set of approximately 28 mementos from Archive-It collection 694:
hc sample dsa1 -i archiveit -a 694 -o sampled-with-dsa1.tsv
or to systematically sample every 4th memento from the mementos in mementos.tsv:
hc sample systematic -i mementos -a mementos.tsv -o sampled-systematically.tsv -j 4
Hypercane’s sample action can also execute the following algorithms on output provided by its cluster action:
stratified-random- chooses j random mementos from each clusterstratified-systematic- chooses every jth memento from each clusterrandom-cluster- randomly chooses j clusters from the input and returns their mementosrandom-oversample- randomly chooses mementos from clusters until those clusters are the same size as the largest clusterrandom-undersample- randomly chooses mementos from clusters until those clusters are the same size as the smallest cluster
Type hc sample --help for more information on all available options. The --help argument can also be supplied to a single option for more information, e.g., hc sample dsa1 --help.
report¶
Hypercane can produce reports for use in storytelling and rudimentary collection analysis. The following report styles are available for use with the report action:
metadata- the metadata scraped from an Archive-It collection; output is a JSON fileimage-data- provides information about all embedded images discovered in the input and ranks them so Raintale has a striking image for the story; output is a JSON fileseed-statistics- calculates metrics on the original resources discovered in the input, as mentioned in Jones et al. in 2018; output is a JSON filemetadata-statistics- calculates metrics on the metadata discovered in the input, as used across collections by Jones et al. in 2019; output is a JSON filehtml-metadata- a report on the metadata available in each memento’s HTML texttt{META} tag, as applied by Jones et al. in 2021 ; output is a JSON filegrowth- calculates metrics on the collection growth, as described in Jones et al. in 2018; output is a JSON fileterms- provides all terms discovered in the input, including their frequency, document frequency, probability, and corpus-wide TF-IDF; output is a tab-delimited fileentities- provides a list of all entities discovered in the input, including frequency, probability, and corpus-wide TF-IDF; output is a tab-delimited file
For example, to generate a report on the metadata for Archive-It collection 8788 and save it in a file named 8788-metadata.json:
hc report metadata -i archiveit -a 8788 -o 8788-metadata.json
or to do the same for Trove collection 13742:
hc report metadata -i trove -a 13742 -o 13742-metadata.json
or generate the entities from a list of mementos:
hc report entities -i mementos -a memento-file.tsv -o entity-report.json
Type hc report --help for more information on all available options. The --help argument can also be supplied to a single option for more information, e.g., hc report growth --help.
synthesize¶
Hypercane’s synthesize action allows users to generate output for other tools with output in other formats, like WARC, JSON, or a set of files in a directory. The synthesize action has the following supported output formats:
warcs- (experimental) for generating a directory of WARCsfiles- for generating a directory of mementosbpfree-files- for generating a directory of boilerplate-free mementosraintale-story- for generating a JSON file suitable as input for Raintalecombine- combine the output from several Hypercane runs together
To synthesize Archive-It collection 694 into a set of WARCs stored in output-directory:
hc synthesize warcs -i archiveit -a 694 -o output-directory
To synthesize a list of mementos into a set of WARCs stored in output-directory without any embedded images, JavaScript, or stylesheets:
hc synthesize warcs -i archiveit -a 694 -o output-directory --no-download-embedded
To synthesize a Raintale story built from the output of other Hypercane commands:
hc synthesize raintale-story -i mementos -a story-mementos.tsv \
--imagedata imagedata.json --termdata sumgrams.json \
--entitydata entities.json --collection_metadata metadata.json \
--title "Archive-It Collection" -o raintale-story.json
Type hc synthesize --help for more information on all available options. The --help argument can also be supplied to a single option for more information, e.g., hc synthesize warcs --help.
For more examples and a discussion of using synthesize please read this blog post.