Advanced Actions¶
Attention
All examples on this page assume that the HC_CACHE_STORAGE
variable has been set. If you do not know what this means, read the Caching and being nice to web archives section first.
To build your own sampling algorithms or explore a collection in different ways, Hypercane also offers the advanced actions of:
- identify - for discovering one Memento object from another
- filter - for filtering the documents from the input based on some criteria
- cluster - for clustering the documents from the input based on an algorithm and features
- score - for scoring the documents from the input based on some scoring function
- order - for ordering the documents in the input based on some feature
identify
¶
The identify
action produces one type of Memento Protocol object from another. With identify, a user can submit a list of URI-Ts and get the full list of URI-Ms or URI-Rs discovered in those TimeMaps. They can also submit a list of URI-Ms and produce the corresponding URI-Ts or URI-Rs. Hypercane discovers this information via the Memento Protocol.
For example, to identify the mementos in Trove collection 13524:
hc identify mementos -i trove -a 13524 -o mementos.tsv
or to identify the TimeMaps in Archive-It collection 8788:
hc identify timemaps -i archiveit -a 8788 -o timemap-output-file.tsv
or to identify the original resources based on the mementos in file mementos.tsv:
hc identify original-resources -i mementos -a mementos.tsv -o original-resources.tsv
or discover the TimeMaps for the mementos in the file mementos.tsv:
hc identify timemaps -i mementos -a mementos.tsv -o timemaps.tsv
The identify action supports the following criteria:
mementos
- discover all mementos in the input; for performance, it only discovers URI-Ms and assumes that the archive reports the mementos accuratelytimemaps
- discover all timemaps in the input; for performance, it only discovers URI-Ts and assumes that the archive reports the TimeMaps accuratelyoriginal-resources
- discovers all original-resources in the input
The following diagram indicates how Hypercane handles discovering one type of memento object based on another.
Type hc identify --help
for more information on all available options. The --help
argument can also be supplied to a single option for more information, e.g., hc identify mementos --help
.
filter
¶
Hypercane’s filter action accepts either include-only
or exclude
as subactions. These modifiers then apply to the criteria specified. For example, to discover mementos in Archive-It collection 8788 and list only those in English and German, run the following:
hc filter include-only languages --lang en,es -i archiveit -a 8788 -o english-spanish-mementos.txt
or to exclude mementos containing the string cheese from a list of mementos:
hc filter exclude containing_pattern --pattern 'cheese' -i mementos -a mementofile.tsv -o mementos-without-cheese.tsv
The filter
action supports the following criteria for include-only
:
on-topic
- only list mementos that are on-topicnon-duplicates
- only list mementos that are not duplicatescontaining-pattern
- only list mementos where their content contains the given patternlanguages
- only list mementos written in the given languagesnear-datetime
- only list mementos between the datetimes provided, needed for SPFT storiescontaining-url-pattern
- only list mementos where the URI-R contains the given pattern, needed for FPST storieshighest-score-per-cluster
- only list the highest ranking memento in each cluster, requires that the input contain clusters generated by the texttt{cluster} actionlargest-clusters
- only list mementos from the $k$ largest clusters, requires that the input contain clusters generated by the texttt{cluster} actionscore
- only list mementos that match a given score range, requires that the input contains scoring information generated by the texttt{score} action
Type hc filter include-only --help
for more information on the available options for each criteria. The --help
argument can also be supplied to a single option for more information, e.g., hc filter include-only languages --help
.
The following criteria exist for exclude
:
languages
- exclude mementos with the given languages (specified with –lang)off-topic
- execute the Off-Topic Memento Toolkit to exclude off-topic mementosnear-duplicates
- employ Simhash to exclude mementos that are near-duplicatescontaining-pattern
- exclue mementos that contain the given regular experession pattern
Type hc filter exclude --help
for more information on the available options for each criteria. The --help
argument can also be supplied to a single option for more information, e.g., hc filter exclude off-topic --help
.
cluster
¶
Clustering divides a collection into meaningful sub-collections for further processing, such as scoring and filtering, to produce exemplars. Some clustering functions, like dbscan and kmeans require that the user specify a feature through the –feature argument.
For example, to cluster a the mementos from the file mdt-slices.tsv by the feature TF Simash using the clustering algorithm DBSCAN, type the following:
hc cluster dbscan -i mementos -a mdt-slices.tsv -o sliced-and-clustered.tsv --feature tf-simhash
or to cluster all of the mementos in Pandora Subject 82 by their original resource URI-R:
hc cluster original-resource -i pandora-subject -a 82 -o 82-original-resources.tsv
or to cluster the mementos in Archive-It collection 694 by Latent Dirichlet Allocation:
hc cluster lda -i archiveit -a 694 -o clustered.tsv
The following criteria exist for cluster
:
time-slice
- cluster the collection via AlNoamany’s time slicing algorithmdbscan
- cluster the collection via the user-supplied feature and DBSCANlda
- cluster the collection via Latent Dirchlet Allocation (LDA) topic modelingkmeans
- cluster the collection via the user-supplied feature and K-means clusteringdomainname
- cluster the collection by the domainname of each URI-Roriginal-resource
- cluster the collection by URI-R
Type hc cluster --help
for more information on the available options for each criteria. The --help
argument can also be supplied to a single option for more information, e.g., hc cluster lda --help
.
score
¶
Scoring is also an essential staple of summarization to help rank multiple candidates. Hypercane’s score action allows the user to choose from multiple scoring criteria.
To score the mementos in the file input_mementos.tsv by AlNoamany’s scoring function
hc score dsa1-scoring -i mementos -a input_mementos.tsv -o scored_mementos.tsv
or to score the mementos in Trove collection 13123 using BM25 with the query cheese:
hc score bm25 -i trove -a 13123 -o scored_mementos.tsv --query "cheese"
or to score the mementos found in the TimeMaps in the file timemaps.tsv by how well those mementos create a card on Facebook and Twitter:
hc score simple-card-score -i timemaps -a timemaps.tsv -o scored_mementos.tsv
The following criteria are available for score
:
dsa1-score
- AlNoamany’s scoring function; requires a Memento Damage serverbm25
- score each memento by how well they match a given term as per BM25; requires that the user specify a term via--query
image-count
- score each memento by the number of images in each memento; this can be useful for image-rich story typessimple-card-score
- score each memento by how well it would be represented as a social card on Facebook and Twitter as applied by Jones et al. in 2021path-depth
- score each memento by its path depth, as defined by McCown et al. in 2015url-category-score
- score each memento the web page categories established by Padia et al.
Type hc score --help
for more information on the available options for each criteria. The --help
argument can also be supplied to a single option for more information, e.g., hc score bm25 --help
.
order
¶
Sorting content is critical for conveying meaning. For storytelling, we often want articles to flow in chronological order.
To order mementos in scored_mementos.tsv according to the publication date as extracted by newspaper3k, otherwise by memento-datetime:
hc order pubdate-else-memento-datetime -i mementos -ia scored_mementos.tsv -o ordered_mementos.tsv
The following criteria are available for order
:
pubdate-else-memento-datetime
- order mementos, if possible, by publication date as extracted by newspaper3k, otherwise fall back to memento-datetimememento-datetime
- order mementos by memento-datetimescore
- order mementos by a given score, as produced by the score action
Type hc order --help
for more information on the available options for each criteria. The --help
argument can also be supplied to a single option for more information, e.g., hc order memento-datetime --help
.