| Title: | Highlight Conserved Edits Across Versions of a Document |
|---|---|
| Description: | Input multiple versions of a source document, and receive HTML code for a highlighted version of the source document indicating the frequency of occurrence of phrases in the different versions. This method is described in Chapter 3 of Rogers (2024) <https://digitalcommons.unl.edu/dissertations/AAI31240449/>. |
| Authors: | Center for Statistics and Applications in Forensic Evidence [aut, cph, fnd], Rachel Rogers [aut, cre] (ORCID: <https://orcid.org/0000-0002-4145-9630>), Susan VanderPlas [aut] (ORCID: <https://orcid.org/0000-0002-3803-0972>) |
| Maintainer: | Rachel Rogers <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.0.1.9000 |
| Built: | 2026-05-11 06:45:25 UTC |
| Source: | https://github.com/rachelesrogers/highlightr |
This function provides the frequency of collocations in comments that correspond to the provided source document.
collocation_frequency( tbl, source_row, text_column, collocate_length = 5, fuzzy = FALSE, n_bands = 50, threshold = 0.7, n_gram_width = 4, band_width = 8 )collocation_frequency( tbl, source_row, text_column, collocate_length = 5, fuzzy = FALSE, n_bands = 50, threshold = 0.7, n_gram_width = 4, band_width = 8 )
tbl |
data frame containing documents, where each row represents a document |
source_row |
row containing text to be treated as source |
text_column |
string indicating the name of the column containing derivative text |
collocate_length |
the length of the collocation. Default is 5 |
fuzzy |
whether or not to use fuzzy matching in collocation calculations |
n_bands |
number of bands used in MinHash algorithm passed to |
threshold |
Jaccard distance threshold to be considered a match passed to |
n_gram_width |
width of n-grams used in Jaccard distance calculation passed to |
band_width |
width of band used in MinHash algorithm passed to |
Collocations are sequences of words present in the source document. For example, the phrase "the blue bird flies" contains one collocation of length 4 ("the blue bird flies"), two collocations of length 3 ("the blue bird" and "blue bird flies"), and three collocations of length 2 ("the blue", "blue bird", and "bird flies"). This function counts the number of corresponding phrases in the 'notes', or the derivative documents. This count is divided by the number of times the phrase occurs in the source document. When fuzzy matching is included, indirect matches are included with a weight of (n*d)/m, where n is the frequency of the fuzzy collocation, d is the Jaccard similarity between the transcript and note collocation, and m is the number of closest matches for the note collocation.
a dataframe of the transcript document with collocation values by word
src_row <- which(notepad_example$ID=="source") merged_frequency <- collocation_frequency(notepad_example, src_row, "Text")src_row <- which(notepad_example$ID=="source") merged_frequency <- collocation_frequency(notepad_example, src_row, "Text")
This assigns colors based on frequency to the words in the transcript.
collocation_plot( frequency_doc, colors = c("#f251fc", "#f8ff1b"), values = "Freq", order = "word_num", text = "words" )collocation_plot( frequency_doc, colors = c("#f251fc", "#f8ff1b"), values = "Freq", order = "word_num", text = "words" )
frequency_doc |
document of frequencies (returned from
|
colors |
list for color specification for the gradient. Default is c("#f251fc","#f8ff1b") |
values |
column name of values to use in gradient calculation. Default is "Freq",
corresponding to document returned from |
order |
column name corresponding to the the word order of the text. Default
is "word_num", corresponding to the document returned from |
text |
column name corresponding to text to map the gradient to. Default is "words",
corresponding to the document returned from |
list of plot, plot object, and frequency
# Identify Source Row src_row <- which(notepad_example$ID=="source") merged_frequency <- collocation_frequency(notepad_example, src_row, "Text") # Create a plot object to assign colors based on frequency freq_plot <- collocation_plot(merged_frequency)# Identify Source Row src_row <- which(notepad_example$ID=="source") merged_frequency <- collocation_frequency(notepad_example, src_row, "Text") # Create a plot object to assign colors based on frequency freq_plot <- collocation_plot(merged_frequency)
Adds html tags to create a highlighted testimony corresponding to word frequency.
To render correctly, the object produced from highlighted_text() can be added outside of a code chunk in an .Rmd document in the `r highlighted_text()` format.
Alternatively, the html output can be saved by using the xml2 package as follows:
xml2::write_html(xml2::read_html(highlighted_text(), "filepath.html"))
highlighted_text(plot_object, labels = c("", ""))highlighted_text(plot_object, labels = c("", ""))
plot_object |
plot object resulting from |
labels |
lower and upper labels for the gradient scale |
html code for highlighted text
# Identify Source Row src_row <- which(notepad_example$ID=="source") # Calculate Frequency merged_frequency <- collocation_frequency(notepad_example, src_row, "Text") # Create a plot object to assign colors based on frequency freq_plot <- collocation_plot(merged_frequency) # Add html tags to create a highlighted version of the source document page_highlight <- highlighted_text(freq_plot, merged_frequency)# Identify Source Row src_row <- which(notepad_example$ID=="source") # Calculate Frequency merged_frequency <- collocation_frequency(notepad_example, src_row, "Text") # Create a plot object to assign colors based on frequency freq_plot <- collocation_plot(merged_frequency) # Add html tags to create a highlighted version of the source document page_highlight <- highlighted_text(freq_plot, merged_frequency)
Participant comments for the initial description used in the jury perception study
notepad_examplenotepad_example
notepad_exampleA data frame with 126 rows and 2 columns:
Participant Identifier, as well as source document identifier
Participant notes, as well as source transcript
Jury Perception Study (see Rogers (2024) https://digitalcommons.unl.edu/dissertations/AAI31240449/)
Text corresponding to versions of the Wikipedia article for Highlighter
wiki_pageswiki_pages
wiki_pagesA data frame with 300 rows and 1 column:
text of the Wikipedia page for Highlighter
Wikipedia: https://en.wikipedia.org/w/index.php?title=Highlighter&action=history