API References

Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

TextAnalysis.columnindicesMethod
columnindices(terms::Vector{String})

Creates a column index lookup dictionary from a vector of terms.

TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)

Basic low-level function that calculates the co-occurence matrix of a document. Returns a sparse co-occurence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions.

Example

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999
TextAnalysis.coomMethod
coom(c::CooMatrix)

Access the co-occurrence matrix field coom of a CooMatrixc.

TextAnalysis.coomMethod
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

TextAnalysis.counter2Method
counter is used to make conditional distribution, which is used by score functions to 
calculate conditonal frequency distribution
TextAnalysis.dtmMethod
dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)

Creates a simple sparse matrix of DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
 1  2  0  1  1  1
 1  0  2  1  1  1
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
 1  2  0  1  1  1
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
 10-element Array{Any,1}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

TextAnalysis.featuresMethod
features(::AbstractDict, dict)

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

  • RLCS - Recall Factor
  • PLCS - Precision Factor
  • β - Parameter
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occuring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
 "is"
 "This"
 "Document"

See also: remove_frequent_terms!, sparse_terms

TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

TextAnalysis.hash_dtvMethod
hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represents a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0

julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")

julia> language!(d, Languages.Spanish())

julia> d.metadata.language
Languages.Spanish()

See also: language, languages, languages!

TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64)

Perform Latent Dirichlet allocation.

Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Return values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
TextAnalysis.lexical_frequencyMethod
lexical_frequency(crps::Corpus, term::AbstractString)

Tells us how often a term occurs across all of the documents.

TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                          StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

julia> lexicon(crps)
Dict{String,Int64} with 0 entries
TextAnalysis.lookupMethod

lookup a sequence or words in the vocabulary

Return an Array of String

TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
  "be or not" => 1
  "or not to" => 1
  "To be or"  => 1
TextAnalysis.ngramizenewMethod
ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
 7-element Array{Any,1}:
  "To be" 
  "be or" 
  "or not"
  "not To"
  "To not"
  "not To"
  "To not"
TextAnalysis.ngramsMethod
ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> ngrams(sd)
 Dict{String,Int64} with 7 entries:
  "or"   => 1
  "not"  => 1
  "to"   => 1
  "To"   => 1
  "be"   => 1
  "be.." => 1
  "."    => 1
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
 6-element Array{Any,1}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

TextAnalysis.prepare!Method
prepare!(doc, flags)
prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

  • strip_patterns
  • stripcorruptutf8
  • strip_case
  • stem_words
  • tagpartof_speech
  • strip_whitespace
  • strip_punctuation
  • strip_numbers
  • stripnonletters
  • stripindefinitearticles
  • stripdefinitearticles
  • strip_articles
  • strip_prepositions
  • strip_pronouns
  • strip_stopwords
  • stripsparseterms
  • stripfrequentterms
  • striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is   document of "
TextAnalysis.probFunction

To get probability of word given that context

In otherwords, for given context calculate frequency distribution of word

TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

TextAnalysis.remove_case!Method
remove_case!(doc)
remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"

See also: remove_case

TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove terms in crps, occuring more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
"     1"
julia> text(crps[2])
"     2"

See also: remove_sparse_terms!, frequent_terms

TextAnalysis.remove_html_tags!Method
remove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)

Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.

Example

julia> html_doc = StringDocument(
             "
               <html>
                   <head><script language="javascript">x = 20;</script></head>
                   <body>
                       <h1>Hello</h1><a href="world">world</a>
                   </body>
               </html>
             "
            )
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet:  <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"

See also: remove_html_tags

TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

TextAnalysis.remove_sparse_terms!Function
remove_sparse_terms!(crps, alpha=0.05)

Remove sparse terms in crps, occuring less than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "

See also: remove_frequent_terms!, sparse_terms

TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
remove_whitespace!(crps)

Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

TextAnalysis.remove_words!Method
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})

Remove the occurences of words from doc or crps.

Example

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context in MLE

TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probablity of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(language, str)

Split str into sentences.

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Array{SubString{String},1}:
 "Here are few words!"
 "I am Foo Bar."

See also: tokenize

TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05])

Find the sparse terms from Corpus, occuring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Array{String,1}:
 "1"
 "2"

See also: remove_sparse_terms!, frequent_terms

TextAnalysis.standardize!Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              TokenDocument("Document 2"),
		              NGramDocument("Document 3")])
A Corpus with 3 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 1 TokenDocument's
 * 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens


julia> standardize!(crps, NGramDocument)

julia> crps
A Corpus with 3 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.stem!Method
stem!(doc)
stem!(crps)

Stems the document or documents in crps with a suitable stemmer.

Stemming cannot be done for FileDocument and Corpus made of these type of documents.

TextAnalysis.stem!Method
stem!(crps::Corpus)

Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)

TextAnalysis.summarizeMethod
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
 "Assume this Short Document as an example."
 "This has too foo sentences."
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from current_scheme to new_scheme.

List of tagging schemes currently supported-

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]

julia> tag_scheme!(tags, "BIO1", "BIOES")

julia> tags
8-element Array{String,1}:
 "S-LOC"
 "O"
 "S-PER"
 "B-MISC"
 "E-MISC"
 "B-PER"
 "I-PER"
 "E-PER"
TextAnalysis.textMethod
text(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)

Access the text of Document as a string.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> text(sd)
"To be or not to be..."
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

tf should have the has same nonzeros as dtm.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

Works correctly if dtm and tf are same matrix.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

dtm and tf-idf must be matrices of same dimensions.

See also: tf, tf! , tf_idf

TextAnalysis.tf_idfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.

See also: titles, title!, title

TextAnalysis.tokensMethod
tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> tokens(sd)
7-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be.."
    "."
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

WordTokenizers.tokenizeMethod
tokenize(language, str)

Split str into words and other tokens such as punctuation.

Example

julia> tokenize(Languages.English(), "Too foo words!")
4-element Array{String,1}:
 "Too"
 "foo"
 "words"
 "!"

See also: sentence_tokenize

TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              StringDocument("Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.DocumentMetadataMethod
DocumentMetadata(language, title::String, author::String, timestamp::String)

Stores basic metadata about Document.

...

Arguments

  • language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
  • title::String : What is the title of the document? Defaults to "Untitled Document".
  • author::String : Who wrote the document? Defaults to "Unknown Author".
  • timestamp::String : When was the document written? Defaults to "Unknown Time".

...

TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})

Represent documents as a matrix of word counts.

Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

TextAnalysis.NGramDocumentMethod
NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], Symbol[:spam, :non_spam], Array{Int64}(0,2))

julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], Symbol[:spam, :non_spam], [2 1; 2 1; 2 1])

julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], Symbol[:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])

julia> predict(m, "is this a spam")
Dict{Symbol,Float64} with 2 entries:
  :spam     => 0.59883
  :non_spam => 0.40117
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
TextAnalysis.TextHashFunctionMethod
TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.

Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
TextAnalysis.TokenDocumentMethod
TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.VocabularyType
Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2) 
  Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 

julia> vocabulary.vocab
  Dict{String,Int64} with 4 entries:
   "<unk>" => 1
   "c"     => 3
   "a"     => 3
   "d"     => 2

Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
 3

julia> "c" in keys(vocabulary.vocab)
 true

julia> vocabulary.vocab["d"]
 2

julia> "d" in keys(vocabulary.vocab)
 true

Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
 false

julia> "<unk>" in keys(vocabulary.vocab)
 true

We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
 'a'

julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
 5-element Array{Any,1}:
  "a"    
  "<unk>"
  "d"    
  "c"    
  "a"

If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
   
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
 1

julia> vocabulary.vocab["b"]
 1
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.