Metiendo las manos en la grasa con el paquete tm() de R. Usaremos los datos del 20Newsgroups dataset http://qwone.com/~jason/20Newsgroups/
## Loading required package: NLP
Asumiendo que ya tenemos los archivo el working directory:
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tm_0.6 NLP_0.1-5
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.4 evaluate_0.7 formatR_0.10 htmltools_0.2.6
## [5] knitr_1.10.5 parallel_3.1.1 rmarkdown_0.5.1 slam_0.1-32
## [9] stringr_0.6.2 tools_3.1.1 yaml_2.1.13
sci.electr.train <- Corpus(DirSource("sci.electronics", encoding = "UTF-8"), readerControl=list(reader=readPlain,language="en"))
# getReaders()
sci.electr.train
## <<VCorpus (documents: 591, metadata (corpus/indexed): 0/0)>>
str(sci.electr.train[[1]])
## List of 2
## $ content: chr [1:45] "From: et@teal.csn.org (Eric H. Taylor)" "Subject: Re: HELP_WITH_TRACKING_DEVICE" "Summary: underground and underwater wireless methods" "Keywords: Rogers, Tesla, Hertz, underground, underwater, wireless, radio" ...
## $ meta :List of 7
## ..$ author : chr(0)
## ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-03 20:11:31"
## ..$ description : chr(0)
## ..$ heading : chr(0)
## ..$ id : chr "52434"
## ..$ language : chr "en"
## ..$ origin : chr(0)
## ..- attr(*, "class")= chr "TextDocumentMeta"
## - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
sci.electr.train[[1]]
## <<PlainTextDocument (metadata: 7)>>
## From: et@teal.csn.org (Eric H. Taylor)
## Subject: Re: HELP_WITH_TRACKING_DEVICE
## Summary: underground and underwater wireless methods
## Keywords: Rogers, Tesla, Hertz, underground, underwater, wireless, radio
## Nntp-Posting-Host: teal.csn.org
## Organization: 4-L Laboratories
## Expires: Fri, 30 Apr 1993 06:00:00 GMT
## Lines: 36
##
## In article <00969FBA.E640FF10@AESOP.RUTGERS.EDU> mcdonald@AESOP.RUTGERS.EDU writes:
## >[...]
## >There are a variety of water-proof housings I could use but the real meat
## >of the problem is the electronics...hence this posting. What kind of
## >transmission would be reliable underwater, in murky or even night-time
## >conditions? I'm not sure if sound is feasible given the distortion under-
## >water...obviously direction would have to be accurate but range could be
## >relatively short (I imagine 2 or 3 hundred yards would be more than enough)
## >
## >Jim McDonald
##
## Refer to patents by JAMES HARRIS ROGERS:
## 958,829; 1,220,005; 1,322,622; 1,349,103; 1,315,862; 1,349,104;
## 1,303,729; 1,303,730; 1,316,188
##
## He details methods of underground and underwater wireless communications.
## For a review, refer to _Electrical_Experimenter_, March 1919 and June 1919.
##
## Rogers' methods were used extensively during the World War, and was
## unclassified after the war. Supposedly, the government rethought this
## soon after, and Rogers was convieniently forgotten.
##
## The bottom line is that all antennas that are grounded send HALF of
## their signal THRU the ground. The half that travels thru space is
## quickly dissapated (by the square of the distance), but that which
## travels thru the ground does not disapate at all. Furthermore,
## the published data showed that when noise drowned out regular
## reception, the underground antennas would recieve virtually noise-free.
##
## IF you find this hard to believe, then refer to the work of the
## man who INVENTED wireless: Tesla. Tesla confirmed that Rogers' methods
## were correct, while Hertzian wave theory was completely "abberant".
##
## ----
## ET "Tesla was 100 years ahead of his time. Perhaps now his time comes."
## ----
# to lower
sci.electr.train.p <- tm_map (sci.electr.train, content_transformer(tolower))
#remove punctuations
sci.electr.train.p <- tm_map (sci.electr.train.p, removePunctuation)
#strip extra whitespaces
sci.electr.train.p <- tm_map (sci.electr.train.p, stripWhitespace)
sci.electr.train.p[[1]]
## <<PlainTextDocument (metadata: 7)>>
## from ettealcsnorg eric h taylor
## subject re helpwithtrackingdevice
## summary underground and underwater wireless methods
## keywords rogers tesla hertz underground underwater wireless radio
## nntppostinghost tealcsnorg
## organization 4l laboratories
## expires fri 30 apr 1993 060000 gmt
## lines 36
##
## in article 00969fbae640ff10aesoprutgersedu mcdonaldaesoprutgersedu writes
##
## there are a variety of waterproof housings i could use but the real meat
## of the problem is the electronicshence this posting what kind of
## transmission would be reliable underwater in murky or even nighttime
## conditions im not sure if sound is feasible given the distortion under
## waterobviously direction would have to be accurate but range could be
## relatively short i imagine 2 or 3 hundred yards would be more than enough
##
## jim mcdonald
##
## refer to patents by james harris rogers
## 958829 1220005 1322622 1349103 1315862 1349104
## 1303729 1303730 1316188
##
## he details methods of underground and underwater wireless communications
## for a review refer to electricalexperimenter march 1919 and june 1919
##
## rogers methods were used extensively during the world war and was
## unclassified after the war supposedly the government rethought this
## soon after and rogers was convieniently forgotten
##
## the bottom line is that all antennas that are grounded send half of
## their signal thru the ground the half that travels thru space is
## quickly dissapated by the square of the distance but that which
## travels thru the ground does not disapate at all furthermore
## the published data showed that when noise drowned out regular
## reception the underground antennas would recieve virtually noisefree
##
## if you find this hard to believe then refer to the work of the
## man who invented wireless tesla tesla confirmed that rogers methods
## were correct while hertzian wave theory was completely abberant
##
##
## et tesla was 100 years ahead of his time perhaps now his time comes
tdm <- TermDocumentMatrix (sci.electr.train.p)
#inspect part of the term-document matrix, a submatrix
inspect(tdm[1:20, 1:10])
## <<TermDocumentMatrix (terms: 20, documents: 10)>>
## Non-/sparse entries: 2/198
## Sparsity : 99%
## Maximal term length: 31
## Weighting : term frequency (tf)
##
## Docs
## Terms 52434 52446 52464 52717 52718 52719
## 000 0 0 0 0 0 0
## 0016 0 0 0 0 0 0
## 002 0 0 0 0 0 0
## 0022 0 0 0 0 0 0
## 00235 0 0 0 0 0 0
## 003 0 0 0 0 0 0
## 003800 0 0 0 0 0 0
## 004418 0 0 0 0 0 0
## 0047 0 0 0 0 0 0
## 00472 0 0 0 0 0 0
## 0078 0 1 0 0 0 0
## 008 0 0 0 0 0 0
## 00969fbae640ff10aesoprutgersedu 1 0 0 0 0 0
## 00indextxt 0 0 0 0 0 0
## 0108 0 0 0 0 0 0
## 01760 0 0 0 0 0 0
## 01775 0 0 0 0 0 0
## 01a 0 0 0 0 0 0
## 01x01 0 0 0 0 0 0
## 02115 0 0 0 0 0 0
## Docs
## Terms 52721 52722 52723 52724
## 000 0 0 0 0
## 0016 0 0 0 0
## 002 0 0 0 0
## 0022 0 0 0 0
## 00235 0 0 0 0
## 003 0 0 0 0
## 003800 0 0 0 0
## 004418 0 0 0 0
## 0047 0 0 0 0
## 00472 0 0 0 0
## 0078 0 0 0 0
## 008 0 0 0 0
## 00969fbae640ff10aesoprutgersedu 0 0 0 0
## 00indextxt 0 0 0 0
## 0108 0 0 0 0
## 01760 0 0 0 0
## 01775 0 0 0 0
## 01a 0 0 0 0
## 01x01 0 0 0 0
## 02115 0 0 0 0
dim(tdm)
## [1] 12189 591
rownames(tdm) [5000:5110]
## [1] "futservaustinibmcomrg" "future"
## [3] "futurenet" "fyi"
## [5] "fyzzicks" "g22226nextworkrosehulmanedu"
## [7] "g7hwngb7khw" "g8870"
## [9] "g90h6721hipporuacza" "g90k3853alpharuacza"
## [11] "g92m3062alpharuacza" "g92m3062hipporuacza"
## [13] "gaas" "gadget"
## [15] "gadgeteers" "gadgets"
## [17] "gaff" "gain"
## [19] "gainaltering" "gait"
## [21] "gaithersburg" "galen"
## [23] "galenpiceacfnrcolostateedu" "galvanic"
## [25] "galvanized" "galvonometerlike"
## [27] "game" "gameboy"
## [29] "gamecocks" "games"
## [31] "gammahutfi" "gandler"
## [33] "ganged" "ganter"
## [35] "ganterifiunibasch" "gap1pkveuinnduk"
## [37] "gap1pli7ginni6b" "garage"
## [39] "garages" "garden"
## [41] "gardi" "garfield"
## [43] "garlicsbscom" "gary"
## [45] "garygwarrenmentorgcom" "gas"
## [47] "gaskets" "gasoline"
## [49] "gasses" "gasturbine"
## [51] "gate" "gates"
## [53] "gatesource" "gateway"
## [55] "gather" "gauge"
## [57] "gaussian" "gave"
## [59] "gaz" "gaze"
## [61] "gcarterinfoservcom" "gcfi"
## [63] "gcfis" "gd3004"
## [65] "gear" "geared"
## [67] "gecmarconi" "gecplessey"
## [69] "gee" "geekshhhhhh"
## [71] "gel" "gen"
## [73] "genashor" "gendel"
## [75] "general" "generally"
## [77] "generalpurpose" "generate"
## [79] "generated" "generateradiate"
## [81] "generates" "generating"
## [83] "generation" "generator"
## [85] "generatorflashlightsiren" "generators"
## [87] "geneva" "geniuses"
## [89] "gentle" "gentleman"
## [91] "gently" "genuinely"
## [93] "geo" "geographic"
## [95] "geophysical" "george"
## [97] "georgedhodge" "georgia"
## [99] "gerald" "geraldbeltonozoneholecom"
## [101] "gerard" "gergnetcomcom"
## [103] "german" "germany"
## [105] "gerrit" "gerritlaosinhstgtsuborg"
## [107] "gerry" "get"
## [109] "gets" "getting"
## [111] "geva"
tdm <- TermDocumentMatrix (sci.electr.train.p, control=list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE) )
#inspect part of the term-document matrix, a submatrix
inspect(tdm[1:20, 1:10])
## <<TermDocumentMatrix (terms: 20, documents: 10)>>
## Non-/sparse entries: 2/198
## Sparsity : 99%
## Maximal term length: 31
## Weighting : term frequency - inverse document frequency (tf-idf)
##
## Docs
## Terms 52434 52446 52464 52717 52718
## 000 0.000000 0.000000 0 0 0
## 0016 0.000000 0.000000 0 0 0
## 002 0.000000 0.000000 0 0 0
## 0022 0.000000 0.000000 0 0 0
## 00235 0.000000 0.000000 0 0 0
## 003 0.000000 0.000000 0 0 0
## 003800 0.000000 0.000000 0 0 0
## 004418 0.000000 0.000000 0 0 0
## 0047 0.000000 0.000000 0 0 0
## 00472 0.000000 0.000000 0 0 0
## 0078 0.000000 9.207014 0 0 0
## 008 0.000000 0.000000 0 0 0
## 00969fbae640ff10aesoprutgersedu 9.207014 0.000000 0 0 0
## 00indextxt 0.000000 0.000000 0 0 0
## 0108 0.000000 0.000000 0 0 0
## 01760 0.000000 0.000000 0 0 0
## 01775 0.000000 0.000000 0 0 0
## 01a 0.000000 0.000000 0 0 0
## 01x01 0.000000 0.000000 0 0 0
## 02115 0.000000 0.000000 0 0 0
## Docs
## Terms 52719 52721 52722 52723 52724
## 000 0 0 0 0 0
## 0016 0 0 0 0 0
## 002 0 0 0 0 0
## 0022 0 0 0 0 0
## 00235 0 0 0 0 0
## 003 0 0 0 0 0
## 003800 0 0 0 0 0
## 004418 0 0 0 0 0
## 0047 0 0 0 0 0
## 00472 0 0 0 0 0
## 0078 0 0 0 0 0
## 008 0 0 0 0 0
## 00969fbae640ff10aesoprutgersedu 0 0 0 0 0
## 00indextxt 0 0 0 0 0
## 0108 0 0 0 0 0
## 01760 0 0 0 0 0
## 01775 0 0 0 0 0
## 01a 0 0 0 0 0
## 01x01 0 0 0 0 0
## 02115 0 0 0 0 0
tdm <- TermDocumentMatrix (sci.electr.train.p, control=list(weighting=weightTfIdf, minWordLength=2, minDocFreq=5))
#inspect part of the term-document matrix, a submatrix
inspect(tdm[400:410, 1:10])
## <<TermDocumentMatrix (terms: 11, documents: 10)>>
## Non-/sparse entries: 0/110
## Sparsity : 100%
## Maximal term length: 37
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
##
## Docs
## Terms 52434 52446 52464 52717 52718
## 1qihcl9riusenetinscwruedu 0 0 0 0 0
## 1qk158kcpbigbirdhricom 0 0 0 0 0
## 1qk4hjqosvtserfccvtedu 0 0 0 0 0
## 1qk724inn474hpcolcolhpcom 0 0 0 0 0
## 1ql7ugi50sunbocsmqeduau 0 0 0 0 0
## 1qlg9od7qsequoiaccsdutseduau 0 0 0 0 0
## 1qmisfodpsdlwarrenmentorgcom 0 0 0 0 0
## 1qngqlinnnp8shelleyuwashingtonedu 0 0 0 0 0
## 1qnroed1nusenetinscwruedu 0 0 0 0 0
## 1qpgsiinn31pdiplodocuscisohiostateedu 0 0 0 0 0
## 1qpj5titgvelaacsoaklandedu 0 0 0 0 0
## Docs
## Terms 52719 52721 52722 52723 52724
## 1qihcl9riusenetinscwruedu 0 0 0 0 0
## 1qk158kcpbigbirdhricom 0 0 0 0 0
## 1qk4hjqosvtserfccvtedu 0 0 0 0 0
## 1qk724inn474hpcolcolhpcom 0 0 0 0 0
## 1ql7ugi50sunbocsmqeduau 0 0 0 0 0
## 1qlg9od7qsequoiaccsdutseduau 0 0 0 0 0
## 1qmisfodpsdlwarrenmentorgcom 0 0 0 0 0
## 1qngqlinnnp8shelleyuwashingtonedu 0 0 0 0 0
## 1qnroed1nusenetinscwruedu 0 0 0 0 0
## 1qpgsiinn31pdiplodocuscisohiostateedu 0 0 0 0 0
## 1qpj5titgvelaacsoaklandedu 0 0 0 0 0
freqterms <- findFreqTerms(tdm, 3, 10)
str(freqterms)
## chr [1:34] "any" "anyone" "are" "but" "can" "does" ...
findAssocs(tdm, c("ground", "anyone"), c(0.6, 0.6))
## $ground
## outlets wire connected breaker neutral outlet
## 0.74 0.68 0.67 0.64 0.62 0.61
##
## $anyone
## numeric(0)