Metiendo las manos en la grasa con el paquete tm() de R. Usaremos los datos del 20Newsgroups dataset http://qwone.com/~jason/20Newsgroups/

## Loading required package: NLP

Asumiendo que ya tenemos los archivo el working directory:

sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tm_0.6    NLP_0.1-5
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.4    evaluate_0.7    formatR_0.10    htmltools_0.2.6
##  [5] knitr_1.10.5    parallel_3.1.1  rmarkdown_0.5.1 slam_0.1-32    
##  [9] stringr_0.6.2   tools_3.1.1     yaml_2.1.13
  1. cargar archivos del
sci.electr.train <- Corpus(DirSource("sci.electronics", encoding = "UTF-8"), readerControl=list(reader=readPlain,language="en"))
  1. Inspeccionemos
# getReaders()
sci.electr.train
## <<VCorpus (documents: 591, metadata (corpus/indexed): 0/0)>>
str(sci.electr.train[[1]])
## List of 2
##  $ content: chr [1:45] "From: et@teal.csn.org (Eric H. Taylor)" "Subject: Re: HELP_WITH_TRACKING_DEVICE" "Summary: underground and underwater wireless methods" "Keywords: Rogers, Tesla, Hertz, underground, underwater, wireless, radio" ...
##  $ meta   :List of 7
##   ..$ author       : chr(0) 
##   ..$ datetimestamp: POSIXlt[1:1], format: "2015-09-03 20:11:31"
##   ..$ description  : chr(0) 
##   ..$ heading      : chr(0) 
##   ..$ id           : chr "52434"
##   ..$ language     : chr "en"
##   ..$ origin       : chr(0) 
##   ..- attr(*, "class")= chr "TextDocumentMeta"
##  - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
sci.electr.train[[1]]
## <<PlainTextDocument (metadata: 7)>>
## From: et@teal.csn.org (Eric H. Taylor)
## Subject: Re: HELP_WITH_TRACKING_DEVICE
## Summary: underground and underwater wireless methods
## Keywords: Rogers, Tesla, Hertz, underground, underwater, wireless, radio
## Nntp-Posting-Host: teal.csn.org
## Organization: 4-L Laboratories
## Expires: Fri, 30 Apr 1993 06:00:00 GMT
## Lines: 36
## 
## In article <00969FBA.E640FF10@AESOP.RUTGERS.EDU> mcdonald@AESOP.RUTGERS.EDU writes:
## >[...]
## >There are a variety of water-proof housings I could use but the real meat
## >of the problem is the electronics...hence this posting.  What kind of
## >transmission would be reliable underwater, in murky or even night-time
## >conditions?  I'm not sure if sound is feasible given the distortion under-
## >water...obviously direction would have to be accurate but range could be
## >relatively short (I imagine 2 or 3  hundred yards would be more than enough)
## >
## >Jim McDonald
## 
## Refer to patents by JAMES HARRIS ROGERS:
## 958,829; 1,220,005; 1,322,622; 1,349,103; 1,315,862; 1,349,104;
## 1,303,729; 1,303,730; 1,316,188
## 
## He details methods of underground and underwater wireless communications.
## For a review, refer to _Electrical_Experimenter_, March 1919 and June 1919.
## 
## Rogers' methods were used extensively during the World War, and was
## unclassified after the war. Supposedly, the government rethought this
## soon after, and Rogers was convieniently forgotten.
## 
## The bottom line is that all antennas that are grounded send HALF of
## their signal THRU the ground. The half that travels thru space is
## quickly dissapated (by the square of the distance), but that which
## travels thru the ground does not disapate at all. Furthermore,
## the published data showed that when noise drowned out regular
## reception, the underground antennas would recieve virtually noise-free.
## 
## IF you find this hard to believe, then refer to the work of the
## man who INVENTED wireless: Tesla. Tesla confirmed that Rogers' methods
## were correct, while Hertzian wave theory was completely "abberant".
## 
## ----
##  ET   "Tesla was 100 years ahead of his time. Perhaps now his time comes."
## ----
  1. Pre-processing
# to lower
sci.electr.train.p <- tm_map (sci.electr.train, content_transformer(tolower))
#remove punctuations
sci.electr.train.p <- tm_map (sci.electr.train.p, removePunctuation)
#strip extra whitespaces
sci.electr.train.p <- tm_map (sci.electr.train.p, stripWhitespace)

sci.electr.train.p[[1]]
## <<PlainTextDocument (metadata: 7)>>
## from ettealcsnorg eric h taylor
## subject re helpwithtrackingdevice
## summary underground and underwater wireless methods
## keywords rogers tesla hertz underground underwater wireless radio
## nntppostinghost tealcsnorg
## organization 4l laboratories
## expires fri 30 apr 1993 060000 gmt
## lines 36
## 
## in article 00969fbae640ff10aesoprutgersedu mcdonaldaesoprutgersedu writes
## 
## there are a variety of waterproof housings i could use but the real meat
## of the problem is the electronicshence this posting what kind of
## transmission would be reliable underwater in murky or even nighttime
## conditions im not sure if sound is feasible given the distortion under
## waterobviously direction would have to be accurate but range could be
## relatively short i imagine 2 or 3 hundred yards would be more than enough
## 
## jim mcdonald
## 
## refer to patents by james harris rogers
## 958829 1220005 1322622 1349103 1315862 1349104
## 1303729 1303730 1316188
## 
## he details methods of underground and underwater wireless communications
## for a review refer to electricalexperimenter march 1919 and june 1919
## 
## rogers methods were used extensively during the world war and was
## unclassified after the war supposedly the government rethought this
## soon after and rogers was convieniently forgotten
## 
## the bottom line is that all antennas that are grounded send half of
## their signal thru the ground the half that travels thru space is
## quickly dissapated by the square of the distance but that which
## travels thru the ground does not disapate at all furthermore
## the published data showed that when noise drowned out regular
## reception the underground antennas would recieve virtually noisefree
## 
## if you find this hard to believe then refer to the work of the
## man who invented wireless tesla tesla confirmed that rogers methods
## were correct while hertzian wave theory was completely abberant
## 
## 
##  et tesla was 100 years ahead of his time perhaps now his time comes
  1. Crear TDm
tdm <- TermDocumentMatrix (sci.electr.train.p)
#inspect part of the term-document matrix, a submatrix
inspect(tdm[1:20, 1:10])
## <<TermDocumentMatrix (terms: 20, documents: 10)>>
## Non-/sparse entries: 2/198
## Sparsity           : 99%
## Maximal term length: 31
## Weighting          : term frequency (tf)
## 
##                                  Docs
## Terms                             52434 52446 52464 52717 52718 52719
##   000                                 0     0     0     0     0     0
##   0016                                0     0     0     0     0     0
##   002                                 0     0     0     0     0     0
##   0022                                0     0     0     0     0     0
##   00235                               0     0     0     0     0     0
##   003                                 0     0     0     0     0     0
##   003800                              0     0     0     0     0     0
##   004418                              0     0     0     0     0     0
##   0047                                0     0     0     0     0     0
##   00472                               0     0     0     0     0     0
##   0078                                0     1     0     0     0     0
##   008                                 0     0     0     0     0     0
##   00969fbae640ff10aesoprutgersedu     1     0     0     0     0     0
##   00indextxt                          0     0     0     0     0     0
##   0108                                0     0     0     0     0     0
##   01760                               0     0     0     0     0     0
##   01775                               0     0     0     0     0     0
##   01a                                 0     0     0     0     0     0
##   01x01                               0     0     0     0     0     0
##   02115                               0     0     0     0     0     0
##                                  Docs
## Terms                             52721 52722 52723 52724
##   000                                 0     0     0     0
##   0016                                0     0     0     0
##   002                                 0     0     0     0
##   0022                                0     0     0     0
##   00235                               0     0     0     0
##   003                                 0     0     0     0
##   003800                              0     0     0     0
##   004418                              0     0     0     0
##   0047                                0     0     0     0
##   00472                               0     0     0     0
##   0078                                0     0     0     0
##   008                                 0     0     0     0
##   00969fbae640ff10aesoprutgersedu     0     0     0     0
##   00indextxt                          0     0     0     0
##   0108                                0     0     0     0
##   01760                               0     0     0     0
##   01775                               0     0     0     0
##   01a                                 0     0     0     0
##   01x01                               0     0     0     0
##   02115                               0     0     0     0
dim(tdm)
## [1] 12189   591
rownames(tdm) [5000:5110]
##   [1] "futservaustinibmcomrg"       "future"                     
##   [3] "futurenet"                   "fyi"                        
##   [5] "fyzzicks"                    "g22226nextworkrosehulmanedu"
##   [7] "g7hwngb7khw"                 "g8870"                      
##   [9] "g90h6721hipporuacza"         "g90k3853alpharuacza"        
##  [11] "g92m3062alpharuacza"         "g92m3062hipporuacza"        
##  [13] "gaas"                        "gadget"                     
##  [15] "gadgeteers"                  "gadgets"                    
##  [17] "gaff"                        "gain"                       
##  [19] "gainaltering"                "gait"                       
##  [21] "gaithersburg"                "galen"                      
##  [23] "galenpiceacfnrcolostateedu"  "galvanic"                   
##  [25] "galvanized"                  "galvonometerlike"           
##  [27] "game"                        "gameboy"                    
##  [29] "gamecocks"                   "games"                      
##  [31] "gammahutfi"                  "gandler"                    
##  [33] "ganged"                      "ganter"                     
##  [35] "ganterifiunibasch"           "gap1pkveuinnduk"            
##  [37] "gap1pli7ginni6b"             "garage"                     
##  [39] "garages"                     "garden"                     
##  [41] "gardi"                       "garfield"                   
##  [43] "garlicsbscom"                "gary"                       
##  [45] "garygwarrenmentorgcom"       "gas"                        
##  [47] "gaskets"                     "gasoline"                   
##  [49] "gasses"                      "gasturbine"                 
##  [51] "gate"                        "gates"                      
##  [53] "gatesource"                  "gateway"                    
##  [55] "gather"                      "gauge"                      
##  [57] "gaussian"                    "gave"                       
##  [59] "gaz"                         "gaze"                       
##  [61] "gcarterinfoservcom"          "gcfi"                       
##  [63] "gcfis"                       "gd3004"                     
##  [65] "gear"                        "geared"                     
##  [67] "gecmarconi"                  "gecplessey"                 
##  [69] "gee"                         "geekshhhhhh"                
##  [71] "gel"                         "gen"                        
##  [73] "genashor"                    "gendel"                     
##  [75] "general"                     "generally"                  
##  [77] "generalpurpose"              "generate"                   
##  [79] "generated"                   "generateradiate"            
##  [81] "generates"                   "generating"                 
##  [83] "generation"                  "generator"                  
##  [85] "generatorflashlightsiren"    "generators"                 
##  [87] "geneva"                      "geniuses"                   
##  [89] "gentle"                      "gentleman"                  
##  [91] "gently"                      "genuinely"                  
##  [93] "geo"                         "geographic"                 
##  [95] "geophysical"                 "george"                     
##  [97] "georgedhodge"                "georgia"                    
##  [99] "gerald"                      "geraldbeltonozoneholecom"   
## [101] "gerard"                      "gergnetcomcom"              
## [103] "german"                      "germany"                    
## [105] "gerrit"                      "gerritlaosinhstgtsuborg"    
## [107] "gerry"                       "get"                        
## [109] "gets"                        "getting"                    
## [111] "geva"
  1. Usar TFIDF
tdm <- TermDocumentMatrix (sci.electr.train.p, control=list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE) )
#inspect part of the term-document matrix, a submatrix
inspect(tdm[1:20, 1:10])
## <<TermDocumentMatrix (terms: 20, documents: 10)>>
## Non-/sparse entries: 2/198
## Sparsity           : 99%
## Maximal term length: 31
## Weighting          : term frequency - inverse document frequency (tf-idf)
## 
##                                  Docs
## Terms                                52434    52446 52464 52717 52718
##   000                             0.000000 0.000000     0     0     0
##   0016                            0.000000 0.000000     0     0     0
##   002                             0.000000 0.000000     0     0     0
##   0022                            0.000000 0.000000     0     0     0
##   00235                           0.000000 0.000000     0     0     0
##   003                             0.000000 0.000000     0     0     0
##   003800                          0.000000 0.000000     0     0     0
##   004418                          0.000000 0.000000     0     0     0
##   0047                            0.000000 0.000000     0     0     0
##   00472                           0.000000 0.000000     0     0     0
##   0078                            0.000000 9.207014     0     0     0
##   008                             0.000000 0.000000     0     0     0
##   00969fbae640ff10aesoprutgersedu 9.207014 0.000000     0     0     0
##   00indextxt                      0.000000 0.000000     0     0     0
##   0108                            0.000000 0.000000     0     0     0
##   01760                           0.000000 0.000000     0     0     0
##   01775                           0.000000 0.000000     0     0     0
##   01a                             0.000000 0.000000     0     0     0
##   01x01                           0.000000 0.000000     0     0     0
##   02115                           0.000000 0.000000     0     0     0
##                                  Docs
## Terms                             52719 52721 52722 52723 52724
##   000                                 0     0     0     0     0
##   0016                                0     0     0     0     0
##   002                                 0     0     0     0     0
##   0022                                0     0     0     0     0
##   00235                               0     0     0     0     0
##   003                                 0     0     0     0     0
##   003800                              0     0     0     0     0
##   004418                              0     0     0     0     0
##   0047                                0     0     0     0     0
##   00472                               0     0     0     0     0
##   0078                                0     0     0     0     0
##   008                                 0     0     0     0     0
##   00969fbae640ff10aesoprutgersedu     0     0     0     0     0
##   00indextxt                          0     0     0     0     0
##   0108                                0     0     0     0     0
##   01760                               0     0     0     0     0
##   01775                               0     0     0     0     0
##   01a                                 0     0     0     0     0
##   01x01                               0     0     0     0     0
##   02115                               0     0     0     0     0
  1. Otros filtros
tdm <- TermDocumentMatrix (sci.electr.train.p, control=list(weighting=weightTfIdf, minWordLength=2, minDocFreq=5))
#inspect part of the term-document matrix, a submatrix
inspect(tdm[400:410, 1:10])
## <<TermDocumentMatrix (terms: 11, documents: 10)>>
## Non-/sparse entries: 0/110
## Sparsity           : 100%
## Maximal term length: 37
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##                                        Docs
## Terms                                   52434 52446 52464 52717 52718
##   1qihcl9riusenetinscwruedu                 0     0     0     0     0
##   1qk158kcpbigbirdhricom                    0     0     0     0     0
##   1qk4hjqosvtserfccvtedu                    0     0     0     0     0
##   1qk724inn474hpcolcolhpcom                 0     0     0     0     0
##   1ql7ugi50sunbocsmqeduau                   0     0     0     0     0
##   1qlg9od7qsequoiaccsdutseduau              0     0     0     0     0
##   1qmisfodpsdlwarrenmentorgcom              0     0     0     0     0
##   1qngqlinnnp8shelleyuwashingtonedu         0     0     0     0     0
##   1qnroed1nusenetinscwruedu                 0     0     0     0     0
##   1qpgsiinn31pdiplodocuscisohiostateedu     0     0     0     0     0
##   1qpj5titgvelaacsoaklandedu                0     0     0     0     0
##                                        Docs
## Terms                                   52719 52721 52722 52723 52724
##   1qihcl9riusenetinscwruedu                 0     0     0     0     0
##   1qk158kcpbigbirdhricom                    0     0     0     0     0
##   1qk4hjqosvtserfccvtedu                    0     0     0     0     0
##   1qk724inn474hpcolcolhpcom                 0     0     0     0     0
##   1ql7ugi50sunbocsmqeduau                   0     0     0     0     0
##   1qlg9od7qsequoiaccsdutseduau              0     0     0     0     0
##   1qmisfodpsdlwarrenmentorgcom              0     0     0     0     0
##   1qngqlinnnp8shelleyuwashingtonedu         0     0     0     0     0
##   1qnroed1nusenetinscwruedu                 0     0     0     0     0
##   1qpgsiinn31pdiplodocuscisohiostateedu     0     0     0     0     0
##   1qpj5titgvelaacsoaklandedu                0     0     0     0     0
  1. terminos frecuentes y asociaciones
freqterms <- findFreqTerms(tdm, 3, 10)
str(freqterms)
##  chr [1:34] "any" "anyone" "are" "but" "can" "does" ...
findAssocs(tdm, c("ground", "anyone"), c(0.6, 0.6))
## $ground
##   outlets      wire connected   breaker   neutral    outlet 
##      0.74      0.68      0.67      0.64      0.62      0.61 
## 
## $anyone
## numeric(0)