Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Make a specific row the column names for the specified data.frame
Description
Many tables in Word documents are in twisted formats where there may be
labels or other oddities mixed in that make it difficult to work with the
underlying data. This function makes it easy to identify a particular row
in a scraped data.frame as the one containing column names and
have it become the column names, removing it and (optionally) all of the
rows before it (since that's usually what needs to be done).
Usage
assign_colnames(dat, row, remove = TRUE, remove_previous = remove)
Arguments
dat
can be any data.frame but is intended for use with
ones retuned by this package
row
numeric value indicating the row number that is to become the column names
remove
remove row specified by row after making it
the column names? (Default: TRUE)
remove_previous
remove any rows preceding row? (Default:
TRUE but will be assigned whatever is given for
remove).
Value
data.frame
See Also
docx_extract_all , docx_extract_tbl
Examples
# a "real" Word doc
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))
docx_tbl_count(real_world)
# get all the tables
tbls <- docx_extract_all_tbls(real_world)
# make table 1 better
assign_colnames(tbls[[1]], 2)
# make table 5 better
assign_colnames(tbls[[5]], 2)
Convert a Document (usually PowerPoint) to a PDF
Description
Convert a Document (usually PowerPoint) to a PDF
Usage
convert_to_pdf(path, pdf_file = sub("[.]pptx", ".pdf", path))
Arguments
path
path to the document, can be PowerPoint or DOCX
pdf_file
output PDF file name. By default, creates a PDF in the
same directory as the path file.
This functionality requires the use of
LibreOffice and the soffice binary it contains. See
set_libreoffice_path for more information. Note,
Examples
## Not run:
path = system.file("examples/ex.pptx", package="docxtractr")
pdf <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf"))
path = system.file("examples/data.docx", package="docxtractr")
pdf_doc <- convert_to_pdf(path, pdf_file = tempfile(fileext = ".pdf"))
## End(Not run)
Get number of comments in a Word document
Description
Get number of comments in a Word document
Usage
docx_cmnt_count(docx)
Arguments
docx
docx object read with read_docx
Value
numeric
Examples
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr"))
docx_cmnt_count(cmnts)
Returns information about the comments in the Word document
Description
Returns information about the comments in the Word document
Usage
docx_describe_cmnts(docx)
Arguments
docx
docx object read with read_docx
Examples
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr"))
docx_cmnt_count(cmnts)
docx_describe_cmnts(cmnts)
Returns a description of all the tables in the Word document
Description
This function will attempt to discern the structure of each of the tables
in docx and print this information
Usage
docx_describe_tbls(docx)
Arguments
docx
docx object read with read_docx
Examples
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))
docx_tbl_count(complx)
docx_describe_tbls(complx)
Extract all tables from a Word document
Description
Extract all tables from a Word document
Usage
docx_extract_all(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
Arguments
docx
docx object read with read_docx
guess_header
should the function make a guess as to the existence of
a header in a table? (Default: TRUE)
preserve
preserve line breaks within a cell? Default: 'FALSE'. NOTE: This overrides 'trim'.
trim
trim leading/trailing whitespace (if any) in cells? (default: TRUE)
Value
list of data.frames or an empty list if no
tables exist in docx
See Also
assign_colnames , docx_extract_tbl
Examples
# a "real" Word doc
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))
docx_tbl_count(real_world)
# get all the tables
tbls <- docx_extract_all_tbls(real_world)
Extract all comments from a Word document
Description
Extract all comments from a Word document
Usage
docx_extract_all_cmnts(docx, include_text = FALSE)
Arguments
docx
docx object read with read_docx
include_text
if TRUE then the text associated with the comment will
also be included
Value
data_frame of comment id, author & text
Examples
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr"))
docx_cmnt_count(cmnts)
docx_describe_cmnts(cmnts)
docx_extract_all_cmnts(cmnts)
Extract all tables from a Word document
Description
Extract all tables from a Word document
Usage
docx_extract_all_tbls(docx, guess_header = TRUE, preserve = FALSE, trim = TRUE)
Arguments
docx
docx object read with read_docx
guess_header
should the function make a guess as to the existence of
a header in a table? (Default: TRUE)
preserve
preserve line breaks within a cell? Default: 'FALSE'. NOTE: This overrides 'trim'.
trim
trim leading/trailing whitespace (if any) in cells? (default: TRUE)
Value
list of data.frames or an empty list if no
tables exist in docx
See Also
assign_colnames , docx_extract_tbl
Examples
# a "real" Word doc
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))
docx_tbl_count(real_world)
# get all the tables
tbls <- docx_extract_all_tbls(real_world)
Extract a table from a Word document
Description
Given a document read with read_docx and a table to extract (optionally
indicating whether there was a header or not and if cell whitepace trimming is
desired) extract the contents of the table to a data.frame.
Usage
docx_extract_tbl(
docx,
tbl_number = 1,
header = TRUE,
preserve = FALSE,
trim = TRUE
)
Arguments
docx
docx object read with read_docx
tbl_number
which table to extract (defaults to 1)
header
assume first row of table is a header row? (default; TRUE)
preserve
preserve line breaks within a cell? Default: FALSE. NOTE: This overrides trim.
trim
trim leading/trailing whitespace (if any) in cells? (default: TRUE)
Value
data.frame
See Also
docx_extract_all , docx_extract_tbl ,
assign_colnames
Examples
doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))
docx_extract_tbl(doc3, 3)
intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
docx_extract_tbl(intracell_whitespace, 2, preserve=FALSE)
docx_extract_tbl(intracell_whitespace, 2, preserve=TRUE)
Get number of tables in a Word document
Description
Get number of tables in a Word document
Usage
docx_tbl_count(docx)
Arguments
docx
docx object read with read_docx
Value
numeric
Examples
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))
docx_tbl_count(complx)
Extract Data Tables and Comments from 'Microsoft' 'Word' Documents
Description
Microsoft Word 'docx“ files provide an XML structure that is fairly straightforward to navigate, especially when it applies to Word tables. The 'docxtractr“ package provides tools to determine table count + table structure and extract tables from Microsoft Word docx documents. It also provides tools to determine comment count and extract comments from Word 'docx“ documents.
Author(s)
Bob Rudis (bob@rud.is)
Make Column Names Great Again
Description
Remove punctuation and spaces and turn them to underscores plus convert to lower case.
Usage
mcga(tbl)
Arguments
tbl
a data.frame-like object
Value
whatver class x was but with truly great, really great column names. They're amazing.
Trust me. They'll be incredible column names once we're done.
Examples
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))
tbls <- docx_extract_all_tbls(real_world)
mcga(assign_colnames(tbls[[1]], 2))
Display information about the document
Description
Display information about the document
Usage
## S3 method for class 'docx'
print(x, ...)
Arguments
x
docx object
...
ignored
Read in a Word document for table extraction
Description
Local file path or URL pointing to a .docx file. Can also take
.doc file as input if LibreOffice is installed
(see https://www.libreoffice.org/ for more info and to download).
Usage
read_docx(path, track_changes = NULL)
Arguments
path
path to the Word document
track_changes
if not NULL (the default) then must be one of
"accept" or "reject" which will, respectively, accept all or
reject all changes. NOTE: this functionality relies on the
pandoc utility being available on the system PATH. Both
system PATH and the RSTUDIO_PANDOC (RStudio ships with
a copy of pandoc) environment variables will be checked.
If no pandoc binary is found then a warning will be issued
and the document will be read without integrating or ignoring
any tracked changes. The original Word document will not be modified
and this feature only works with docx files.
Examples
doc <- read_docx(system.file("examples/data.docx", package="docxtractr"))
class(doc)
doc <- read_docx(
system.file("examples/trackchanges.docx", package="docxtractr"),
track_changes = "accept"
)
## Not run:
# from a URL
budget <- read_docx(
"http://rud.is/dl/1.DOCX")
## End(Not run)
Point to Local soffice.exe File
Description
Function to set an option that points to the local LibreOffice file
soffice.exe.
Usage
set_libreoffice_path(path)
Arguments
path
path to the LibreOffice soffice file
Details
For a list of possible file path locations for soffice.exe,
see https://github.com/hrbrmstr/docxtractr/issues/5#issuecomment-233181976
Value
Returns nothing, function sets the option variable
path_to_libreoffice.
Examples
## Not run:
set_libreoffice_path("local/path/to/soffice.exe")
## End(Not run)