detex
a filter to strip TeX commands from a .tex file.
see also :
tex
Synopsis
detex [
-clnstw ] [ -e
environment-list ] [ filename[.tex] ... ]
add an example, a script, a trick and tips
examples
source
echo "Word
count:" `detex
$1 | wc -w`
echo "Symbol
count:" `detex
$1 | wc -m`
source
WCW=`detex thesis.tex | wc -w`
echo -n $WCW
" "
sleep 1
done
else
twitter | tail -n 1 | gawk '{print $3;}'
> .count || exit
WCW=`detex thesis.tex | wc -w`
OLD_WCW=`cat .count`
if [ $WCW -gt $(($OLD_WCW + 100)) ]
then
source
(( PDFTOTEX_COUNT
= (${PDFTOTEX_COUNT_1} + ${PDFTOTEX_COUNT_2} +
${PDFTOTEX_COUNT_3}) / 3 ))
DETEX_COUNT=$(detex ${TEX_FILE} | wc -w)
(( WORD_COUNT_MEAN
= (${PDFTOTEX_COUNT} + 4 * ${DETEX_COUNT}) / 5 ))
echo "There are
${WORD_COUNT_MEAN} (${PDFTOTEX_COUNT}, ${DETEX_COUNT})
on ${PAGE_COUNT} pages"
exit $EXIT;
source
else
detex $1 | tr -cd '0-9A-Z a-z\n' | wc -w
fi
source
echo Detexed words:"
"`detex thesis.tex | wc
-w`
echo LaTeX files:"
"`ls *.tex | wc -l`
echo Total dir size:"
"`du -sh . | cut -f 1`
echo 5 most used words:"
"`detex thesis.tex | tr
" "[:punct:] "\n" | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn | awk 'length($NF)>4 {print $NF"("$1")"}' | head -n
5`
description
Detex
(Version 2.6) reads each file in sequence, removes all
comments and TeX control sequences and writes the
remainder on the standard output. All text in math mode and
display mode is removed. By default, detex follows
\input commands. If a file cannot be opened, a warning
message is printed and the command is ignored. If the
-n option is used, no \input or \include
commands will be processed. This allows single file
processing. If no input file is given on the command line,
detex reads from standard input.
If the magic
sequence ’’\begin{document}’’
appears in the text, detex assumes it is dealing with
LaTeX source and detex recognizes additional
constructs used in LaTeX. These include the \include
and \includeonly commands. The -l option can be
used to force LaTeX mode and the -t
option can be used to force TeX mode regardless of
input content.
Text in various
environment modes of LaTeX is ignored. The default
modes are array, eqnarray, equation, figure, mathmatica,
picture, table and verbatim. The -e option can
be used to specify a comma separated environment-list
of environments to ignore. The list replaces the defaults so
specifying an empty list effectively causes no environments
to be ignored.
The
-c option can be used in LaTeX mode to
have detex echo the arguments to \cite, \ref, and \pageref
macros. This can be useful when sending the output to a
style checker.
Detex
assumes the standard character classes are being used for
TeX. Detex allows white space between control
sequences and magic characters like ’{’ when
recognizing things like LaTeX environments.
If the
-w flag is given, the output is a word list,
one ’word’ (string of two or more letters and
apostrophes beginning with a letter) per line, and all other
characters ignored. Without -w the output
follows the original, with the deletions mentioned above.
Newline characters are preserved where possible so that the
lines of output match the input as closely as possible.
The TEXINPUTS
environment variable is used to find \input and \include
files. Like TeX, it interprets a leading or trailing
’:’ as the default TEXINPUTS. It does not
support the ’//’ directory expansion magic
sequence.
Detex now
handles the basic TeX ligatures as a special case,
replacing the ligatures with acceptable charater
substitutes. This eliminates spelling errors introduced by
merely removing them. The ligatures are \aa, \ae, \oe, \ss,
\o, \l (and their upper-case equivalents). The special
"dotless" characters \i and \j are also replaced
with i and j respectively.
Note that
previous versions of detex would replace control
sequences with a space character to prevent words from
running together. However, this caused accents in the middle
of words to break words, generating "spelling
errors" that were not desirable. Therefore, the new
version merely removes these accents. The old functionality
can be essentially duplicated by using the -s
option.
diagnostics
Nesting of \input is allowed but the number of opened files must
not exceed the system’s limit on the number of simultaneously
opened files. Detex ignores unrecognized option characters
after printing a warning message.
bugs
Detex is
not a complete TeX interpreter, so it can be confused
by some constructs. Most errors result in too much rather
than too little output.
Running
LaTeX source without a
’’\begin{document}’’ through
detex may produce errors.
Suggestions for
improvements are (mildly) encouraged.
see also
tex (1L)
author
Daniel Trinkle,
Computer Science Department, Purdue University