emacs/layers.personal/misctools/my-polymode/local/polymode/samples/nbib.nw
2018-04-07 10:54:04 +08:00

3690 lines
122 KiB
Plaintext

% -*- mode: noweb; noweb-code-mode: lua-mode -*-
\documentclass{article}
\usepackage{fullpage}
\usepackage{noweb,url}
\usepackage[hypertex]{hyperref}
\noweboptions{smallcode}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\def\NbibTeX{{\rm N\kern-.05em{\sc bi\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\let\bibtex\BibTeX
\let\nbibtex\NbibTeX
\title{A Replacement for \bibtex\\
(Version <VERSION>)}
\author{Norman Ramsey}
\setcounter{tocdepth}{2} %% keep TOC on one page
\def\lbrace{\char123}
\def\rbrace{\char125}
\begin{document}
@
\maketitle
\tableofcontents
\clearpage
\section{Overview}
The code herein comprises the ``nbib'' package,
which is a collection of tools to help authors take better
advantage of \BibTeX\ data, especially when working in collaboration.
The driving technology is that instead of using \BibTeX\
``keys,'' which are chosen arbitrarily and idiosyncratically,
nbib builds a bibliography by searching the contents of
citations.
\begin{itemize}
\item
\texttt{nbibtex} is a drop-in replacement for \texttt{bibtex}.
Authors' \verb+\cite{+\ldots\kern-2pt \verb+}+
commands are interpreted either as classic \bibtex\ keys (for
backward compatibility) or as search commands.
Thus, if your
bibliography contains the classic paper on type inference, \texttt{nbibtex}
should find it using a citation like
\verb+\cite{damas-milner:1978}+, or
\verb+\cite{damas-milner:polymorphism}+, or perhaps even simply
\verb+\cite{damas-milner}+---\emph{regardless} of the \bibtex\ key you
may have chosen.
The same citations should also work with
your coauthors' bibliographies, even if they are keyed
differently.
\item
\texttt{nbibfind} uses the nbib search engine on the command line. If you
know you are looking for a paper by Harper and Moggi, you can just
type
\begin{verbatim}
nbibfind harper-moggi
\end{verbatim}
and see what comes out.
\item
To help you work with coauthors who don't have the nbib package,
\texttt{nbibmake}\footnote
{Not yet implemented.}
examines a {\LaTeX} document and builds a custom
\texttt{.bib} file
just for that document.
\end{itemize}
\noindent
The package is written in a combination of~C and Lua:
\begin{itemize}
\item
Because I want nbib to be able to handle bibliographies with thousands
or tens of thousands of entries,
the code to parse a \texttt{.bib} ``database'' is written in~C.
A~computer bought in 2003 can parse over 15,000~entries per second.
\item
Because the search for \bibtex\ entries requires string searching on
every entry,
the string search is also written in~C (and uses Boyer-Moore).
\item
Because string manipulation is much more easily done in Lua,
all the code that converts a \bibtex\ entry into printed matter is
written in Lua,
as is all the ``driver'' code that implements various programs.
\end{itemize}
The net result is that \texttt{nbibtex} is about five times slower
than classic \texttt{bibtex}.
This slowdown is easy to observe when printing a bibliography
of several thousand entries,
but on a typical paper with fewer than fifty citations and a personal
bibliography with a thousand entries,
the pause is imperceptible.
\subsection{Compatibility}
I've made every effort to make \nbibtex\ compatible with \bibtex, so
that \nbibtex\ can be used on existing papers and should produce
the same output as \bibtex.
Regrettably, compatibility means avoiding modern treatment
of non-ASCII characters, such as are found in the ISO Latin-1
character set:
classic \bibtex\ simply treats every non-ASCII character as a letter.
\begin{itemize}
\item
It would be pleasant to try instead to set \nbibtex\ to use an
ISO~8859-1 locale, but this leads to incompatible output:
\nbibtex\ forces characters to lower case that \bibtex\ leaves alone.
<<pleasant code that results in incompatible output>>=
do
local locales =
{ "en_US", "en_AU", "en_CA", "en_GB", "fr_CA", "fr_CH", "fr_FR", }
for _, l in pairs(locales) do
if os.setlocale(l .. '.iso88591', 'ctype') then break end
end
end
@
\item
A much less pleasant alternative would be to abandon the support that Lua
provides for distinguishing letters from nonletters and instead
to try to do some sort of system-dependent character classification,
as is done in \bibtex.
I~don't have the stomach for it.
\item
The most principled solution I~can imagine would be to define a
special ``\bibtex\ locale,'' whose sole purpose would be to guarantee
compatibility with \bibtex.
But this potential solution looks like a
nightmare for software distribution.
\item
What I've done is proceed blithely with the user's current
locale, throwing in a hack here or there as needed to guarantee
compatibility with the test cases I~have in the default locale
I~happen to use.
The most notable case is [[bst.purify]], which is used to generate
keys for sorting.
\end{itemize}
Expedience carries the day. Feh.
@
\section{Parsing \texttt{.bib} files}
This section reads the \texttt{.bib} file(s).
<<nbib.c>>=
#include <stdio.h>
#include <assert.h>
#include <ctype.h>
#include <string.h>
#include <stdlib.h>
#include <stdarg.h>
#include <lua.h>
#include <lauxlib.h>
<<type definitions>>
<<function prototypes>>
<<initialized and uninitialized data>>
<<macro definitions>>
<<Procedures and functions for input scanning>>
<<function definitions>>
@
\subsection{Internal interfaces}
\subsubsection {Data structures}
For convenience in keeping function prototypes uncluttered,
all state associated with reading a particular \bibtex\ file is stored
in a single [[Bibreader]] abstraction.
That state is divided into three groups:
\begin{itemize}
\item
Fields that say what file we are reading and what is our position
within that file
\item
A~buffer that holds one line of the \texttt{.bib} file currently being
scanned
\item
State accessible from Lua: an interpreter;
a list of strings from the \texttt{.bib} preamble, which is exposed to
the client;
a warning function provided by the client;
and a macro table provided by the client and updated by
[[@string]] commands
\end{itemize}
In the buffer,
the meaningful characters are in the half-open interval $[{}[[buf]],
[[lim]])$,
and we reserve space for a sentinel at~[[lim]].
The invariant is that $[[buf]] \le [[cur]] < [[lim]]$
and $[[buf]]+[[bufsize]] \ge [[lim]]+1$.
<<type definitions>>=
typedef struct bibreader {
const char *filename; /* name of the .bib file */
FILE *file; /* .bib file open for read */
int line_num; /* line number of the .bib file */
int entry_line; /* line number of last seen entry start */
unsigned char *buf, *cur, *lim; /* input buffer */
unsigned bufsize; /* size of buffer */
char entry_close; /* character expected to close current entry */
lua_State *L;
int preamble; /* reference to preamble list of strings */
int warning; /* reference to universal warning function */
int macros; /* reference to macro table */
} *Bibreader;
@
The [[is_id_char]] array is used to define a predicate that says
whether a character is considered part of an identifier.
<<initialized and uninitialized data>>=
bool is_id_char[256]; /* needs initialization */
#define concat_char '#' /* used to concatenate parts of a field defn */
@
\subsubsection {Scanning}
Most internal functions are devoted to some form of scanning.
The model is a bit like Icon: scanning may succeed or fail, and it has
a side effect on the state of the reader---in particular the value of
the [[cur]] pointer, and possibly also the contents of the buffer.
(Unlike Icon, there is no backtracking.)
Success or failure is nonzero or zero but is represented using type [[bool]].
<<function prototypes>>=
typedef int bool;
@
Function [[getline]] refills the buffer with a new line (and updates
[[line_num]]), returning failure on end of file.
<<function prototypes>>=
static bool getline(Bibreader rdr);
@
Several scanning functions come in two flavors,
which depend on what happends at the end of a line:
the [[_getline]] flavor refills the buffer and keeps scanning;
the normal flavor fails.
Here are some functions that scan for combinations of particular
characters, whitespace, and nonwhite characters.
<<function prototypes>>=
static bool upto1(Bibreader rdr, char c);
static bool upto1_getline(Bibreader rdr, char c);
static void upto_white_or_1(Bibreader rdr, char c);
static void upto_white_or_2(Bibreader rdr, char c1, char c2);
static void upto_white_or_3(Bibreader rdr, char c1, char c2, char c3);
static bool upto_nonwhite(Bibreader rdr);
static bool upto_nonwhite_getline(Bibreader rdr);
@ Because there is always whitespace at the end of a line, the
[[upto_white_*]] flavor cannot fail.
@
Some more sophisticated scanning functions.
None attempts to return a value; instead each functions scans past the
token in question, which the client can then find between the old and
new values of the [[cur]] pointer.
<<function prototypes>>=
static bool scan_identifier (Bibreader rdr, char c1, char c2, char c3);
static bool scan_nonneg_integer (Bibreader rdr, unsigned *np);
@
Continuing from low to high level, here are
functions used to scan fields, about which more below:
<<function prototypes>>=
static bool scan_and_buffer_a_field_token (Bibreader rdr, int key, luaL_Buffer *b);
static bool scan_balanced_braces(Bibreader rdr, char close, luaL_Buffer *b);
static bool scan_and_push_the_field_value (Bibreader rdr, int key);
@
Two utility functions used after scanning:
The [[lower_case]] function overwrites buffer characters with their
lowercase equivalents.
The [[strip_leading_and_trailing_space]] functions removes leading and
trailing space characters from a string on top of the Lua stack.
<<function prototypes>>=
static void lower_case(unsigned char *p, unsigned char *lim);
static void strip_leading_and_trailing_space(lua_State *L);
@
\subsubsection{Other functions}
<<function prototypes>>=
static int get_bib_command_or_entry_and_process(Bibreader rdr);
int luaopen_bibtex (lua_State *L);
@
\subsubsection{Commands}
In addition to database entries, a \texttt{.bib} file may contain
the [[comment]], [[preamble]], and [[string]] commands.
Each is implemented by a function of type [[Command]], which is
associated with the name by [[find_command]].
<<function prototypes>>=
typedef bool (*Command)(Bibreader);
static Command find_command(unsigned char *p, unsigned char *lim);
static bool do_comment (Bibreader rdr);
static bool do_preamble(Bibreader rdr);
static bool do_string (Bibreader rdr);
@
\subsubsection{Error handling}
The [[warnv]] function is used to call the warning function supplied
by the Lua client.
In addition to the reader, it takes as arguments the number of results
expected and the signature of the arguments.
(The warning function may receive any combination of string~([[s]]),
floating-point~([[f]]), and integer~([[d]]) arguments;
the [[fmt]] string gives the sequence of the arguments that follow.)
<<function prototypes>>=
static void warnv(Bibreader rdr, int nres, const char *fmt, ...);
@
There's a lot of crap here to do with reporting errors.
An error in a function called direct from Lua
pushes [[false]] and a message and returns~[[2]];
an error in a boolean function pushes the same but returns failure to
its caller.
I~hope to replace this code with native Lua error handling ([[lua_error]]).
<<macro definitions>>=
#define LERRPUSH(S) do { \
if (!lua_checkstack(rdr->L, 10)) assert(0); \
lua_pushboolean(rdr->L, 0); \
lua_pushfstring(rdr->L, "%s, line %d: ", rdr->filename, rdr->line_num); \
lua_pushstring(rdr->L, S); \
lua_concat(rdr->L, 2); \
} while(0)
#define LERRFPUSH(S,A) do { \
if (!lua_checkstack(rdr->L, 10)) assert(0); \
lua_pushboolean(rdr->L, 0); \
lua_pushfstring(rdr->L, "%s, line %d: ", rdr->filename, rdr->line_num); \
lua_pushfstring(rdr->L, S, A); \
lua_concat(rdr->L, 2); \
} while(0)
#define LERR(S) do { LERRPUSH(S); return 2; } while(0)
#define LERRF(S,A) do { LERRFPUSH(S,A); return 2; } while(0)
/* next: cases for Boolean functions */
#define LERRB(S) do { LERRPUSH(S); return 0; } while(0)
#define LERRFB(S,A) do { LERRFPUSH(S,A); return 0; } while(0)
@
\subsection{Reading a database entry}
Syntactically, a \texttt{.bib} file is a
sequence of entries, perhaps with a few \texttt{.bib} commands thrown
in.
Each entry consists of an at~sign, an entry
type, and, between braces or parentheses and separated by commas, a
database key and a list of fields. Each field consists of a field
name, an equals sign, and nonempty list of field tokens separated by
[[concat_char]]s. Each field token is either a nonnegative number, a
macro name (like `jan'), or a brace-balanced string delimited by
either double quotes or braces. Finally, case differences are
ignored for all but delimited strings and database keys, and
whitespace characters and ends-of-line may appear in all reasonable
places (i.e., anywhere except within entry types, database keys, field
names, and macro names); furthermore, comments may appear anywhere
between entries (or before the first or after the last) as long as
they contain no at~signs.
This function reads a database entry and pushes it on the Lua stack.
Any commands encountered before the database entry are executed.
If no entry remains, the function returns~0.
<<function definitions>>=
#undef ready_tok
#define ready_tok(RDR) do { \
if (!upto_nonwhite_getline(RDR)) \
LERR("Unexpected end of file"); \
} while(0)
static int get_bib_command_or_entry_and_process(Bibreader rdr) {
unsigned char *id, *key;
int keyindex;
bool (*command)(Bibreader);
getnext:
<<scan [[rdr]] up to and past the next [[@]] sign and skip white space (or return 0)>>
id = rdr->cur;
if (!scan_identifier (rdr, '{', '(', '('))
LERR("Expected an entry type");
lower_case (id, rdr->cur); /* ignore case differences */
<<if $[{}[[id]], \mbox{[[rdr->cur]]})$ points to a command, execute it and go to [[getnext]]>>
lua_pushlstring(rdr->L, (char *) id, rdr->cur - id); /* push entry type */
rdr->entry_line = rdr->line_num;
ready_tok(rdr);
<<scan past opening delimiter and set [[rdr->entry_close]]>>
ready_tok(rdr);
key = rdr->cur;
<<set [[rdr->cur]] to next whitespace, comma, or possibly [[}]]>>
lua_pushlstring(rdr->L, (char *) key, rdr->cur - key); /* push database key */
keyindex = lua_gettop(rdr->L);
lua_newtable(rdr->L); /* push table of fields */
ready_tok(rdr);
for (; *rdr->cur != rdr->entry_close; ) {
<<absorb comma (breaking if followed by [[rdr->entry_close]])>>
<<read a field-value pair and set it in the field table, which is on top of the Lua stack>>
ready_tok(rdr);
}
rdr->cur++; /* skip past close of entry */
return 3; /* entry type, key, table of fields */
}
@
<<scan [[rdr]] up to and past the next [[@]] sign and skip white space (or return 0)>>=
if (!upto1_getline(rdr, '@'))
return 0; /* no more entries; return nil */
assert(*rdr->cur == '@');
rdr->cur++; /* skip the @ sign */
ready_tok(rdr);
@
<<if $[{}[[id]], \mbox{[[rdr->cur]]})$ points to a command, execute it and go to [[getnext]]>>=
command = find_command(id, rdr->cur);
if (command) {
if (!command(rdr))
return 2; /* command put (false, message) on Lua stack; we're done */
goto getnext;
}
@
An entry is delimited either by braces or by brackets;
in order to recognize the correct closing delimiter, we put it in
[[rdr->entry_close]].
<<scan past opening delimiter and set [[rdr->entry_close]]>>=
if (*rdr->cur == '{')
rdr->entry_close = '}';
else if (*rdr->cur == '(')
rdr->entry_close = ')';
else
LERR("Expected entry to open with { or (");
rdr->cur++;
@
I'm not quite sure why stopping at~[[}]] is conditional on the closing
delimiter in this way.
<<set [[rdr->cur]] to next whitespace, comma, or possibly [[}]]>>=
if (rdr->entry_close == '}') {
upto_white_or_1(rdr, ',');
} else {
upto_white_or_2(rdr, ',', '}');
}
@
At this point we're at a nonwhite token that is not the closing
delimiter.
If it's not a comma, there's big trouble---but even if it is,
the database may be using comma as a terminator, in which case a
closing delimiter signals the end of the entry.
<<absorb comma (breaking if followed by [[rdr->entry_close]])>>=
if (*rdr->cur == ',') {
rdr->cur++;
ready_tok(rdr);
if (*rdr->cur == rdr->entry_close) {
break;
}
} else {
LERR("Expected comma or end of entry");
}
@
The syntax for a field is \emph{identifier}\texttt{=}\emph{value}.
The field name is forced to lower case.
<<read a field-value pair and set it in the field table, which is on top of the Lua stack>>=
if (id = rdr->cur, !scan_identifier (rdr, '=', '=', '='))
LERR("Expected a field name");
lower_case(id, rdr->cur);
lua_pushlstring(rdr->L, (char *) id, rdr->cur - id); /* push field name */
ready_tok(rdr);
if (*rdr->cur != '=')
LERR("Expected '=' to follow field name");
rdr->cur++; /* skip over the [['=']] */
ready_tok(rdr);
if (!scan_and_push_the_field_value(rdr, keyindex))
return 2;
strip_leading_and_trailing_space(rdr->L);
<<if field is not already set, set it; otherwise warn>>
@
Official \bibtex\ does not permit duplicate entries for a single
field.
But in entries on the net, you see lots of such duplicates in such
unofficial fields as \texttt{reffrom}.
Because classic \bibtex\ doesn't report errors on fields that aren't
advertised by the \texttt{.bst} file, we don't want to just blat out a
whole bunch of warning messages.
So instead we dump the problem on the warning function provided by the Lua
client.
We therefore can't simply set the field in the field table:
we first look it up, and
if it is nil, we set it; otherwise we warn.
<<if field is not already set, set it; otherwise warn>>=
lua_pushvalue(rdr->L, -2); /* push key */
lua_gettable(rdr->L, -4);
if (lua_isnil(rdr->L, -1)) {
lua_pop(rdr->L, 1);
lua_settable(rdr->L, -3);
} else {
lua_pop(rdr->L, 1); /* off comes old value */
warnv(rdr, 0, "ssdsss", /* tag, file, line, cite-key, field, newvalue */
"extra field", rdr->filename, rdr->line_num,
lua_tostring(rdr->L, keyindex),
lua_tostring(rdr->L, -2), lua_tostring(rdr->L, -1));
lua_pop(rdr->L, 2); /* off come key and new value */
}
@
\subsection{Scanning functions}
\subsubsection{Scanning functions for fields}
@
While scanning fields, we are not operating in a toplevel function, so
the error handling for [[ready_tok]] needs to be a bit different.
<<Procedures and functions for input scanning>>=
#undef ready_tok
#define ready_tok(RDR) do { \
if (!upto_nonwhite_getline(RDR)) \
LERRB("Unexpected end of file"); \
} while(0)
@
Each field value is accumulated into a [[luaL_Buffer]] from the Lua
auxiliary library.
The buffer is always called~[[b]];
for conciseness, we use the macro [[copy_char]] to add a character to
it.
<<Procedures and functions for input scanning>>=
#define copy_char(C) luaL_putchar(b, (C))
@
A field value is a sequence of one or more tokens separated by a
[[concat_char]] ([[#]]~mark).
A~precondition for calling [[scan_and_push_the_field_value]] is that
[[rdr]] is pointing at a nonwhite character.
<<Procedures and functions for input scanning>>=
static bool scan_and_push_the_field_value (Bibreader rdr, int key) {
luaL_Buffer field;
luaL_checkstack(rdr->L, 10, "Not enough Lua stack to parse bibtex database");
luaL_buffinit(rdr->L, &field);
for (;;) {
if (!scan_and_buffer_a_field_token(rdr, key, &field))
return 0;
ready_tok(rdr); /* cur now points to [[concat_char]] or end of field */
if (*rdr->cur != concat_char) break;
else { rdr->cur++; ready_tok(rdr); }
}
luaL_pushresult(&field);
return 1;
}
@ Because [[ready_tok]] can [[return]] in case of error, we can't write
\begin{quote}
[[for(; *rdr->cur == concat_char; rdr->cur++, ready_tok(rdr))]].
\end{quote}
@
A field token is either a nonnegative number, a macro name (like
`jan'), or a brace-balanced string delimited by either double quotes
or braces.
Thus there are four possibilities for the first character
of the field token: If it's a left brace or a double quote, the
token (with balanced braces, up to the matchin closing delimiter) is
a string; if it's a digit, the token is a number; if it's anything
else, the token is a macro name (and should thus have been defined by
either the \texttt{.bst}-file's \texttt{macro} command or the \texttt{.bib}-file's
\texttt{string} command). This function returns [[false]] if there was a
serious syntax error.
<<Procedures and functions for input scanning>>=
static bool scan_and_buffer_a_field_token (Bibreader rdr, int key, luaL_Buffer *b) {
unsigned char *p;
unsigned number;
*rdr->lim = ' ';
switch (*rdr->cur) {
case '{': case '"':
return scan_balanced_braces(rdr, *rdr->cur == '{' ? '}' : '"', b);
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
p = rdr->cur;
scan_nonneg_integer(rdr, &number);
luaL_addlstring(b, (char *)p, rdr->cur - p);
return 1;
default:
/* named macro */
p = rdr->cur;
if (!scan_identifier(rdr, ',', rdr->entry_close, concat_char))
LERRB("Expected a field part");
lower_case (p, rdr->cur); /* ignore case differences */
/* missing warning of macro name used in its own definition */
lua_pushlstring(rdr->L, (char *) p, rdr->cur - p); /* stack: name */
lua_getref(rdr->L, rdr->macros); /* stack: name macros */
lua_insert(rdr->L, -2); /* stack: name macros name */
lua_gettable(rdr->L, -2); /* stack: name defn */
lua_remove(rdr->L, -2); /* stack: defn */
<<if top of stack is nil, pop it and warn of undefined macro; else buffer it>>
return 1;
}
}
@
Here's another warning that's kicked out to the client.
Reason: standard \bibtex\ complains only if it intends to use the
entry in question.
<<if top of stack is nil, pop it and warn of undefined macro; else buffer it>>=
{ int t = lua_gettop(rdr->L);
if (lua_isnil(rdr->L, -1)) {
lua_pop(rdr->L, 1);
lua_pushlstring(rdr->L, (char *) p, rdr->cur - p);
warnv(rdr, 1, "ssdss", /* tag, file, line, key, macro */
"undefined macro", rdr->filename, rdr->line_num,
key ? lua_tostring(rdr->L, key) : NULL, lua_tostring(rdr->L, -1));
if (lua_isstring(rdr->L, -1))
luaL_addvalue(b);
else
lua_pop(rdr->L, 1);
lua_pop(rdr->L, 1);
} else {
luaL_addvalue(b);
}
assert(lua_gettop(rdr->L) == t-1);
}
@
This \texttt{.bib}-specific function scans and buffers a string with
balanced braces, stopping just past the matching [[close]].
The original \bibtex\ tries to optimize the common case of a field with
no internal braces; I~don't.
A~precondition for calling this function is that [[rdr->cur]] point at
the opening delimiter.
Whitespace is compressed to a single space character.
<<Procedures and functions for input scanning>>=
static int scan_balanced_braces(Bibreader rdr, char close, luaL_Buffer *b) {
unsigned char *p, *cur, c;
int braces = 0; /* number of currently open braces *inside* string */
rdr->cur++; /* scan past left delimiter */
*rdr->lim = ' ';
if (isspace(*rdr->cur)) {
copy_char(' ');
ready_tok(rdr);
}
for (;;) {
p = rdr->cur;
upto_white_or_3(rdr, '}', '{', close);
cur = rdr->cur;
for ( ; p < cur; p++) /* copy nonwhite, nonbrace characters */
copy_char(*p);
*rdr->lim = ' ';
c = *cur; /* will be whitespace if at end of line */
<<depending on [[c]], return or adjust [[braces]] and continue>>
}
}
@
Beastly complicated:
\begin{itemize}
\item
Space is compressed and scanned past.
\item
A closing delimiter ends the scan at brace level~0 and otherwise is
buffered.
\item
Braces adjust the [[braces]] count.
\end{itemize}
<<depending on [[c]], return or adjust [[braces]] and continue>>=
if (isspace(c)) {
copy_char(' ');
ready_tok(rdr);
} else {
rdr->cur++;
if (c == close) {
if (braces == 0) {
luaL_pushresult(b);
return 1;
} else {
copy_char(c);
if (c == '}')
braces--;
}
} else if (c == '{') {
braces++;
copy_char(c);
} else {
assert(c == '}');
if (braces > 0) {
braces--;
copy_char(c);
} else {
luaL_pushresult(b); /* restore invariant */
LERRB("Unexpected '}'");
}
}
}
@
\subsubsection {Low-level scanning functions}
Scan the reader up to the character requested or end of line;
fails if not found.
<<function definitions>>=
static bool upto1(Bibreader rdr, char c) {
unsigned char *p = rdr->cur;
unsigned char *lim = rdr->lim;
*lim = c;
while (*p != c)
p++;
rdr->cur = p;
return p < lim;
}
@
Scan the reader up to the character requested or end of file;
fails if not found.
<<function definitions>>=
static int upto1_getline(Bibreader rdr, char c) {
while (!upto1(rdr, c))
if (!getline(rdr))
return 0;
return 1;
}
@
Scan the reader up to the next whitespace or the one character requested.
Always succeeds, because the end of the line is whitespace.
<<function definitions>>=
static void upto_white_or_1(Bibreader rdr, char c) {
unsigned char *p = rdr->cur;
unsigned char *lim = rdr->lim;
*lim = c;
while (*p != c && !isspace(*p))
p++;
rdr->cur = p;
}
@
Scan the reader up to the next whitespace or either of two characters requested.
<<function definitions>>=
static void upto_white_or_2(Bibreader rdr, char c1, char c2) {
unsigned char *p = rdr->cur;
unsigned char *lim = rdr->lim;
*lim = c1;
while (*p != c1 && *p != c2 && !isspace(*p))
p++;
rdr->cur = p;
}
@
Scan the reader up to the next whitespace or any of three characters requested.
<<function definitions>>=
static void upto_white_or_3(Bibreader rdr, char c1, char c2, char c3) {
unsigned char *p = rdr->cur;
unsigned char *lim = rdr->lim;
*lim = c1;
while (!isspace(*p) && *p != c1 && *p != c2 && *p != c3)
p++;
rdr->cur = p;
}
@
This function scans over whitespace characters, stopping either at
the first nonwhite character or the end of the line, respectively
returning [[true]] or [[false]].
<<function definitions>>=
static bool upto_nonwhite(Bibreader rdr) {
unsigned char *p = rdr->cur;
unsigned char *lim = rdr->lim;
*lim = 'x';
while (isspace(*p))
p++;
rdr->cur = p;
return p < lim;
}
@
Scan past whitespace up to end of file if needed;
returns true iff nonwhite character found.
<<function definitions>>=
static int upto_nonwhite_getline(Bibreader rdr) {
while (!upto_nonwhite(rdr))
if (!getline(rdr))
return 0;
return 1;
}
@
\subsubsection{Actual input}
<<function definitions>>=
static bool getline(Bibreader rdr) {
char *result;
unsigned char *buf = rdr->buf;
int n;
result = fgets((char *)buf, rdr->bufsize, rdr->file);
if (result == NULL)
return 0;
rdr->line_num++;
for (n = strlen((char *)buf); buf[n-1] != '\n'; n = strlen((char *)buf)) {
/* failed to get whole line */
rdr->bufsize *= 2;
buf = rdr->buf = realloc(rdr->buf, rdr->bufsize);
assert(buf);
if (fgets((char *)buf+n,rdr->bufsize-n,rdr->file)==NULL) {
n = strlen((char *)buf) + 1; /* -1 below is incorrect without newline */
break; /* file ended without a newline */
}
}
rdr->cur = buf;
rdr->lim = buf+n-1; /* trailing newline not in string */
return 1;
}
@
\subsubsection{Medium-level scanning functions}
This procedure scans for an identifier, stopping at the first
[[illegal_id_char]], or stopping at the first character if it's
[[numeric]]. It sets the global variable [[scan_result]] to [[id_null]] if
the identifier is null, else to [[white_adjacent]] if it ended at a
whitespace character or an end-of-line, else to
[[specified_char_adjacent]] if it ended at one of [[char1]] or [[char2]] or
[[char3]], else to [[other_char_adjacent]] if it ended at a nonspecified,
nonwhitespace [[illegal_id_char]]. By convention, when some calling
code really wants just one or two ``specified'' characters, it merely
repeats one of the characters.
<<Procedures and functions for input scanning>>=
static int scan_identifier (Bibreader rdr, char c1, char c2, char c3) {
unsigned char *p, *orig, c;
orig = p = rdr->cur;
if (!isdigit(*p)) {
/* scan until end-of-line or an [[illegal_id_char]] */
*rdr->lim = ' '; /* an illegal id character and also white space */
while (is_id_char[*p])
p++;
}
c = *p;
if (p > rdr->cur && (isspace(c) || c == c1 || c == c2 || c == c3)) {
rdr->cur = p;
return 1;
} else {
return 0;
}
}
@
This function scans for a nonnegative integer, stopping at the first
nondigit; it writes the resulting integer through [[np]].
It returns
[[true]] if the token was a legal nonnegative integer (i.e., consisted
of one or more digits).
<<Procedures and functions for input scanning>>=
static bool scan_nonneg_integer (Bibreader rdr, unsigned *np) {
unsigned char *p = rdr->cur;
unsigned n = 0;
*rdr->lim = ' '; /* sentinel */
while (isdigit(*p)) {
n = n * 10 + (*p - '0');
p++;
}
if (p == rdr->cur)
return 0; /* no digits */
else {
rdr->cur = p;
*np = n;
return 1;
}
}
@
This procedure scans for an integer, stopping at the first nondigit;
it sets the value of [[token_value]] accordingly. It returns [[true]] if
the token was a legal integer (i.e., consisted of an optional
[[minus_sign]] followed by one or more digits).
<<unused Procedures and functions for input scanning>>=
static bool scan_integer (Bibreader rdr) {
unsigned char *p = rdr->cur;
int n = 0;
int sign = 0; /* number of characters of sign */
*rdr->lim = ' '; /* sentinel */
if (*p == '-') {
sign = 1;
p++;
}
while (isdigit(*p)) {
n = n * 10 + (*p - '0');
p++;
}
if (p == rdr->cur)
return 0; /* no digits */
else {
rdr->cur = p;
return 1;
}
}
@
\subsection{C~utility functions}
@
<<function definitions>>=
static void lower_case(unsigned char *p, unsigned char *lim) {
for (; p < lim; p++)
*p = tolower(*p);
}
@
<<function definitions>>=
static void strip_leading_and_trailing_space(lua_State *L) {
const char *p;
int n;
assert(lua_isstring(L, -1));
p = lua_tostring(L, -1);
n = lua_strlen(L, -1);
if (n > 0 && (isspace(*p) || isspace(p[n-1]))) {
while(n > 0 && isspace(*p))
p++, n--;
while(n > 0 && isspace(p[n-1]))
n--;
lua_pushlstring(L, p, n);
lua_remove(L, -2);
}
}
@
\subsection{Implementations of the \bibtex\ commands}
On encountering an [[@]]\emph{identifier}, we ask if the
\emph{identifier} stands for a command and if so, return that command.
<<function definitions>>=
static Command find_command(unsigned char *p, unsigned char *lim) {
int n = lim - p;
assert(lim > p);
#define match(S) (!strncmp(S, (char *)p, n) && (S)[n] == '\0')
switch(*p) {
case 'c' : if (match("comment")) return do_comment; else break;
case 'p' : if (match("preamble")) return do_preamble; else break;
case 's' : if (match("string")) return do_string; else break;
}
return (Command)0;
}
@
%% \webindexsort{database-file commands}{\quad \texttt{comment}}
The \texttt{comment} command is implemented for SCRIBE compatibility. It's
not really needed because \BibTeX\ treats (flushes) everything not
within an entry as a comment anyway.
<<function definitions>>=
static bool do_comment(Bibreader rdr) {
return 1;
}
@
%% \webindexsort{database-file commands}{\quad \texttt{preamble}}
The \texttt{preamble} command lets a user have \TeX\ stuff inserted (by the
standard styles, at least) directly into the \texttt{.bbl} file. It is
intended primarily for allowing \TeX\ macro definitions used within
the bibliography entries (for better sorting, for example). One
\texttt{preamble} command per \texttt{.bib} file should suffice.
A \texttt{preamble} command has either braces or parentheses as outer
delimiters. Inside is the preamble string, which has the same syntax
as a field value: a nonempty list of field tokens separated by
[[concat_char]]s. There are three types of field tokens---nonnegative
numbers, macro names, and delimited strings.
This module does all the scanning (that's not subcontracted), but the
\texttt{.bib}-specific scanning function
[[scan_and_push_the_field_value_and_eat_white]] actually stores the
value.
<<function definitions>>=
static bool do_preamble(Bibreader rdr) {
ready_tok(rdr);
<<scan past opening delimiter and set [[rdr->entry_close]]>>
ready_tok(rdr);
lua_rawgeti(rdr->L, LUA_REGISTRYINDEX, rdr->preamble);
lua_pushnumber(rdr->L, lua_objlen(rdr->L, -1) + 1);
if (!scan_and_push_the_field_value(rdr, 0))
return 0;
ready_tok(rdr);
if (*rdr->cur != rdr->entry_close)
LERRFB("Missing '%c' in preamble command", rdr->entry_close);
rdr->cur++;
lua_settable(rdr->L, -3);
lua_pop(rdr->L, 1); /* remove preamble */
return 1;
}
@
%% \webindexsort{database-file commands}{\quad \texttt{string}}
The \texttt{string} command is implemented both for SCRIBE compatibility
and for allowing a user: to override a \texttt{.bst}-file \texttt{macro}
command, to define one that the \texttt{.bst} file doesn't, or to engage in
good, wholesome, typing laziness.
The \texttt{string} command does mostly the same thing as the
\texttt{.bst}-file's \texttt{macro} command (but the syntax is different and the
\texttt{string} command compresses white space). In fact, later in this
program, the term ``macro'' refers to either a \texttt{.bst} ``macro'' or a
\texttt{.bib} ``string'' (when it's clear from the context that it's not
a \texttt{WEB} macro).
A \texttt{string} command has either braces or parentheses as outer
delimiters. Inside is the string's name (it must be a legal
identifier, and case differences are ignored---all upper-case letters
are converted to lower case), then an equals sign, and the string's
definition, which has the same syntax as a field value: a nonempty
list of field tokens separated by [[concat_char]]s. There are three
types of field tokens---nonnegative numbers, macro names, and
delimited strings.
<<function definitions>>=
static bool do_string(Bibreader rdr) {
unsigned char *id;
int keyindex;
ready_tok(rdr);
<<scan past opening delimiter and set [[rdr->entry_close]]>>
ready_tok(rdr);
id = rdr->cur;
if (!scan_identifier(rdr, '=', '=', '='))
LERRB("Expected a string name followed by '='");
lower_case(id, rdr->cur);
lua_pushlstring(rdr->L, (char *)id, rdr->cur - id);
keyindex = lua_gettop(rdr->L);
ready_tok(rdr);
if (*rdr->cur != '=')
LERRB("Expected a string name followed by '='");
rdr->cur++;
ready_tok(rdr);
if (!scan_and_push_the_field_value(rdr, keyindex))
return 0;
ready_tok(rdr);
if (*rdr->cur != rdr->entry_close)
LERRFB("Missing '%c' in macro definition", rdr->entry_close);
rdr->cur++;
lua_getref(rdr->L, rdr->macros);
lua_insert(rdr->L, -3);
lua_settable(rdr->L, -3);
lua_pop(rdr->L, 1);
return 1;
}
@
\subsection{Interface to Lua}
First, we define Lua access to a reader.
<<function definitions>>=
static Bibreader checkreader(lua_State *L, int index) {
return luaL_checkudata(L, index, "bibtex.reader");
}
@
The reader's [[__index]] metamethod provides access to the
[[entry_line]] and [[preamble]] values as if they were fields of the
Lua table.
It also provides access to the [[next]] and [[close]] methods of the
reader object.
<<function definitions>>=
static int reader_meta_index(lua_State *L) {
Bibreader rdr = checkreader(L, 1);
const char *key;
if (!lua_isstring(L, 2))
return 0;
key = lua_tostring(L, 2);
if (!strcmp(key, "next"))
lua_pushcfunction(L, next_entry);
else if (!strcmp(key, "entry_line"))
lua_pushnumber(L, rdr->entry_line);
else if (!strcmp(key, "preamble"))
lua_rawgeti(L, LUA_REGISTRYINDEX, rdr->preamble);
else if (!strcmp(key, "close"))
lua_pushcfunction(L, closereader);
else
lua_pushnil(L);
return 1;
}
@
Here are the functions exported in the [[bibtex]] module:
<<function prototypes>>=
static int openreader(lua_State *L);
static int next_entry(lua_State *L);
static int closereader(lua_State *L);
<<initialized and uninitialized data>>=
static const struct luaL_reg bibtexlib [] = {
{"open", openreader},
{"close", closereader},
{"next", next_entry},
{NULL, NULL}
};
@
\newcommand\nt[1]{\rmfamily{\emph{#1}}}
\newcommand\optional[1]{\rmfamily{[}#1\rmfamily{]}}
To create a reader, we call
\begin{quote}
\texttt{openreader(\nt{filename},
\optional{\nt{macro-table}, \optional{\nt{warn-function}}})}
\end{quote}
The warning function will be called in one of the following ways:
\begin{itemize}
\item
warn([["extra field"]], \emph{file}, \emph{line}, \emph{citation-key},
\emph{field-name}, \emph{field-value})
Duplicate definition of a field in a single entry.
\item
warn([["undefined macro"]], \emph{file}, \emph{line}, \emph{citation-key},
\emph{macro-name})
Use of an undefined macro.
\end{itemize}
<<function definitions>>=
#define INBUF 128 /* initial size of input buffer */
/* filename * macro table * warning function -> reader */
static int openreader(lua_State *L) {
const char *filename = luaL_checkstring(L, 1);
FILE *f = fopen(filename, "r");
Bibreader rdr;
if (!f) {
lua_pushnil(L);
lua_pushfstring(L, "Could not open file '%s'", filename);
return 2;
}
<<set items 2 and 3 on stack to hold macro table and optional warning function>>
rdr = lua_newuserdata(L, sizeof(*rdr));
luaL_getmetatable(L, "bibtex.reader");
lua_setmetatable(L, -2);
rdr->line_num = 0;
rdr->buf = rdr->cur = rdr->lim = malloc(INBUF);
rdr->bufsize = INBUF;
rdr->file = f;
rdr->filename = malloc(lua_strlen(L, 1)+1);
assert(rdr->filename);
strncpy((char *)rdr->filename, filename, lua_strlen(L, 1)+1);
rdr->L = L;
lua_newtable(L);
rdr->preamble = luaL_ref(L, LUA_REGISTRYINDEX);
lua_pushvalue(L, 2);
rdr->macros = luaL_ref(L, LUA_REGISTRYINDEX);
lua_pushvalue(L, 3);
rdr->warning = luaL_ref(L, LUA_REGISTRYINDEX);
return 1;
}
@
<<set items 2 and 3 on stack to hold macro table and optional warning function>>=
if (lua_type(L, 2) == LUA_TNONE)
lua_newtable(L);
if (lua_type(L, 3) == LUA_TNONE)
lua_pushnil(L);
else if (!lua_isfunction(L, 3))
luaL_error(L, "Warning value to bibtex.open is not a function");
@
Reader method [[next_entry]] takes no parameters.
On success it returns a triple (\emph{type}, \emph{key},
\emph{field-table}).
On error it returns (\texttt{false}, \emph{message}).
On end of file it returns nothing.
<<function definitions>>=
static int next_entry(lua_State *L) {
Bibreader rdr = checkreader(L, 1);
if (!rdr->file)
luaL_error(L, "Tried to read from closed bibtex.reader");
return get_bib_command_or_entry_and_process(rdr);
}
@
Closing a reader recovers its resources;
the [[file]] field of a closed reader is [[NULL]].
<<function definitions>>=
static int closereader(lua_State *L) {
Bibreader rdr = checkreader(L, 1);
if (!rdr->file)
luaL_error(L, "Tried to close closed bibtex.reader");
fclose(rdr->file);
rdr->file = NULL;
free(rdr->buf);
rdr->buf = rdr->cur = rdr->lim = NULL;
rdr->bufsize = 0;
free((void*)rdr->filename);
rdr->filename = NULL;
rdr->L = NULL;
luaL_unref(L, LUA_REGISTRYINDEX, rdr->preamble);
rdr->preamble = 0;
luaL_unref(L, LUA_REGISTRYINDEX, rdr->warning);
rdr->warning = 0;
luaL_unref(L, LUA_REGISTRYINDEX, rdr->macros);
rdr->macros = 0;
return 0;
}
@
To help implement the call to the warning function, we have [[warnv]].
If there is no warning function, we return the nubmer of nils specified by [[nres]].
<<function definitions>>=
static void warnv(Bibreader rdr, int nres, const char *fmt, ...) {
const char *p;
va_list vl;
lua_rawgeti(rdr->L, LUA_REGISTRYINDEX, rdr->warning);
if (lua_isnil(rdr->L, -1)) {
lua_pop(rdr->L, 1);
while (nres-- > 0)
lua_pushnil(rdr->L);
} else {
va_start(vl, fmt);
for (p = fmt; *p; p++)
switch (*p) {
case 'f': lua_pushnumber(rdr->L, va_arg(vl, double)); break;
case 'd': lua_pushnumber(rdr->L, va_arg(vl, int)); break;
case 's': {
const char *s = va_arg(vl, char *);
if (s == NULL) lua_pushnil(rdr->L);
else lua_pushstring(rdr->L, s);
break;
}
default: luaL_error(rdr->L, "invalid parameter type %c", *p);
}
lua_call(rdr->L, p - fmt, nres);
va_end(vl);
}
}
@
Here's where the library is initialized.
This is the only exported function in the whole file.
<<function definitions>>=
int luaopen_bibtex (lua_State *L) {
luaL_newmetatable(L, "bibtex.reader");
lua_pushstring(L, "__index");
lua_pushcfunction(L, reader_meta_index); /* pushes the index method */
lua_settable(L, -3); /* metatable.__index = metatable */
luaL_register(L, "bibtex", bibtexlib);
<<initialize the [[is_id_char]] table>>
return 1;
}
@
In an identifier, we can accept any printing character except the ones
listed in the [[nonids]] string.
<<initialize the [[is_id_char]] table>>=
{
unsigned c;
static unsigned char *nonids = (unsigned char *)"\"#%'(),={} \t\n\f";
unsigned char *p;
for (c = 0; c <= 0377; c++)
is_id_char[c] = 1;
for (c = 0; c <= 037; c++)
is_id_char[c] = 0;
for (p = nonids; *p; p++)
is_id_char[*p] = 0;
}
@
\subsection{Main function for the nbib commands}
This code will is the standalone main function for all the nbib commands.
\nextchunklabel{c-main}
<<nbibtex.c>>=
#include <stdlib.h>
#include <stdio.h>
#include <lua.h>
#include <lualib.h>
#include <lauxlib.h>
extern int luaopen_bibtex(lua_State *L);
extern int luaopen_boyer_moore (lua_State *L);
int main (int argc, char *argv[]) {
int i, rc;
lua_State *L = luaL_newstate();
static const char* files[] = { SHARE "/bibtex.lua", SHARE "/natbib.nbs" };
#define OPEN(N) lua_pushcfunction(L, luaopen_ ## N); lua_call(L, 0, 0)
OPEN(base); OPEN(table); OPEN(io); OPEN(package); OPEN(string); OPEN(bibtex);
OPEN(boyer_moore);
for (i = 0; i < sizeof(files)/sizeof(files[0]); i++) {
if (luaL_dofile(L, files[i])) {
fprintf(stderr, "%s: error loading configuration file %s\n",
argv[0], files[i]);
exit(2);
}
}
lua_pushstring(L, "bibtex");
lua_gettable(L, LUA_GLOBALSINDEX);
lua_pushstring(L, "main");
lua_gettable(L, -2);
lua_newtable(L);
for (i = 0; i < argc; i++) {
lua_pushnumber(L, i);
lua_pushstring(L, argv[i]);
lua_settable(L, -3);
}
rc = lua_pcall(L, 1, 0, 0);
if (rc) {
fprintf(stderr, "Call failed: %s\n", lua_tostring(L, -1));
lua_pop(L, 1);
}
lua_close(L);
return rc;
}
@
\section{Implementation of \texttt{nbibtex}}
From here out, everything is written in Lua (\url{http://www.lua.org}).
The main module is [[bibtex]], and style-file support is in the
submodule [[bibtex.bst]].
Each has a [[doc]] submodule, which is intended as machine-readable
documentation.
<<bibtex.lua>>=
<<if not already present, load the C code for the [[bibtex]] module>>
local config = config or { } --- may be defined by config process
local workaround = {
badbibs = true, --- don't look at bad .bib files that come with teTeX
}
local bst = { }
bibtex.bst = bst
bibtex.doc = { }
bibtex.bst.doc = { }
bibtex.doc.bst = '# table of functions used to write style files'
@
Not much code is executed during startup, so the main issue is to
manage declaration before use.
I~have a few forward declarations in
[[<<declarations of internal functions>>]]; otherwise, count only on
``utility'' functions being declared before ``exported'' ones.
<<bibtex.lua>>=
local find = string.find
<<declarations of internal functions>>
<<Lua utility functions>>
<<exported Lua functions>>
<<check constant values for consistency>>
return bibtex
@
The Lua code relies on the C~code.
How we get the C~code depends on how
\texttt{bibtex.lua} is used; there are two alternatives:
\begin{itemize}
\item
In the distribution, \texttt{bibtex.lua} is loaded by the C~code in
chunk~\subpageref{c-main}, which defines the [[bibtex]] module.
\item
For standalone testing purposes, \texttt{bibtex.lua} can be loaded
directly into an
interactive Lua interpreter, in which case it loads the [[bibtex]]
module as a shared library.
\end{itemize}
<<if not already present, load the C code for the [[bibtex]] module>>=
if not bibtex then
local nbib = require 'nbib-bibtex'
bibtex = nbib
end
@
\subsection{Error handling, warning messages, and logging}
<<Lua utility functions>>=
local function printf (...) return io.stdout:write(string.format(...)) end
local function eprintf(...) return io.stderr:write(string.format(...)) end
@
I have to figure out what to do about errors --- the current code is bogus.
Among other things, I should be setting error levels.
<<Lua utility functions>>=
local function bibwarnf (...) eprintf(...); eprintf('\n') end
local function biberrorf(...) eprintf(...); eprintf('\n') end
local function bibfatalf(...) eprintf(...); eprintf('\n'); os.exit(2) end
@
Logging? What logging?
<<Lua utility functions>>=
local function logf() end
@
\subsubsection{Support for delayed warnings}
Like classic \bibtex, \nbibtex\ typically warns only about entries
that are actually used.
This functionality is implemented by function [[hold_warning]], which
keeps warnings on ice until they are either returned by
[[held_warnings]] or thrown away by [[drop_warning]].
The function [[emit_warning]] emits a warning message eagerly when
called;
it is used to issue warnings about entries we actually use, or if the
[[-strict]] option is given, to issue every warning.
<<Lua utility functions>>=
local hold_warning -- function suitable to pass to bibtex.open; holds
local emit_warning -- function suitable to pass to bibtex.open; prints
local held_warnings -- returns nil or list of warnings since last call
local drop_warnings -- drops warnings
local extra_ok = { reffrom = true }
-- set of fields about which we should not warn of duplicates
do
local warnfuns = { }
warnfuns["extra field"] =
function(file, line, cite, field, newvalue)
if not extra_ok[field] then
bibwarnf("Warning--I'm ignoring %s's extra \"%s\" field\n--line %d of file %s\n",
cite, field, line, file)
end
end
warnfuns["undefined macro"] =
function(file, line, cite, macro)
bibwarnf("Warning--string name \"%s\" is undefined\n--line %d of file %s\n",
macro, line, file)
end
function emit_warning(tag, ...)
return assert(warnfuns[tag])(...)
end
local held
function hold_warning(...)
held = held or { }
table.insert(held, { ... })
end
function held_warnings()
local h = held
held = nil
return h
end
function drop_warnings()
held = nil
end
end
@
\subsection{Miscellany}
All this stuff is dubious.
<<Lua utility functions>>=
function table.copy(t)
local u = { }
for k, v in pairs(t) do u[k] = v end
return u
end
@
<<Lua utility functions>>=
local function open(f, m, what)
local f, msg = io.open(f, m)
if f then
return f
else
(what or bibfatalf)('Could not open file %s: %s', f, msg)
end
end
@
<<exported Lua functions>>=
local function entries(rdr, empty)
assert(not empty)
return function() return rdr:next() end
end
bibtex.entries = entries
bibtex.doc.entries = 'reader -> iterator # generate entries'
@
\subsection{Internal documentation}
We attempt to document everything!
<<exported Lua functions>>=
function bibtex:show_doc(title)
local out = bst.writer(io.stdout, 5)
local function outf(...) return out:write(string.format(...)) end
local allkeys, dkeys = { }, { }
for k, _ in pairs(self) do table.insert(allkeys, k) end
for k, _ in pairs(self.doc) do table.insert(dkeys, k) end
table.sort(allkeys)
table.sort(dkeys)
for i = 1, table.getn(dkeys) do
outf("%s.%-12s : %s\n", title, dkeys[i], self.doc[dkeys[i]])
end
local header
for i = 1, table.getn(allkeys) do
local k = allkeys[i]
if k ~= "doc" and k ~= "show_doc" and not self.doc[k] then
if not header then
outf('Undocumented keys in table %s:', title)
header = true
end
outf(' %s', k)
end
end
if header then outf('\n') end
end
bibtex.bst.show_doc = bibtex.show_doc
@
Here is the documentation for what's defined in C~code:
<<exported Lua functions>>=
bibtex.doc.open = 'filename -> reader # open a reader for a .bib file'
bibtex.doc.close = 'reader -> unit # close open reader'
bibtex.doc.next = 'reader -> type * key * field table # read an entry'
@
\subsection{Main function for \texttt{nbibtex}}
Actually, the same main function does for both \texttt{nbibtex} and
\texttt{nbibfind}; depending on how the program is called, it
delegates to [[bibtex.bibtex]] or [[bibtex.run_find]].
<<exported Lua functions>>=
bibtex.doc.main = 'string list -> unit # main program that dispatches on argv[0]'
function bibtex.main(argv)
if argv[1] == '-doc' then -- undocumented internal doco
bibtex:show_doc('bibtex')
bibtex.bst:show_doc('bst')
elseif find(argv[0], 'bibfind$') then
return bibtex.run_find(argv)
elseif find(argv[0], 'bibtex$') then
return bibtex.bibtex(argv)
else
error("Call me something ending in 'bibtex' or 'bibfind'; when called\n "..
argv[0]..", I don't know what to do")
end
end
@
<<exported Lua functions>>=
local permissive = false -- nbibtex extension (ignore missing .bib files, etc.)
local strict = false -- complain eagerly about errors in .bib files
local min_crossrefs = 2 -- how many crossref's required to add an entry?
local output_name = nil -- output file if not default
local bib_out = false -- output .bib format
bibtex.doc.bibtex = 'string list -> unit # main program for nbibtex'
function bibtex.bibtex(argv)
<<set bibtex options from [[argv]]>>
if table.getn(argv) < 1 then
bibfatalf('Usage: %s [-permissive|-strict|...] filename[.aux] [bibfile...]',
argv[0])
end
local auxname = table.remove(argv, 1)
local basename = string.gsub(string.gsub(auxname, '%.aux$', ''), '%.$', '')
auxname = basename .. '.aux'
local bblname = output_name or (basename .. '.bbl')
local blgname = basename .. (output_name and '.nlg' or '.blg')
local blg = open(blgname, 'w')
-- Here's what we accumulate by reading .aux files:
local bibstyle -- the bibliography style
local bibfiles = { } -- list of files named in order of file
local citekeys = { } -- list of citation keys from .aux
-- (in order seen, mixed case, no duplicates)
local cited_star = false -- .tex contains \cite{*} or \nocite{*}
<<using file [[auxname]], set [[bibstyle]], [[citekeys]], and [[bibfiles]]>>
if table.getn(argv) > 0 then -- override the bibfiles listed in the .aux file
bibfiles = argv
end
<<validate contents of [[bibstyle]], [[citekeys]], and [[bibfiles]]>>
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>
blg:close()
end
@
Options are straightforward.
<<set bibtex options from [[argv]]>>=
while table.getn(argv) > 0 and find(argv[1], '^%-') do
if argv[1] == '-terse' then
-- do nothing
elseif argv[1] == '-permissive' then
permissive = true
elseif argv[1] == '-strict' then
strict = true
elseif argv[1] == '-min-crossrefs' and find(argv[2], '^%d+$') then
min_crossrefs = assert(tonumber(argv[2]))
table.remove(argv, 1)
elseif string.find(argv[1], '^%-min%-crossrefs=(%d+)$') then
local _, _, n = string.find(argv[1], '^%-min%-crossrefs=(%d+)$')
min_crossrefs = assert(tonumber(n))
elseif string.find(argv[1], '^%-min%-crossrefs') then
biberrorf("Ill-formed option %s", argv[1])
elseif argv[1] == '-o' then
output_name = assert(argv[2])
table.remove(argv, 1)
elseif argv[1] == '-bib' then
bib_out = true
elseif argv[1] == '-help' then
help()
elseif argv[1] == '-version' then
printf("nbibtex version <VERSION>\n")
os.exit(0)
else
biberrorf('Unknown option %s', argv[1])
help(2)
end
table.remove(argv, 1)
end
@
<<Lua utility functions>>=
local function help(code)
printf([[
Usage: nbibtex [OPTION]... AUXFILE[.aux] [BIBFILE...]
Write bibliography for entries in AUXFILE to AUXFILE.bbl.
Options:
-bib write output as BibTeX source
-help display this help and exit
-o FILE write output to FILE (- for stdout)
-min-crossrefs=NUMBER include item after NUMBER cross-refs; default 2
-permissive allow missing bibfiles and (some) duplicate entries
-strict complain about any ill-formed entry we see
-version output version information and exit
Home page at http://www.eecs.harvard.edu/~nr/nbibtex.
Email bug reports to nr@eecs.harvard.edu.
]])
os.exit(code or 0)
end
@
\subsection{Reading all the aux files and validating the inputs}
We pay attention to four commands: [[\@input]], [[\bibdata]],
[[\bibstyle]], and [[\citation]].
<<using file [[auxname]], set [[bibstyle]], [[citekeys]], and [[bibfiles]]>>=
do
local commands = { } -- table of commands we recognize in .aux files
local function do_nothing() end -- default for unrecognized commands
setmetatable(commands, { __index = function() return do_nothing end })
<<functions for commands found in .aux files>>
commands['@input'](auxname) -- reads all the variables
end
@
<<functions for commands found in .aux files>>=
do
local auxopened = { } --- map filename to true/false
commands['@input'] = function (auxname)
if not find(auxname, '%.aux$') then
bibwarnf('Name of auxfile "%s" does not end in .aux\n', auxname)
end
<<mark [[auxname]] as opened (but fail if opened already)>>
local aux = open(auxname, 'r')
logf('Top-level aux file: %s\n', auxname)
for line in aux:lines() do
local _, _, cmd, arg = find(line, '^\\([%a%@]+)%s*{([^%}]+)}%s*$')
if cmd then commands[cmd](arg) end
end
aux:close()
end
end
<<mark [[auxname]] as opened (but fail if opened already)>>=
if auxopened[auxname] then
error("File " .. auxname .. " cyclically \\@input's itself")
else
auxopened[auxname] = true
end
@
\bibtex\ expects \texttt{.bib} files to be separated by commas.
They are forced to lower case, should have no spaces in them,
and the [[\bibdata]] command should appear exactly once.
<<functions for commands found in .aux files>>=
do
local bibdata_seen = false
function commands.bibdata(arg)
assert(not bibdata_seen, [[LaTeX provides multiple \bibdata commands]])
bibdata_seen = true
for bib in string.gmatch(arg, '[^,]+') do
assert(not find(bib, '%s'), 'bibname from LaTeX contains whitespace')
table.insert(bibfiles, string.lower(bib))
end
end
end
@
The style should be unique, and it should be known to us.
<<functions for commands found in .aux files>>=
function commands.bibstyle(stylename)
if bibstyle then
biberrorf('Illegal, another \\bibstyle command')
else
bibstyle = bibtex.style(string.lower(stylename))
if not bibstyle then
bibfatalf('There is no nbibtex style called "%s"')
end
end
end
@
We accumulated cited keys in [[citekeys]].
Keys may be duplicated, but the input should not contain two keys that
differ only in case.
<<functions for commands found in .aux files>>=
do
local keys_seen, lower_seen = { }, { } -- which keys have been seen already
function commands.citation(arg)
for key in string.gmatch(arg, '[^,]+') do
assert(not find(key, '%s'),
'Citation key {' .. key .. '} from LaTeX contains whitespace')
if key == '*' then
cited_star = true
elseif not keys_seen[key] then --- duplicates are OK
keys_seen[key] = true
local low = string.lower(key)
<<if another key with same lowercase, complain bitterly>>
if not cited_star then -- no more insertions after the star
table.insert(citekeys, key) -- must be key, not low,
-- so that keys in .bbl match .aux
end
end
end
end
end
@
<<if another key with same lowercase, complain bitterly>>=
if lower_seen[low] then
biberrorf("Citation key '%s' inconsistent with earlier key '%s'",
key, lower_seen[low])
else
lower_seen[low] = key
end
@
After reading the variables, we do a little validation.
I~can't seem to make up my mind what should be done incrementally
while things are being read.
<<validate contents of [[bibstyle]], [[citekeys]], and [[bibfiles]]>>=
if not bibstyle then
bibfatalf('No \\bibliographystyle in original LaTeX')
end
if table.getn(bibfiles) == 0 then
bibfatalf('No .bib files specified --- no \\bibliography in original LaTeX?')
end
if table.getn(citekeys) == 0 and not cited_star then
biberrorf('No citations in document --- empty bibliography')
end
do --- check for duplicate bib entries
local i = 1
local seen = { }
while i <= table.getn(bibfiles) do
local bib = bibfiles[i]
if seen[bib] then
bibwarnf('Multiple references to bibfile "%s"', bib)
table.remove(bibfiles, i)
else
i = i + 1
end
end
end
@
\subsection{Reading the entries from all the \bibtex\ files}
These are diagnostics that might be written to a log.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
logf("bibstyle == %q\n", bibstyle.name)
logf("consult these bibfiles:")
for _, bib in ipairs(bibfiles) do logf(" %s", bib) end
logf("\ncite these papers:\n")
for _, key in ipairs(citekeys) do logf(" %s\n", key) end
if cited_star then logf(" and everything else in the database\n") end
@
@
Each bibliography file is opened with [[openbib]].
Unlike classic \bibtex, we can't simply select the first entry
matching a citation key.
Instead, we read all entries into [[bibentries]] and do searches later.
The easy case is when we're not permissive: we put all the entries
into one list, just as if they had come from a single \texttt{.bib} file.
But if we're permissive, duplicates in different bibfiles are OK: we
will search one bibfile after another and stop after the first
successful search---thus instead of a single list, we have a list of
lists.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
local bibentries = { } -- if permissive, list of lists, else list of entries
local dupcheck = { } -- maps lower key to entry
local preamble = { } -- accumulates preambles from all .bib files
local got_one_bib = false -- did we open even one .bib file?
<<definition of function [[openbib]], which sets [[get_one_bib]] if successful>>
local warnings = { } -- table of held warnings for each entry
local macros = bibstyle.macros() -- must accumulate macros across .bib files
for _, bib in ipairs(bibfiles) do
local bibfilename, rdr = openbib(bib, macros)
if rdr then
local t -- list that will receive entries from this reader
if permissive then
t = { }
table.insert(bibentries, t)
else
t = bibentries
end
local localdupcheck = { } -- lower key to entry; finds duplicates within this file
for type, key, fields, file, line in entries(rdr) do
if type == nil then
break
elseif type then -- got something without error
local e = { type = type, key = key, fields = fields,
file = bibfilename, line = rdr.entry_line }
warnings[e] = held_warnings()
<<definition of local function [[not_dup]]>>
local ok1, ok2 = not_dup(localdupcheck), not_dup(dupcheck) -- evaluate both
if ok1 and ok2 then
table.insert(t, e)
end
end
end
for _, l in ipairs(rdr.preamble) do table.insert(preamble, l) end
rdr:close()
end
end
if not got_one_bib then
bibfatalf("Could not open any of the following .bib files: %s",
table.concat(bibfiles, ' '))
end
@ Because the preamble is accumulated as the \texttt{.bib} file is
read, it must be copied at the end.
@
Here we open files.
If we're not being permissive, we must open each file successfully.
If we're permissive, it's enough to get at least one.
To find the pathname for a bib file, we use [[bibtex.bibpath]].
<<definition of function [[openbib]], which sets [[get_one_bib]] if successful>>=
local function openbib(bib, macros)
macros = macros or bibstyle.macros()
local filename, msg = bibtex.bibpath(bib)
if not filename then
if not permissive then biberrorf("Cannot find file %s.bib", bib) end
return
end
local rdr = bibtex.open(filename, macros, strict and emit_warning or hold_warning)
if not rdr and not permissive then
biberrorf("Cannot open file %s.bib", bib)
return
end
got_one_bib = true
return filename, rdr
end
@
\subsubsection{Duplication checks}
There's a great deal of nuisance to checking the integrity of a
\texttt{.bib} file.
<<definition of local function [[not_dup]]>>=
<<abstraction exporting [[savecomplaint]] and [[issuecomplaints]]>>
local k = string.lower(key)
local function not_dup(dup)
local e1, e2 = dup[k], e
if e1 then
-- do return false end --- avoid extra msgs for now
local diff = entries_differ(e1, e2)
if diff then
local verybad = not permissive or e1.file == e2.file
local complain = verybad and biberrorf or bibwarnf
if e1.key == e2.key then
if verybad then
savecomplaint(e1, e2, complain,
"Ignoring second entry with key '%s' on file %s, line %d\n" ..
" (first entry occurred on file %s, line %d;\n"..
" entries differ in %s)\n",
e2.key, e2.file, e2.line, e1.file, e1.line, diff)
end
else
savecomplaint(e1, e2, complain,
"Entries '%s' on file %s, line %d and\n '%s' on file %s, line %d" ..
" have keys that differ only in case\n",
e1.key, e1.file, e1.line, e2.key, e2.file, e2.line)
end
elseif e1.file == e2.file then
savecomplaint(e1, e2, bibwarnf,
"Entry '%s' is duplicated in file '%s' at both line %d and line %d\n",
e1.key, e1.file, e1.line, e2.line)
elseif not permissive then
savecomplaint(e1, e2, bibwarnf,
"Entry '%s' appears both on file '%s', line %d and file '%s', line %d"..
"\n (entries are exact duplicates)\n",
e1.key, e1.file, e1.line, e2.file, e2.line)
end
return false
else
dup[k] = e
return true
end
end
@
Calling [[savecomplaint(e1, e2, complain, ...)]] takes the complaint
[[complain(...)]] and associates it with entries [[e1]] and [[e2]].
If we are operating in ``strict'' mode, the complaint is issued right
away; otherwise
calling [[issuecomplaints(e)]] issues the complaint lazily.
In non-strict, lazy mode, the outside world arranges to issue only
complaints with entries that are actually used.
<<abstraction exporting [[savecomplaint]] and [[issuecomplaints]]>>=
local savecomplained, issuecomplaints
if strict then
function savecomplaint(e1, e2, complain, ...)
return complain(...)
end
function issuecomplaints(e) end
else
local complaints = { }
local function save(e, t)
complaints[e] = complaints[e] or { }
table.insert(complaints[e], t)
end
function savecomplaint(e1, e2, ...)
save(e1, { ... })
save(e2, { ... })
end
local function call(c, ...)
return c(...)
end
function issuecomplaints(e)
for _, c in ipairs(complaints[e] or { }) do
call(unpack(c))
end
end
end
@
<<Lua utility functions>>=
-- return 'key' or 'type' or 'field <name>' at which entries differ,
-- or nil if entries are the same
local function entries_differ(e1, e2, notkey)
if e1.key ~= e2.key and not notkey then return 'key' end
if e1.type ~= e2.type then return 'type' end
for k, v in pairs(e1.fields) do
if e2.fields[k] ~= v then return 'field ' .. k end
end
for k, v in pairs(e2.fields) do
if e1.fields[k] ~= v then return 'field ' .. k end
end
end
@
I've seen at least one bibliography with identical entries listed
under multiple keys. (Thanks, Andrew.)
<<Lua utility functions>>=
-- every entry is identical to every other
local function all_entries_identical(es, notkey)
if table.getn(es) == 0 then return true end
for i = 2, table.getn(es) do
if entries_differ(es[1], es[i], notkey) then
return false
end
end
return true
end
@
\subsection{Computing and emitting the list of citations}
A significant complexity added in \nbibtex\ is that a single entry may
be cited using more than one citation key.
For example, [[\cite{milner:type-polymorphism}]] and
[[\cite{milner:theory-polymorphism}]] may well specify the same paper.
Thus, in addition to a list of citations, I~also keep track of the set
of keys with which each entry is cited, as well as the first such key.
The function [[cite]] manages all these data structures.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
local citations = { } -- list of citations
local cited = { } -- (entry -> key set) table
local first_cited = { } -- (entry -> key) table
local function cite(c, e) -- cite entry e with key c
local seen = cited[e]
cited[e] = seen or { }
cited[e][c] = true
if not seen then
first_cited[e] = c
table.insert(citations, e)
end
end
@
When the dust settles, we adjust members of each citation record:
the first key actually used becomes [[key]],
the original key becomes [[orig_key]], and other keys go into [[also_cited_as]].
<<using [[cited]] and [[first_cited]], adjust fields [[key]] and [[also_cited_as]]>>=
for i = 1, table.getn(citations) do
local c = citations[i]
local key = assert(first_cited[c], "citation is not cited?!")
c.orig_key, c.key = c.key, key
local also = { }
for k in pairs(cited[c]) do
if k ~= key then table.insert(also, k) end
end
c.also_cited_as = also
end
@
For each actual [[\cite]] command in the original {\LaTeX} file, we
call [[find_entry]] to find an appropriate \bibtex\ entry.
Because a [[\cite]] command might match more than one paper, the
results may be ambiguous.
We therefore produce a list of all \emph{candidates} matching the
[[\cite]] command.
If we're permissive, we search one list of entries after another,
stopping as soon as we get some candidates.
If we're not permissive, we have just one list of entries overall, so
we search it and we're done.
If permissive, we search entry lists in turn until we
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
local find_entry -- function from key to citation
do
local cache = { } -- (citation-key -> entry) table
function find_entry(c)
local function remember(e) cache[c] = e; return e end -- cache e and return it
if cache[c] or dupcheck[c] then
return cache[c] or dupcheck[c]
else
local candidates
if permissive then
for _, entries in ipairs(bibentries) do
candidates = query(c, entries)
if table.getn(candidates) > 0 then break end
end
else
candidates = query(c, bibentries)
end
assert(candidates)
<<from the available [[candidates]], choose one and [[remember]] it>>
end
end
end
@
If we have no candidates, we're hosed.
Otherwise, if all the candidates are identical (most likely when there
is a unique candidate, but still possible otherwise),\footnote
{Andrew Appel has a bibliography in which the \emph{Definition of
Standard~ML} appears as two different entries that are identical
except for keys.}
we take the first.
Finally, if there are multiple, distinct candidates to choose from,
we take the first and issue a warning message.
To avoid surprising the unwary coauthor, we put a warning message into
the entry as well, from which it will go into the printed bibliography.
<<from the available [[candidates]], choose one and [[remember]] it>>=
if table.getn(candidates) == 0 then
biberrorf('No .bib entry matches \\cite{%s}', c)
elseif all_entries_identical(candidates, 'notkey') then
logf("Query '%s' produced unique candidate %s from %s\n",
c, candidates[1].key, candidates[1].file)
return remember(candidates[1])
else
local e = table.copy(candidates[1])
<<warn of multiple candidates for query [[c]]>>
e.warningmsg = string.format('[This entry is the first match for query ' ..
'\\texttt{%s}, which produced %d matches.]',
c, table.getn(candidates))
return remember(e)
end
@
I can do better later\ldots
<<warn of multiple candidates for query [[c]]>>=
bibwarnf("Query '%s' produced %d candidates\n (using %s from %s)\n",
c, table.getn(candidates), e.key, e.file)
bibwarnf("First two differ in %s\n", entries_differ(candidates[1], candidates[2], true))
@
The [[query]] function uses the engine described in Section~\ref{sec:query}.
<<definition of [[query]], used to search a list of entries>>=
function query(c, entries)
local p = matchq(c)
local t = { }
for _, e in ipairs(entries) do
if p(e.type, e.fields) then
table.insert(t, e)
end
end
return t
end
bibtex.query = query
bibtex.doc.query = 'query: string -> entry list -> entry list'
<<declarations of internal functions>>=
local query
local matchq
bibtex.doc.matchq = 'matchq: string -> predicate --- compile query string'
bibtex.matchq = matchq
@
Finally we can compute the list of entries:
search on each citation key, and if we had [[\cite{*}]] or
[[\nocite{*}]], add all the other entries as well.
The [[cite]] command takes care of avoiding duplicates.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
for _, c in ipairs(citekeys) do
local e = find_entry(c)
if e then cite(c, e) end
end
if cited_star then
for _, es in ipairs(permissive and bibentries or {bibentries}) do
logf('Adding all entries in list of %d\n', table.getn(es))
for _, e in ipairs(es) do
cite(e.key, e)
end
end
end
<<using [[cited]] and [[first_cited]], adjust fields [[key]] and [[also_cited_as]]>>
@
I've always hated \bibtex's cross-reference feature, but I believe
I've implemented it faithfully.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
bibtex.do_crossrefs(citations, find_entry)
@
With the entries computed, there are two ways to emit:
as another \bibtex\ file or as required by the style file.
So that we can read from [[bblname]] before writing to it,
the opening of [[bbl]] is carefully delayed to this point.
<<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>>=
<<emit warnings for entries in [[citations]]>>
local bbl = bblname == '-' and io.stdout or open(bblname, 'w')
if bib_out then
bibtex.emit(bbl, preamble, citations)
else
bibstyle.emit(bbl, preamble, citations)
end
if bblname ~= '-' then bbl:close() end
@
Here's a function to emit a list of citations as \bibtex\ source.
<<exported Lua functions>>=
bibtex.doc.emit =
'outfile * string list * entry list -> unit -- write citations in .bib format'
function bibtex.emit(bbl, preamble, citations)
local warned = false
if preamble[1] then
bbl:write('@preamble{\n')
for i = 1, table.getn(preamble) do
bbl:write(string.format(' %s "%s"\n', i > 1 and '#' or ' ', preamble[i]))
end
bbl:write('}\n\n')
end
for _, e in ipairs(citations) do
local also = e.also_cited_as
if also and table.getn(also) > 0 then
for _, k in ipairs(e.also_cited_as or { }) do
bbl:write(string.format('@%s{%s, crossref={%s}}\n', e.type, k, e.key))
end
if not warned then
warned = true
bibwarnf("Warning: some entries (such as %s) are cited with multiple keys;\n"..
" in the emitted .bib file, these entries are duplicated (using crossref)\n",
e.key)
end
end
emit_tkf.bib(bbl, e.type, e.key, e.fields)
end
end
@
<<emit warnings for entries in [[citations]]>>=
for _, e in ipairs(citations) do
if warnings[e] then
for _, w in ipairs(warnings[e]) do emit_warning(unpack(w)) end
end
end
@
\subsection{Cross-reference}
If an entry contains a [[crossref]] field,
that field is used as a key to find the parent, and the entry inherits
missing fields from the parent.
If the parent is cross-referenced sufficiently often (i.e., more than
[[min_crossref]] times), it may be added
to the citation list, in which case the style file knows what to do
with the [[crossref]] field.
But if the parent is not cited sufficiently often,
it disappears, and do does the [[crossref]] field.
<<exported Lua functions>>=
bibtex.doc.do_crossrefs = "citation list -> unit # add crossref'ed fields in place"
function bibtex.do_crossrefs(citations, find_entry)
local map = { } --- key to entry (on citation list)
local xmap = { } --- key to entry (xref'd only)
local xref_count = { } -- entry -> number of times xref'd
<<make [[map]] map lower-case keys in [[citations]] to their entries>>
for i = 1, table.getn(citations) do
local c = citations[i]
if c.fields.crossref then
local lowref = string.lower(c.fields.crossref)
local parent = map[lowref] or xmap[lowref]
if not parent and find_entry then
parent = find_entry(lowref)
xmap[lowref] = parent
end
if not parent then
biberrorf("Entry %s cross-references to %s, but I can't find %s",
c.key, c.fields.crossref, c.fields.crossref)
c.fields.crossref = nil
else
xref_count[parent] = (xref_count[parent] or 0) + 1
local fields = c.fields
fields.crossref = parent.key -- force a case match!
for k, v in pairs(parent.fields) do -- inherit field if missing
fields[k] = fields[k] or v
end
end
end
end
<<add oft-crossref'd entries from [[xmap]] to the list in [[citations]]>>
<<remove [[crossref]] fields for entries with seldom-crossref'd parents>>
end
<<make [[map]] map lower-case keys in [[citations]] to their entries>>=
for i = 1, table.getn(citations) do
local c = citations[i]
local key = string.lower(c.key)
map[key] = map[key] or c
end
<<add oft-crossref'd entries from [[xmap]] to the list in [[citations]]>>=
for _, e in pairs(xmap) do -- includes only missing entries
if xref_count[e] >= min_crossrefs then
table.insert(citations, e)
end
end
<<remove [[crossref]] fields for entries with seldom-crossref'd parents>>=
for i = 1, table.getn(citations) do
local c = citations[i]
if c.fields.crossref then
local parent = xmap[string.lower(c.fields.crossref)]
if parent and xref_count[parent] < min_crossrefs then
c.fields.crossref = nil
end
end
end
@
\subsection{The query engine (i.e., the point of it all)}
\label{sec:query}
The query language is described in the man page for [[nbibtex]].
Its implementation is divided into two parts:
the internal predicates which are composed to form a query predicate,
and the parser that takes a string and produces a query predicate.
Function [[matchq]] is declared [[local]] above and is the only
function visible outside this block.
<<exported Lua functions>>=
do
if not boyer_moore then
require 'boyer-moore'
end
local bm = boyer_moore
local compile = bm.compilenc
local search = bm.matchnc
-- type predicate = type * field table -> bool
-- val match : field * string -> predicate
-- val author : string -> predicate
-- val matchty : string -> predicate
-- val andp : predicate option * predicate option -> predicate option
-- val orp : predicate option * predicate option -> predicate option
-- val matchq : string -> predicate --- compile query string
<<definitions of query-predicate functions>>
<<definition of [[matchq]], the query compiler>>
<<definition of [[query]], used to search a list of entries>>
end
@
\subsubsection{Query predicates}
The common case is a predicate for a named field.
We also have some special syntax for ``all fields'' and the \bibtex\
``type,'' which is not a field.
<<definitions of query-predicate functions>>=
local matchty
local function match(field, string)
if string == '' then return nil end
local pat = compile(string)
if field == '*' then
return function (t, fields)
for _, v in pairs(fields) do if search(pat, v) then return true end end
end
elseif field == '[type]' then
return matchty(string)
else
return function (t, fields) return search(pat, fields[field] or '') end
end
end
@
Here's a type matcher.
<<definitions of query-predicate functions>>=
function matchty(string)
if string == '' then return nil end
local pat = compile(string)
return function (t, fields) return search(pat, t) end
end
@
We make a special case of [[author]] because it really means ``author
or editor.''
<<definitions of query-predicate functions>>=
local function author(string)
if string == '' then return nil end
local pat = compile(string)
return function (t, fields)
return search(pat, fields.author or fields.editor or '')
end
end
@
We conjoin and disjoin predicates, being careful to use tail calls
(not [[and]] and [[or]]) in order to save stack space.
<<definitions of query-predicate functions>>=
local function andp(p, q)
-- associate to right for constant stack space
if not p then
return q
elseif not q then
return p
else
return function (t,f) if p(t,f) then return q(t,f) end end
end
end
<<definitions of query-predicate functions>>=
local function orp(p, q)
-- associate to right for constant stack space
if not p then
return q
elseif not q then
return p
else
return function (t,f) if p(t,f) then return true else return q(t,f) end end
end
end
@
\subsubsection{The query compiler}
The function [[matchq]] takes the syntax explained in the man page and
produces a predicate.
<<definition of [[matchq]], the query compiler>>=
function matchq(query)
local find = string.find
local parts = split(query, '%:')
local p = nil
if parts[1] and not find(parts[1], '=') then
<<add to [[p]] a match for [[parts[1]]] as author>>
table.remove(parts, 1)
if parts[1] and not find(parts[1], '=') then
<<add to [[p]] a match for [[parts[1]]] as title or year>>
table.remove(parts, 1)
if parts[1] and not find(parts[1], '=') then
<<add to [[p]] a match for [[parts[1]]] as type or year>>
table.remove(parts, 1)
end
end
end
for _, part in ipairs(parts) do
if not find(part, '=') then
biberrorf('bad query %q --- late specs need = sign', query)
else
local _, _, field, words = find(part, '^(.*)=(.*)$')
assert(field and words, 'bug in query parsing')
<<add to [[p]] a match for [[words]] as [[field]]>>
end
end
if not p then
bibwarnf('empty query---matches everything\n')
return function() return true end
else
return p
end
end
@
Here's where an unnamed key defaults to author or editor.
<<add to [[p]] a match for [[parts[1]]] as author>>=
for _, word in ipairs(split(parts[1], '-')) do
p = andp(author(word), p)
end
<<add to [[p]] a match for [[parts[1]]] as title or year>>=
local field, words = find(parts[1], '%D') and 'title' or 'year', parts[1]
<<add to [[p]] a match for [[words]] as [[field]]>>
<<add to [[p]] a match for [[parts[1]]] as type or year>>=
if find(parts[1], '%D') then
local ty = nil
for _, word in ipairs(split(parts[1], '-')) do
ty = orp(matchty(word), ty)
end
p = andp(p, ty) --- check type last for efficiency
else
for _, word in ipairs(split(parts[1], '-')) do
p = andp(p, match('year', word)) -- check year last for efficiency
end
end
@
There could be lots of matches on a year, so we check years last.
<<add to [[p]] a match for [[words]] as [[field]]>>=
for _, word in ipairs(split(words, '-')) do
if field == 'year' then
p = andp(p, match(field, word))
else
p = andp(match(field, word), p)
end
end
@
\subsection{Path search and other system-dependent stuff}
To find a bib file, I rely on the \texttt{kpsewhich} program,
which is typically found on Unix {\TeX} installations, and which
should guarantee to find the same bib files as normal bibtex.
<<Lua utility functions>>=
assert(io.popen)
local function capture(cmd, raw)
local f = assert(io.popen(cmd, 'r'))
local s = assert(f:read('*a'))
assert(f:close()) --- can't get an exit code
if raw then return s end
s = string.gsub(s, '^%s+', '')
s = string.gsub(s, '%s+$', '')
s = string.gsub(s, '[\n\r]+', ' ')
return s
end
@
Function [[bibpath]] is normally called on a bibname in a {\LaTeX}
file, but because a bibname may also be given on the command line,
we add \texttt{.bib} only if not already present.
Also, because we can
<<exported Lua functions>>=
bibtex.doc.bibpath = 'string -> string # from \\bibliography name, find pathname of file'
function bibtex.bibpath(bib)
if find(bib, '/') then
local f, msg = io.open(bib)
if not f then
return nil, msg
else
f:close()
return bib
end
else
if not find(bib, '%.bib$') then
bib = bib .. '.bib'
end
local pathname = capture('kpsewhich ' .. bib)
if string.len(pathname) > 1 then
return pathname
else
return nil, 'kpsewhich cannot find ' .. bib
end
end
end
@
\section{Implementation of \texttt{nbibfind}}
\subsection{Output formats for \bibtex\ entries}
We can emit a \bibtex\ entry in any of three formats:
[[bib]], [[terse]], and [[full]].
An emitter takes as arguments the type, key, and fields of the entry,
and optionally the name of the file the entry came from.
<<Lua utility functions>>=
local emit_tkf = { }
@
The simplest entry is legitimate \bibtex\ source:
<<exported Lua functions>>=
function emit_tkf.bib(outfile, type, key, fields)
outfile:write('@', type, '{', key, ',\n')
for k, v in pairs(fields) do
outfile:write(' ', k, ' = {', v, '},\n')
end
outfile:write('}\n\n')
end
@
For the other two entries, we devise a string format.
In principle, we could go with an ASCII form of a full-blown style,
but since the purpose is to identify the entry in relatively few
characters, it seems sufficient to spit out the author, year, title,
and possibly the source.
``Full'' output shows the whole string; ``terse'' is just the first line.
<<exported Lua functions>>=
do
local function bibstring(type, key, fields, bib)
<<define local [[format_lab_names]] as for a bibliography label>>
local names = format_lab_names(fields.author) or
format_lab_names(fields.editor) or
fields.key or fields.organization or '????'
local year = fields.year
local lbl = names .. (year and ' ' .. year or '')
local title = fields.title or '????'
if bib then
key = string.gsub(bib, '.*/', '') .. ': ' .. key
end
local answer =
bib and
string.format('%-25s = %s: %s', key, lbl, title) or
string.format('%-21s = %s: %s', key, lbl, title)
local where = fields.booktitle or fields.journal
if where then answer = answer .. ', in ' .. where end
answer = string.gsub(answer, '%~', ' ')
for _, cs in ipairs { 'texttt', 'emph', 'textrm', 'textup' } do
answer = string.gsub(answer, '\\' .. cs .. '%A', '')
end
answer = string.gsub(answer, '[%{%}]', '')
return answer
end
function emit_tkf.terse(outfile, type, key, fields, bib)
outfile:write(truncate(bibstring(type, key, fields, bib), 80), '\n')
end
function emit_tkf.full(outfile, type, key, fields, bib)
local w = bst.writer(outfile)
w:write(bibstring(type, key, fields, bib), '\n')
end
end
@
<<define local [[format_lab_names]] as for a bibliography label>>=
local format_lab_names
do
local fmt = '{vv }{ll}'
local function format_names(s)
local s = bst.commafy(bst.format_names(fmt, bst.namesplit(s)))
return (string.gsub(s, ' and others$', ' et al.'))
end
function format_lab_names(s)
if not s then return s end
local t = bst.namesplit(s)
if table.getn(t) > 3 then
return bst.format_name(fmt, t[1]) .. ' et al.'
else
return format_names(s)
end
end
end
@
Function [[truncate]]
returns enough of a string to fit in [[n]] columns, with ellipses as
needed.
<<Lua utility functions>>=
local function truncate(s, n)
local l = string.len(s)
if l <= n then
return s
else
return string.sub(s, 1, n-3) .. '...'
end
end
@
@
\subsection{Main functions for \texttt{nbibfind}}
<<exported Lua functions>>=
bibtex.doc.run_find = 'string list -> unit # main program for nbibfind'
bibtex.doc.find = 'string * string list -> entry list'
function bibtex.find(pattern, bibs)
local es = { }
local p = matchq(pattern)
for _, bib in ipairs(bibs) do
local rdr = bibtex.open(bib, bst.months(), hold_warning)
for type, key, fields in entries(rdr) do
if type == nil then
break
elseif not type then
io.stderr:write('Something disastrous happened with entry ', key, '\n')
elseif key == pattern or p(type, fields) then
<<emit held warnings, if any>>
table.insert(es, { type = type, key = key, fields = fields,
bib = table.getn(bibs) > 1 and bib })
else
drop_warnings()
end
end
rdr:close()
end
return es
end
function bibtex.run_find(argv)
local emit = emit_tkf.terse
while argv[1] and find(argv[1], '^-') do
if emit_tkf[string.sub(argv[1], 2)] then
emit = emit_tkf[string.sub(argv[1], 2)]
else
biberrorf('Unrecognized option %s', argv[1])
end
table.remove(argv, 1)
end
if table.getn(argv) == 0 then
io.stderr:write(string.format('Usage: %s [-bib|-terse|-full] pattern [bibs]\n',
string.gsub(argv[0], '.*/', '')))
os.exit(1)
end
local pattern = table.remove(argv, 1)
local bibs = { }
<<make [[bibs]] the list of pathnames implied by [[argv]]>>
local entries = bibtex.find(pattern, bibs)
for _, e in ipairs(entries) do
emit(io.stdout, e.type, e.key, e.fields, e.bib)
end
end
@
If we have no arguments, search all available bibfiles.
Otherwise, an argument with a~[[/]] is a pathname, and
an argument without~[[/]] is a name as it would appear in
[[\bibliography]].
<<make [[bibs]] the list of pathnames implied by [[argv]]>>=
if table.getn(argv) == 0 then
bibs = all_bibs()
else
for _, a in ipairs(argv) do
if find(a, '/') then
table.insert(bibs, a)
else
table.insert(bibs, assert(bibtex.bibpath(a)))
end
end
end
@
<<emit held warnings, if any>>=
local ws = held_warnings()
if ws then
for _, w in ipairs(ws) do
emit_warning(unpack(w))
end
end
@
To search all bib files, we lean heavily on \texttt{kpsewhich}, which is
distributed with the Web2C version of {\TeX}, and which knows exactly
which directories to search.
<<post-split Lua utility functions>>=
local function all_bibs()
local pre_path = assert(capture('kpsewhich -show-path bib'))
local path = assert(capture('kpsewhich -expand-path ' .. pre_path))
local bibs = { } -- list of results
local inserted = { } -- set of inserted bibs, to avoid duplicates
for _, dir in ipairs(split(path, ':')) do
local files = assert(capture('echo ' .. dir .. '/*.bib'))
for _, file in ipairs(split(files, '%s')) do
if readable(file) then
if not (workaround.badbibs and (find(file, 'amsxport%-options') or
find(file, '/plbib%.bib$')))
then
if not inserted[file] then
table.insert(bibs, file)
inserted[file] = true
end
end
end
end
end
return bibs
end
bibtex.all_bibs = all_bibs
@ Notice the [[workaround.badbibs]], which prevents us from searching
some bogus bibfiles that come with Thomas Esser's te{\TeX}.
@
It's a pity there's no more efficient way to see if a file is readable
than to try to read it, but that's portability for you.
<<Lua utility functions>>=
local function readable(file)
local f, msg = io.open(file, 'r')
if f then
f:close()
return true
else
return false, msg
end
end
@
\section{Support for style files}
A \bibtex\ style file is used to turn a \bibtex\ entry into {\TeX} or
{\LaTeX} code suitable for inclusion in a bibliography.
It can also be used for many other wondrous purposes, such as
generating HTML for web pages.
In classes \bibtex, each style file is written in a rudimentary,
unnamed, stack-based language,
which is described in a document called ``Designing \bibtex\ Styles,''
which is often called \texttt{btxhak.dvi}.
One of the benefits of \nbibtex\ is that styles can instead be written
in Lua, which is a much more powerful language---and perhaps even
easier to read.
But while Lua has amply powerful string-processing primitives, it
lacks some of the primitives that are specific to \bibtex.
Most notable among these primitives is the machinery for parsing and
formatting names (of authors, editors and so on).
That machinery is re-implemented here.
If documentation seems scanty, consult the original \texttt{btxhak}.
@
In classic \bibtex, each style is its own separate file.
Here, we share code by allowing a single file to register multiple
styles.
<<exported Lua functions>>=
bibtex.doc.register_style =
[[string * style -> unit # remember style with given name
type style = { emit : outfile * string list * citation list -> unit
, style : table of formatting functions # defined document types
, macros : unit -> macro table
}]]
bibtex.doc.style = 'name -> style # return style with given name, loading on demand'
do
local styles = { }
function bibtex.register_style(name, s)
assert(not styles[name], "Duplicate registration of style " .. name)
styles[name] = s
s.name = s.name or name
end
function bibtex.style(name)
if not styles[name] then
local loaded
if config.nbs then
local loaded = loadfile(config.nbs .. '/' .. name .. '.nbs')
if loaded then loaded() end
end
if not loaded then
require ('nbib-' .. name)
end
if not styles[name] then
bibfatalf('Tried to load a file, but it did not register style %s\n', name)
end
end
return styles[name]
end
end
@
\subsection{Special string-processing support}
A great deal of \bibtex's processing depends on giving a special
status to substrings inside braces;
indeed, when such a substring begins with a backslash, it is called a
``special character.''
Accordingly, we provide a function to search for a pattern
\emph{outside} balanced braces.
<<Lua utility functions>>=
local function find_outside_braces(s, pat, i)
local len = string.len(s)
local j, k = string.find(s, pat, i)
if not j then return j, k end
local jb, kb = string.find(s, '%b{}', i)
while jb and jb < j do --- scan past braces
--- braces come first, so we search again after close brace
local i2 = kb + 1
j, k = string.find(s, pat, i2)
if not j then return j, k end
jb, kb = string.find(s, '%b{}', i2)
end
-- either pat precedes braces or there are no braces
return string.find(s, pat, j) --- 2nd call needed to get captures
end
@
\subsubsection{String splitting}
Another common theme in \bibtex\ is the list represented as string.
A~list of names is represented as a string with individual names
separated by ``and.''
A~name itself is a list of parts separated by whitespace.
So here are some functions to do general splitting.
When we don't care about the separators, we use [[split]];
when we care only about the separators, we use [[splitters]];
and
when we care about both, we use [[odd_even_split]].
<<Lua utility functions>>=
local function split(s, pat, find) --- return list of substrings separated by pat
find = find or string.find -- could be find_outside_braces
local len = string.len(s)
local t = { }
local insert = table.insert
local i, j, k = 1, true
while j and i <= len + 1 do
j, k = find(s, pat, i)
if j then
insert(t, string.sub(s, i, j-1))
i = k + 1
else
insert(t, string.sub(s, i))
end
end
return t
end
@
Function [[splitters]] returns a table that, when interleaved with the
result of [[split]], reconstructs the original string.
<<Lua utility functions>>=
local function splitters(s, pat, find) --- return list of separators
find = find or string.find -- could be find_outside_braces
local t = { }
local insert = table.insert
local j, k = find(s, pat, 1)
while j do
insert(t, string.sub(s, j, k))
j, k = find(s, pat, k+1)
end
return t
end
@
Function [[odd_even_split]] makes odd entries strings between the
sought-for pattern and even entries the strings that match the pattern.
<<Lua utility functions>>=
local function odd_even_split(s, pat)
local len = string.len(s)
local t = { }
local insert = table.insert
local i, j, k = 1, true
while j and i <= len + 1 do
j, k = find(s, pat, i)
if j then
insert(t, string.sub(s, i, j-1))
insert(t, string.sub(s, j, k))
i = k + 1
else
insert(t, string.sub(s, i))
end
end
return t
end
@
As a special case, we may want to pull out brace-delimited substrings:
<<Lua utility functions>>=
local function brace_split(s) return odd_even_split(s, '%b{}') end
@
Some things need splits.
<<Lua utility functions>>=
<<post-split Lua utility functions>>
@
\subsubsection{String lengths and widths}
Function [[text_char_count]] counts characters,
but a special counts as one character.
It is based on \bibtex's [[text.length$]] function.
<<Lua utility functions>>=
local function text_char_count(s)
local n = 0
local i, last = 1, string.len(s)
while i <= last do
local special, splast, sp = find(s, '(%b{})', i)
if not special then
return n + (last - i + 1)
elseif find(sp, '^{\\') then
n = n + (special - i + 1) -- by statute, it's a single character
i = splast + 1
else
n = n + (splast - i + 1) - 2 -- don't count braces
i = splast + 1
end
end
return n
end
bst.text_length = text_char_count
bst.doc.text_length = "string -> int # length (with 'special' char == 1)"
@
Sometimes we want to know not how many characters are in a string, but
how much space we expect it to take when typeset.
(Or rather, we want to compare such widths to find the widest.)
This is original \bibtex's [[width$]] function.
The code should use the [[char_width]] array, for which
[[space]] is the only whitespace character given a nonzero printing
width. The widths here are taken from Stanford's June~'87
$cmr10$~font and represent hundredths of a point (rounded), but since
they're used only for relative comparisons, the units have no meaning.
<<exported Lua functions>>=
do
local char_width = { }
local special_widths = { ss = 500, ae = 722, oe = 778, AE = 903, oe = 1014 }
for i = 0, 255 do char_width[i] = 0 end
local char_width_from_32 = {
278, 278, 500, 833, 500, 833, 778, 278, 389, 389, 500, 778, 278, 333,
278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278,
278, 778, 472, 472, 778, 750, 708, 722, 764, 681, 653, 785, 750, 361,
514, 778, 625, 917, 750, 778, 681, 778, 736, 556, 722, 750, 750,
1028, 750, 750, 611, 278, 500, 278, 500, 278, 278, 500, 556, 444,
556, 444, 306, 500, 556, 278, 306, 528, 278, 833, 556, 500, 556, 528,
392, 394, 389, 556, 528, 722, 528, 528, 444, 500, 1000, 500, 500,
}
for i = 1, table.getn(char_width_from_32) do
char_width[32+i-1] = char_width_from_32[i]
end
bst.doc.width = "string -> faux_points # width of string in 1987 cmr10"
function bst.width(s)
assert(false, 'have not implemented width yet')
end
end
@
\subsection{Parsing names and lists of names}
Names in a string are separated by \texttt{and} surrounded by nonnull
whitespace.
Case is not significant.
<<exported Lua functions>>=
local function namesplit(s)
local t = split(s, '%s+[aA][nN][dD]%s+', find_outside_braces)
local i = 2
while i <= table.getn(t) do
while find(t[i], '^[aA][nN][dD]%s+') do
t[i] = string.gsub(t[i], '^[aA][nN][dD]%s+', '')
table.insert(t, i, '')
i = i + 1
end
i = i + 1
end
return t
end
bst.namesplit = namesplit
bst.doc.namesplit = 'string -> list of names # split names on "and"'
@
<<exported Lua functions>>=
local sep_and_not_tie = '%-'
local sep_chars = sep_and_not_tie .. '%~'
@
To parse an individual name, we want to count commas.
We first remove leading white space (and [[sep_char]]s), and trailing
white space (and [[sep_char]]s) and commas, complaining for each
trailing comma.
We then represent the name as two sequences: [[tokens]] and
[[trailers]].
The [[tokens]] are the names themselves, and the [[trailers]] are the
separator characters between tokens.
(A~separator is white space, a dash, or a tie, and multiple separators
in sequence are frowned upon.)
The [[commas]] table becomes an array mapping the comma number to the
index of the token that follows it.
<<exported Lua functions>>=
local parse_name
do
local white_sep = '[' .. sep_chars .. '%s]+'
local white_comma_sep = '[' .. sep_chars .. '%s%,]+'
local trailing_commas = '(,[' .. sep_chars .. '%s%,]*)$'
local sep_char = '[' .. sep_chars .. ']'
local leading_white_sep = '^' .. white_sep
<<name-parsing utilities>>
function parse_name(s, inter_token)
if string.find(s, trailing_commas) then
biberrorf("Name '%s' has one or more commas at the end", s)
end
s = string.gsub(s, trailing_commas, '')
s = string.gsub(s, leading_white_sep, '')
local tokens = split(s, white_comma_sep, find_outside_braces)
local trailers = splitters(s, white_comma_sep, find_outside_braces)
<<rewrite [[trailers]] to hold a single separator character each>>
local commas = { } --- maps each comma to index of token the follows it
for i, t in ipairs(trailers) do
string.gsub(t, ',', function() table.insert(commas, i+1) end)
end
local name = { }
<<parse the name tokens and set fields of [[name]]>>
return name
end
end
bst.parse_name = parse_name
bst.doc.parse_name = 'string * string option -> name table'
@
A~name has up to four parts: the most general form is either
``First von Last, Junior'' or
``von Last, First, Junior'', but various vons and Juniors can be
omitted.
The name-parsing algorithm is baroque and is transliterated from the
original \bibtex\ source, but the principle is clear:
assign the full version of each part to the four fields
[[ff]], [[vv]], [[ll]], and [[jj]];
and
assign an abbreviated version of each part to the fields
[[f]], [[v]], [[l]], and [[j]].
<<parse the name tokens and set fields of [[name]]>>=
local first_start, first_lim, last_lim, von_start, von_lim, jr_lim
-- variables mark subsequences; if start == lim, sequence is empty
local n = table.getn(tokens)
<<local parsing functions>>
local commacount = table.getn(commas)
if commacount == 0 then -- first von last jr
von_start, first_start, last_lim, jr_lim = 1, 1, n+1, n+1
<<parse first von last jr>>
elseif commacount == 1 then -- von last jr, first
von_start, last_lim, jr_lim, first_start, first_lim =
1, commas[1], commas[1], commas[1], n+1
divide_von_from_last()
elseif commacount == 2 then -- von last, jr, first
von_start, last_lim, jr_lim, first_start, first_lim =
1, commas[1], commas[2], commas[2], n+1
divide_von_from_last()
else
biberrorf("Too many commas in name '%s'")
end
<<set fields of name based on [[first_start]] and friends>>
@
The von name, if any, goes from the first von token to the last von
token, except the last name is entitled to at least one token.
So to find the limit of the von name, we start just before the last
token and wind down until we find a von token or we hit the von start
(in which latter case there is no von name).
<<local parsing functions>>=
function divide_von_from_last()
von_lim = last_lim - 1;
while von_lim > von_start and not isVon(tokens[von_lim-1]) do
von_lim = von_lim - 1
end
end
@
OK, here's one form.
<<parse first von last jr>>=
local got_von = false
while von_start < last_lim-1 do
if isVon(tokens[von_start]) then
divide_von_from_last()
got_von = true
break
else
von_start = von_start + 1
end
end
if not got_von then -- there is no von name
while von_start > 1 and find(trailers[von_start - 1], sep_and_not_tie) do
von_start = von_start - 1
end
von_lim = von_start
end
first_lim = von_start
@
The last name starts just past the last token, before the first
comma (if there is no comma, there is deemed to be one at the end
of the string), for which there exists a first brace-level-0 letter
(or brace-level-1 special character), and it's in lower case, unless
this last token is also the last token before the comma, in which
case the last name starts with this token (unless this last token is
connected by a [[sep_char]] other than a [[tie]] to the previous token, in
which case the last name starts with as many tokens earlier as are
connected by non[[tie]]s to this last one (except on Tuesdays
$\ldots\,$), although this module never sees such a case). Note that
if there are any tokens in either the von or last names, then the last
name has at least one, even if it starts with a lower-case letter.
@
The string separating tokens is reduced to a single ``separator
character.''
A~comma always trumps other separator characters.
Otherwise, if there's no comma, we take the first character, be it a
separator or a space.
(Patashnik considers that multiple such characters constitute
``silliness'' on the user's part.)
<<rewrite [[trailers]] to hold a single separator character each>>=
for i = 1, table.getn(trailers) do
local s = trailers[i]
assert(string.len(s) > 0)
if find(s, ',') then
trailers[i] = ','
else
trailers[i] = string.sub(s, 1, 1)
end
end
@
<<set fields of name based on [[first_start]] and friends>>=
<<definition of function [[set_name]]>>
set_name(first_start, first_lim, 'ff', 'f')
set_name(von_start, von_lim, 'vv', 'v')
set_name(von_lim, last_lim, 'll', 'l')
set_name(last_lim, jr_lim, 'jj', 'j')
@
We set long and short forms together; [[ss]]~is the long form and
[[s]]~is the short form.
<<definition of function [[set_name]]>>=
local function set_name(start, lim, long, short)
if start < lim then
-- string concatenation is quadratic, but names are short
<<definition of [[abbrev]], for shortening a token>>
local ss = tokens[start]
local s = abbrev(tokens[start])
for i = start + 1, lim - 1 do
if inter_token then
ss = ss .. inter_token .. tokens[i]
s = s .. inter_token .. abbrev(tokens[i])
else
local ssep, nnext = trailers[i-1], tokens[i]
local sep, next = ssep, abbrev(nnext)
<<possibly adjust [[sep]] and [[ssep]] according to token position and size>>
ss = ss .. ssep .. nnext
s = s .. '.' .. sep .. next
end
end
name[long] = ss
name[short] = s
end
end
@
Here is the default for a character between tokens:
a~tie is the default space character between the last two tokens of
the name part, and between the first two tokens if the first token is
short enough; otherwise, a space is the default.
<<possibly adjust [[sep]] and [[ssep]] according to token position and size>>=
if find(sep, sep_char) then
-- do nothing; sep is OK
elseif i == lim-1 then
sep, ssep = '~', '~'
elseif i == start + 1 then
sep = text_char_count(s) < 3 and '~' or ' '
ssep = text_char_count(ss) < 3 and '~' or ' '
else
sep, ssep = ' ', ' '
end
@
The von name starts with the first token satisfying [[isVon]],
unless that is the last token.
A~``von token'' is simply one that begins with a lower-case
letter---but those damn specials complicate everything.
<<Lua utility functions>>=
local upper_specials = { OE = true, AE = true, AA = true, O = true, L = true }
local lower_specials = { i = true, j = true, oe = true, ae = true, aa = true,
o = true, l = true, ss = true }
<<name-parsing utilities>>=
function isVon(s)
local lower = find_outside_braces(s, '%l') -- first nonbrace lowercase
local letter = find_outside_braces(s, '%a') -- first nonbrace letter
local bs, ebs, command = find_outside_braces(s, '%{%\\(%a+)') -- \xxx
if lower and lower <= letter and lower <= (bs or lower) then
return true
elseif letter and letter <= (bs or letter) then
return false
elseif bs then
if upper_specials[command] then
return false
elseif lower_specials[command] then
return true
else
local close_brace = find_outside_braces(s, '%}', ebs+1)
lower = find(s, '%l') -- first nonbrace lowercase
letter = find(s, '%a') -- first nonbrace letter
return lower and lower <= letter
end
else
return false
end
end
@
An abbreviated token is the first letter of a token, except again we
have to deal with the damned specials.
<<definition of [[abbrev]], for shortening a token>>=
local function abbrev(token)
local first_alpha, _, alpha = find(token, '(%a)')
local first_brace = find(token, '%{%\\')
if first_alpha and first_alpha <= (first_brace or first_alpha) then
return alpha
elseif first_brace then
local i, j, special = find(token, '(%b{})', first_brace)
if i then
return special
else -- unbalanced braces
return string.sub(token, first_brace)
end
else
return ''
end
end
@
\subsection{Formatting names}
Lacking Lua's string-processing utilities, classic \bibtex\ defines a
way of converting a ``format string'' and a name into a formatted
name.
I~find this formatting technique painful, but I also wanted to preserve
compatibility with existing bibliography styles, so I've implemented
it as accurately as I~can.
The interface is not quite identical to classic \bibtex;
a style can use [[namesplit]] to split names and then
[[format_name]] to format a single one,
or it can throw caution to the winds and call [[format_names]] to
format a whole list of names.
<<exported Lua functions>>=
bst.doc.format_names = "format * name list -> string list # format each name in list"
function bst.format_names(fmt, t)
local u = { }
for i = 1, table.getn(t) do
u[i] = bst.format_name(fmt, t[i])
end
return u
end
@
A \bibtex\ format string contains its variable elements inside braces.
Thus, we format a name by replacing each braced substring of the
format string.
<<exported Lua functions>>=
do
local good_keys = { ff = true, vv = true, ll = true, jj = true,
f = true, v = true, l = true, j = true, }
bst.doc.format_name = "format * name -> string # format 1 name as in bibtex"
function bst.format_name(fmt, name)
local t = type(name) == 'table' and name or parse_name(name)
-- at most one of the important letters, perhaps doubled, may appear
local function replace_braced(s)
local i, j, alpha = find_outside_braces(s, '(%a+)', 2)
if not i then
return '' --- can never be printed, but who are we to complain?
elseif not good_keys[alpha] then
biberrorf ('The format string %q has an illegal brace-level-1 letter', s)
elseif find_outside_braces(s, '%a+', j+1) then
biberrorf ('The format string %q has two sets of brace-level-1 letters', s)
elseif t[alpha] then
local k = j + 1
local t = t
<<make [[k]] follow inter-token string, if any, rebuilding [[t]] as needed>>
local head, tail = string.sub(s, 2, i-1) .. t[alpha], string.sub(s, k, -2)
<<adjust [[tail]] to account for discretionality of ties, if any>>
return head .. tail
else
return ''
end
end
return (string.gsub(fmt, '%b{}', replace_braced))
end
end
@
<<make [[k]] follow inter-token string, if any, rebuilding [[t]] as needed>>=
local kk, jj = find(s, '%b{}', k)
if kk and kk == k then
k = jj + 1
if type(name) == 'string' then
t = parse_name(name, string.sub(s, kk+1, jj-1))
else
error('Style error -- used a pre-parsed name with non-standard inter-token format string')
end
end
@
<<adjust [[tail]] to account for discretionality of ties, if any>>=
if find(tail, '%~%~$') then
tail = string.sub(tail, 1, -2) -- denotes hard tie
elseif find(tail, '%~$') then
if text_char_count(head) + text_char_count(tail) - 1 >= 3 then
tail = string.gsub(tail, '%~$', ' ')
end
end
@
\subsection{Line-wrapping output}
EXPLAIN THIS INTERFACE!!!
My [[max_print_line]] appears to be off by one from Oren Patashnik's.
<<exported Lua functions>>=
local min_print_line, max_print_line = 3, 79
bibtex.hard_max = max_print_line
bibtex.doc.hard_max = 'int # largest line that avoids a forced line break (for wizards)'
bst.doc.writer = "io-handle * int option -> object # result:write(s) buffers and breaks lines"
function bst.writer(out, indent)
indent = indent or 2
assert(indent + 10 < max_print_line)
indent = string.rep(' ', indent)
local gsub = string.gsub
local buf = ''
local function write(self, ...)
local s = table.concat { ... }
local lines = split(s, '\n')
lines[1] = buf .. lines[1]
buf = table.remove(lines)
for i = 1, table.getn(lines) do
local line = lines[i]
if not find(line, '^%s+$') then -- no line of just whitespace
line = gsub(line, '%s+$', '')
while string.len(line) > max_print_line do
<<emit initial part of line and reassign>>
end
out:write(line, '\n')
end
end
end
assert(out.write, "object passed to bst.writer does not have a write method")
return { write = write }
end
<<emit initial part of line and reassign>>=
local last_pre_white, post_white
local i, j, n = 1, 1, string.len(line)
while i and i <= n and i <= max_print_line do
i, j = find(line, '%s+', i)
if i and i <= max_print_line + 1 then
if i > min_print_line then last_pre_white, post_white = i - 1, j + 1 end
i = j + 1
end
end
if last_pre_white then
out:write(string.sub(line, 1, last_pre_white), '\n')
if post_white > max_print_line + 2 then
post_white = max_print_line + 2 -- bug-for-bug compatibility with bibtex
end
line = indent .. string.sub(line, post_white)
elseif n < bibtex.hard_max then
out:write(line, '\n')
line = ''
else -- ``unbreakable''
out:write(string.sub(line, 1, bibtex.hard_max-1), '%\n')
line = string.sub(line, bibtex.hard_max)
end
@
<<check constant values for consistency>>=
assert(min_print_line >= 3)
assert(max_print_line > min_print_line)
@
\subsection{Functions copied from classic \bibtex}
\paragraph{Adding a period}
Find the last non-[[}]] character, and if it is not a sentence
terminator, add a period.
<<exported Lua functions>>=
do
local terminates_sentence = { ["."] = true, ["?"] = true, ["!"] = true }
bst.doc.add_period = "string -> string # add period unless already .?!"
function bst.add_period(s)
local _, _, last = find(s, '([^%}])%}*$')
if last and not terminates_sentence[last] then
return s .. '.'
else
return s
end
end
end
@
\paragraph{Case-changing}
Classic \bibtex\ has a [[change.case$]] function, which takes an
argument telling whether to change to lower case, upper case, or
``title'' case (which has initial letters capitalized).
Because Lua supports first-class functions, it makes more sense just
to export three functions: [[lower]], [[title]], and [[upper]].
<<exported Lua functions>>=
do
bst.doc.lower = "string -> string # lower case according to bibtex rules"
bst.doc.upper = "string -> string # upper case according to bibtex rules"
bst.doc.title = "string -> string # title case according to bibtex rules"
<<utilities for case conversion>>
<<definitions of case-conversion functions>>
end
@
Case conversion is complicated by the presence of brace-delimited
sequences, especially since there is one set of conventions for a ``special
character'' (brace-delimited sequence beginning with {\TeX} control
sequence) and
another set of conventions for other brace-delimited sequences.
To deal with them, we typically do an ``odd-even split'' on balanced
braces,
then apply a ``normal'' conversion function to the odd elements and a
``special'' conversion function to the even elements.
The application is done by [[oeapp]].
<<utilities for case conversion>>=
local function oeapp(f, g, t)
for i = 1, table.getn(t), 2 do
t[i] = f(t[i])
end
for i = 2, table.getn(t), 2 do
t[i] = g(t[i])
end
return t
end
@
Upper- and lower-case conversion are easiest.
Non-specials are hit directly with [[string.lower]] or
[[string.upper]];
for special characters, we use utility called [[convert_special]].
<<definitions of case-conversion functions>>=
local lower_special = convert_special(string.lower)
local upper_special = convert_special(string.upper)
function bst.lower(s)
return table.concat(oeapp(string.lower, lower_special, brace_split(s)))
end
function bst.upper(s)
return table.concat(oeapp(string.upper, upper_special, brace_split(s)))
end
@
Here is [[convert_special]].
If a special begins with an alphabetic control sequence, we convert
only elements between control sequences.
If a special begins with a nonalphabetic control sequence, we convert
the whole special as usual.
Finally, if a special does not begin with a control sequence, we leave
it the hell alone.
(This is the convention that allows us to put [[{FORTRAN}]] in a
\bibtex\ entry and be assured that capitalization is not lost.)
<<utilities for case conversion>>=
function convert_special(cvt)
return function(s)
if find(s, '^{\\(%a+)') then
local t = odd_even_split(s, '\\%a+')
for i = 1, table.getn(t), 2 do
t[i] = cvt(t[i])
end
return table.concat(t)
elseif find(s, '^{\\') then
return cvt(s)
else
return s
end
end
end
@
Title conversion doesn't fit so nicely into the framework.
Function [[lower_later]] lowers all but the first letter of a string.
<<utilities for case conversion>>=
local function lower_later(s)
return string.sub(s, 1, 1) .. string.lower(string.sub(s, 2))
end
@
For title conversion, we don't mess with a token that follows a colon.
Hence, we must maintain [[prev]] and can't use [[convert_special]].
<<definitions of case-conversion functions>>=
local function title_special(s, prev)
if find(prev, ':%s+$') then
return s
else
if find(s, '^{\\(%a+)') then
local t = odd_even_split(s, '\\%a+')
for i = 1, table.getn(t), 2 do
local prev = t[i-1] or prev
if find(prev, ':%s+$') then
assert(false, 'bugrit')
else
t[i] = string.lower(t[i])
end
end
return table.concat(t)
elseif find(s, '^{\\') then
return string.lower(s)
else
return s
end
end
end
@
Internal function [[recap]] deals with the damn colons.
<<definitions of case-conversion functions>>=
function bst.title(s)
local function recap(s, first)
local parts = odd_even_split(s, '%:%s+')
parts[1] = first and lower_later(parts[1]) or string.lower(parts[1])
for i = (first and 3 or 1), table.getn(parts), 2 do
parts[i] = lower_later(parts[i])
end
return table.concat(parts)
end
local t = brace_split(s)
for i = 1, table.getn(t), 2 do -- elements outside specials get recapped
t[i] = recap(t[i], i == 1)
end
for i = 2, table.getn(t), 2 do -- specials are, well, special
local prev = t[i-1]
if i == 2 and not find(prev, '%S') then prev = ': ' end
t[i] = title_special(t[i], prev)
end
return table.concat(t)
end
@
\paragraph{Purification}
Purification (classic [[purify$]]) involves removing non-alphanumeric
characters.
Each sequence of ``separator'' characters becomes a single space.
<<exported Lua functions>>=
do
bst.doc.purify = "string -> string # remove nonalphanumeric, non-sep chars"
local high_alpha = string.char(128) .. '-' .. string.char(255)
local sep_white_char = '[' .. sep_chars .. '%s]'
local disappears = '[^' .. sep_chars .. high_alpha .. '%s%w]'
local gsub = string.gsub
local function purify(s)
return gsub(gsub(s, sep_white_char, ' '), disappears, '')
end
-- special characters are purified by removing all non-alphanumerics,
-- including white space and sep-chars
local function spurify(s)
return gsub(s, '[^%w' .. high_alpha .. ']+', '')
end
local purify_all_chars = { oe = true, OE = true, ae = true, AE = true, ss = true }
function bst.purify(s)
local t = brace_split(s)
for i = 1, table.getn(t) do
local _, k, cmd = find(t[i], '^{\\(%a+)%s*')
if k then
if lower_specials[cmd] or upper_specials[cmd] then
if not purify_all_chars[cmd] then
cmd = string.sub(cmd, 1, 1)
end
t[i] = cmd .. spurify(string.sub(t[i], k+1))
else
t[i] = spurify(string.sub(t[i], k+1))
end
elseif find(t[i], '^{\\') then
t[i] = spurify(t[i])
else
t[i] = purify(t[i])
end
end
return table.concat(t)
end
end
@
\paragraph{Text prefix}
Function [[text_prefix]] (classic [[text.prefix$]]) takes an initial
substring of a string, with the proviso that a \bibtex\ ``special
character'' sequence counts as a single character.
<<exported Lua functions>>=
bst.doc.text_prefix = "string * int -> string # take first n chars with special == 1"
function bst.text_prefix(s, n)
local t = brace_split(s)
local answer, rem = '', n
for i = 1, table.getn(t), 2 do
answer = answer .. string.sub(t[i], 1, rem)
rem = rem - string.len(t[i])
if rem <= 0 then return answer end
if find(t[i+1], '^{\\') then
answer = answer .. t[i+1]
rem = rem - 1
else
<<take up to [[rem]] characters from [[t[i+1]]], not counting braces>>
end
end
return answer
end
<<take up to [[rem]] characters from [[t[i+1]]], not counting braces>>=
local s = t[i+1]
local braces = 0
local sub = string.sub
for i = 1, string.len(s) do
local c = sub(s, i, i)
if c == '{' then
braces = braces + 1
elseif c == '}' then
braces = braces + 1
else
rem = rem - 1
if rem == 0 then
return answer .. string.sub(s, 1, i) .. string.rep('}', braces)
end
end
end
answer = answer .. s
@
\paragraph{Emptiness test}
Function [[empty]] (classic [[empty$]]) tells if a value is empty;
i.e., it is missing (nil) or it is only white space.
<<exported Lua functions>>=
bst.doc.empty = "string option -> bool # is string there and holding nonspace?"
function bst.empty(s)
return s == nil or not find(s, '%S')
end
@
@
\subsection{Other utilities}
\paragraph{A stable sort}
Function [[bst.sort]] is like [[table.sort]] only stable.
It is needed because classic \bibtex\ uses a stable sort.
Its interface is the same as [[table.sort]].
<<exported Lua functions>>=
bst.doc.sort = 'value list * compare option # like table.sort, but stable'
function bst.sort(t, lt)
lt = lt or function(x, y) return x < y end
local pos = { } --- position of each element in original table
for i = 1, table.getn(t) do pos[t[i]] = i end
local function nlt(x, y)
if lt(x, y) then
return true
elseif lt(y, x) then
return false
else -- elements look equal
return pos[x] < pos[y]
end
end
return table.sort(t, nlt)
end
bst.doc.sort = 'value list * compare option -> unit # stable sort'
@
\paragraph{The standard months}
Every style is required to recognize the months, so we make it easy to
create a fresh table with either full or abbreviated months.
<<exported Lua functions>>=
bst.doc.months = "string option -> table # macros table containing months"
function bst.months(what)
local m = {
jan = "January", feb = "February", mar = "March", apr = "April",
may = "May", jun = "June", jul = "July", aug = "August",
sep = "September", oct = "October", nov = "November", dec = "December" }
if what == 'short' or what == 3 then
for k, v in pairs(m) do
m[k] = string.sub(v, 1, 3)
end
end
return m
end
@
\paragraph{Comma-separated lists}
The function [[commafy]] takes a list and inserts commas and
[[and]] (or [[or]]) using American conventions.
For example,
\begin{quote}
[[commafy { 'Graham', 'Knuth', 'Patashnik' }]]
\end{quote}
returns [['Graham, Knuth, and Patashnik']],
but
\begin{quote}
[[commafy { 'Knuth', 'Plass' }]]
\end{quote}
returns [['Knuth and Plass']].
<<exported Lua functions>>=
bst.doc.commafy = "string list -> string # concat separated by commas, and"
function bst.commafy(t, andword)
andword = andword or 'and'
local n = table.getn(t)
if n == 1 then
return t[1]
elseif n == 2 then
return t[1] .. ' ' .. andword .. ' ' .. t[2]
else
local last = t[n]
t[n] = andword .. ' ' .. t[n]
local answer = table.concat(t, ', ')
t[n] = last
return answer
end
end
@
\section{Testing and so on}
Here are a couple of test functions I used during development that I
thought might be worth keeping around.
<<exported Lua functions>>=
bibtex.doc.cat = 'string -> unit # emit the named bib file in bib format'
function bibtex.cat(bib)
local rdr = bibtex.open(bib, bst.months())
if not rdr then
rdr = assert(bibtex.open(assert(bibtex.bibpath(bib)), bst.months()))
end
for type, key, fields in entries(rdr) do
if type == nil then
break
elseif not type then
io.stderr:write('Error on key ', key, '\n')
else
emit_tkf.bib(io.stdout, type, key, fields)
end
end
bibtex.close(rdr)
end
@
<<exported Lua functions>>=
bibtex.doc.count = 'string list -> unit # take list of bibs and print number of entries'
function bibtex.count(argv)
local bibs = { }
local macros = { }
local n = 0
<<make [[bibs]] the list of pathnames implied by [[argv]]>>
local function warn() end
for _, bib in ipairs(bibs) do
local rdr = bibtex.open(bib, macros)
for type, key, fields in entries(rdr) do
if type == nil then
break
elseif type then
n = n + 1
end
end
rdr:close()
end
printf("%d\n", n)
end
@
<<exported Lua functions>>=
bibtex.doc.all_entries = "bibname * macro-table -> preamble * citation list"
function bibtex.all_entries(bib, macros)
macros = macros or bst.months()
warn = warn or emit_warning
local rdr = bibtex.open(bib, macros, warn)
if not rdr then
rdr = assert(bibtex.open(assert(bibtex.bibpath(bib)), macros, warn),
"could not open bib file " .. bib)
end
local cs = { }
local seen = { }
for type, key, fields in entries(rdr) do
if type == nil then
break
elseif not type then
io.stderr:write(key, '\n')
elseif not seen[key] then
seen[key] = true
table.insert(cs, { type = type, key = key, fields = fields, file = bib,
line = rdr.entry_line })
end
end
local p = assert(rdr.preamble)
rdr:close()
return p, cs
end
@
\section{Laundry list}
THINGS TO DO:
\begin{itemize}
\item TRANSITION THE C~CODE TO LUA NATIVE ERROR HANDLING ([[luaL_error]] and [[pcall]])
\item NO WARNING FOR DUPLICATE FIELDS NOT DEFINED IN .BST?
\item STANDARD WARNING FOR REPEATED ENTRY?
\item
NOT ENFORCED: An entry type must be
defined in the \texttt{.bst} file if this entry is to be included in the
reference list.
\item
THE WHOLE BST-SEARCH THING NEEDS MORE CARE.
BibTeX searches the directories in the path defined by the BSTINPUTS
environment variable for .bst files. If BSTINPUTS is not set, it uses
the system default. For .bib files, it uses the BIBINPUTS environment
variable if that is set, otherwise the default. See tex(1) for the
details of the searching.
If the environment variable TEXMFOUTPUT is set, BibTeX attempts to put
its output files in it, if they cannot be put in the current directory.
Again, see tex(1). No special searching is done for the .aux file.
\item
RATIONALIZE ERROR MACHINERY WITH WARNING, ERROR, AND FATAL CASES --
AND COUNTS.
\item
Here are some things that \bibtex\ does that \nbibtex\ should do:
\begin{enumerate}
\item
Writes a log file
\item
Counts warnings, or if there is an error, counts errors instead
\end{enumerate}
\end{itemize}
\end{document}
@