Forum: Other

Post: UsingVoltDB doc corrections and suggestions

UsingVoltDB doc corrections and suggestions
Feb 28, 2012

p.1 "As have" should be "So have" to make a complete sentence.

p.1 "costly disk accesses" should be "costly disk access"?

p.12 reads "• Partition columns do not need to have unique values, but they cannot be null. Numeric fields can
be zero and string or character fields can be empty, but the column cannot contain a null value."
This point may call for a specific side-note to Oracle users, reminding them that standard SQL, unlike Oracle, makes a clear distinction between a null string value and an empty string which is non-null.

p.13 3.2.2 Is a new user supposed to come with an understanding of "For clusters with a K-safety value greater than zero"?
There should be some simpler way to describe this configuration in more generic terms, something like "When running queries on duplicated data for greater reliability".

p.21-22 3.2.4 reads "the parameter that is used as the hash value for the partition" and later "using the FLIGHTID column and the flight_id parameter as the hash value." which seems late to be introducing the technical term "hash value" when "partitioning column" would suffice as in "the parameter that is used for the partitioning column" and "using the FLIGHTID column and the flight_id parameter as its value."

p.22 3.2.4 reads "single-partitioned store procedure", should be "single-partitioned stored procedure"

p.24 3.3.2 reads "the LookupFlight store procedure", should be "the LookupFlight stored procedure"

p.33 4.2 reads "the stored procedure is partitioned on the RESERVEID column", should be "the stored procedure is partitioned on the FLIGHTID column"

p.39 p 6.1.2 "tag (as a child of" uses the wrong font

p.40 6.2 reads "It is important that all nodes in the cluster can resolve the hostname or IP address of the lead node you specify." which can be dropped since it is redundant with stronger statements in Section 6.1.3 covering ALL nodes.

p.40 6.2 reads "naming voltsvr1 as the leader node", should be "naming voltsvr1 as the lead node"

p.42 6.4 reads "} (catch Exception e) {", should be "} (catch org.voltdb.client.ProcCallException e) {" for better style, as shown in the appendix.

p.43 6.5.2 reads "The start command allows for a single command" should be "The start action allows for a single command".

p.54 9.1 reads "Use save to write a snapshot of the current data to disk.", which would read more clearly and consistently as "Save a snapshot of the current data to disk (using @SnapshotSave)." The general issue in this section is that clarity would be helped by reducing the number of slightly different terms used for the same thing, "save" (as noun and verb), "save the data", "use save", "the save operation", "(use) the save command", "the @SaveSnapshot procedure". For another example, "When you issue a save command" should be "When you save the data".

p.55 9.1 reads "In the case of manual saves, " which seems an inaccurate way of expressing "When using the @SnapshotSave procedure". The documentation would apply to any use of the @SnapshotSave procedure, even if, expectably, it were part of some "non-manual" user-scripted operation. Since the current discussion centers around the @SnapshotSave procedure, there seems to be an implicit qualifier of "when not using the Enterprise Manager". There may be another implicit qualifier of "when not running an automated snapshot", but automated snapshots have not been previously mentioned. This issue recurs in section 9.1.1. It seems like there should be a more accurate distinction than "manual", maybe "explicit use of @SnapshotSave"? A better way of explaining the "synchronous" function argument may be to document the setting as a strongly recommended option (without qualification -- in all cases) and then document the fact that the alternative option exists for internal use by the Enterprise Manager which is capable of handling concurrent transactions without data loss. The use of the term "manual" in section 9.3 to characterize ALL non-scheduled snapshots (arguably even those driven by Enterprise Manager?) further muddies the waters.

p.55 9.1 reads "Note that every node in the cluster uses the same absolute path, so the path specified must be valid, must exist on every node, and must not already contain data from any previous saves using the same unique identifier, or the save will fail." but "must be valid" is redundant with "must exist". The description in section for how data can be migrated to surviving nodes when a node is removed suggests that there may be serious consequences to having the path specify a location on a shared network drive mounted with NFS. This should be warned against, here, since a network drive is suggested elsewhere as a solution to path consistency (when sharing code and configuration data).

p.55 9.1 reads "the appropriate save set exists", should be "the appropriate snapshot files exist" to avoid introducing a new term.

p.55 9.1.1 reads "It is a good practice to examine the return value of the save operation to make sure all partitions are saved as expected." which is only marginally helpful advice without a more detailed description of how an "as expected" return value differs from an unexpected one. The same comment applies to section 9.1.2.

p.56 9.1.3 reads "the database schema, procedure source files, ...", should be "the database schema, the procedure source files, ..."

p.56 titled "Adding Nodes to the Database" contains the only description of removing nodes so should be titled "Adding or Removing Database Nodes"

p.56 The discussion of migrating snapshots from removed nodes would be improved by an example of using the command line to copy snapshots across the network from a removed node to a surviving node for a given a snapshot directory and unique identifier.

p.60 10 reads "every transaction (that is, stored procedure)", should be "every transaction (that is, stored procedure invocation)".

p.60 10.1 reads "the logs begin to fill up", which hints rather too subtly that log size is bounded, but only really makes sense when the log size configuration is fully explained. Maybe "the logs could begin to require significant disk space" would be more understandable at this point, or maybe "the logs begin to fill up the limited space allocated to them (See Section 10.3.1 Log Size)." would be clearer.

p.61 10.1 reads "In reverse, when it is time to "replay" the logs, if a database starts with either the start or recover option (as described in Section 6.5.2, “Command Logging and Recovery”) once the server nodes establish a quorum, they start by restoring...", where the sense of being reversed is not really clear (or very helpful?). Suggested rewording: "When a database is started with either the start or recover option (as described in Section 6.5.2, “Command Logging and Recovery”) and the server nodes have established a quorum, it is time to "replay" the logs. This process starts by restoring..."

p.62 10.3.1 reads "Note that the log size specifies the initial size. " which only partially addresses the point that follows. More to the point, if I understand correctly, "Note that the log size specifies the initial size allocation and the size at which the snapshot creation begins."

p.63 10.3.3 reads "In other words, the results for all of the transactions since the last write are held on the server until the next write occurs." contradicting other documentation that states that transaction requests, not results, are logged. Also, the references to "the last write" and "the next write" are misleading since "write" is used elsewhere to refer to log file writes, rather than to snapshot files (the assumed intent, here). Suggestion: "In other words, each transaction in the log since the last snapshot is stored to disk until the next snapshot truncates the log."

p.63 10.3.3 reads "no transactions are lost", which would be technically more accurate as "no successful transactions are lost" or "no transaction is lost without an error condition reported back to the client application".

p.63 10.3.3 reads "the interval between writes (i.e. the frequency) while the results are held, adds to the latency...", should be "the delay until the next log write (i.e. controlled by the frequency), while logged transactions are accumulated in memory, adds to the latency...", again to avoid referencing "results" and to avoid the suggestion that the full "time frequency" (vs. more likely about half?) is added to the latency.

p.63 10.3.3 reads "avoid adding undo latency", should be "avoid adding undue latency"

p.65 11 reads "...the system keeps a log of every stored procedure (or "command") as it is invoked." which appears to be the first (only?) place in the document that explicitly identifies "command" as a synonym for "stored procedure (invocation)". Elsewhere, in particular throughout Chapter 10 Command Logging and Recovery, the term "command" (distinct from "command log" or "command logging") is never used for a "stored procedure invocation". The prevalent term is "transaction". Should this line use "transaction" as well to be consistent with the prior chapter (even if seemingly inconsistent with the "Command Logging" feature name and the "commandlog*" keywords)? In the broader context of the document, "command" is overwhelmingly more likely to refer to shell commands or to informally referenced system procedures like "save" and "recover".

p.65 11 reads "how to recover in the case of an eventual system failure", in which "eventual" doesn't seem to contribute much other than an unfortunate sense of "inevitable". Suggestion: "how to recover in the event of a system failure"

p.70 11.4.1 reads "accept requests thinking they are the only viable copy", should be less anthropomorphic "accept requests as if they are the only viable copy". Consider adding "There is no way to retain the effects of the separately committed transactions in both of the separated copies once the network connection is restored".

p.73 12.1 reads "certain tables in the schema as sources for export", should be "certain tables as export-only tables, sources for exported data," so that the upcoming reference to "insert data into the export-only tables" makes sense.

p.75 12.3 reads "As mentioned before," should be "As mentioned above,"

p.76 12.3 There is a case mismatch between the schema file "CREATE TABLE Reservation_final" and the project definition file ''.
Is this required or allowed? If so, that fact should be called out in the text.

p.76 12.3 reads "In reverse", should be "Conversely"

p.78 12.5 reads "For example, one client writes the serialized data to a sequence of files while another could insert it into an analytic database.", should be "For example, a client could write the serialized data to a sequence of files; another could insert it into an analytic database." Avoid using "while" which could be misinterpreted as implying simultaneous operation of clients.

p.78 12.5.2 reads "@SnapShotSave", should be "@SnapshotSave"

p.79 12.6 should be consistent in terminology "export-to-file client" and "export-to-Hadoop client" vs. the few times using "file client" and "Hadoop client".

p.84 13 reads "there are a number of different", should be "you have a number of different"

p.88 14.1.2 reads "IP address", should be "server name or IP address"

p.88+ 14.1.3/4 reads "declare" in several places, all of which should be "define"

p.89 14.1.4 reads "declare a callback structure and method that will be used", should be "define a callback class with a method that will be called"

p.89 14.1.4 reads "Then, when you go to make the actual stored procedure invocation, you declare an callback instance and invoke the procedure, using both the procedure structure and the callback instance:", should be "To asynchronously invoke the stored procedure, define an instance of the callback class and pass a reference to the procedure object and a shared pointer to the callback object to the client's invoke method"

p.89 14.1.4 code examples reading "client->invoke" should be "client.invoke"

p.89 14.1.4 reads "until told not to. (That is, until a callback returns a value of false.)", should be "until a callback returns a value of false."

p.90 14.2 should follow the fixed font convention for URLs

p.94 14.2.3 reads "VoltDB does it best", should be "VoltDB does its best"

p.94 14.2.3 reads "a JDBC-escaped timestamp", should be "a JDBC-encoded timestamp"

p.95 14.2.4 reads '12345 "2010-07-01 12:30:21"', should have a more realistic number of digits for milliseconds since epoch, more like '1278001821000 "2010-07-01 12:30:21"'.

p.95 14.2.4 Figure 14.1 uses "status" in two places and "statusstring" once. The text explanation that follows documents one "status" component, not identifying which of the two it is describing, and does not document "statusstring" at all.

p.96 14.2.5 reads "unreachable, The database may not be stated yet", should be "unreachable, the database may not be started yet"

p.97 14.2.5 reads "But the consequence is that you must be more organized in how you handle errors as a consequence." should be "But the consequence is that you must be more organized in how you handle errors."

p.97 14.2.5 reads "1. First check to see that the HTTP request itself can be performed.", should be "1. First check to see that the HTTP request itself completed without error."

p.97 14.3.1. reads 'using the "jdbc:" protocol, followed by "voltdb://", the server name, a colon, and the port number. In other words, the complete JDBC connection url is "jdbc:voltdb://{server}:{port}".' should be 'in the following format" "jdbc:voltdb://{server}:{port}"'.

p.97 14.3.2 should follow the fixed font convention for URLs

p.100 CREATE INDEX reads "when using the index as a key.", should be "when matching value(s) or range(s) of the named column(s)".

"• Although the definition of a default value for columns is allowed in the schema definition, currently INSERT statements must specify values for all columns in the table at runtime."
appears to contradict

p. 106 INSERT
"The last example assigns values for the employee ID and the first and last names, but not the middle initial. This query will only succeed if the MI column is nullable or has a default value defined in the database schema.
INSERT INTO employee (emp_id, lastname, firstname) VALUES (145304, "Doe", "John")"

p.106 INSERT example statements lack the terminating semi-colon.

p.109 UPDATE reads "when using the INSERT statement", should be "when using the UPDATE statement".

p.113 Appendix D reads "(frequently referred to as deployment)" should be "(frequently referred to as deployment.xml or the deployment file)"

p.113 Appendix D reads "identifying allowed users and passwords" should be "the identification of allowed users and passwords"

p.113 Appendix D reads "The project definition file is a" should be "The configuration file is a"
Feb 29, 2012
Thank you so much for your suggestions. We especially appreciate the time and care you took. We will consider these changes for an upcoming release.