7July_13_2008 Preventing Injection Attack with Syntax Embeddings
http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2007-003.pdf
use of syntax embedding to prevent injection vulnerabilities in lang-independent ways

One lang that constructs setnences in aother lang: SQL, XQuery, Xpath, XML or shell cmds 
usually done using unhygienic string manipulation
injection attacks (largest classes of security problems)
SQL-construction is more likely vulnerable than not
often in host language that dynamically computers SQL queries (like PHP)
CGI scripts call unix shell cmds
unhygienically constructed HTML
cross-site scripting (XSS)
malicious javascript code.
-Injections prevented by escaping external input.... but this can still be injected
   can escape the escape and much has been done w/ escape char

-better sol'n (in this paper) is to us API to build the sentence.
  can ensure that injections are impossible by construction
  sting literals can take care of escaping via code.
  type system ensures well-formedness of the sentence 
  (this is unnattractive due to gap bet. programmer and syntax of guest language, which is a
  domain-specific languages (DSL) )

this paper: comibning security of using API w/ string manipulation via embedding the syntax of the guest lang into syntax of host lang. (pioneered by meta-programming)
ex: SQL-in-Java
preprocessor (assimilator) translates code in combined lang into Java code that calls API generated from guest lang grammar.
embedding is not new idea (SQL-92 standard into C programming)

contributions:
-comprehensive sol'n to injection attacks via construction
-generic enough to be easily adapted to new host and guest langs
-generic via language embedding (modular, scannerless parsing)   other through generating underlying APIs form context-free grammar of gues lang. assimilator translatees guest lang to API can be applied to any host lang and combos of guest langs (NO meta-programming required)
-well-formedness of guest lang. sentences that are constructed can be ensured at run-time (as well as statically)
-ambigiuities are dealt with instead of the having the programmer disambiguate such things as antiquotations

prototype: StringBorg (after MetaBorg) http://www.stringborg.org/

core prob w/ underlying injection attacks -- query is parsed after construction that does not correspond to intended grammatical structure. structures not easily compared (accd to this author--but look at parse trees)
StringBorg handles this as preprocessing step, then constructs code

overview: syntax of guest lang is embedded in host lang, combined syntax to write programs, assimilator parse source file and trans forms embedded code to invocations of API (using API generator)
thus preventing SQL injection attacks by ALWAYS checking lexcial values
-this method is language independent

discussion:
Static typechecking two major disadvantages
 (1) the programmer has to know all these syntactical categories and their mapping to types of the host language and
 (2) no ambiguities are allowed, which makes the syntax embedding more difficult to use.
advantage is that static checking provides more static guarantees, not a security advantage.
"both the statically and dynamically typed back-ends guarantee statically that an injection attack cannot occur.
The dynamic or static typechecking only checks for programming errors, not for problems with input provided by the user.
The generated APIs will never throw an ‘injection attack exception’; the exceptions that can occur are either related to illegal characters in the input (e.g. the newline in SQL) or are programming errors. The last category of exceptions does not depend on particular inputs, but only on execution paths, which are easier to detect using testing."

Prevented classes of injection attacks
attacks classified by injection mechanism or  intent of the attack.
FROM paper directly----
Injection through user input is the mechanism of using specially crafted user input to construct a query
that has a different parse tree then originally intended. StringBorg prevents these attacks by checking the
syntax of lexical values and automatic escaping of all strings.
Injection through cookies differs from injection through user input by exploiting input from cookies,
which are sometimes naively assumed to be controlled by a web application. StringBorg checks and
escapes all strings, irrespective of their origin, thus disabling this injection mechanism.
Injection through server variables, such as HTTP headers, employs yet another origin of strings to
perform an attack. Again, these attacks are prevented since StringBorg escapes all strings.
Second-order injection attacks indirectly perform the attack by first introducing a malicious input in the
system (e.g. database), which is used later as the input of an affected query. Again, these attacks are
prevented since StringBorg checks and escapes all strings, whether they originate directly from the user
or not.
Tautology-based attacks use an injection mechanism to craft a query where the condition always evaluates
to true. StringBorg prevents the mechanisms of injection attacks from being applied, which implies that
crafting tautologies is impossible.
Union query attacks are related to tautologies, but allow access to different tables than the ones originally
involved in the query. Similar to tautology attacks, StringBorg prevents the mechanisms that are used.
Piggy-backed queries are malicious queries added to be executed in addition to the original query. Again,
StringBorg prevents the mechanisms that are used.
Illegal query attacks are used to trigger syntax, type or logical errors. This often results in an error report
that reveals information about possible exploits. StringBorg only throws an exception if an input string
contains invalid characters that could not be escaped. StringBorg disables the construction of syntactically
invalid queries.
    NOTE:"An embedding that allows conversion of input strings to table and column names
    (which is not the case in our embeddings). It is advisable to disallow this conversion and only allow
    literal table and column names. In general, allowing users to input identifiers can introduce a plenitude of
    options for manipulating the intended semantics of the constructed guest sentence."
Inference attacks are related to illegal query attacks. They can be applied if a site is protected not to show
error messages. By observing the success or failure of queries, the setup of the database can indirectly
still be examined. The prevention of inference attacks does not differ from illegal query attacks.
Stored procedure attacks are a class of all known attacks applied to stored procedures. If stored procedures
compose queries based on user input, then the same method for structured construction should be
applied.
Alternate encoding attacks avoid detection and prevention of an attack by concealing the actual query
in a different syntax or character encoding, which tricks the detection and prevention techniques into
interpreting the query in a different way then the actual processor of the guest language does. In all
known embeddings, StringBorg prevents encoding attacks since the encoding itself is escaped and lexical
strings are checked syntactically.
---END FROM paper directly

NOT GUARANTEED to prevent for all guest languages (unicode escaping in java for any input char, not just string literals)
So if java is used, then that java's unicode escapes can be used to to terminate a string literal and inject. (not caught by lexical checking)
DFA does not unescape Unicode escape

sol'n's
1. escape sequence can be escaped
2. unescaping rules defined next to escape rules
3. syntax definition of guest lang restricted not to support unicode escape sequences
4. syntax def formalism extended to lexical escape

use of unexpected char encodings--hide an attack.

StringBorg
--relies heavily on  modular syntax def and parser generation, by SDF and scannerless Generalized-LR parsing
need syntax of host and guest expressible in a context-free grammar (not all langs are!)
--error reporting quality of error msgs important...
--efficient parser composition need every combo parser.... parser generation too expensive as part of compilation... done separately, lacks "plug-in" future: parse table plug-ins

RELATED WORK
this work does not alleviate the need for static or dynamic analysis techniques as they apply to existing code that is more traditional

Explicit escaping and filtering (escape input and filter malicious inputs) ... requires programmers to get it right each time
APIs SQL DOM sefe SQK w/ query construction behind an API that ensures string literals are escaped via construction: SQL abstract trees generated from a specific db schema. ensures typing of queries wrt db.
SQL Query Objects: quires defined in plain Java, compiled using OpenJava in JDO calls (can be seen as embedding convenient sytnax for queries, into a host lang, assimilation is translation into JDO.)
 
LINQ syntactic hygiene (Haskell Server Pages) Cw and LINQ, provide XML literals, enable XML output host expression converted to an expression tree than can be processed in arbitrary ways. (not extensible to other host langs)

Static analysis techniques JDBC Checker, 'tainted data' , etc....

Run-time detection techniques  AMNESIA automaton for query strings, WASP 'trusted' string, SQLCHECK wrappers with markers
they discover but don't recover... so dos attack still possible (they just shut down the server or something? )

SQL-specific techniques SQL-92 has embedding for specific langs,
  prepared statements allow safe construction
  stored procedures (as long as it is called in a safe way)

MetaBorg provides an embedded domain-specific syntax for using libraries.
 scannerless Generalized-LR alg to parse embedded domain-specific lang and Stratego prog transformation lang for assimilation of embedded code to host lang. http://www.program-transformation.org/Stratego/MetaBorg


CONCLUSION
"The main advantage over previous approaches is that it
makes injections impossible by construction, and that it is generic—it is not necessary to produce APIs and
assimilators for each element of the cross-product of host and guest languages {Java, C#, PHP, Perl, . . .}×
{SQL, JDOQL, HQL, EJBQL, OQL, XML, HTML, XPath, XQuery, Shell, . . .}, but only to perform a
relatively small amount of work for each host and guest language."