Show TOC

Checks and Validations of Entries Received through ComponentsLocate this document in the navigation structure

Implementation Considerations

Entries and input data that are supplied through component interfaces must be checked for syntactical and semantic correctness.

Component interfaces in this context comprise:

  • User interfaces.

  • Communication protocol end-points.

  • File system and database access.

  • Public interfaces of components, for example, APIs, and command interpreters.

Syntactical correctness includes checks for the data type and format, including:

  • Allowed length.

  • Allowed characters.

  • Allowed ranges, for example, case-sensitive alphanumeric characters such as, Aa-Zz, 0-99.

  • Allowed values, also known as 'white list' check.

Semantic correctness validation includes checks for the technical and the business context, such as, status of business objects, interdependencies of arguments. In addition, the maximum size of the input must be checked against a defined limit. Incorrect input should be rejected and should not cause the product to transition into an uncontrolled state.

The fundamental principle is not to trust the client or any input coming from an unknown or untrusted source. Inappropriate input validation leads to numerous well known vulnerabilities, including extremely dangerous injection attacks, such as, Cross-Site Scripting (XSS), SQL Injection, command and code Injection, buffer overflow attacks, directory, and path traversal.

The following are examples of input and output of potentially critical entries:

  • SQL injections: Where users' use 'or' in SQL, the output is an SQL statement.

  • Directory and path traversal: Where users supply a file name '..'

  • Command injection: Where users enter commands such as, '>', '|', ';'

  • XSS (reflective): Where users supply URL parameters such as, <script, the output is HTML and Javascript.

  • XSS (persisted): Where the input is a data store, using for example, <script, the output is HTML and Javascript.

A complementary protection measure of input validation is what is known as output encoding (escaping). Output encoding serves as a kind of a 'double bottom' in order to be on the safe side in case something slipped through the input validation fence.

The following are some use cases and the impact for failing to perform necessary checks and validations:

  • Confidentiality

    Missing or insufficient input validation can lead to various dangerous injection attacks that can end up in the worst case, in a complete confidentiality compromise, such as, by traversing and reading the server file system or reading database tables.

    The impact, however, cannot be generalized and needs to be evaluated on a case-by-case basis.

  • Integrity

    Missing or insufficient input validation can lead to various dangerous injection attacks that can end up in the worst case, in a complete integrity compromise, such as, by changing content of database tables through SQL injection.

    The impact, however, cannot be generalized and needs to be evaluated on a case-by-case basis.

  • Availability

    Missing or insufficient input validation can lead to various dangerous injection attacks that can end up in complete availability compromise, such as, dropping important database tables, or by deleting, and changing files in the server file system through directory, and path traversal.

    The impact, however, cannot be generalized and needs to be evaluated on a case-by-case basis.

  • Compliance

    A system with missing input validation cannot guarantee confidentiality and integrity of its data. Thus, compliance with respect to auditing, (SOX, FDA Part 11) and data protection can be heavily impacted.

  • Total Cost of Ownership (TCO)

    In general, missing input validation does not generate TCO for the customer, however, due to the compliance impact, customers could be forced to protect their data, which would clearly increase the TCO.

  • Total Cost of Development (TCD)

    Since the basis for a sufficient input validation is a good architectural trust model, implementing proper input validation consumes additional development resources and thus increases the TCD (but the money is very well spent).

Input validation is a very important corner stone of a sound security concept, however, applications very often fail to implement a proper input validation. Below are some of the reasons for this failure:

  • Input is not recognized as such.

  • The source of input is trusted.

  • The false implicit assumption is that, somebody else has already done the validation.

  • Performance is deemed to be more important than security.

  • Tight development schedules do not leave enough time for secure programming.

The following sections give an overview what to take into account when designing and implementing secure input validation in your application.

Syntactical Correctness

In an ideal case, the used platform, such as, .NET, Java, NetWeaver ABAP, enforces strongly typed variables; the platform implicitly takes care of a large portion of the syntactical input validation tasks.


A variable of type '16-bit signed INTEGER' should automatically be checked by the platform, and rejected if the provided input contains any characters (other than 0..9), decimal fractions, and if the provided values exceed or under-run the boundaries of the respective data type, for example, -32.768 .. +32.768.


Always make as much use as possible of input validation features provided by your platform.

Always do a meaningful typing of variables used in method and function interfaces, (no strings - if possible).

  • Existence and length check

    If not done automatically by the platform, the application should check if all mandatory and required parts are contained in the provided input.


    Make sure that received input contains all required information. An attacker should not be able to harm the application by simply omitting required variables.

    If not done automatically by the platform, the application should check if the length of provided input data matches the defined length of receiving data structures.


    Make sure that received input matches the length restrictions defined in the application. An attacker should not be able to harm the application by providing data which is too long OR too short for the receiving data structure.

    Even if the platform provides support in checking for mandatory input data and allowed length, these features must be used in a correct way.


    Even the best-in-class platform with strongly typed variables enforcement do not help if you, the developer, do not make proper use of this valuable feature, for example, by typing all input variables of a class method and function module with type STRING.

    Strings make input validation extremely difficult or even impossible due to missing restrictions. Additionally, the 'unlimited' length of a string gives attackers all the space needed for crafting their 'deadly injections'.

  • Type check

    If not done automatically by the platform, the application must check the correct type of input data provided.


    Attackers should not be able to harm the application by simply feeding input data of the wrong type to it, for example, strings instead of dates.

    The validation function has to check that the type of input data, and the type of the receiving data structure match.

  • Range check

    If not done automatically by the platform, the application must make sure that the provided input is inside the range defined for the receiving data structure.

    As a prerequisite, the application has to define those limits in each and every case. Clearly, this is only possible for variables with a definable range of values.


    Attackers should not be able to harm the application by feeding input data that is exceeding or under-running the defined upper and lower limits of a data structure.

    The validation function has to check that input data is inside the defined range.

  • Canonicalization

    Whenever input data can be provided in multiple or polymorphic forms, it is necessary to transform the data into its simplest, shortest and thus unambiguous representation before applying any further checks.

    This is especially important if strong typing of variables is missing, for example, some script languages, or, if all input data is transferred and received as STRINGs, for example, a Servlet consuming URL parameters of an HTTP request.


    An application does not distinguish small and capital letters in user names and thus it has been decided to always store user names in capital letters in the database.

    On the user interface, however, it is still possible to enter user names with mixed mode, for example, AdMiNisTraTor, adminisTrator.

    Without canonicalization, a comparison like ='Administrator' equals ='ADMINISTRATOR' would return FALSE as a result.

    Canonicalization, however, would transform the user input 'Administrator' into 'ADMINISTRATOR' (in this case), thus removing all mixed mode, and the check would be successful.

    Canonicalization is not a fixed single algorithm that can be reused all over the place. In fact, canonicalization is always strongly bound to contracts, for example, 'user names are always stored in capital letters or 'the shortcut ..\ moves up one directory in the file system'.

    A canonicalization routine needs to know all details of these contracts and based on this information reduce the provided input data to the minimal, shortest, and unambiguous representation.

    Typically, platforms provide canonicalization routines for URLs, file system paths, code pages, and many more.


    Always use canonicalization routines provided by the used platform. Do not try to code these routines yourself, the risk of creating loop holes which later can be exploited by an attacker is just too high.


    A physical path in a file system can be addressed in many different ways, for example. \\fileserver123\system\documents\ \\fileserver123\system\documents\..\..\etc\passwd

    Assuming, an application would define '\\fileserver123\system\documents\' as the application's root path, and the logic would just check if the provided path in the input data starts with this root path (access shall be allowed to all directories underneath this 'root' directory BUT NOT ABOVE).

    In this case, the second example path would just pass the check successfully, since it starts with the literal '\\fileserver123\system\documents\'. However, the notation '..\..\' would move up to the file server's root directory and with '\etc\passwd' the attacker would dive directly into the heart of the server.

    With proper canonicalization, the second example path would have been reduced to the shortest, simplest, and unambiguous representation, which is in this case '\\fileserver123\etc\passwd' which is clearly not a subdirectory of '\\fileserver123\system\documents\' and the check would thus fail.


    Many current attack methods, make use of missing care about polymorphic representations in order to evade filter mechanisms.

    Thus canonicalization is a fundamental puzzle piece in a successful security concept, especially in cases where strong typing is not enforced, for example, web applications receiving URL parameters

    When developing canonicalization routines, pay special attention to the following:

    The canonical representation fully depends on the expected syntax of the final receiver of the data, for example, file system, Web server, database (DBMS).

    Certain control characters or shortcuts might be interpreted differently on different platforms.

    Take into account different notations on different platforms, for example, '/' , '\' in Windows and Linux OS.

    Be aware of double encoded characters. Check if you work in the same character space for example, Unicode or ASCII.

    Remember that combinations of ASCII and HEX characters may represent malicious code.

    Think about case sensitivity.

    Never use black lists in your algorithm.

  • White list checks

    Wherever possible, input values shall be checked against a meaningful(!) white list. Also here, most platforms offer support which helps reducing the TCD for white list checks to a minimum.


    Domain values (ABAP) enumerations (Java, .NET) ranges (ABAP) foreign key relationships in relational databases

    If it is not possible to base a white list check on one of the standard features provided by your platform, the check has to be explicitly implemented by the development team in the respective application.


    A commonly found concept is to programmatically check an input value against a white list of allowed values stored in a file or database table, for example, customizing tables, and configuration files.

    A white list is an explicit list of allowed values, however, white lists can also contain patterns, regular expressions, or wildcards.


    A white list for allowed user names simply contains the regular expression [A-Z0-9]{4,20} , which means that only user names are allowed containing capital letters and numbers while having a minimum length of 4 and a maximum length of 20 characters.

    User names with small letters or special characters would be rejected.

    Another example, is the white list entry 'ORDER-????.TXT' would be programmatically interpreted to only allow file names starting with 'ORDER-' and ending with a 4-digit numerical suffix, followed by the file type .TXT.

    In these cases, the application logic must take care that the comparison algorithm does not have loop holes which could end up in security vulnerabilities.



    Allowing wildcards in a white list can reduce TCO for the customer and keep the white list short and readable, however, the concept of wildcards also comes with a risk.

    Entering wildcards like * or 'ALL' to the white list immediately renders this valuable security features completely useless.

    The most extreme example in this context is the * or 'ALL' wildcard in access control lists or other user permissions and authorizations.

    If input data can be received in polymorphic representations, for example, relative paths, relative URLs, canonicalization is a mandatory step before checking input against a white list.

  • Last Resort: Black Lists (highly exceptional)

    There may be cases where you cannot even implement a white list filter. In this situation, you should write a black list filter function at least, for this is better than doing no filtering at all.

    Doing nothing here regularly has detrimental effects for the security of your application.


    Black lists are a fallback only.

    At this point, it is extremely important to acknowledge that black list filters are vulnerable by nature and so can be only a fallback solution.

    Whenever possible, use a white list filter. If it is not feasible, work towards making it so.

    Only if it is still absolutely impossible, use a black list filter.

    In the later case, include this as vulnerability in the application's internal documentation.

  • Sanitization

    Sanitization is the endeavor of trying to repair malicious input data by cutting out or rewriting the dangerous parts.

    Relying on sanitization as a protection mechanism is always dangerous, since a single flaw in the algorithm is sometimes enough for an attacker to exploit the application.


    A sanitization algorithm chops input data into separate parts searching for those starting with the '<' character or ending with the '>' character.

    As a second step, the algorithm removes the first or last character, respectively.

    Possible attack:

    An attacker could just double the '<' character, for example. <script>.

    The algorithm would return:

    Due to missing iterative approach, that is, search and cut as long as there are still '<' or '>' characters available.


    In general, the chances to fail with sanitization are too high, thus we recommend not to use sanitization at all. Instead, applications should reject input data containing characters that are not allowed, and throw an error message.

    If you really need sanitization in your application, make sure to have an additional protection measure in place, for example, output encoding.

    Do not rely on sanitization as the only protection.

  • Sanitization and Canonicalization

    Canonicalization is a transformation of a possible polymorphic representation of information into the shortest and unambiguous form. However, afterwards, it is still the same information.

    Example: ..\..\documents\file.txt and c:\documents\file.txt are two different pointers to the same file.

    Sanitization, in contrast, applies real changes to the information, for example. removing <script tags.


    '<script>alert('Hacker')</script>' and 'alert('Hacker')' are totally different.

    The first representation would be executed by a Web browser as Javascript code, the second would be displayed as text.

Semantic Correctness

In addition to the basic syntactical correctness checks, you must check input data for semantic correctness. This includes, dependencies between arguments, allowed status of objects rule-based checks.


A method is called to trigger a state change of a travel request from 'requested' to 'approved'

Logically, a travel request can have both status 'requested' and 'approved', so syntactically, both status belong to the white list of allowed values. However, from a business perspective, the transition from 'requested' to 'approved' is bound to some preconditions: requestor belongs to the respective cost center [whitelist check], approval of cost center manager is available [allowed status of business object], travel cost covered by available budget [dependency to status of another business object], travel cost according to corporate cost limits and other travel restrictions for example, no first class flights to Hawaii [rule-based checks].

While this is more in the field of functional correctness, this is only partly true. If the application does not take care of these checks, an attacker could provoke inconsistent or unforeseen states in a business system and use this for crafting further attacks.

Semantic checks are by nature only rarely provided by a platform (exception: frameworks like a rule engine) and thus have to be implemented by the applications.


As always, also input validation has 'grey zones' and limits where a meaningful check is very difficult to apply or where a check would severely limit the business functionality built on top, or unreasonably increase TCO for the customer.

The frequently cited example is 'name and address fields': Names and addresses are difficult to properly limit from a size perspective, if your software is supposed to run all over this planet.

White lists for names and addresses, in addition, are also very difficult to fill due to the variety and sheer amount of possibilities.


Longest name of a city in Europe: Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch (58 characters, city in Wales, Great Britain)

Longest first name of human being (according to Guiness Book of Records 1978, born 29th of February 1904): First name(s): Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus

Looking at these extreme examples, the conclusion could be that, input validation is not possible and thus name and address fields should always be typed as STRINGs, but this would be completely wrong.

Also name and address fields (and similar fields with a broad size variance) should have a reasonable size limit.


Even if the size of input varies a lot or is difficult to predict, input variables or database fields should not be typed as STRING but limited to a reasonable length.

In addition, critical characters (characters useful for crafting attacks) should be disallowed if possible and if the limitation would not hinder business functionality.

As an example, if you can exclude '<' and '>' characters in input fields, this exclusion would make the life of a XSS attacker much more difficult.

The less input validation is possible, the more important it is to apply proper output encoding.