A brief discussion on PHP automated code audit technology, a brief discussion on PHP automated audit

A brief talk about PHP automated code audit technology, a brief talk about PHP automated audit

Original source: exploit Welcome to share the original to Bole Headlines

0×00

Since there is really nothing to update on the blog, I will summarize what I am doing so far and treat it as a blog, mainly talking about some of the technologies used in the project. There are currently many PHP automated audit tools on the market, including open source ones such as RIPS and Pixy, and commercial versions such as Fortify. RIPS currently only has the first version. Since it does not support PHP object-oriented analysis, the effect is not very satisfactory now. Pixy is a tool based on data flow analysis, but only supports PHP4. Fortify is a commercial version. Due to this limitation, research on it is impossible. Domestic research on PHP automatic auditing is generally done by companies. Currently, most of the tools use simple token flow analysis or are more direct and crude, using regular expressions for matching, and the effect will be very average.

0×01

The technology I want to talk about today is an implementation idea for PHP automated auditing based on static analysis, which is also the idea in my project. In order to carry out more effective variable analysis and taint analysis, and to cope with various flexible syntax expressions in PHP scripts, the effect of regular expressions is definitely not ideal. The idea I introduced is based on code static analysis technology and data Auditing of streaming analytics technology.

First of all, I think an effective audit tool at least contains the following modules:

1. Compile front-end module
Compile front-end module mainly uses abstract syntax tree construction and control flow graph construction methods in compilation technology to convert source code files into a form suitable for back-end static analysis.

2. Global information collection module
This module is mainly used to collect unified information on the analyzed source code files, such as collecting the definitions of how many classes there are in the audit project, and collecting the method names, parameters, And the starting and ending line numbers of the method definition code block are collected to speed up subsequent static analysis.

3. Data flow analysis module
This module is different from the data flow analysis algorithm in compilation technology. In the project, it pays more attention to the processing of the characteristics of the PHP language itself. When the call of a sensitive function is discovered during the inter-process and intra-process analysis of the system, data flow analysis is performed on the sensitive parameters in the function, that is, the specific changes of the variable are tracked to prepare for subsequent taint analysis.

4. Vulnerable code analysis module
This module performs taint data analysis based on global variables, assignment statements and other information collected by the data flow analysis module. Mainly targeting dangerous parameters in sensitive sinks, such as the first parameter in the mysql_query function, the corresponding data flow information is obtained through backtracking. If the parameter is found to have signs of user control during the backtracking process, it will be recorded. If the dangerous parameter has a corresponding code, the purification operation must also be recorded. Complete stain analysis by tracking and analyzing data on dangerous parameters.

0×02

With the module, how to implement an effective process to implement automated auditing? I used the following process:

The general process of the analysis system is as follows:

1. Framework initialization

First, initialize the analysis framework, mainly to collect information about all user-defined classes in the source code project to be analyzed, including class names, class attributes, class method names, and file paths where the classes are located.
These Records are stored in the global context class Context, which is designed using the singleton pattern and is resident in memory to facilitate subsequent analysis and use.

2. Determine Main File

Secondly, determine whether each PHP file is a Main file. In the PHP language, there is no so-called main function. Most PHP files in the Web are divided into two types: call and definition. PHP files of the definition type are used to define some business classes, tool classes, tool functions, etc., and are not provided to The user accesses the PHP file provided to the calling type for calling. What actually handles user requests is the calling type of PHP file, such as the global index.php file. Static analysis is mainly aimed at the PHP file that handles the call type requested by the user, that is, the Main File. The basis for judgment is:
Based on the completion of AST analysis, judge whether the number of code lines of class definitions and method definitions in a PHP file exceeds a range of all code lines in the file. If so, it is regarded as a defined type. The PHP file, otherwise the Main File, is added to the list of file names to be analyzed.

3. Construction of AST abstract syntax tree

This project is developed based on the PHP language itself. For the construction of its AST, we refer to the current excellent implementation of PHP AST construction——PHP Parser.
This open source project is developed based on the PHP language itself and can parse most of PHP's structures such as if, while, switch, array declaration, method call, global variables and other grammatical structures. It can complete part of the compilation front-end processing of this project very well.

4. CFG flow graph construction

Use the CFGBuilder method in the CFGGenerator class. The method is defined as follows:

The specific idea is to use recursion to build CFG. First, input the nodes collection obtained by traversing the AST. During the traversal, the type of the elements (nodes) in the collection is judged, such as whether it is a branch, jump, end, etc. statement, and the CFG is constructed according to the node type.
Here, the jump conditions (conditions) for branch statements and loop statements should be stored on the edges (Edge) in CFG to facilitate data flow analysis.

5. Collection of data flow information

For a block of code, the most effective information worth collecting is assignment statements, function calls, constants (const define), and registered variables (extract parse_str).
The function of the assignment statement is for subsequent variable tracking. In the implementation, I used a structure to represent the assigned value and location. Other data information is identified and obtained based on AST. For example, in a function call, determine whether the variable is escaped, encoded, etc., or whether the called function is a sink (such as mysql_query).

6. Variable purification and encoding information processing

$clearsql = addslashes($sql) ;
Assignment statement, when the right side is a filter function (user-defined filter function or built-in filter function), the return value of the calling function is purified, that is, the purification of $clearsql Tags plus addslashes.
Discover function calls and determine whether the function name is a safe function configured in the configuration file.
If yes, add the sanitization tag to the location symbol.

7. Inter-process analysis

If a call to a user function is found during the audit, inter-process analysis must be performed at this time. The code block of the specific method must be located in the analyzed project and the variables must be brought in for analysis.
The difficulty lies in how to perform variable backtracking, how to deal with methods with the same name in different files, how to support class method call analysis, and how to save user-defined sinks (such as calling the exec function in myexec. If there is no valid purification, then myexec should also be regarded as a dangerous function), how to classify user-defined sinks (such as SQLI XSS XPATH, etc.).

The processing flow is as follows:

8. Taint analysis

After the above process, the last thing to be done is taint analysis, which mainly focuses on some risk functions built into the system, such as echo that may cause XSS. And it is necessary to conduct effective analysis of the dangerous parameters in the dangerous function. These analyzes include determining whether effective purification has been carried out (such as escaping, regular matching, etc.), and formulating algorithms to retrace the previous assignment or other transformation of the variable. This is undoubtedly a test of the engineering capabilities of security researchers and is also the most important stage of automated auditing.

0×03

Through the above introduction, you can see that there are many pitfalls to implement your own automated audit tool. I also encountered many difficulties in my attempts, and static analysis does have certain limitations. For example, the string transformation process that can be easily obtained in dynamic analysis is difficult to implement in static analysis. This is not technically possible. The breakthrough is caused by the limitations of static analysis itself. Therefore, if pure static analysis wants to achieve low false positives and false negatives, after all, some dynamic ideas should be introduced, such as simulating the code in eval and character analysis. String transformation functions and regular expressions for processing, etc. Also, for some MVC-based frameworks, such as CI frameworks, the code is very scattered. For example, the data purification code is placed in the extension of the input class. For PHP applications like this, I think it is difficult to achieve a universal audit framework. It should To be treated individually.

The above is just a rough summary of my current attempts (currently not fully implemented) to share. After all, college dogs are not professionals. I hope it can inspire more and more security researchers to pay attention to this field.