A diagram showing a central blue bubble labeled "Source" with multiple smaller blue bubbles branching out. These branches are labeled "HTML element," "HTTP request," "API Responses," and "Cross-Origin Message," with a dashed line labeled "....." Below is the caption: "Figure 1 - Potential threats for DOM-based XSS.

Semmle QL: DOM XSS vulnerability hunting

In two previous blog posts ( part 1 and part 2), we talked about using Semmle QL in C and C++ codebases to find vulnerabilities such as integer overflow, path traversal, and those leading to memory corruption. In this post, we will explore applying Semmle QL to web security by hunting for one of­­­ the most common type of client-side vulnerabilities: DOM-based cross-site scripting (XSS).

You may already know that an XSS attack can be either server-side or client-side. Most Microsoft services are built on top of ASP.NET, while client-side is heavily based on TypeScript and JavaScript. This blog will go through the process of determining sources (the starting point of an input) and sinks (where the input would land at, perhaps to be leveraged into a vulnerability) o­n the target and how to eliminate a false positive/unexploitable case to reduce the time investment in code auditing. Our target is Outlook Web (outlook.office.com), a part of Office 365.

Note: Before reading this, I suggest you read the previous parts of this series to understand some of the underlying concepts, especially data flow analysis and taint tracking.

Defining the source

Source is where application receives data provided by a user. From an attacker’s point of view, an interesting source is a place where they could easily control it on the victim’s side. For example, it’s unlikely for an attacker to control the User-Agent field in a HTTP request by merely luring a victim to a malicious page (where an exploit chain gets started with), making it a non-interesting source.

In an initial review, we observe that the source on this situation can be coming from several places, such as HTML element properties, HTTP request headers, API responses, cross-origin message channels or the URL itself (location.search, location.href, location.hash, location.pathname), etc.

Narrowing the source is pretty hard, expensive and has potential for mistakenly losing false negative cases. Instead, we can evaluate and then eliminate an unexploitable path later by overwriting the Dataflow library’s isSanitizer predicate.

Moreover, Taint Tracking will be used to track the flow here since our data probably flows through some places which eventually would taint other nodes (part 2 mentions it). For example, the dataflow node corresponding to the string “Hello World” should taint the res variable.

var res = "Hello world!".substr(14);

Let’s make it simple and widely open by defining the source in TaintTracking configuration with the following model:

class Cfg extends TaintTracking::Configuration {

  Cfg() { this = "Track data flows to XSS sink" }

  override predicate isSource(DataFlow::Node source) { any() }

...

}

Defining the sink

XSS sinks are extremely abundant because of the complexity of the Web world. As usual, we need to manually review and gain some knowledge of our target prior to getting started to writing a query. After a while, we should be ready to list some promising sinks to perform our analysis. In this post, we are focusing on the three biggest and well known sinks: Location, Document, and ReactJS.

Location Sink

Location Sink is where the user’s browser will be navigating to somewhere else by various ways (see the figure below). It’s possible that these could be vulnerable to XSS due to one of the common vector around injecting javascript: URI scheme, which makes browser execute a JavaScript code.

Assignment Method Call
location = “javascript:alert(document.domain)” window .location = “javascript:alert(document.domain)” document.location = “javascript:alert(document.domain)” location.href = “javascript:alert(document.domain)” open (“javascript:alert(document.domain)”) window.open(“javascript:alert(document.domain)”) location. assign(“javascript:alert(document.domain)”) location. replace(“javascript:alert(document.domain)”)

Note: Sinks related to HTML element and others are beyond the scope of this post.

Let’s get started.

First, we look for an assignment having the left side as the global object location and the right side value being the sink node of our interest. location object references can be predicated to DataFlow::globalVarRef(string name) which gets an access to a global object with name as name.

The following query uncovers window.location=... and location=... since both of them are exposed as global object.

class LocationXSS_Sink extends DataFlow::Node {

  LocationXSS_Sink() {

    exists(Assignment m | m.getLhs() = DataFlow::globalVarRef("location").asExpr() |

      this.asExpr() = m.getRhs()

    )

     ...

  }

}

The next step is to find the two remaining assignments: document.location=... and location.href=...

As evident, both of these are expression writing a value to an object property. DataFlow::SourceNode offers a predicate to help us with this named getAPropertyWrite(string prop_name), we can use it to track down all the nodes writing to the property prop_name. The predicate getAPropertyWrite returns a DataFlow::PropWrite, hence we need to define one to grab it then let the sink node be the right side of this assignment.

By adding one more set of sink nodes, as in the following QL, we’re able to identify all location sinks which are formed in an assignment statement.

exists(DataFlow::PropWrite pw |

      DataFlow::globalVarRef("document").getAPropertyWrite("location") = pw //document.location = ...

      or

      DataFlow::globalVarRef("location").getAPropertyWrite("href") = pw //location.href = ...

    |

      this = pw.getRhs()

)

The next step is to identify the location sinks that form a function call. We’re getting a data flow node references to the global object whose name is open. In particular, the following QL query lists the calls which have the target as both open(...) and window.open(...)

import javascript

select DataFlow::globalVarRef("open").getACall()

Besides getAPropertyWrite, the DataFlow library also provides us the predicate named getAMethodCallthat finds all method calls on a SourceNode (which is the supertype of GlobalVarRefNode). Here is a query that locates any call to either location.assign or location.replace.

import javascript

 

from DataFlow::MethodCallNode call

where

  call = DataFlow::globalVarRef("location").getAMethodCall("assign"or

  call = DataFlow::globalVarRef("location").getAMethodCall("replace")

select call

Putting all of these together, the final query to identify this type of sink looks like this:

class LocationXSS_Sink extends DataFlow::Node {

  LocationXSS_Sink() {

    exists(DataFlow::CallNode call |

      call = DataFlow::globalVarRef("open").getACall() // window.open(...) and open(...)

      or

      call = DataFlow::globalVarRef("location").getAMethodCall("assign")

      or

      call = DataFlow::globalVarRef("location").getAMethodCall("replace")

    |

      this = call.getArgument(0)

    )

    or

    exists(Assignment m | m.getLhs() = DataFlow::globalVarRef("location").asExpr() |

      this.asExpr() = m.getRhs() // this uncovers `location=...` and `window.location=...`

    )

    or

    exists(DataFlow::PropWrite pw |

      DataFlow::globalVarRef("document").getAPropertyWrite("location") = pw //document.location = ...

      or

      DataFlow::globalVarRef("location").getAPropertyWrite("href") = pw //location.href = ...

    |

      this = pw.getRhs()

    )

  }

Document Sinks

Like the previous one, let’s divide this type of sink into two separate forms as below,

Assignment Method Call
element.innerHTML = “” element.outerHTML = “ document.write(“”) document.writeln(“”) node. insertAdjacentHTML(“afterend”,”“) jquery_method.html(““)

Note: there are many jQuery methods other than .html() that accept a HTML string. However, in this codebase, I didn’t observe much usage of those methods, and they don’t look vulnerable.

Firstly, we’re seeking any dataflow node which writes a value to the object’s innerHTML or outerHTMLproperties. This can be done by a simple query:

import javascript

from DataFlow::PropWrite pw

where pw.getPropertyName().regexpMatch("(innerHTML|outerHTML)")

select pw

On “method call” part, with a similar approach to the previous sink, the ones we’re interested in is the first argument to a call to methods write and writeln from the global DOM object document. On the other hand, the insertAdjacentHTMLand html calls got second and first argument respectively as a sink node we’re looking for.

Finally, we define a sink following the above conditionals with a query:

class Document_Sinks extends DataFlow::Node {

  Document_Sinks() {

    exists(DataFlow::MethodCallNode call, int argPos |

      call = DataFlow::globalVarRef("document").getAMethodCall("write")

      or

      call = DataFlow::globalVarRef("document").getAMethodCall("writeln")

      or

      call.getCalleeName() = "insertAdjacentHTML" and argPos = 1

      or

      call.asExpr().(JQueryMethodCall).getCalleeName() = "html" and argPos = 0

    |

      this = call.getArgument(argPos)

    )

  }

}

Note: we are using inline cast .(JqueryMethodCall), built-in on top of the javascript library, to neatly cut down any call node not corresponding to JQuery method. Because JQueryMethodCall is an expression, prior to doing the cast, we need to expose the expression of the dataflow node call. Also because write and writeln can accept multiple arguments as an HTML string, we are letting argPos be unspecified for the document methods , to make getArgument catch all of them.

ReactJS XSS sinks

In the OWA codebase, the developers are also adopting ReactJS which is fast and a convenient way to build user interfaces. So, for such instances there is another sink to be taken care of.

dangerouslySetInnerHTML is a prop to let a developer push an HTML string directly to React element when it’s been rendering (which called JSX). It looks something like this in the codebase:

export default class HtmlContent extends React.Component<HtmlContentProps, {}> {

...

    render() {

        /* tslint:disable:react-no-dangerous-html */

        return (

            <div

                ref={ref => (this.htmlContentRef = ref)}

                dangerouslySetInnerHTML={{ __html: this.props.html }}

            />

        );

        /* tslint:enable:react-no-dangerous-html */

    }

As noted in the code, developers are using tslint to ensure the code quality and also potentially avoiding some well-known issues. This React component is responsible for writing HTML expressions out to a document, after the content got sanitized carefully since it may have untrusted data.

Now, let’s get back to QL. Fortunately, the built-in library has a module semmle.javascript.JSX to provide us classes and predicates to work on JSX code. For example, the following query indicates places where HtmlContent has been used:

import javascript

 

from JSXElement jsx

where jsx.getName() = "HtmlContent"

select jsx 

Moreover, JSXAttribute class helps us identify properties/attributes among JSX code, the target sink here is the value of the property __html inside an object which is passed as value of JSX’s attribute named dangerouslySetInnerHTML. Relate them as sink as follows:

class ReactDangerousSetInnerHTMLSinks extends DataFlow::Node {

  ReactDangerousSetInnerHTMLSinks() {

    exists(JSXAttribute attr, DataFlow::ObjectLiteralNode obj |

      attr.getName() = "dangerouslySetInnerHTML" and

      attr.getValue() = obj.asExpr() and

      obj.hasPropertyWrite("__html", this)

    )

  }

}

While trying to make sure everything works with this approach, I figured out that there was a missing case with has a different code pattern (see the example below). Here the value is a return value of function call. Hence, the above query is not capable of handling this unique situation. The flow analysis fails where the function prepareForInnerHTML is being called:

<div

className={this.props.className}

id={'Example' + this.props.id}

dangerouslySetInnerHTML={this.prepareForInnerHTML()}

 />

......

    private prepareForInnerHTML = () => {

        let text = this.processData();

        return { __html: text };

    };

To solve this, we can take a slightly different approach by modifying the sink to make it more general:

class ReactDangerousSetInnerHTMLSinks extends DataFlow::Node {

  ReactDangerousSetInnerHTMLSinks() {

    exists(JSXAttribute attr |

      attr.getName() = "dangerouslySetInnerHTML" and attr.getValue() = this.asExpr()

    )

  }

}

Thereafter, we can make an additional taint step that would flow through a data flow node corresponding to an object that is inside of it. Here is a data flow node writing a value to property __html.

override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {

    exists(DataFlow::ObjectLiteralNode obj, DataFlow::Node html_value |

      obj.hasPropertyWrite("__html", html_value) and

      succ = obj and

      pred = html_value

    )

 }

With this QL, we should be able to cover all cases.

Sorting it out

So far so good. However, the queries reveal too many records including risk-free/unexploitable places where we don’t want to spend too much time. It leads us to the next problem; how could we avoid wasting time on safe results? Consider location sinks as an example. For any normal site if we walk through the records against our query, we observe that a large number of the source nodes are a constant string, and so totally uncontrollable by an attacker. There is no reason to display them on the table. As the last part of this blog post series mentioned, predicate isSanitizer can be used to eliminate a path that we’re not interested in, by placing a sanitizer on any node that meets our conditions.

override predicate isSanitizer(DataFlow::Node node) {

    node.asExpr() instanceof ConstantString

}

A bunch of nodes are excluded when we rerun the analysis with the above definition.

As we keep manually reviewing remaining records, we can notice many nodes calling these functions: getAttachmentUrl, getModuleUrl, getHelpUrl, etc. They are unlikely to be exploitable because their return values always have a prefix looking something like /some_path/...{controlled value}... This prefix means that it’s impossible to inject the javascript: URI scheme at the head of a URL. Moreover, all of them are in the same pattern: get...Url Here is the final model for these characteristics:

override predicate isSanitizer(DataFlow::Node node) {

    node.asExpr() instanceof ConstantString

    or

    node.(DataFlow::CallNode).getCalleeName().regexpMatch("get.*Url")

}


windows.open sink dataflow

Repeat the same strategy that we used with the other sinks, figure out the pattern of non vulnerable codes, then remove uninteresting nodes by defining a sanitizer. It would make a security engineer’s life easier, resulting in scaling up code review capabilities and less effort on a target that has huge codebases.

Conclusions

One great advantage of this methodology is that we can apply most of the work to other targets. There are plenty of classes/predicates in QL that we can also use to investigate other components such as HTML element, AngularJS, Electron, and others. There are even some existing configuration/queries provided by Semmle for identifying DOM-Based XSS issues. However, in this post, I wanted to show you how could we build one from scratch so you can make your own.

As security engineers trying to adopt Semmle QL, we should keep improving the quality of queries to make it more accurate and smarter. Also do not forget to keep learning from clever vulnerabilities identified by white-hat security researchers. These can help us discover new potential threats and sinks.

Finally, Semmle QL is a very promising tool. Using Semmle QL to analyze the Outlook Web App codebase then manually trace the data flows, led me to two important severity cross-site scripting vulnerabilities among 88 Location Sinks, 50 Document Sinks and 11 ReactJS sinks. Isn’t it cool when you could find security vulnerabilities with a query? There is a chance to develop creative solutions to a problem. The only limit here may be our imagination.

Luật Nguyễn, MSRC Vulnerabilities & Mitigations team.

Skip to content