Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation on data flow in Go (and some small fixes for java) #18511

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

owen-mc
Copy link
Contributor

@owen-mc owen-mc commented Jan 16, 2025

The documentation for data flow in Go was not up-to-date.

Note to reviewers: the vast majority of this is copied from the equivalent docs for java and C#, with a few changes made to make it accurate for Go. So I recommend either reviewing the diff between this new file and the java and C# versions or just focussing on the specific examples that are given.

@Copilot Copilot bot review requested due to automatic review settings January 16, 2025 12:07
@owen-mc owen-mc requested a review from a team as a code owner January 16, 2025 12:07

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (7)
  • docs/codeql/codeql-language-guides/analyzing-data-flow-in-go.rst: Language not supported
  • docs/codeql/codeql-language-guides/analyzing-data-flow-in-java.rst: Language not supported
  • docs/codeql/codeql-language-guides/codeql-for-go.rst: Language not supported
  • docs/codeql/codeql-language-guides/modeling-data-flow-in-go-libraries.rst: Language not supported
  • docs/codeql/writing-codeql-queries/about-data-flow-analysis.rst: Language not supported
  • docs/codeql/writing-codeql-queries/creating-path-queries.rst: Language not supported
  • go/docs/language/learn-ql/go/library-modeling-go.rst: Language not supported

Tip: Copilot code review supports C#, Go, Java, JavaScript, Markdown, Python, Ruby and TypeScript, with more languages coming soon. Learn more

@github-actions github-actions bot added the Go label Jan 16, 2025
Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 3-5 LGTM

@owen-mc
Copy link
Contributor Author

owen-mc commented Jan 16, 2025

For the record, I was able to build the docs on my mac by running brew install sphinx-doc and then sphinx-build -b html . ~/sphinx-output in the folder docs/codeql/. The readme for our docs is here.

Copy link
Contributor

@smowton smowton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks plausible; slight tweaks

Flow sources
~~~~~~~~~~~~

The data flow library contains some predefined flow sources. The class ``RemoteFlowSource`` (defined in ``semmle.code.java.dataflow.FlowSources``) represents data flow sources that may be controlled by a remote user, which is useful for finding security problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note / recommend ActiveThreatModelSource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's more advanced. This guide gives a simple approach that works. (Also, if we want to do that, we should do it separately for all the languages.)

Comment on lines +345 to +348
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
urlParse.hasQualifiedName("url", ["Parse", "ParseRequestURI"]) and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately didn't do this because I think it's a bit harder to understand. I think this guide should just be a simple approach that is easy to understand and which works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could mention it as a note underneath the example. "For brevity, we could also shorten ... to ...".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my mind, the aim of this guide is not to give the best way to write things, but to help the reader use data flow and to be as clear as possible. I think that introducing a new notation to do the same thing does not help with that, especially when it isn't very easy to see at a glance what it is doing.

Co-authored-by: Chris Smowton <smowton@github.com>
smowton
smowton previously approved these changes Jan 16, 2025
Copy link
Contributor

@smowton smowton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, mild preference for using set notation since it shows an example of the best way to write this, but this is fine

Copy link
Member

@mbg mbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tackling this and putting this together ✨

I have a few suggestions for improving the clarity of the guide, even though I appreciate some of the bits I commented on are copied from the other languages' guides.

Comment on lines 77 to 82
.. code-block:: go

temp := x;
y := temp + ", " + temp;

If ``x`` is a tainted string then ``y`` is also tainted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential point for confusion with this example is that it may not be clear to readers whether temp := x requires taint flow analysis or just y := ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's an odd example, with the capacity for confusion. I've changed it to just (each language's version of) y := "Hello " + x in all the language guides where it appears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'd be surprised if we have a language where (the equivalent of) temp := x requires taint-flow analysis, I'd be careful with just changing examples for other languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've advertised the change more widely among the language teams.

DataFlow::localFlow(DataFlow::exprNode(src), DataFlow::exprNode(call.getArgument(0)))
select src

Then we can make the source more specific, for example an access to a parameter. This query finds where a public parameter is passed to ``os.Open(..)``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of this could be improved with something like this:

Suggested change
Then we can make the source more specific, for example an access to a parameter. This query finds where a public parameter is passed to ``os.Open(..)``:
To restrict sources to only parameters, rather than arbitrary expressions, we can modify this query as follows:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer the original. It has more narrative flow ("start with a simple query, then slowly add restrictions to get something more interesting"). Can you explain what you don't like about it, and how your suggestion is an improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The narrative is good, and I wouldn't change that. The intention with my suggestion was to keep it as well. However, the flow/grammar/style of the first sentence that's currently here isn't great:

  • Since the reader will have looked over / thought about / possibly played with the code before this, the Then ... feels disconnected here as an opening to the paragraph. It implies that we are continuing some line of work, even though the code example accomplished everything the previous paragraph set out to do. This is why, in my suggestion, the sentence begins by setting out what the next goal is.
  • "the source" is ambiguous and possibly misleading. Does it refer to the input to localFlow or (the) source discovered by evaluating the query? If a reader assumes the latter case, "the" implies that this query only ever finds one source.
  • Similarly with "access to a parameter".
  • The "[..], for example [..]" fragment doesn't read right. It might be better if there was a verb in it (e.g. ", for example by constraining [the source(s)] to function parameters")

Also, while going through this again, does Parameter give us parameters or variable accesses to parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter extends DeclaredVariable, so I'm pretty sure it's parameters rather than variable accesses to parameters.

DataFlow::localFlow(DataFlow::parameterNode(p), DataFlow::exprNode(call.getArgument(0)))
select p

This query finds calls to formatting functions where the format string is not hard-coded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This" is potentially ambiguous as it could refer to the query above the text:

Suggested change
This query finds calls to formatting functions where the format string is not hard-coded.
The following query finds calls to formatting functions where the format string is not hard-coded.

I also think that "hard-coded" is a bit ambiguous. One person might understand this to mean "a constant that's provided directly as argument". Your interpretation here is: "a constant that is defined locally in the scope of the function somewhere". Another person might say "a constant that's defined anywhere". We should clarify what the query is intended to find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Lots of the language guides says "This query ...:". Would using a colon at the end of this sentence (and other equivalent ones) solve the problem? Or would you still like "The following"?
  • I agree it could be clearer, but that is a bigger change than I want to do in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using a colon at the end of this sentence (and other equivalent ones) solve the problem? Or would you still like "The following"?

Yes, a colon would help, but resolving the ambiguity at the start of the sentence would still be good as well since it then doesn't require the reader to read/scan the entire sentence to know what it's about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to do this for multiple languages.


import go

from StringOps::Formatting::Range format, CallExpr call, Expr formatString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there could be a short sentence explaining what StringOps::Formatting::Range is or a link to other documentation?

Exercises
~~~~~~~~~

Exercise 1: Write a query that finds all hard-coded strings used to create a ``url.URL``, using local data flow. (`Answer <#exercise-1>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment about "hard-coded" as above.

Using global data flow
~~~~~~~~~~~~~~~~~~~~~~

The global data flow library is used by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passive voice makes this ambiguous. I would write "We can use global data flow by [..]", but if the agreed on style dictates that the passive must be used, then something like "A query can use the global data flow library by [..]"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

Suggested change
The global data flow library is used by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``:
To use the global data flow library, implement the signature ``DataFlow::ConfigSig`` and apply the module ``DataFlow::Global<ConfigSig>``:

I note that java has this: You use the global data flow library by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``: . I don't like it, though it is in the active voice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like using the second person ("You") for this. Third-person, active voice is best in my opinion. Your suggestion with the imperative voice is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, didn't your suggestion use "we", which is first person, rather than third person?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to write "first person plural".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted your suggestion for all language guides where this sentence appears.


- ``isSource`` - defines where data may flow from.
- ``isSink`` - defines where data may flow to.
- ``isBarrier`` - optional, restricts the data flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better description than "restricts"? All the predicates "restrict" data flow in some way.

Suggested change
- ``isBarrier`` - optional, restricts the data flow.
- ``isBarrier`` - optional, breaks the data flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to optional, defines where data flow is blocked in all the language guides where it appears.

Exercises
~~~~~~~~~

Exercise 2: Write a query that finds all hard-coded strings used to create a ``url.URL``, using global data flow. (`Answer <#exercise-2>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern about "hard-coded".


Exercise 3: Write a class that represents flow sources from ``os.Getenv(..)``. (`Answer <#exercise-3>`__)

Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flows from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Data flows" sounds odd to me. How about "data flow paths" instead?

Suggested change
Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flows from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)
Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flow paths from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this for all the language guides where it appears.

Comment on lines +345 to +348
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could mention it as a note underneath the example. "For brevity, we could also shorten ... to ...".

@owen-mc
Copy link
Contributor Author

owen-mc commented Jan 20, 2025

@mbg I think I've addressed all your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants