Skip to content

DWARF debug support#6

Open
scpmw wants to merge 5086 commits intomasterfrom
profiling-import
Open

DWARF debug support#6
scpmw wants to merge 5086 commits intomasterfrom
profiling-import

Conversation

@scpmw
Copy link
Copy Markdown
Owner

@scpmw scpmw commented Mar 13, 2014

Currently under consideration as the basis for implementing stack traces. See ticket for further discussion:

http://hackage.haskell.org/trac/ghc/ticket/3693

tibbe and others added 30 commits July 23, 2014 21:03
Duplicate record fields would not be detected when given a type
with multiple data constructors, and the first data constructor
had a record field r1 and any consecutive data constructors
had multiple fields named r1.

This fixes #9156 and was reviewed in https://phabricator.haskell.org/D87
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
This patch was provoked by Trac #5610, which I finally got a moment to look at.

In the end I added a new data type ErrUtils.Validity,

  data Validity
    = IsValid            -- Everything is fine
    | NotValid MsgDoc    -- A problem, and some indication of why

with some suitable combinators, and used it where appropriate (which touches
quite a few modules).  The main payoff is that error messages improve for
FFI type validation.
after changes in 92587bf.

This problem was noticed on ghcspeed (although only by accident,
unfortunately, as a change from 0 to 1 is not reported in the summary).
The general approach is to add a new field to the package database,
reexported-modules, which considered by the module finder as possible
module declarations.  Unlike declaring stub module files, multiple
reexports of the same physical package at the same name do not
result in an ambiguous import.

Has submodule updates for Cabal and haddock.

NB: When a reexport renames a module, that renaming is *not* accessible
from inside the package.  This is not so much a deliberate design choice
as for implementation expediency (reexport resolution happens only when
a package is in the package database.)

TODO: Error handling when there are duplicate reexports/etc is not very
well tested.

Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>

Conflicts:
	compiler/main/HscTypes.lhs
	testsuite/.gitignore
	utils/haddock
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
This also removes the short-lived NO_OVERLAP pragama, and renames
OVERLAP to OVERLAPS.

An instance may be annotated with one of 4 pragams, to control its
interaction with other overlapping instances:

  * OVERLAPPABLE:
    this instance is ignored if a more specific candidate exists

  * OVERLAPPING:
    this instance is preferred over more general candidates

  * OVERLAPS:
    both OVERLAPPING and OVERLAPPABLE (i.e., the previous GHC behavior).
    When compiling with -XOverlappingInstances, all instance are OVERLAPS.

  * INCOHERENT:
    same as before (see manual for details).
    When compiling with -XIncoherentInstances, all instances are INCOHERENT.
Summary:
Today's hardware is much faster, so it makes sense to report timings
with more precision, and possibly help reduce rounding-induced
fluctuations in the nofib statistics.

This commit increases the precision of all timings previously reported
with a granularity of 10ms to 1ms. For instance, the `+RTS -S` output is
now rendered as:

    Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
    bytes     bytes     bytes   user   elap     user     elap
   641936     59944    158120  0.000  0.000    0.013    0.001    0    0  (Gen:  0)
   517672     60840    158464  0.000  0.000    0.013    0.002    0    0  (Gen:  0)
   517256     58800    156424  0.005  0.005    0.019    0.007    0    0  (Gen:  1)
   670208      9520    158728  0.000  0.000    0.019    0.008    0    0  (Gen:  0)

  ...

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        24 colls,     0 par    0.002s   0.002s     0.0001s    0.0002s
  Gen  1         3 colls,     0 par    0.011s   0.011s     0.0038s    0.0055s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.005s  (  0.006s elapsed)
  GC      time    0.014s  (  0.014s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time    0.032s  (  0.020s elapsed)

Note that this change also requires associated changes in the nofib
submodule.

Test Plan: tested with modified nofib

Reviewers: simonmar, nomeata, austin

Subscribers: simonmar, relrod, carter

Differential Revision: https://phabricator.haskell.org/D97
Summary: Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>

Test Plan: validate

Reviewers: hvr, simonmar, austin

Subscribers: simonmar, relrod, carter

Differential Revision: https://phabricator.haskell.org/D98
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
Signed-off-by: Austin Seipp <austin@well-typed.com>
hvr and others added 28 commits August 17, 2014 13:09
On Linux/i386 the 64bit `__builtin_ctzll()` instrinsic doesn't get
inlined by GCC but rather a short `__ctzdi2` runtime function is
inserted when needed into compiled object files.

This causes failures for the four test-cases

  TEST="T8639_api T8628 dynCompileExpr T5313"

with error messages of the kind

  dynCompileExpr: .../libraries/ghc-prim/dist-install/build/libHSghcpr_BE58KUgBe9ELCsPXiJ1Q2r.a: unknown symbol `__ctzdi2'
  dynCompileExpr: dynCompileExpr: unable to load package `ghc-prim'

This workaround forces GCC on 32bit x86 to to express `hs_ctz64` in
terms of the 32bit `__builtin_ctz()` (this is no loss, as there's no
64bit BSF instruction on i686 anyway) and thus avoid the problematic
out-of-line runtime function.

Note: `__builtin_ctzll()` is used since
      e0c1767 (re #9340)
This became dead with 1e87c0a
and was probably just missed.

I plan to re-use the freed up `mkPreludeTyConUnique 23` slot soon
for a new `bigNatTyConKey` (as part of the #9281 effort)
Dead code
Not too interesting, just trying to get it out of the diff.
Doesn't make much of a difference, but keeping unused variables around
seems like a bit of a waste?
This patch introduces "SourceNote" tickishs that carry a reference to the
original source code. They are meant to be passed along the compilation
pipeline with as little disturbance to optimization processes as possible.

Generation is triggered by command line parameter -g. It's free and
fits with the intended end result (generation of DWARF). Internally we
say that we compile with "debugging", which is probably at least
slightly confusing given the plethora of other debugging options we have.

Note that this pass creates *lots* of tick nodes. We take care to
remove duplicated and overlapping source ticks, which gets rid of most
of them. Possible optimization could be to make Tick carry a list of
Tickishs instead of one at a time.

Keeping ticks from getting into the way of Core transformations is
tricky, but doable. The changes in this patch produce identical Core
in all cases I tested (nofib). We should probably look for a way to
make a test-case out of this.

Fix CoreLint problem

Caused by yet another instance of failing to look through ticks
This allows having, say, HPC ticks, automatic cost centres and source
notes active at the same time.
This is basically just about continuing maintaining source notes after
the Core stage. Unfortunately, this is more involved as it might seem,
as there are more restrictions on where ticks are allowed to show up.

Design decisions:

* We replace the StgTick / StgSCC constructors with a unified StgTick
  that can carry any tickish.

* For handling constructor or lambda applications, we generally float
  ticks out.

* Note that thank to the NonLam placement, we know that source notes
  can never appear on lambdas. This means that as long as we are careful
  to always use mkTick, we will never violate CorePrep invariants.

* Where CorePrep floats out lets, we make sure to wrap them in the same
  spirit as FloatOut.

* Detecting selector thunks becomes a bit more involved, as we can run
  into ticks at multiple points.
This patch allows source notes to refer to --ddump-to-file Core dumps, so
we can have debugging data refer directly to places in the Core. The
implementation is slightly tricky, as we couldn't find a way to get Pretty
to generate line number information for us. Instead, we now generate
"annotations" into the dump that get stripped out later, yielding line
numbers as we go along.
These tickishs are meant to carry the (simplified and prepared) Core
through the later compilation stages.

Notes:

* Core notes are only useful in certain scenarios (mostly profiling),
  and will end up taking up significant space in object files. We therefore
  use another GHC flag (-fsave-core) to decide whether we annotate them or
  not.

* Annotations happen after CorePrep. This is slightly tricky, as CoreToStg
  moves ticks around even after this point. We have to be careful to ensure
  ticks end up where we intend them to be.

* We take the easy route to just "point" into the Core code directly.
  This is slightly awkward given that Core is normally a more stright-
  forward data structure. We have to short-circuit Eq/Ord, for example.

* We only annotate the interesting control flow points, which are
  either top-level or let binding bodies as well as case branches.

* In order to establish an identity and later perform sub-expression
  checks, we safe a binder (of the binding or case) and the case
  constructor (if applicable, otherwise __DEFAULT).
This patch adds CmmTick nodes to Cmm code. On their own these ticks are
not useful yet, as there will be many blocks that lack annotation - and
we have no way of deriving them.

Notes:

* We use this design over, say, putting ticks into the entry node of all
  blocks, as it seems to work better alongside existing optimisations.
  Now granted, the reason for this is that currently GHC's main Cmm
  optimisations seem to mainly reorganize and merge code, so this might
  change in the future.

* We have the Cmm parser generate a few source notes as well. This is
  relatively easy to do - worst thing is that it blows up the CmmParse
  implementation a bit.
This patch solves the scoping problem of CmmTick nodes: If we just put
CmmTicks into blocks we have no idea what exactly they are meant to cover.
Here we introduce nested scopes, represented as lists of uniques. The
"nesting" relation is given by the subset relation. For example a tick
declared in a block with, say, scope [b,a] now scopes over all blocks
that have at least a tick scope of [b,a], so for example also [c,b,a].

Notes:

* This makes it easy to express most optimisations: It is both easy to
  generate new blocks that share all ticks with existing blocks. It is
  especially possible to merge blocks to have combined contexts, simply
  by merging the scope lists. If this happens, we actually end up with
  an (acyclic) scope graph instead.

* Given that the code often passes Cmm around "head-less", we have to
  make sure that its intended scope does not get lost. To keep the amount
  of passing-around to a minimum we define a CmmAGraphScoped type synonym
  here that just bundles the scope with a portion of Cmm to be assembled
  later.

* We introduce new scopes at somewhat random places, aligning with
  getCode calls. This works surprisingly well, but we might have to
  add new scopes into the mix later on if we find things too be too
  coarse-grained.
This is meant as a tool for the debugger to determine past values of
registers, most critically the stack pointer Sp.

* We declare yet another new constructor for CmmNode - and this time
  there's actually little choice, as unwind information can and will
  change mid-block. We don't actually make use of these capabilities,
  and back-end support would be tricky (generate new labels?), but it
  feels like the right way to do it.

* Even though we only use it for Sp so far, we allow CmmUnwind to specify
  unwind information for any register. This is pretty cheap and could
  come in useful in future.

* We allow full CmmExpr expressions for specifying unwind values. The
  advantage here is that we don't have to make up new syntax, and can e.g.
  use the WDS macro directly. On the other hand, the back-end will now
  have to simplify the expression until it can sensibly be converted
  into DWARF byte code - a process which might fail, yielding NCG panics.
  On the other hand, when you're writing Cmm by hand you really ought to
  know what you're doing.
The purpose of the Debug module is to collect all required information
to generate debug information (DWARF etc.) in the back-ends. Our main
data structure is the "debug block", which carries all information we have
about a block of code that is going to get produced.

Notes:

* Debug blocks are arranged into a tree according to tick scopes. This
  makes it easier to reason about inheritance rules. Note however that
  tick scopes are not guaranteed to form a tree, in which case we end
  up discarding some information here. This is however not too relevant
  in realistic scenarios, I feel.

* This is also where we decide what source location we regard as
  representing a code block the "best". The heuristic is basically that
  we want the most specific source reference that comes from the same file
  we are currently compiling. This seems to be the most useful choice in
  my experience.

* We are careful to not be too lazy so we don't end up breaking streaming.
  Debug data will be kept alive until the end of codegen, after all.

* We change native assembler dumps to happen right away for every Cmm group.
  This simplifies the code somewhat and is consistent with how pretty much
  all of GHC handles dumps with  respect to streamed code.
This generates DWARF, albeit indirectly using the assembler. This is
the easiest (and, apparently, quite standard) method of generating the
.debug_line DWARF section.

Notes:

* Note we have to make sure that .file directives appear correctly
  before the respective .loc. Right now we ppr them manually, which makes
  them absent from dumps. Fixing this would require .file to become a
  native instruction.

* We have to pass a lot of things around the native code generator. I
  know Ian did quite a bit of refactoring already, but having one common
  monad could *really* simplify things here...

* To support SplitObjcs, we need to emit/reset all DWARF data at every
  split. We use the occassion to move split marker generation to
  cmmNativeGenStream as well, so debug data extraction doesn't have to
  choke on it.
This is where we actually make GHC emit DWARF code. The info section
contains all the general meta information bits as well as an entry for
every block of native code.

Notes:

* We need quite a few new labels in order to properly address starts
  and ends of blocks.w1

* There is no DWARF language ID for Haskell, so we arbitrarily choose a
  number derived from dW_LANG_lo_user and 'hs'. This feels like the right
  thing to do, even though sometimes DWARF tools get confused by any
  unknown value in this field.

* Thanks to Nathan Howell for taking the iniative to get our own Haskell
  language ID for DWARF!

Mac OS port
This is telling debuggers such as GDB how to "unwind" a program state,
which allows them to walk the stack up.

Notes:

* The code is quite general, perhaps unnecessarily so. Unless we get more
  unwind information, only the first case of pprSetUnwind will get used -
  and pprUnwindExpr and pprUndefUnwind will never be called. It just so
  happens that this is a point where we can get a lot of features
  cheaply, even if we don't use them.

* When determining what location to show for a return address, most
  debuggers check the map for "rip-1", assuming that's where the "call"
  instruction is. For tables-next-to-code, that happens to always
  be the end of an info table. We therefore cheat a bit here by shifting
  .debug_frame information so it covers the end of the info table, as
  well as generating a .loc directive for the info table data.

  Debuggers will still show the wrong label for the return address, though.
  Haven't found a way around that one yet.
The conversion to DWARF is always lossy, so we put all the extra
bits of information into an extra object file section (.debug-ghc).

Notes:

* We use the eventlog format. This might seem like a slightly arbitrary
  choice, but makes it easy to copy debug data into eventlogs later in
  order to do profiling. In the meantime, it's well-defined and extensible,
  so until we run out of record IDs there's no strong reason against it
  either.

* Core notes now cause the complete Core to be copied. We are reasonably
  smart about this: We never emit a piece of Core twice, and use a compact
  binary representation for most Core constructors.

  On the other hand, we just pretty-print types as well as names and emit
  them as strings. This can sometimes lead to packets becoming too large
  for the eventlog format to handle (we had types break the 20k loc mark).
  In order to not run into these kinds of problems, we just omit packets
  that are longer than a certain threshold.

* The amount of data generated here is significant. We therefore use faily
  low-level generation code using memory buffers. Furthermore, we include
  the data as a string, escaped using another well-optimized low-level
  routine. All this might make it hard to read debug data in the assembly,
  but is absolutely required for debugging not to become a significant
  resource hog.

* The eventlog IDs used here were chosen primarily to avoid collisions.
  If this code gets merged they should be adjusted appropriately.
This is an example of how to set more complicated unwind rules - here
for returnToSched & co. Note that we have to work around a few issues here:
The unwind declaration needs to be the first node in the block, so we
move SAVE_THREAD_STATE accordingly. That's ugly - a system that handles
unwind rules in the middle of a block would be better.
This sets up the infrastructure for sample-based profiling. Namely, we
now read the debug information from .debug_ghc and associate them with
the (relocated) IP ranges. Furthermore, we generate stub debug data for
symbol tables, which allows debugging tools to identify e.g. procedures
from linked C code

This patch also sets up everything needed for actually emitting samples.
We try to be very general here - samples *could* for example become
cost-centres if we want to support cost-centre based profiling at some
point down the line.
This is one of the cheapest possible ways to get profiling data: The
nursery grows block-wise, with each block getting requested separately
after the last block filled up. We simply fill an array with the sources
of these requests, and get a nice overview of allocation hot spots out
of it.
This casts heap profiling as a source of IP samples. This works because
the closure header pointers are code pointers at the saem time - so by
identifying them we get a good view of memory residency.
This is a bit of an experiment - we can theoretically re-use the heap
allocation profiling facilities for identification of instances where
we allocate a lot of stack space (enough to warrant requesting new blocks!).
This is however less useful due to the fact that the allocation of new
stack blocks is, in fact, quite rare. We might have to decrease stack
chunk size (-kc?) to get anything useful out of it.

Also the implementation is pretty hacky...
Now it actually works for multi-threaded programs and pauses correctly for
garbage collections.

The way the code is distributed between rts/Timer.c and rts/posix/Itimer.c
is a bit awkward, might need more work.
Simon Marlow doesn't like this approach, but at this point I am getting
quite fed up with having to set LD_LIBRARY_PATH for every single Haskell
program...
@scpmw scpmw force-pushed the profiling-import branch from 3dc907a to 830f6e7 Compare August 21, 2014 19:13
@Tarrasch
Copy link
Copy Markdown

Maybe we should close this in favor of the differential patch?

https://phabricator.haskell.org/D169

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.