Skip to content

Conversation

@thomash-acinq
Copy link
Contributor

@thomash-acinq thomash-acinq commented Nov 19, 2021

Currently when a payment fails we ban the channel for some time (60 seconds by default). There are several problems with that:

  • If we try to relay a huge payment and it fails, we may still be able to relay a smaller payment
  • If the ban duration is too short, it still won't be able to relay payments when we unban it. Many channels are unbalanced and inactive and if they are considered attractive by our path-finding algorithm (because they're big and cheap), they'll keep coming back and make our payment attempts fail.
    In this PR, I introduce a probabilistic estimation of the channel balance to replace this binary behavior.

@thomash-acinq thomash-acinq force-pushed the balance-estimate branch 3 times, most recently from dcff914 to 794ddf6 Compare November 24, 2021 15:28
@thomash-acinq thomash-acinq force-pushed the balance-estimate branch 2 times, most recently from 02542d7 to b6d57d9 Compare December 27, 2021 10:32
@codecov-commenter
Copy link

codecov-commenter commented Dec 27, 2021

Codecov Report

Merging #2071 (333c0db) into master (0b807d2) will decrease coverage by 0.07%.
The diff coverage is 81.98%.

@@            Coverage Diff             @@
##           master    #2071      +/-   ##
==========================================
- Coverage   87.58%   87.50%   -0.08%     
==========================================
  Files         165      166       +1     
  Lines       12788    12875      +87     
  Branches      563      530      -33     
==========================================
+ Hits        11200    11266      +66     
- Misses       1588     1609      +21     
Impacted Files Coverage Δ
...r/acinq/eclair/payment/send/PaymentLifecycle.scala 86.39% <56.25%> (-0.49%) ⬇️
...src/main/scala/fr/acinq/eclair/router/Router.scala 93.29% <77.77%> (-1.65%) ⬇️
...scala/fr/acinq/eclair/router/BalanceEstimate.scala 83.92% <83.92%> (ø)
.../src/main/scala/fr/acinq/eclair/router/Graph.scala 97.19% <94.87%> (+0.03%) ⬆️
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 91.84% <100.00%> (+0.03%) ⬆️
...cinq/eclair/remote/EclairInternalsSerializer.scala 97.87% <100.00%> (ø)
...cala/fr/acinq/eclair/router/RouteCalculation.scala 95.10% <100.00%> (ø)
...main/scala/fr/acinq/eclair/router/Validation.scala 90.87% <100.00%> (-1.44%) ⬇️
...main/scala/fr/acinq/eclair/io/PeerConnection.scala 85.44% <0.00%> (-0.39%) ⬇️
... and 1 more

@thomash-acinq thomash-acinq force-pushed the balance-estimate branch 7 times, most recently from 18078b3 to 333c0db Compare December 28, 2021 14:53
@thomash-acinq thomash-acinq marked this pull request as ready for review December 28, 2021 15:08
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a really good direction to explore! I like the model, I put some comments mostly on what functions it should expose, but the actual logic sounds good to me.

I would scope this PR to only create the balance estimates though, without the path-finding changes. I expect that we will need to iterate on it after seeing how it behaves in practice, before we can start an A/B experiment to actually use it.

def baseline(capacity: Satoshi, halfLife: FiniteDuration): BalanceEstimate = noChannels(halfLife).addChannel(capacity)
}

case class BalancesEstimates(balances: Map[(PublicKey, PublicKey), BalanceEstimate], defaultHalfLife: FiniteDuration) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a (PublicKey, PublicKey) pair and having to do the LexicographicalOrdering check in many places, why not create a case class BalanceEdge(a: PublicKey, b: PublicKey) that enforces this invariant (a < b)?

otherSide.couldNotSend(totalCapacity - amount, timestamp).otherSide

def didSend(amount: MilliSatoshi, timestamp: TimestampSecond): BalanceEstimate =
copy(low = (low - amount) max MilliSatoshi(0), high = (high - amount) max MilliSatoshi(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you forgot to update the timestamps?

def didSend(amount: MilliSatoshi, timestamp: TimestampSecond): BalanceEstimate =
copy(low = (low - amount) max MilliSatoshi(0), high = (high - amount) max MilliSatoshi(0))

def addChannel(capacity: Satoshi): BalanceEstimate = copy(high = high + toMilliSatoshi(capacity), totalCapacity = totalCapacity + capacity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should sum the capacity of individual channels, since HTLCs cannot be dynamically split.
I think we should use the max instead.
Otherwise I'm afraid this will not work well for nodes that have a lot of small channels (we'll have a high value that may be N times the actual size of their biggest channel, which isn't actually usable).

It does make removeChannel harder though. Maybe we can simply replace it with a removeEdge function that is called when the last channel between a pair of nodes has been closed?

Or instead of computing the capacity ourselves, we can let external callers that have access to the graph object tell us what capacity to use for a given edge (ie we would expose an updateCapacity function and it's the caller's responsibility to take the max of all existing channels)? If we do that I think we get rid of both addChannel and removeChannel and replace them with addOrUpdateCapacity (which creates the edge if not created before, and sets it capacity) and removeEdge.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a very long time I was in favor of combining parallel channels for Uncertainty Estimates into one large virtual channel with the sum of the capacities. However this becomes really tricky if these channels have different fee policies and one may wish to include those as an additional feature to the cost function of the min cost flow problem. I finally arrived at the point where I maintain those parallel channels in my data structure. While I belief it messes up the uncertainty prediction in particular as node operators may chose to forward on one of the other channels I think it is a "them" problem and not a "me" problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However this becomes really tricky if these channels have different fee policies

Do note that it doesn't make any sense IMO to have difference fee policies on different channels with the same peer (because non-strict forwarding)

Copy link

@renepickhardt renepickhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aborted my review as I cannot make any sense of the semantics of the canSend method which I guess is the main point where you estimate the actual probability of a payment to be successful given your belief about the network.

Overall I can state that a lot of the correct ideas and approaches are in here and some of my comments are very detailed and nuanced because of that. So please don't be offended if I sound over critical but rather take this as a form of respect.

In between the comments I have linked to my reference implementation of handling the probabilities. I strongly encourage you to read through the glossary (Especially terms like cost, uncertaintay cost, linearized integer uncertainty unit cost ,unit cost and features) of that note book and the UncertaintChannel class itself. Maybe this can help resolve the current misunderstandings that i/we seem to have about the semantics of your computation.

Notebale differences of your and my code which should probably be neglected when comparing our approaches:

  • My code also accounts in the probablitiy estimates for the inflight HTLCs that we currently put on the network due to an ongoing PaymentSession.
  • My code currently does not decay the belief about the network. Instead I reset the learnt information before every new payment session (which is obviously not preferable)
    • However as noted below I would decay the low and high bound instead of using a weighed average between prior and conditional probability.
  • My code is mainly created for educational purpose and it does not yet adopt the belief in both directions of the channel if we have updates (You encouraged me to do this sooner than later as I guess this may be confusing for others if it is missing)
  • Also as you can guess instead of yen's algorithm I use the more accurately modeled min cost flow problem for predicting candidate onions for the MPP problem. (For that reason I also linearize the routing cost which is easiest when ignoring non zero base fee channels)

*/
private def decay(amount: MilliSatoshi, successProbabilityAtT: Double, t: TimestampSecond): Double = {
val decayRatio = 1 / math.pow(2, (TimestampSecond.now() - t) / halfLife)
val baseline = 1 - amount.toLong.toDouble / toMilliSatoshi(totalCapacity).toLong.toDouble

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes a uniform distribution of the liquidity in channels. Recent research indicates that this in only the case for about 70% of the channels. The other 30% are drained. Thus a probability of: 0.3 * 1/2 + 0.7*(1-a/(c+1)) seems more accurate. of course while operating the node one may collect statistics and constantly estimate the 30/70 split. Alternatively one may estimate the split depending on the channel as the research also reported that there seems to be an active kernel in the network where one may safely assume a uniform distribution.

In particular the success proability is not 1 - a/c but rather 1 - a / (c+1) thogh the difference is in most cases neglectable.

btw: baseline was a very confusing term. Maybe call it priorSuccessProbability as it encodes (P(X > amt) and the other one conditionalSuccessProbability as it should be the successprobability under the conditional information that the random variable is larger than minLiquidity and smaller than maxLiquidity which in math terms may be expressed as (P(X > amt | minLiquidity < X < maxLiquidity)

private def decay(amount: MilliSatoshi, successProbabilityAtT: Double, t: TimestampSecond): Double = {
val decayRatio = 1 / math.pow(2, (TimestampSecond.now() - t) / halfLife)
val baseline = 1 - amount.toLong.toDouble / toMilliSatoshi(totalCapacity).toLong.toDouble
baseline * (1 - decayRatio) + successProbabilityAtT * decayRatio

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised that you do a weighted average here. Why not just decaying minLiquidity towards 0 and maxLiqudity towards capacitywith the givenhalfLife`. I did not check how these two compare to each other but it seems cleaner than using a combination of the prior knowlege and the conditional knowledge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decaying low towards 0 doesn't make sense. At a time t we have a given probability distribution for the balance and decaying going back to the prior (which is our uniform baseline), that's why I'm averaging the distribution at time t and the baseline.

def couldNotSend(amount: MilliSatoshi, timestamp: TimestampSecond): BalanceEstimate =
if (amount < high) {
if (amount > low) {
copy(high = amount, highTimestamp = timestamp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the high estimate has to be updated intependently of the low estimate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high is updated regardless of the value of low.

if (amount > low) {
copy(high = amount, highTimestamp = timestamp)
} else {
// the balance is actually below `low`, we discard our lower bound

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree with the next line that if the amount is below low that one should discard the low estimate

// the balance is actually below `low`, we discard our lower bound
copy(low = MilliSatoshi(0), lowTimestamp = timestamp, high = amount, highTimestamp = timestamp)
}
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are doing here. if you use the conditional probabilities properly you would never come to the situation where you send something larger than your last estimate.

Sure the decay may change this but then I think the new amount should just be the new high estimate as indicated two comments above.

val pLow = decay(low, 1, lowTimestamp)
val pHigh = decay(high, 0, highTimestamp)

if (amount < low) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the probability of sending the amount should be 1 as you are certain that you can send the amount!

However taking your notation a-x is the amount that can be sent. However I do not understand the semantics of x*pLow and what is acchieved by dividing all by the lower bound of the channel a It certainly does not give you the satoshis that may savely be sent and it most certaintly does not give you an accurate probability.

I have similar issues understanding the next 5 lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.
low was the lower bound at some time t. If you call canSend at that time t, then yes the result will be 1 but if call it later it will be lower than 1.

* @param totalCapacity total capacity of all the channels between the pair of nodes
* @param halfLife time after which the certainty of the lower/upper bounds is halved
*/
case class BalanceEstimate private(low: MilliSatoshi, lowTimestamp: TimestampSecond, high: MilliSatoshi, highTimestamp: TimestampSecond, totalCapacity: Satoshi, halfLife: FiniteDuration) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LiquidityEstimate would be a much better name because it is the uncertainty in Liqudity Information that yields a problem. I know in our early work we also called it Balance but this term turned out to be very confusing in many conversations which is why I arrived at the point to consistantly call it Liquidity.
That being said if you look at the reference implementation that I did for the stuff that I think you try to achieve here I use the terms UncertaintyChannel that quantifies my belief about the channels Liqudity and estimates probabilities and costs that can be used in routing decisions (in particular for the min cost flow solver) and UncertaintyNetwork for the entire network that maintains such information:

you can find a very will documented notebook in with a glossary at https://github.com/renepickhardt/mpp-splitter/blob/pickhardt-payments-simulation-dev/Pickhardt-Payments-Simulation.ipynb which is currently up for review at: renepickhardt/mpp-splitter#11


/* Estimate the probability that we can successfully send `amount` through the channel
*
* We estimate this probability with a piecewise linear function:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the gist / goal here but I don't see this being reflected in the code or understand why this is the comment for the function removeChannel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the comment of removeChannel, it's for canSend

def baseline(capacity: Satoshi, halfLife: FiniteDuration): BalanceEstimate = noChannels(halfLife).addChannel(capacity)
}

case class BalancesEstimates(balances: Map[(PublicKey, PublicKey), BalanceEstimate], defaultHalfLife: FiniteDuration) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above I call this the UncertaintyNetwork

* @param includeLocalChannelCost if the path is for relaying and we need to include the cost of the local channel
*/
private def addEdgeWeight(sender: PublicKey, edge: GraphEdge, prev: RichWeight, currentBlockHeight: BlockHeight, weightRatios: Either[WeightRatios, HeuristicsConstants], includeLocalChannelCost: Boolean): RichWeight = {
private def addEdgeWeight(sender: PublicKey, edge: GraphEdge, balance: BalanceEstimate, prev: RichWeight, currentBlockHeight: BlockHeight, weightRatios: Either[WeightRatios, HeuristicsConstants], includeLocalChannelCost: Boolean): RichWeight = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming the canSend function returns the actual conditional probability given your knowledge. The entire thing here is a bit weired to me. IIRC Dijsktra works by minimizing the sum of weights. However an onion is successfull if it is successfull at every hop which means the probabilities along the path have to be multiplied.
So just factoring in your probability weights from your estimates would again break the probabilistic model.

Without deeper understanding I initially used the negative logarithm to transform the multiplication of probabilities to an addition that can be used in dijkstra or for min cost flow problems with seperable cost functions. However while creating my reference implementation and tinkering about feature engineering I realized that for a channel of capacity c we have -log(P(X>=c) = log(c+1) which means that the result is just the entropy of the channel. This is what made me realize that the negative logarithms are not just a neat mathematical trick but rather something that I now call the uncertainty cost. This can btw easily be linearized and transered to a unit cost that is comparable to the fee rate of routing. I strongly suggest to look at the glossary of https://github.com/renepickhardt/mpp-splitter/blob/pickhardt-payments-simulation-dev/Pickhardt-Payments-Simulation.ipynb where I explain this all in more detail. A lot of the choices in feature engineering become very natural an clean when thinking about the situation from the perspective of uncertainty costs (or linearized integer uncertainty unit costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this PR, I've fixed it in #2257
However I'm not convinced the fix is actually useful in practice. By using 1/p we can give a cost to failed attempts which is something concrete, when using -log(p) we fix Dijkstra but the penalty factor does not correspond to anything we care about.

lockedFundsRisk = config.getDouble("locked-funds-risk"),
failureCost = getRelayFees(config.getConfig("failure-cost")),
hopCost = getRelayFees(config.getConfig("hop-cost")),
usePastRelaysData = config.getBoolean("use-past-relay-data"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That config value isn't defined in reference.conf

* @param usePastRelaysData use data from past relays to estimate the balance of the channels
*/
case class HeuristicsConstants(lockedFundsRisk: Double, failureCost: RelayFees, hopCost: RelayFees)
case class HeuristicsConstants(lockedFundsRisk: Double, failureCost: RelayFees, hopCost: RelayFees, usePastRelaysData: Boolean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this called useBalanceEstimates? It would be more coherent, wouldn't it?

@thomash-acinq thomash-acinq marked this pull request as draft May 10, 2022 11:44
@t-bast
Copy link
Member

t-bast commented Jun 8, 2022

Since we merged #2272, we can probably close this PR for now? We'll gather mainnet data from our node and will later submit a new PR to use the balance estimation once our model seems to be accurate.

@thomash-acinq
Copy link
Contributor Author

No, this PR also contains the part using the data for the path-finding. There is no reason to throw that away.

@t-bast
Copy link
Member

t-bast commented Jun 8, 2022

That won't be lost, you just have to keep the branch...?

All of the comments made on the balance estimation part will be obsolete once you rebase, making it unnecessarily harder for people to read through the PR. It feels better to me to start from scratch and have a new PR focused on only using the balance estimation to keep it focused?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants