Many datasets on the internet accidentally expose location information by
allowing distance based search (“find everything within 3 miles around this
point”) from arbitrary points of origin. I happened upon such a dataset and
wanted to try rediscover the original location through trilateration.
Trilateration, which is commonly wrongly referred to as triangulation, is the
process of finding a location L given three other locations A, B, C and
their distance to L (let’s call that dA, dB and dC). Very surprisingly
to me, I was not able to find a lot of good, sure-fire solutions to this easily
when I wanted to solve a trilateration problem so there’s a summary of what I
Given the aforementioned setup with absolutely accurate values and under the
assumption that the earth is flat, L can be found by finding the intersection
of the three circles around A, B and C with their respective diameters.
Unfortunately, neither is the earth flat nor du we commonly have perfect values
for the diameters (and potentially the coordinates of A, B and C).
With reasonably good data however, we can compute L as the intersection of
the spheres around A, B and C with their respective diameters. Let’s look
at that, I hacked together this 3d demo (this is zoom and panable by dragging
and mouse wheel scrolling):
In this example, the three known points are Moscow, Madrid and London and the
diameters are the distance from Munich, Germany calculated using the Haversine
formula. If you zoom into the intersection of the spheres, we actually do hit
Munich pretty well and the spheres don’t seem to overlap so that’s a start.
Let’s assume our input is 3 sets of latitude/longitude and distance in kilometers. We need to map this to a coordinate system where we can actually
determine the intersection somewhat easily. A good candidate is ECEF. Ideally, a library should be used as this gets iffy very quickly. Forward mapping from geo coordinates to ECEF is
still reasonably straight forward, the back-mapping is an iterative algorithm
and not trivial. A great library for this with bindings to many languages is
Once we’ve mapped the geo coordinates to xyz coordinates on the earth’s ellipsoid’s surface, we can now proceed to determining the intersection of the
spheres around our xyz points and map the point of intersection back to geo
coordinates using PROJ4. Determining the intersection can be done as described
in this stackexchange post. The post suggests manually converting coordinates back and forth between ECEF and geo coordinates; I’ll show in the next section why that may not be such a good idea.
The reason I am investigating this is that I happened upon a website which
allows me to search for things sorting them by distance from my location. It
does not reveal the location of the search results. I figured I may be able
to search for them from 3 different locations and do the trilateration, which,
when tried visually by drawing circles on Google Maps, looked very promising.
I first tried the approach posted here. This does in fact yield points which are in the right vicinity but not quite where the location actually is. I initially
only looked at the distance between the trilateration results and the actual
location. Plotting the results on a map is a lot more helpful in understanding
what the problem is:
The actual location is Haimhausen. We observe two problems: First, the points used for A, B and C are all south-east of Haimhausen. It looks like we
are a little short of the target in the east-west axis. Then, much worse, the
points seem quite spread out on the north-south axis. This led me to suspect that the conversion from geo to ECEF listed on stack overflow is not quite
accurate enough for what I wanted. The post mentions that “if using an ellipsoid this step is slightly different” which is quite the understatement. In fact, the
code for doing the proper ECEF conversion is fantastically more complicated. A
good example for validating ones implementation (and for looking at the code used in their implementation) can be found here.
I decided to try using python’s PROJ.4 bindings to make this projection easier
and see if it improves the result at all. Low and behold, using an ellipsoid:
This is vastly better than before. North-south as well as west-east axis seem to be about equally skewed now. I was however still convinced that this could
be improved. My suspicion was that the distance metric used for calculating
the geo-coordiante distance pairs on the server differed from Haversine or
Vincenty. I played around with the numbers a little bit but eventually tried multiplying all distances with a small factor. This yields a much better result
and makes the trilaterated points cluster a lot more densely. I tried this with
many known points from the dataset and this constant helps all trilaterations
about equally. It looks like internally, the source of the data set maps all
geo coordiantes on a flat x-y plane and calculates distances on that plane. Result with the correction factor:
The reason I’m showing more than one result point is because I actually have
10 source points with distances and this is the trilateration algorithm run
on each combination of three out of those 10 source points.
Trilateration as an optimization problem.
I mentioned this to a friend who suggested thinking about it as an optimization problem. I pondered this for a bit and googled it and that quickly got me excited: We can likely express this in a fairly clean way and have all the hard work be done by a nonlinear solver. This reduces code, does away with coordinate system conversions and - even better - allows taking all source points into account at the same time.
I hacked this up in python. We try to optimize the squared difference of the distances from target point L to all input points A, B, … etc. Let’s assume L is an array consisting of latitude and longitude and the input points are an array points consisting of one array per point containing latitude, longitude and distance in that order. The python now comes out to this:
We can use scipy.optimize to minimize distfunc using the fmin_powell method. This comes down to:
One thing we have to do, however is to either pick the initial guess ([0,0] in the example call) close to the answer or we have to make sure that distfunc returns something large for values of L for which it is not defined. Otherwise, the result will be outside of the range of valid geo coordinates simply because the vincenty function does produce distances even for illegal inputs, expect for the fact that they are not well-defined.
In addition to constraining illegal values for L, for my input data set I had to take the fact that all distances seemed to be off by a factor into account. This can be easily done using this approach by adding a third variable to L and having it optimized as well, which is the magic factor. This changes the distance computation to
and requires for the initial guess to be extended by one more zero to [0,0,0]. I tried the algorithm with a dataset which had 30 points with
distances and a known result for L. The following list is the number of points together with median, average and 90th-percentile error from the actual point in kilometers.
With 10 points, the error is 50 centimeters (less than 2 feet). Seeing as most geo coordinates in databases are probably geocoded, this is more than enough for reconstructing the original address of a point. Visually, the red pin is the original, the blue one is the trilatered point:
I know, the red one is basically invisible. Let’s zoom in:
A remaining question is why this only starts delivering good results at 8 points and not less. I am guessing this is because of the introduction of the additional distance factor which is only correctly determined with 8 or more points. Since the API I am reverse trilaterating allows me to essentially get an unlimited number of points, this is not a big issue.
Protecting yourself from trilateration.
For me, it’s great that I can recover the location information from the provided data; it gives me an edge over other consumers of the data. In general though, you probably don’t want to make this mistake and protect the location information your customers enter into your website. This is actually subtle. At first sight, adding a random number to all reported distances seems like an easy fix. Let’s do that and run the optimization algorithm again. This is with 250m added in either direction for every coordinate:
Even though we added ±250m to all points, we found the original coordinates with error of only 120 meters. This is what that looks like:
That’s not that far off and we can do even better by adding more points (even though this gets better quite slowly).
From my point of view, what one has to do is NOT use the user supplied geo location in the distance formula to begin with. For instance, one could select a representative point P within 500 meters and calculate all distances from there. This way, the user supplied location is not even part of the distance calculation and at most the selected representative point can be trilaterated back from the data. However, if the same point P is chosen for any location within 500 meters of it, then a reasonable degree of anonymity can be achieved in urban areas.
If you own any stocks or funds you probably want to track the allocation, meaning
how much money is in stocks and funds respectively. Some fairly well known low
key investment strategies rely on balancing this ratio, for instance
In Ledger, allocations can be modelled easily through allocation based accounts
and automatic transactions. This is outlined here.
If you do this however, you still have to manually update allocations for every
fund you own which is a nuisance, especially if you own something like a target
retirement investment fund which naturally and purposefully changes its allocation
strategy over time.
In a nutshell, we need an automated way of creating records like this:
38.00$ Non US Stock
80.00$ US Stock
ledger bal Allocation: -X '$' -%
which roughly yields
19.00% Non US Stock
40.00% US Stock
A reasonable source for asset allocation per fund is Morningstar. There, each
fund has a Portfolio page which shows asset allocation. Let’s try to extract
First, let’s use Ledger to determine what kind of commodities we might need to
figure out allocation for:
Then we load the Morningstar page and extract all relevant data using Nokogiri.
I used Selenium to do scraping for banks before however here that seems like
renders HTML server side which, combined with a DOM parser, allows for very
easy data extraction.
Essentially, we use a few XPath expression to find the relevant data in the page
and extract the asset allocation. The code ommitted here is a very dumb
heuristic to determine allocation when we are not dealing with a fund or when
no data is available. We use the category name, which is usually something
like “Total Stock Market Fund” or “Total International Bond Market Fund” as a
basis for the allocation decision and otherwise just assume it’s a stock for
the simple reason that I don’t trade individual bonds.
If you use Ledger CLI, chances are you will use more commodities than just
dollar. This is actually one of Ledger’s best features. For instance, I
track our 401k in Ledger and it allows me to list the ticker symbols of the
funds we have purchased inside the 401k as well as the fact that this is pre-tax
money. Our opening balances look something like this:
Roughly, this means that we have two different funds in our 401k with ticker
symbols VFIFX and VTTSX respectively. Also, the funds are not valued in dollars but
in p$ which is our commodity for pre-tax money. The idea is that this money,
if accessed today, has a much lower actual value because we’d have to pay
federal, state and penality taxes on it. Therefore, the price database for
this looks like this
P 2016/07/19 p$ 0.55$
P 2016/07/31 VFIFX 30.15p$
P 2016/07/31 VTTSX 30.15p$
where the first row is an approximation of federal tax rate + state tax rate +
10% penalty. The last two rows are the current price in dollars but expressed
at p$ to indicate that we haven’t paid taxes on this yet.
Now while I am not very concerned about the detailed movement of the funds we
own, I want to be able to pull current prices for them without looking up every
ticker symbol manually. The following script roughly achieves this goal. A brief
First, we query all commodity names from ledger and convert them into a hashmap.
Some cleanup is required as very few symbols in my ledger file don’t correspond
to their Google Finance equivalent.
Now we just need to retrieve current data and output it in the right format for
Ledger’s price db.
When trying to migrate away from Mint to Ledger, one of the major obstacles is automating the work of entering every 3 dollar coffee shop transaction manually. Ledger does provide some help in this area through the use of ledger xact, yet as of today, we enter 25 transactions per week into ledger and a large chunk of them is from our Chase credit cards. Automation seems to be the easiest way to make this a lot less painful.
Headless Selenium to the rescue.
Selenium commonly runs while interacting with an actual, visible browser. This however makes windows pop up on my machine which I don’t care about to I migrated to a headless implementation, in this case PhantomJS early on. This makes development a little trickier because you can’t observe what Selenium does and where it gets stuck but screenshotting mostly alleviates this problem.
The language of choice for this was ruby for no other reason than me being pretty familiar with it and it offering Selenium bindings as well as a reasonable amount of example code. Curiously, Chase seems to require the user-agent to look non-automated so this is the first hurdle to take care of while automating:
I mostly copied the user-agent string from my actual browser and that seemed to be enough. Since PhantomJS is a full browser, it takes care of all cookie handling and other features that might be required to get the Chase website to work.
Login & Frames :-(
What’s not trivial is realizing that the Chase website actually uses frames. Frames require special handling in Selenium so one has to realize that elements can’t be found because there’s a frame around the elements which needs to be entered first.
For Chase, we first navigate to a url displaying the login box:
Now we enter the frame the login inputs are located in and fill them
The code above enters the frame, fills the form, takes a screenshot if in debug mode and then logs us in. The last line exits the frame again as the remainder of the interaction happens outside of frames.
After successfully submitting the login, Chase may ask for a second authentication factor. The code on GitHub handles this case by asking for the OTP token that Chase emails to the address stored in their database; here I’ll obmit it for brevity.
Account discovery and CSV download.
Next we have to wait until the accounts overview is loaded, extract all account numbers and download a CSV file with the latest transactions.
This is pretty much it. Now we have a CSV file for all transactions for all accounts accessible through the supplied chase login. This can easily be translated to Ledger, either through its import methods or through Reckon.
I use the following Reckon command to convert Chase CSV to Ledger
In the last 4 months, I did not have to change any of the code. I did however once
have to enable screenshotting to notice that Chase was serving a notice which one
had to acknowledge before the accounts page. After manually doing that, the script
worked again without modification. I have scripts for 2 others banks and I have not
modified any of them. Bank websites don’t seem to change so much, it turns out.
I’ve recently switched all my families accounting from manually looking at
bank websites + mint.com to Ledger.
This post describes the first part of that fairly long but ultimately successful
journey: Motivation and Challenges.
Motivation A: If your bank dropped $1,000 from your account, would you notice?
I think my dad would have. He had all statements printed. In Germany, you do this
at a small machine in the bank building (or you have the printouts mailed to you)
and by putting them into a folder, you see old and new balances as well as the
transactions between the two. This presumably makes spotting that something is off
For me, this is probably a no. My checking account balance is usually fairly
high since our rent is payed by check and
is a few thousand dollars
that have to be on the account when the check is cashed. Some days, I would
probably notice a thousand dollars missing. Other days I wouldn’t.
Worse yet, I just read an intriguing email thread at work where someone’s bank
took credit card payments for someone else’s credit card off of their checking
account. The colleague who, quite embarrassedly actually, reported this, did not
notice the $100-ish payments which did not belong to them going out of their
account for 5 years. I think this is not completely unlikely to happen for
others as well.
Now does this mean that I am losing money from my bank misreporting my balance
or payments going out that shouldn’t? Probably no, but at the same time I feel
I might just not notice which is bad.
Motivation B: How many accounts do you have and what is the sum of all balances?
This sounds like a fantastic problem to have but realistically I simply don’t
know. We have 5ish credit cards in the US, we have 3 credit cards in Germany,
we have 2 checking accounts in the US, 2 checking accounts in Germany, we have
401ks, we have all kinds of other weird financial products and very little cash.
And really the only reason that I know we have all these random accounts is
because I just put all of this into ledger. This is part of why this is
important and cool work. You actually have one place that is aware of all
accounts and it knows how to deal with different commodities and convert between
them so you can get a digestible picture.
Mint partially solves this. If you only have accounts in the US and all of them
work with Mint, then this could be a decent tool. I never much cared for many
parts of the interface but if you are comfortable with it and Motivation A
doesn’t matter to you then this is probably not the worst solution.
Getting from Mint to Ledger.
I made one simplifying assumption first: I won’t care about the past. While this
is not at all true, it just didn’t seem worth it to import a year or more of
past transactions and categorize them from scratch. Granted, it would give
insight into past behavior but the last two or so months are probably sufficient for
Then I needed to solve the most fundamental issue: How do I stop myself from
starting entering things in Ledger and then stopping 2 weeks later when it
becomes a chore. My solution to this is automation. I do not enjoy copy and
pasting stuff from my bank’s website. Neither do I want to log into 5 websites
every three days just so that my finances are up to date. What simplifies this
is that we really only use one credit card frequently, one checking account
and one bank in Germany. It seemed reasonable to automate all transaction
downloading for three financial institutions and maintaining that code. With a
little luck one might even be able to find a second person who would help
maintain that if open sourced.
What I’ve ended up with now is a solution consisting of:
Ledger file in Dropbox.
Automated downloading of transactions from
our Credit Union, and
our German bank.
Automated report generation from the ledger file.
The next post will talk about the implementation of the automated downloader
for Chase as this had about the most caveats. The code will go on Github when
that post is ready.