Reverse-engineering exact location information through trilateration.

Many datasets on the internet accidentally expose location information by allowing distance based search (“find everything within 3 miles around this point”) from arbitrary points of origin. I happened upon such a dataset and wanted to try rediscover the original location through trilateration.

Trilateration, which is commonly wrongly referred to as triangulation, is the process of finding a location L given three other locations A, B, C and their distance to L (let’s call that dA, dB and dC). Very surprisingly to me, I was not able to find a lot of good, sure-fire solutions to this easily when I wanted to solve a trilateration problem so there’s a summary of what I found.

Code is here.

A linear algebra approach.

Given the aforementioned setup with absolutely accurate values and under the assumption that the earth is flat, L can be found by finding the intersection of the three circles around A, B and C with their respective diameters. Unfortunately, neither is the earth flat nor du we commonly have perfect values for the diameters (and potentially the coordinates of A, B and C).

With reasonably good data however, we can compute L as the intersection of the spheres around A, B and C with their respective diameters. Let’s look at that, I hacked together this 3d demo (this is zoom and panable by dragging and mouse wheel scrolling):

In this example, the three known points are Moscow, Madrid and London and the diameters are the distance from Munich, Germany calculated using the Haversine formula. If you zoom into the intersection of the spheres, we actually do hit Munich pretty well and the spheres don’t seem to overlap so that’s a start.

Let’s assume our input is 3 sets of latitude/longitude and distance in kilometers. We need to map this to a coordinate system where we can actually determine the intersection somewhat easily. A good candidate is ECEF. Ideally, a library should be used as this gets iffy very quickly. Forward mapping from geo coordinates to ECEF is still reasonably straight forward, the back-mapping is an iterative algorithm and not trivial. A great library for this with bindings to many languages is PROJ.4.

Once we’ve mapped the geo coordinates to xyz coordinates on the earth’s ellipsoid’s surface, we can now proceed to determining the intersection of the spheres around our xyz points and map the point of intersection back to geo coordinates using PROJ4. Determining the intersection can be done as described in this stackexchange post. The post suggests manually converting coordinates back and forth between ECEF and geo coordinates; I’ll show in the next section why that may not be such a good idea.

Trying it:

The reason I am investigating this is that I happened upon a website which allows me to search for things sorting them by distance from my location. It does not reveal the location of the search results. I figured I may be able to search for them from 3 different locations and do the trilateration, which, when tried visually by drawing circles on Google Maps, looked very promising.

I first tried the approach posted here. This does in fact yield points which are in the right vicinity but not quite where the location actually is. I initially only looked at the distance between the trilateration results and the actual location. Plotting the results on a map is a lot more helpful in understanding what the problem is:

The actual location is Haimhausen. We observe two problems: First, the points used for A, B and C are all south-east of Haimhausen. It looks like we are a little short of the target in the east-west axis. Then, much worse, the points seem quite spread out on the north-south axis. This led me to suspect that the conversion from geo to ECEF listed on stack overflow is not quite accurate enough for what I wanted. The post mentions that “if using an ellipsoid this step is slightly different” which is quite the understatement. In fact, the code for doing the proper ECEF conversion is fantastically more complicated. A good example for validating ones implementation (and for looking at the code used in their implementation) can be found here.

I decided to try using python’s PROJ.4 bindings to make this projection easier and see if it improves the result at all. Low and behold, using an ellipsoid:

This is vastly better than before. North-south as well as west-east axis seem to be about equally skewed now. I was however still convinced that this could be improved. My suspicion was that the distance metric used for calculating the geo-coordiante distance pairs on the server differed from Haversine or Vincenty. I played around with the numbers a little bit but eventually tried multiplying all distances with a small factor. This yields a much better result and makes the trilaterated points cluster a lot more densely. I tried this with many known points from the dataset and this constant helps all trilaterations about equally. It looks like internally, the source of the data set maps all geo coordiantes on a flat x-y plane and calculates distances on that plane. Result with the correction factor:

The reason I’m showing more than one result point is because I actually have 10 source points with distances and this is the trilateration algorithm run on each combination of three out of those 10 source points.

Trilateration as an optimization problem.

I mentioned this to a friend who suggested thinking about it as an optimization problem. I pondered this for a bit and googled it and that quickly got me excited: We can likely express this in a fairly clean way and have all the hard work be done by a nonlinear solver. This reduces code, does away with coordinate system conversions and - even better - allows taking all source points into account at the same time.

I hacked this up in python. We try to optimize the squared difference of the distances from target point L to all input points A, B, … etc. Let’s assume L is an array consisting of latitude and longitude and the input points are an array points consisting of one array per point containing latitude, longitude and distance in that order. The python now comes out to this:

def distfunc(L, points):
    distsum = 0
    for c in points:
        dist = vincenty([L[0],L[1]], [c[1],c[0]]) - c[2]
        distsum += dist*dist
    return distsum

We can use scipy.optimize to minimize distfunc using the fmin_powell method. This comes down to:

L = scipy.optimize.fmin_powell(distfunc, [0,0], args=(points,), xtol=0.0001, ftol=0.0001, disp=False)

One thing we have to do, however is to either pick the initial guess ([0,0] in the example call) close to the answer or we have to make sure that distfunc returns something large for values of L for which it is not defined. Otherwise, the result will be outside of the range of valid geo coordinates simply because the vincenty function does produce distances even for illegal inputs, expect for the fact that they are not well-defined.

In addition to constraining illegal values for L, for my input data set I had to take the fact that all distances seemed to be off by a factor into account. This can be easily done using this approach by adding a third variable to L and having it optimized as well, which is the magic factor. This changes the distance computation to

  dist = vincenty([L[0],L[1]], [c[1],c[0]]) - L[2]*c[2]

and requires for the initial guess to be extended by one more zero to [0,0,0]. I tried the algorithm with a dataset which had 30 points with distances and a known result for L. The following list is the number of points together with median, average and 90th-percentile error from the actual point in kilometers.

3 Median: 15.916440 Average: 13.894293 90th-percentile: 18.584305
4 Median: 14.115303 Average:  9.638456 90th-percentile: 17.695530
5 Median: 14.087114 Average: 10.022226 90th-percentile: 17.623656
6 Median: 14.207656 Average: 12.087994 90th-percentile: 20.107994
7 Median: 14.012734 Average: 12.129031 90th-percentile: 20.082207
8 Median:  0.000563 Average:  0.000600 90th-percentile:  0.000651
9 Median:  0.000549 Average:  0.000555 90th-percentile:  0.000639
10 Median: 0.000534 Average:  0.004449 90th-percentile:  0.000802

With 10 points, the error is 50 centimeters (less than 2 feet). Seeing as most geo coordinates in databases are probably geocoded, this is more than enough for reconstructing the original address of a point. Visually, the red pin is the original, the blue one is the trilatered point:

I know, the red one is basically invisible. Let’s zoom in:

A remaining question is why this only starts delivering good results at 8 points and not less. I am guessing this is because of the introduction of the additional distance factor which is only correctly determined with 8 or more points. Since the API I am reverse trilaterating allows me to essentially get an unlimited number of points, this is not a big issue.

Protecting yourself from trilateration.

For me, it’s great that I can recover the location information from the provided data; it gives me an edge over other consumers of the data. In general though, you probably don’t want to make this mistake and protect the location information your customers enter into your website. This is actually subtle. At first sight, adding a random number to all reported distances seems like an easy fix. Let’s do that and run the optimization algorithm again. This is with 250m added in either direction for every coordinate:

  7 Median: 0.177487 Average: 5.900967 90th-percentile: 17.687421
  8 Median: 0.161858 Average: 0.171432 90th-percentile: 0.213890

Well, this is already 200m accurate at 8 points. We can improve by using way more points:

  31 Median: 0.121494 Average: 0.121494 90th-percentile: 0.121494

Even though we added ±250m to all points, we found the original coordinates with error of only 120 meters. This is what that looks like:

That’s not that far off and we can do even better by adding more points (even though this gets better quite slowly).

From my point of view, what one has to do is NOT use the user supplied geo location in the distance formula to begin with. For instance, one could select a representative point P within 500 meters and calculate all distances from there. This way, the user supplied location is not even part of the distance calculation and at most the selected representative point can be trilaterated back from the data. However, if the same point P is chosen for any location within 500 meters of it, then a reasonable degree of anonymity can be achieved in urban areas.

The code for this is here.

comments powered by Disqus
Blog by .