#### A Fair Distance Function
#### by Jan Behrens and Björn Swierczek, Berlin, May 11, 2017


When developing democratic proposition development and decision making software, the processing of user-generated geospatial data poses certain questions when it comes to the ordering of search results by location-based relevance. One aspect of location-based relevance is the *distance* of a geographical object from/to a given point (e.g. a search center). While we usually mean the euclidean or spheroidal surface distance [Note: The spheroidal surface distance is the shortest possible distance on the reference spheroid used to model Earth.] when we speak of “distances”, other mathematical distance functions are thinkable and useful, as will be shown in this article. Beside proposing a particular distance function, we will also consider practical aspects of database indexing when using such a distance function.


== Mathematic preliminaries ==

We will use the term “distance function” for a mathematical function ‘f’ that maps two geographical objects ‘G’ and ‘S’ (each being a subset of Earth's surface ‘E’) to a non-negative real number ‘f(G, S) in R0+’. Within this article, however, we will only consider those functions where the second argument ‘S’ is a singular point (which we will refer to as the “search center”), while the first argument ‘G’ may be any other geographical shape (e.g. a point, path, or polygon). Therefore, the functions considered within this article are not “metrics” because a metric maps two objects of the same set to a non-negative real number. Furthermore, we will allow a distance function to return zero even if the two objects are not equal; e.g. the distance between a path and a point on that path may be zero even if the point and the path are two distinct (non-equal) objects.

A distance of zero will denote the highest possible relevance for an object ‘G’ in regard to a certain search center ‘S’; higher numbers will denote less relevant matches.

Note that even when we write ‘G subset of or equal to E’, we only consider those ‘G’ which are finite unions of basic geographical objects, and not, for example, fractals or non-measurable sets.


== Nearest-neighbor searches and distance functions ==

Distance functions are a necessity for database queries in the form of “Show me the object(s) which are closest to my location”. While the distance function is the only mandatory prerequisite for such a nearest-neighbor search (and totally sufficient for a linear search), a fast indexing system requires further support functions to speed up nearest-neighbor searches. A working set of such support functions can be found at the GiST framework that is used by PostgreSQL. [GiST] At first sight, we will focus on the distance function though, keeping in mind that a final solution will require further considerations in order to implement fast indexing techniques which are necessary for scalable applications.


== The trivial approach ==

The easiest function to be used for nearest-neighbor searches is to determine the shortest possible spheroidal surface distance ‘d’ (or euclidean distance in case of flat maps) between a search center ‘S’ and the geographical object ‘G’.

Figure 1 “The shortest way (euclidean distance) from a geographical object ‘G’ to a search center ‘S’”:

    A geographical object ‘G’ of a random shape next to a point ‘S’.
    The shortest distance between ‘S’ and ‘G’ is ‘d’.

End of Figure 1.


== Challenge I: Object size ==

While, at a first glance, the trivial approach of simply determining the shortest possible distance might seem to be straight forward and fair, a closer look reveals a serious problem: objects that cover large areas have a higher chance to return a shorter distance (or even zero if they cover the search center). If these objects are part of user-generated content (e.g. created by an initiator of a LiquidFeedback initiative), then users might be tempted to create intentionally oversized geometric objects in order to optimize search results for their content. Honest users aiming to select exactly those regions or locations that are really affected by their initiatives would be at disadvantage to users entering oversized and thus incorrect geometries. Even if it is still possible that some other users penalize such attempts by giving respective bad ratings to proposals with wrong or slightly wrong geospatial metadata, it would still be possible that the overall data quality is compromised due to tactical considerations of certain users. As shown in the section “Requirements for a fair distance function” below, we can compensate the advantage of big objects in a proper way.

Figure 2 “Big objects having advantages in comparison to small objects”:

    Two geographical objects ‘G_small’ and ‘G_big’, of which the first is
    smaller than the second one and contained in the second one.

    The shortest distance from three points ‘S_1’, ‘S_2’, and ‘S_3’ outside
    those two objects to ‘G_small’ is always greater than the shortest distance
    of these three points to ‘G_big’.

End of Figure 2.


== Challenge II: Relevance ==

Following the demand of a democratically driven proposition development process that is moderated by the users (i.e. collectively moderated), location-based relevance will not only depend on geographical properties but also on the ratings of the users (i.e. voters). Depending on the kind of search, we would also need to take the voter or supporter situation (e.g. “likes”) into account when sorting data by (location-based) relevance. One method to include user ratings for the location-based relevance of objects would be to simply divide the geographical distance by the number of votes such that an object that is 90 times as far away but has 100 times more votes than another object will appear first in the list of entries based on location-based relevance. While this may affect the efficiency of index lookups inside a database (and thus require further consideration, see section “Weighted nearest-neighbor searches” below), it is otherwise easy to convert any distance function to a function which will take the weight of an object into account (e.g. by including a simple division as explained before).

It should be noted that in many cases, the number of votes isn't suitable to be directly taken into consideration as weight. This is because clone-proofness is an important property of proposition development and decision making systems. [PLF, section 4.11] In case of competing proposals with geographical metadata, the “Harmonic Weighting” algorithm (compare [PLF, subsection 4.10.1]) might be a more suitable approach.

In the context of multi-winner elections, the counting scheme used by “Harmonic Weighting” has also been described by Thiele in 1895. [Janson] [Skowron] For the sake of weighting geographical objects, however, we require more than a selection of candidates (like in multi-winner voting systems) and even more than a ranking of all candidates (which is a natural by-product of sequentially working multi-winner voting systems): we require the system to return a weight (i.e. a real number) instead of just a rank for each candidate. See [PLF, Appendix B] for an example (column “Harmonic Weight” in table on [PLF, p.177]).


== Requirements for a fair distance function ==

Our aim is that increasing the size of a geographical object doesn't give a general advantage to appear earlier in nearest-neighbor searches. It should be noted that the “size” in this context doesn't necessarily mean “area” or “length”. For example, neither a huge polygon nor a set of thousand points or a long line (note that the latter two have an area of zero) should gain an advantage over a single point. When we speak of “general advantage”, we must define this in mathematical terms since changing the size or location of an object can always optimize the distance in regard to a *particular* search center. With “general advantage” we rather mean that the distance ‘f(G, S)’ (with ‘G’ being a geographical object and ‘S’ being a search center point) can't be optimized in regard to *all* possible search centers, or by demanding that the average result of the distance function ‘f’ (or the squared distance function, see Figure 3) over all possible search points is constant.

Figure 3 “Constancy of average of square of distance”:

    For all geographical objects G in
    { x subset of or equal to E | x not empty }:
    Surface integral (S in E) [ (f(G, S))^2 ] dA = const,
    where E is the set of all points on Earth's surface
    and dA is the infinitesimal surface area.

End of Figure 3.

The requirement in Figure 3, however, is not sufficient for a fair distance function because ‘f(G, S)’ is not bounded: raising ‘f(G, S)’ to a very high value for some search centers would allow lowering ‘f(G, S)’ for many other search centers.

A stricter requirement would be that the area for which the distance function yields values equal to or less than a limit ‘L’ is independent of ‘G’ and must only depend on ‘L’ (e.g. the area must be equal to ‘pi * L^(2)’ as demanded by the formula given in Figure 4).

Figure 4 “Strict requirement for the area where the distance function yields values smaller than a certain limit”:

    Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2

End of Figure 4.

This stricter requirement is automatically fulfilled for all ‘L in R0+’ if we only consider singular points as geographical objects on a flat map and if ‘f’ is the euclidean distance function between the two points ‘G’ and ‘S’, because the area around a point ‘G’ where the distance to that point is less than a limit ‘L’ is exactly ‘pi * L^(2)’. [Note: The surface area of a circle with radius ‘r’ is ‘pi * r^(2)’.] If we are able to fulfill this requirement also for other geographical objects ‘G’ which are not singular points (e.g. polygons of any size or multiple points), no general advantage could be gained by adding locations or areas to ‘G’, because the size of the area, where a search center ‘S’ could be located to return an ‘f(G, S)’ lower than a certain value, would be independent of ‘G’.

Unfortunately the requirement in Figure 4 is generally infeasible for geographical objects that have an area greater than zero because it would conflict with treating all points inside the object's area equally. We can still demand the requirement depicted in Figure 4 to be true for all ‘L in R0+’ when the geographical object ‘G’ has an area of zero (i.e. only consist of points and paths). For all other objects, we propose to violate this criterion. Notwithstanding, our proposal will still fulfill the requirement in the limiting case where small geographical objects (in terms of covered area) in relation to their distance from the search point are being considered.

Figure 5 “Fulfillment of formula from Figure 4 for all ‘L’ being large enough”:

    Let |G| be the size of the area covered by G (e.g.
    |G| = 0 if G only consists of singular points),
    then there exists a factor c_0 in R+ such that for all
    geographical objects G subset of or equal to E and all
    all L >= c_0 * |G|:

    Surface integral (S in {x in E | f(G, x) <= L}) [ 1 ] dA = pi * L^2.

End of Figure 5.


== Proposal for a fair distance function ==

Considering what has been discussed above, we propose the algorithm described in Figure 6 to serve as a “fair distance function” which is parameterized with a search center point ‘S’ and a geographical object ‘G’ on the spheroid. An example of its application is shown in Figure 7.

Figure 6 “Proposed algorithm”:

    Let ‘c_1’ and ‘c_2’ be two constants, where
    1/2 < c_1 <= 1  and  c_2 = (c_1 ^ 2) / (2 * c_1 - 1).

    1. Calculate the surface area ‘|G|’ of the geographical object ‘G’
       (zero if ‘G’ only consists of points and paths but not, for
       example, filled polygons).

    2. Calculate the shortest spheroidal surface distance ‘d’ between
       the geographical object ‘G’ and the search center point ‘S’
       (zero if ‘S’ is located inside or touching ‘G’).

    3. Calculate the surface area ‘|G_extra|’ covered by all points on
       the spheroid which are not in ‘G’ but whose shortest spheroidal
       surface distance to ‘G’ is less than ‘d’.

    4. Let  R  =  min ( c_1 * |G| + c_2 * |G_extra|  ,  A + E ).

    5. The result of the fair distance function is:
       f(G, S) = sqrt ( R / pi ).

End of Figure 6.

Figure 7 “Example calculation where ‘G’ is a union of 4 basic geographical objects (line, triangle, two points), and where ‘S not in G’”:

    4 black geographical objects marked as ‘G’ are surrounded by a gray margin
    that follows the shape of the four objects and is just thick enough to
    touch a point ‘S’ that is outside of ‘G’.

    f(G, S) = sqrt ( R / pi ).
    R  =  min ( c_1 * |G| + c_2 * |G_extra|  ,  A + E ).

End of Figure 7.

The choice of ‘c_1’ directly corresponds to a penalty for search center points ‘S’ that are located inside or touch the geographical object ‘G’: the result of the fair distance function equals to ‘sqrt(c_1 * |G| / pi)’ in this case (which is proportional to the radius of a circle with the same surface area than the geographical object ‘G’). In order to fulfill the demand stated in Figure 3, the second constant ‘c_2’ is chosen dependent on ‘c_1’ such that the penalties for search center points ‘S’ inside the geographical object ‘G’ are compensated on the outside (see Figure 9). An optimal value for ‘c_1’ would be ‘1/2’ because in case of a geographical object which consists of a huge number of singular points randomly scattered over a certain surface area, the statistical average for ‘|G_extra|’ is half of that area.

Figure 8 “For an ‘S’ randomly placed inside the outer shape, the grey area ‘G_extra’ will cover half of the area within the outer shape in the average case”:

    A shape is filled with black points in a regular pattern. An additional
    point ‘S’ is also within that shape. Each black point is surrounded by a
    gray margin that is just thick enough to touch that point ‘S’.

End of Figure 8.

Therefore, no (statistical) advantage could be gained by replacing filled areas of a geographical object with a huge number of singular points covering that surface area.

Setting ‘c_1 = 1/2’, however, would cause a discontinuity because ‘lim [c_1 -> 1/2] c_2 = infinity’. A reasonable compromise seems to be ‘(c_1, c_2) = (2/3, 4/3)’. This way, search center points that are located inside the geographical object are slightly over-penalized, but the discontinuity is solved smoothly.

Figure 9 “Picking ‘c_2’ such that the formula in Figure 3 is fulfilled”:

    A graph with ‘S ordered by f(G, S)’ on the x-axis and ‘R proportional to
    f(G, S)^2’ on the y-axis. A dashed line marks a linear relation. A solid
    line is constant and equal to ‘c_1 * |G|’ for ‘S in G’. This part of the
    solid line is labeled with ‘treat all S in G equal’. Right of it, the
    solid line has a slope of ‘c_2’ until ‘c_0 * |G|’ on the x-axis.
    Afterwards the solid line overlaps with the dashed line.

    The areas between the solid line and the dashed line are labeled ‘I_1’,
    ‘I_2’, and ‘I_3’. ‘I_1’ is above the dashed line, while ‘I_2’ and ‘I_3’
    are below the dashed line.

    I_1 - I_2: excess penalty for S in G.
    I_3:       compensation for extra penalty (‘c_2’ is chosen such that
               ‘I_3 = I_1 - I_2’).

    c_0 := (c_2 - c_1) / (c_2 - 1).

    Surface integral (S in E) [ (f(G, S))^2 ] dA = const.

    The integral is independent of ‘|G|’ if ‘c_0 * |G|’ is not exceeding
    Earth's surface, because ‘I_1 - I_2 - I_3 = 0’ and the integral will then
    cover ‘I_1’ through ‘I_3’.

End of Figure 9.

In the limiting case where small geographical objects in relation to their distance from the search point ‘S’ are being considered, we know that ‘|G_extra| >> |G|’ from which follows that ‘R = |G| + |G_extra|’. Then ‘R’ is equal to the area of all points with spheroidal surface distance to the geographical object ‘G’ equal to or smaller than ‘d’ (see Figure 7). The area for which the distance function yields values below a limit ‘L’ is therefore equal to ‘pi * L^(2)’ because from ‘f(G, S) <= sqrt(R/pi) = L’ follows that ‘R <= pi * L^(2)’. Therefore, the demand stated in Figure 4 is fulfilled in this limiting case (see Figure 5 for the precise condition; factor ‘c_0’ is given in Figure 9).


== Implementation as part of the PostgreSQL extension “pgLatLon” ==

PostgreSQL is an open-source database management system. [PostgreSQL] The fair distance function has been implemented as part of a PostgreSQL extension named “pgLatLon”, [pgLatLon] which was originally contributed as part of LiquidFeedback. The implementation uses numerical integration (similar to the Monte Carlo method) in order to determine the area ‘|G|’ and extended area ‘|G_extra|’ on the Earth spheroid. While the area of a polygon can also be calculated more easily and more accurately, the calculation of the extended area ‘|G_extra|’ seems to be rather difficult if numerical calculation was to be avoided.

The sample points for numerical intergration on the spheroid (as calculated by the “pgl_sample_points” function) are generated by using a spiral with sample points occurring at the golden angle, similar to a pattern found in many plants, e.g. sunflowers.

Figure 10 “Using the golden angle to create sample points for numerical integration”:

    t is element of { 0.5, 1.5, 2.5, 3.5, …, n-0.5 }.
    x = sqrt(t) * cos(alpha * t).
    y = sqrt(t) * sin(alpha * t).

    alpha = (3 - sqrt(5)) * 180 degrees = approx. 137.51 degrees.

    A spiral starts in the center and turns counter-clockwise. Points are
    placed on the spiral, and the angle between two successive points in
    regard to the beginning of the spiral in the center is alpha.

End of Figure 10.

Figure 11 “C implementation in ‘pgLatLon’”:

    /* half of (spherical) earth's surface area */
    #define PGL_HALF_SURFACE (PGL_RADIUS * PGL_DIAMETER * M_PI)

    /* golden angle in radians */
    #define PGL_GOLDEN_ANGLE (M_PI * (sqrt(5) - 1.0))

    /* create a list of sample points covering a bounding circle 
       and return covered area */
    static double pgl_sample_points(
      pgl_point *center,  /* center of bounding circle */
      double radius,      /* radius of bounding circle */
      int samples,        /* number of sample points (MUST be positive!) */
      pgl_point *result   /* pointer to result array */
    ) {
      double double_share = 2.0;  /* double of covered share of earth's surface */
      double double_share_div_samples;  /* double_share divided by sample count */
      int i;
      double t;  /* parameter of spiral laid on (spherical) earth's surface */
      double x, y, z;  /* normalized coordinates of point on non-rotated spiral */
      double sin_phi;  /* sine of sph. coordinate of point of non-rotated spiral */
      double lambda;   /* other sph. coordinate of point of non-rotated spiral */
      double rot = (0.5 - center->lat / 180.0) * M_PI;  /* needed rot. (in rad) */
      double cos_rot = cos(rot);  /* cosine of rotation by latitude */
      double sin_rot = sin(rot);  /* sine of rotation by latitude */
      double x_rot, z_rot;  /* normalized coordinates of point on rotated spiral */
      double center_lon = center->lon;  /* second rotation in degree */
      /* add safety margin to bounding circle because of spherical approximation */
      radius *= PGL_SPHEROID_A / PGL_RADIUS;
      /* if whole earth is covered, use initialized value, otherwise calculate
         share of covered area (multiplied by 2) */
      if (radius < PGL_MAXDIST) double_share = 1.0 - cos(radius / PGL_RADIUS);
      /* divide double_share by sample count for later calculations */
      double_share_div_samples = double_share / samples;
      /* generate sample points */
      for (i=0; i<samples; i++) {
        /* use an offset of 1/2 to avoid too dense clustering at spiral center */
        t = 0.5 + i;
        /* calculate normalized coordinates of point on non-rotated spiral */
        z = 1.0 - double_share_div_samples * t;
        sin_phi = sqrt(1.0 - z*z);
        lambda = t * PGL_GOLDEN_ANGLE;
        x = sin_phi * cos(lambda);
        y = sin_phi * sin(lambda);
        /* rotate spiral by latitude value of bounding circle */
        x_rot = cos_rot * x + sin_rot * z;
        z_rot = cos_rot * z - sin_rot * x;
        /* set resulting sample point in result array */
        /* (while performing second rotation by bounding circle longitude) */
        result[i].lat = 180.0 * (atan(z_rot / fabs(x_rot)) / M_PI);
        result[i].lon = center_lon + 180.0 * (atan2(y, x_rot) / M_PI);
      }
      /* return covered area */
      return PGL_HALF_SURFACE * double_share;
    }

    /* fair distance between point and cluster (see README file for explanation) */
    /* NOTE: sample count passed as third argument MUST be positive */
    static double pgl_fair_distance(
      pgl_point *point, pgl_cluster *cluster, int samples
    ) {
      double distance;       /* shortest distance from point to cluster */
      pgl_point *points;     /* sample points for numerical integration */
      double area;           /* area covered by sample points */
      int i;
      int inner = 0;         /* number of sample points within cluster */
      int outer = 0;         /* number of sample points outside cluster but
                                within cluster enlarged by distance */
      double result;
      /* calculate shortest distance from point to cluster */
      distance = pgl_point_cluster_distance(point, cluster);
      /* if cluster consists of a single point or has no bounding circle with
          positive radius, simply return distance */
      if (
        (cluster->nentries==1 && cluster->entries[0].entrytype==PGL_ENTRY_POINT) ||
        !(cluster->bounding.radius > 0)
      ) return distance;
      /* if cluster consists of two points which are twice as far apart, return
         distance between point and cluster multiplied by square root of two */
      if (
        cluster->nentries == 2 &&
        cluster->entries[0].entrytype == PGL_ENTRY_POINT &&
        cluster->entries[1].entrytype == PGL_ENTRY_POINT &&
        pgl_distance(
          PGL_ENTRY_POINTS(cluster, 0)[0].lat,
          PGL_ENTRY_POINTS(cluster, 0)[0].lon,
          PGL_ENTRY_POINTS(cluster, 1)[0].lat,
          PGL_ENTRY_POINTS(cluster, 1)[0].lon
        ) >= 2.0 * distance
      ) {
        return distance * M_SQRT2;
      }
      /* otherwise create sample points for numerical integration and determine
         area covered by sample points */
      points = palloc(samples * sizeof(pgl_point));
      area = pgl_sample_points(
        &cluster->bounding.center,
        cluster->bounding.radius + distance,  /* pad bounding circle by distance */
        samples,
        points
      );
      /* perform numerical integration */
      if (distance > 0) {
        /* point (that was passed as argument) is outside cluster */
        for (i=0; i<samples; i++) {
          /* count sample points within cluster */
          if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
          /* count sample points outside of cluster but within cluster enlarged by
             distance between point (that was passed as argument) and cluster */
          else if (
            pgl_point_cluster_distance(points+i, cluster) < distance
          ) outer++;
        }
      } else {
        /* if point is within cluster, just count sample points within cluster */
        for (i=0; i<samples; i++) {
          if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
        }
      }
      /* release memory for sample points needed for numerical integration */
      pfree(points);
      /* if enlargement was less than doubling the area, then combine inner and
         outer sample point counts with different weighting */
      /* (ensures fairness in such a way that the integral of the squared result
         over all possible point parameters is independent of the cluster) */
      if (outer < inner) result = (2*inner + 4*outer) / 3.0;
      /* otherwise weigh inner and outer points the same */
      else result = inner + outer;
      /* convert area into distance (i.e. radius of a circle with the same area) */
      result = sqrt(area * (result / samples) / M_PI);
      /* return result only if it is greater than the distance between point and
         cluster to avoid unexpected results because of errors due to limited
         precision */
      if (result > distance) return result;
      /* otherwise return distance between point and cluster */
      else return distance;
    }


    /* half of (spherical) earth's surface area */
    #define PGL_HALF_SURFACE (PGL_RADIUS * PGL_DIAMETER * M_PI)
    
    /* golden angle in radians */
    #define PGL_GOLDEN_ANGLE (M_PI * (sqrt(5) - 1.0))
    
    /* create a list of sample points covering a bounding circle 
        and return covered area */
    static double pgl_sample_points(
      pgl_point *center,  /* center of bounding circle */
      double radius,      /* radius of bounding circle */
      int samples,        /* number of sample points (MUST be positive!) */
      pgl_point *result   /* pointer to result array */
    ) {
      double double_share = 2.0;  /* double of covered share of earth's surface */
      double double_share_div_samples;  /* double_share divided by sample count */
      int i;
      double t;  /* parameter of spiral laid on (spherical) earth's surface */
      double x, y, z;  /* normalized coordinates of point on non-rotated spiral */
      double sin_phi;  /* sine of sph. coordinate of point of non-rotated spiral */
      double lambda;   /* other sph. coordinate of point of non-rotated spiral */
      double rot = (0.5 - center->lat / 180.0) * M_PI;  /* needed rot. (in rad) */
      double cos_rot = cos(rot);  /* cosine of rotation by latitude */
      double sin_rot = sin(rot);  /* sine of rotation by latitude */
      double x_rot, z_rot;  /* normalized coordinates of point on rotated spiral */
      double center_lon = center->lon;  /* second rotation in degree */
      /* add safety margin to bounding circle because of spherical approximation */
      radius *= PGL_SPHEROID_A / PGL_RADIUS;
      /* if whole earth is covered, use initialized value, otherwise calculate
          share of covered area (multiplied by 2) */
      if (radius < PGL_MAXDIST) double_share = 1.0 - cos(radius / PGL_RADIUS);
      /* divide double_share by sample count for later calculations */
      double_share_div_samples = double_share / samples;
      /* generate sample points */
      for (i=0; i<samples; i++) {
        /* use an offset of 1/2 to avoid too dense clustering at spiral center */
        t = 0.5 + i;
        /* calculate normalized coordinates of point on non-rotated spiral */
        z = 1.0 - double_share_div_samples * t;
        sin_phi = sqrt(1.0 - z*z);
        lambda = t * PGL_GOLDEN_ANGLE;
        x = sin_phi * cos(lambda);
        y = sin_phi * sin(lambda);
        /* rotate spiral by latitude value of bounding circle */
        x_rot = cos_rot * x + sin_rot * z;
        z_rot = cos_rot * z - sin_rot * x;
        /* set resulting sample point in result array */
        /* (while performing second rotation by bounding circle longitude) */
        result[i].lat = 180.0 * (atan(z_rot / fabs(x_rot)) / M_PI);
        result[i].lon = center_lon + 180.0 * (atan2(y, x_rot) / M_PI);
      }
      /* return covered area */
      return PGL_HALF_SURFACE * double_share;
    }
    
    /* fair distance between point and cluster (see README file for explanation) */
    /* NOTE: sample count passed as third argument MUST be positive */
    static double pgl_fair_distance(
      pgl_point *point, pgl_cluster *cluster, int samples
    ) {
      double distance;       /* shortest distance from point to cluster */
      pgl_point *points;     /* sample points for numerical integration */
      double area;           /* area covered by sample points */
      int i;
      int inner = 0;         /* number of sample points within cluster */
      int outer = 0;         /* number of sample points outside cluster but
                                within cluster enlarged by distance */
      double result;
      /* calculate shortest distance from point to cluster */
      distance = pgl_point_cluster_distance(point, cluster);
      /* if cluster consists of a single point or has no bounding circle with
          positive radius, simply return distance */
      if (
        (cluster->nentries==1 && cluster->entries[0].entrytype==PGL_ENTRY_POINT) ||
        !(cluster->bounding.radius > 0)
      ) return distance;
      /* if cluster consists of two points which are twice as far apart, return
          distance between point and cluster multiplied by square root of two */
      if (
        cluster->nentries == 2 &&
        cluster->entries[0].entrytype == PGL_ENTRY_POINT &&
        cluster->entries[1].entrytype == PGL_ENTRY_POINT &&
        pgl_distance(
          PGL_ENTRY_POINTS(cluster, 0)[0].lat,
          PGL_ENTRY_POINTS(cluster, 0)[0].lon,
          PGL_ENTRY_POINTS(cluster, 1)[0].lat,
          PGL_ENTRY_POINTS(cluster, 1)[0].lon
        ) >= 2.0 * distance
      ) {
        return distance * M_SQRT2;
      }
      /* otherwise create sample points for numerical integration and determine
          area covered by sample points */
      points = palloc(samples * sizeof(pgl_point));
      area = pgl_sample_points(
        &cluster->bounding.center,
        cluster->bounding.radius + distance,  /* pad bounding circle by distance */
        samples,
        points
      );
      /* perform numerical integration */
      if (distance > 0) {
        /* point (that was passed as argument) is outside cluster */
        for (i=0; i<samples; i++) {
          /* count sample points within cluster */
          if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
          /* count sample points outside of cluster but within cluster enlarged by
              distance between point (that was passed as argument) and cluster */
          else if (
            pgl_point_cluster_distance(points+i, cluster) < distance
          ) outer++;
        }
      } else {
        /* if point is within cluster, just count sample points within cluster */
        for (i=0; i<samples; i++) {
          if (pgl_point_in_cluster(points+i, cluster, true)) inner++;
        }
      }
      /* release memory for sample points needed for numerical integration */
      pfree(points);
      /* if enlargement was less than doubling the area, then combine inner and
          outer sample point counts with different weighting */
      /* (ensures fairness in such a way that the integral of the squared result
          over all possible point parameters is independent of the cluster) */
      if (outer < inner) result = (2*inner + 4*outer) / 3.0;
      /* otherwise weigh inner and outer points the same */
      else result = inner + outer;
      /* convert area into distance (i.e. radius of a circle with the same area) */
      result = sqrt(area * (result / samples) / M_PI);
      /* return result only if it is greater than the distance between point and
          cluster to avoid unexpected results because of errors due to limited
          precision */
      if (result > distance) return result;
      /* otherwise return distance between point and cluster */
      else return distance;
    }

End of Figure 11.


== Completing the GiST interface with a distance estimator function ==

While the distance function is a mandatory prerequisite for a nearest-neighbor search, further support functions are needed to speed up nearest-neighbor searches when using database indices. PostgreSQL provides the GiST interface to enable fast nearest-neighbor searches for custom functions and operators. “pgLatLon” includes facilities to create indices on geographical objects, including support for nearest-neighbor search using (a) the spheroidal surface distance or (b) the “fair distance” function as defined above. However, the corresponding GiST distance estimator function implemented by pgLatLon uses the spheroidal surface distance function in both cases. It is still possible to do nearest-neighbor searches for the “fair distance” function because the distance estimator function of the GiST framework is always allowed to return smaller values, though the performance may be less optimal.


== Considering the voting weight ==

The second challenge mentioned above is to not only consider geographical properties but also the ratings of others users (i.e. voters) when performing a search. For a pair of a single search center point ‘S’ and a geographical object ‘G’, this can be easily achieved by dividing the value ‘R’ (as calculated in the algorithm explained above) by a number representing the strength of support by the voters. In the easiest case, this could be the total number of votes. However, in order to create a clone-proof process, vote counting mechanisms should be used where clone-proofness is ensured (e.g. restrict voters to vote only for one object or use a clone-proof counting scheme such as Harmonic Weighting).


== Weighted nearest-neighbor searches ==

As previously noted, the distance estimator function of the GiST framework is allowed to return distances shorter than the actual distances. While the penalty of the fair distance function only increases the returned distance (at least in case of a flat map or, by approximation, in case of short distances on the spheroid [Note: The difference between a flat map and a spheroid used to model Earth tends towards zero for small distances. “pgLatLon” version 0.10 ensures that the fair distance function never returns values smaller than the actual spheroidal surface distance (see last if-else-clause in Figure 11, for example), which introduces a tiny error (that tends towards zero for small distances) but ensures that the GiST functions behave consistently which is required for proper index operation.]), considering voting weight might decrease a returned distance. Therefore, it is no longer feasible to use the spheroidal surface distance as an estimation for distances that have been re-weighted according to the number of votes. Whenever an additional weight of an object is taken into account, the index should store such a weight in the index tree so that the corresponding GiST distance estimator can consider this weight by decreasing the returned estimation for the distance accordingly. In the same fashion, the estimated distance can be increased when a geographical object has an area ‘|G| > 0’. While the latter is not necessary (since it is always allowed to return distances shorter than the actual distance), it may speed up calculation. Both adjustments, however, will require to store additional data in the index tree.

Just storing this data in this index tree (and returning the worst case in case of a non-leaf node) doesn't allow for a fast index operation yet: the tree would also need to be organized in a way that considers the additional data. There are many possibilities to achieve this (R-trees, kd-trees, fractals, etc.), and the choice of index structure goes beyond the scope of this article. In either case, a performant implementation would not be trivial.

Nonetheless, it is possible to work around this issue in SQL by performing SELECTs with an (exponentially) increasing LIMIT clause within a custom function. This approach is less optimal than using a specialized index, but an example is given in pgLatLon's source code (version 0.10) in lines 27 thorugh 85 of the file “create_test_db.schema.sql”.


== Acknowledgements ==

The development of “pgLatLon”, including the fair distance function, was contributed to LiquidFeedback by FlexiGuided GmbH, Berlin and co-funded by the European Union's Horizon 2020 research and innovation programme under grant agreement n° 693514 (“WeGovNow”).

“pgLatLon” is available as an open-source software under the terms of the MIT-License and may be downloaded at the project page of the Public Software Group e.V.: http://www.public-software-group.org/pgLatLon


This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 693514.


References:
[Evolution] Jan Behrens: The Evolution of Proportional Representation in LiquidFeedback. In “The Liquid Democracy Journal on electronic participation, collective moderation, and voting systems”, Issue 1, March 20, 2014, pp. 32-41. ISSN 2198-9532. Published by Interaktive Demokratie e. V., available at http://www.liquid-democracy-journal.org/issue/1/The_Liquid_Democracy_Journal-Issue001-04-The_evolution_of_proportional_representation_in_LiquidFeedback.html
[GiST] PostgreSQL 9.6.2 Documentation, Chapter VII (Internals), Section 61 (GiST). Available at https://www.postgresql.org/docs/9.6/static/gist-intro.html
[Janson] Svante Janson: “Phragmén's and Thiele's election methods”, November 27, 2016, subsection 4.3 (‘Thiele's elimination method’) on page 15. Available at https://arxiv.org/abs/1611.08826v1
[pgLatLon] “pgLatLon”, a spatial database extension for PostgreSQL. Homepage at http://www.public-software-group.org/pgLatLon
[PLF] Behrens, Kistner, Nitsche, Swierczek: “The Principles of LiquidFeedback”. ISBN 978-3-00-044795-2. Published January 2014 by Interaktive Demokratie e. V., available at http://principles.liquidfeedback.org/
[PostgreSQL] The open source object-relational database system “PostgreSQL”. Homepage at https://www.postgresql.org/
[Skowron] Skowron, Lackner, Brill, Peters, Elkind: “Proportional Rankings”, December 5, 2015, ranking rule ‘Reverse SeqPAV’ on page 4. Published on arXiv.org at https://arxiv.org/abs/1612.01434v1


=============================================================================
This file is part of:

The Liquid Democracy Journal
on electronic participation, collective moderation, and voting systems
Issue 5, Berlin 2017-05-11 (electronic version 2018-06-21, rev3 2019-02-24)

The full issue is available at: http://www.liquid-democracy-journal.org/issue/5/

All rights reserved
Copyright © 2017, 2018 Interaktive Demokratie e. V., Berlin, Germany
http://www.interaktive-demokratie.org/

Published by: Interaktive Demokratie e. V., Berlin, Germany
Edited by: Jan Behrens, Axel Kistner, Andreas Nitsche, Björn Swierczek

Imprint and contact: http://www.liquid-democracy-journal.org/imprint
Subscription and archive: http://www.liquid-democracy-journal.org/

ISSN-L: 2198-9532
ISSN print version: 2198-9532
ISSN electronic version: 2199-1758

Filename: The_Liquid_Democracy_Journal-Issue005-03-A_Fair_Distance_Function.txt
Encoding: UTF-8