Country code checking using Cantor Pairing
Posted June 13, 2022 by quy-le ‐ 12 min read
Solutions are presented from the easiest to harder for implementation. You can jump directly to section 4 for Cantor Pairing solution.
1. Problem
When I first started working as a developer I used to work with Domain Name System (DNS) and got a task to detect if the top-level-domain of a DNS record is a country-code type or not.
For example:
1zing.vn => true (Vietnam)
2ddc.moph.go.th => true (Thailand)
3bbc.co.uk => true (UK)
4mozilla.org => false
5google.com => false
The logic here is to check if the last part after dot of the domain belongs to the list of ISO 3166-2 country codes.
Let’s assume we already did other validations and string manipulations to get the last 2 characters (lowercase ASCII, assuming again) from the domain.
How can we detect if the 2-character-input is a country code or not?
1// List of 247 country codes
2var countryCodes = "ad,ae,af,ag,ai,al,am,an,ao,aq,ar,as,at,au,aw,ax,az,ba,bb,bd,be,bf,bg,bh,bi,bj,bl,bm,bn,bo," +
3"br,bs,bt,bv,bw,by,bz,ca,cc,cd,cf,cg,ch,ci,ck,cl,cm,cn,co,cr,cu,cv,cx,cy,cz,de,dj,dk,dm,do,dz,ec,ee,eg,eh,er," +
4"es,et,fi,fj,fk,fm,fo,fr,ga,gb,gd,ge,gf,gg,gh,gi,gl,gm,gn,gp,gq,gr,gs,gt,gu,gw,gy,hk,hm,hn,hr,ht,hu,id,ie,il," +
5"im,in,io,iq,ir,is,it,je,jm,jo,jp,ke,kg,kh,ki,km,kn,kp,kr,kw,ky,kz,la,lb,lc,li,lk,lr,ls,lt,lu,lv,ly,ma,mc,md," +
6"me,mf,mg,mh,mk,ml,mm,mn,mo,mp,mq,mr,ms,mt,mu,mv,mw,mx,my,mz,na,nc,ne,nf,ng,ni,nl,no,np,nr,nu,nz,om,pa,pe,pf," +
7"pg,ph,pk,pl,pm,pn,pr,ps,pt,pw,py,qa,re,ro,rs,ru,rw,sa,sb,sc,sd,se,sg,sh,si,sj,sk,sl,sm,sn,so,sr,ss,st,sv,sy," +
8"sz,tc,td,tf,tg,th,tj,tk,tl,tm,tn,to,tr,tt,tv,tw,tz,ua,ug,um,us,uy,uz,va,vc,ve,vg,vi,vn,vu,wf,ws,ye,yt,za,zm,zw"
9
10// input is the 2 lowercase ASCII characters (e.g: aa, gg, nz, op, vn, zz)
11func IsCountryCode(input string) bool {
12 // TODO: Implement logic for this function
13}
This problem is really simple and I’m sure any decent developer can come up with a solution for it within minutes.
Let’s pause reading a bit and think how are you gonna implement this IsCountryCode
function?
time.Sleep(5*time.Minute)
Ok, welcome back.
I’m going to implement the logic code/test/benchmark in Go, but all approaches should be easy enough to implement on any other programming languages.
2. Array solution
The first solution is storing each country code in an array, then inside the IsCountryCode
function, we compare the input
string with each item in the array.
1var ccArr []string
2
3func init() {
4 // ccArr holds all 247 country codes.
5 // => []string{"ad", "ae", ..., "zw"}
6 ccArr = strings.Split(countryCodes, ",")
7}
8
9// IsCountryCodeByArray checks if the given input matches any item in the ccArr array.
10func IsCountryCodeByArray(input string) bool {
11 for _, cc := range ccArr {
12 if cc == input {
13 return true
14 }
15 }
16 return false
17}
This solution is simple and in fact really fast if the input string is in the couple of first country codes. For example: IsCountryCodeByArray("ad")
only takes 1 array access, IsCountryCodeByArray("ae")
takes 2 array access an so on.
Array access is O(1)
and really fast, especially if the array can fit into the CPU cache like in this case (more on this in another post).
But this solution is naive because it becomes worse quickly if the input is at the ending part of the ccArr
array, or in the worst case scenario, not existing in the list. In the worst case scenario, we have to traverse over the whole array in order to return false.
The benchmark (see test and benchmark code at the end of post) below shows the order of magnitude (150 times) slower in worst case versus happy case.
1# IsCountryCodeByArray("ad") => Hit, best case scenario
2# IsCountryCodeByArray("zz") => Miss, worst case scenario
3
4$ go test -run=^$ -bench=BenchmarkCheckCountryCode -cpu 1
5BenchmarkCheckCountryCode/___array_naive_hit 192150000 6.40 ns/op
6BenchmarkCheckCountryCode/___array_naive_miss 1280067 920 ns/op
Algorithm analysis:
- Time complexity:
O(1)
.- Best case: Need 1 array (memory) access => O(1).
- Worst case: Need 247 array access => O(247) => O(1).
- Space complexity:
O(1)
.- Array has 247 items, each requires 2 bytes (2 ASCII characters) so in total of 494 bytes.
Assuming we’re not counting space overhead from the string/array datastructe here (e.g.: each string in Go has at least 8 bytes overhead 1).
- Array has 247 items, each requires 2 bytes (2 ASCII characters) so in total of 494 bytes.
Idea
Because the list of country code is alphabetically sorted, you can try to improve this solution by failing-fast while checking the input against the items in
ccArr
.
For example:IsCountryCodeByArray("ah")
.
You only need to check until theai
item in theccArr
, if you reachedai
, we can be sure that theah
input string is not existed in theccArr
array.
Try to implement and analyze that solution, how much will it improve over the naive solution and on which case it will be as bad as the naive solution?Also, how about binary search on the sorted country code list?
3. Generic hashmap solution
A better solution is storing the list of country code in a generic (hash)map, then in the IsCountryCodeByMap
function, we only need to check if the input
is a key in the map or not.
1// ccMap holds 247 country codes as the key.
2// We don't need to use map value here, so defines it as struct{} to eliminate overhead.
3var ccMap = make(map[string]struct{}, 247)
4
5func init() {
6 // Put country codes into the ccMap
7 ccs := strings.Split(countryCodes, ",")
8 for _, cc := range ccs {
9 ccMap[cc] = struct{}{}
10 }
11}
12
13// IsCountryCodeByMap checks if the given input exists in the ccMap or not.
14func IsCountryCodeByMap(input string) bool {
15 _, ok := ccMap[input]
16 return ok
17}
Even though this solution is slower than the naive array solution above in (some) happy cases.
Benchmark result shows a much better result in the worst case scenario.
The result in both happy and worst case scenario is so consistent which is why 99.99% of the time this is the only solution you’ll need in production code.
Do you know that even Object in Javascript is just a map/dictionary? 2
1$ go test -run=^$ -bench=BenchmarkCheckCountryCode -cpu 1
2BenchmarkCheckCountryCode/____map_string_hit 147317772 8.438 ns/op
3BenchmarkCheckCountryCode/____map_string_miss 124363000 9.644 ns/op
Algorithm analysis:
- Time complexity:
O(1)
.- Same O(1) in both worst and best case scenarios as we need the same 1 hashmap check.
- Space compexity:
O(1)
.- Each key need 2 bytes and we have 247 keys so in total 494 bytes same as the naive array solution above.
Also assuming that we don’t count the overhead from string/hashmap datastructure itself.
- Each key need 2 bytes and we have 247 keys so in total 494 bytes same as the naive array solution above.
Idea
The type of the map’s key matter, try to use a map with integer key instead of string and check if there is any improvement?
If our map has more items, what is the advantages of integer map over string and what is the possible downsides?
4. Cantor pairing solution
As we said above, most of the time, the hashmap solution above is good enough in real life code.
So please take these Cantor solutions as the fun experiments when we try to explore outside of the “normal” border.
Let’s recap the strong points from two solutions above, array solution has very fast memory access, while map solution provides stable time complexity in both happy and worst cases.
How can we combine the advantage of both two solutions into one?
The idea is somehow we can map each country code into a unique number, then use that number as the index on the checking array. If we can do so, when checking the country code, we only need 1 array (memory) access for either happy or worst case, which is really fast as shown in the array solution above.
Experienced readers might realise that what we’re trying to do here is re-implementing a hashmap (hash the country code string to the integer key of the map).
The difference is we’re trying to design a perfect hash function (PHF) for our hashmap. PHF is generally easy to design when we have a fixed input set, and in this problem, we know before hand that the input will ranging from aa
to zz
([aa, zz]
) so it seems like a perfect place.
So how can we design the PHF to map country code to a unique number? A quick research will lead us to the pairing function:
“In mathematics, a pairing function is a process to uniquely encode two natural numbers into a single natural number.” - Wikipedia
Given $\pi$ as the Cantor pairing function, $\pi$ has two important characteristics:
$\pi(k1, k2) = n$, and $n$ is unique in the set of natural numbers $\N$.
=> We can be sure each country code will be mapped to a unique integer number, no collision to handle.$\pi(k1, k2) \neq \pi(k2, k1)$.
=> We can be sure
ad
andda
will be mapped to two different numbers.
That’s exactly what we’re looking for!
Cantor pairing function is defined as below:
$$\pi(k1, k2) \coloneqq \frac{(k1+k2)(k1+k2+1)}{2}+k2$$
It’s easy enough to understand so let’s implement it as our PHF which translates the 2 ASCII characters input to a unique number and use it as the array index.
Our input is ranging from aa
to zz
, which can be translated to the tuple of [0, 0]
to [25, 25]
. We have π("aa") => π(0, 0) = 0
and π("zz") => π(25, 25) = 1300
, so we will need an array with the size of 1301 to hold all the possible input.
1// => Size of the checking array.
2var cantorArr = make([]bool, 1301, 1301)
3
4func init() {
5 // Mark the index of country code in the cantorArr array
6 ccs := strings.Split(countryCodes, ",")
7 for _, cc := range ccs {
8 idx := cantorPair(cc)
9 cantorArr[idx] = true
10 }
11}
12
13// IsCountryCodeByCantorPairing uses Cantor pairing to calculate the index of the given string
14// in the ccArray array.
15func IsCountryCodeByCantorPairing(input string) bool {
16 idx := cantorPair(input)
17 return cantorArr[idx]
18}
19
20// cantorPair returns a unique number for the given input using Cantor pairing function.
21func cantorPair(input string) uint16 {
22 // Reduce unnecessary gap from ASCII [0:97]
23 // e.g.: 'a' (97) => (0), 'z' (122) => (25)
24 k1 := uint16(input[0] - 97)
25 k2 := uint16(input[1] - 97)
26 return uint16(((k1+k2)*(k1+k2+1))/2 + k2)
27}
The benchmark shows using Cantor pairing as the hash function, we can achieve 5 times faster than generic hashmap solution.
1$ go test -run=^$ -bench=BenchmarkCheckCountryCode -cpu 1
2BenchmarkCheckCountryCode/cantor_pairing_hit 584080164 2.056 ns/op
3BenchmarkCheckCountryCode/cantor_pairing_miss 580287861 2.057 ns/op
Algorithm analysis:
- Time complexity:
O(1)
.- For both happy and worst case, we only need 1 array access check so it’s O(1).
- Space complexity:
O(1)
.- We need 1301 boolean items in the array so actual memory usage is 1301 bytes (not countring overhead from array data structure itself).
How can Cantor pairing solution so much faster than generic hashmap solution when basically both two solutions are hashmap implementation?
- Because generic hashmap uses a generic hash function which works with a much bigger number (non-determined) of keys so it’s much slower compare to our Cantor pairing hash function.
- Cantor pairing hash function here is a PHF, which tailored for this set of data (247 country codes, 1301 posible input) only. So it’s much simpler and we don’t have to deal with hash collision.
- Cantor pairing solution operate directly on a small array which can fit directly on CPU cache makes it much faster for CPU to access.
While generic hashmap data structure is much more complicated (e.g.: handle different types of key/value, resolve collisions…) and may need pointer reference to the items which cannot leverage CPU cache.
5. Cantor pairing with bitmap solution
The Cantor pairing solution above is really fast but has one down side is using more memory, how can we optimize it also?
Because we only need to check the existency of the given input in the list of country code, so we can use 1 bit (0 means not in list, 1 means in list) for each input in the set [aa, zz]
.
=> Instead of using 1301 bytes for the checking array, we can operate on bit level using bitmap, so the size required can be reduced by 8 (8 bits makes a byte) to 1301/8 = 163 bytes.
1// Max Cantor pair: p(z, z) ~= p(25, 25) ~= 1300 bits ~= 163 bytes
2bitMap = make([]byte, 163, 163)
3
4func init() {
5 // Mark the index of country code in the cantorArr array bitmap
6 ccs := strings.Split(countryCodes, ",")
7 for _, cc := range ccs {
8 idx := cantorPair(cc)
9 bitMap[idx/8] = setBit(bitMap[idx/8], byte(idx%8))
10 }
11}
12
13// IsCountryCodeByCantorPairing uses Cantor pairing to calculate the index of the given string
14// in the bitmap.
15func IsCountryCodeByCantorBitmap(input string) bool {
16 idx := cantorPair(input)
17 return hasBit(bitMap[idx/8], byte(idx%8))
18}
19
20// cantorPair returns a unique number for the given input using Cantor pairing function.
21func cantorPair(input string) uint16 {
22 // Reduce unnecessary gap from ASCII [0:97]
23 // e.g.: 'a' (97) => (0), 'z' (122) => (25)
24 k1 := uint16(input[0] - 97)
25 k2 := uint16(input[1] - 97)
26 return uint16(((k1+k2)*(k1+k2+1))/2 + k2)
27}
28
29// setBit sets the bit at pos in the byte n.
30func setBit(n byte, pos byte) byte {
31 n |= (1 << pos)
32 return n
33}
34
35// hasBit checks the bit at pos in the byte n.
36func hasBit(n byte, pos byte) bool {
37 return (n & (1 << pos)) > 0
38}
In order to save up memory, we have to do more calculation to figure out the location of checking bit. Benchmark shows that we are 1.1 times slower than direct access on array index in exchange of 8 times lower in memory, which is a good deal.
1$ go test -run=^$ -bench=BenchmarkCheckCountryCode -cpu 1
2BenchmarkCheckCountryCode/_cantor_bitmap_hit 528360094 2.275 ns/op
3BenchmarkCheckCountryCode/_cantor_bitmap_miss 531354234 2.256 ns/op
Algorithm analysis:
- Time complexity:
O(1)
.- For both happy and worst case, we only need 1 array access check so it’s O(1).
- Space complexity:
O(1)
.- We need 1301 items in bitmap so so actual memory usage is 1301 bits or 163 bytes (not counting overhead from array data structure itself).
6. Conclusion
- Most of the time, generic hashmap is good enough.
- Small array access can be faster than hashmap, while working with small set of data, maybe it’s better to use array. CPU cache matter.
- When working with fixed input set, it’s good time to think about perfect hash function.
- Test, benchmark and profiling is valuable when optimizing.
- Micro-benchmarking is overkill most of the time and premature optimization is the root of all evils 3.
- This Cantor pairing PHF can become bloat quickly, e.g.
cantorPairing(500, 500) = 501000
, so it cannot be used as a generic hash function.
What you should take away from this post is the way of solving specific problem only.
Source code:
- main.go: Implementation for all solutions.
- main_test.go: Test and benchmark.
Or see it on Github.
1$ go test -run=^$ -bench=BenchmarkCheckCountryCode -cpu 1
2BenchmarkCheckCountryCode/___array_naive_hit 305538489 3.779 ns/op
3BenchmarkCheckCountryCode/____map_string_hit 147317772 8.438 ns/op
4BenchmarkCheckCountryCode/cantor_pairing_hit 584080164 2.056 ns/op
5BenchmarkCheckCountryCode/_cantor_bitmap_hit 528360094 2.275 ns/op
6BenchmarkCheckCountryCode/___array_naive_miss 1913680 621.7 ns/op
7BenchmarkCheckCountryCode/____map_string_miss 124363000 9.644 ns/op
8BenchmarkCheckCountryCode/cantor_pairing_miss 580287861 2.057 ns/op
9BenchmarkCheckCountryCode/_cantor_bitmap_miss 531354234 2.256 ns/op