The majority of the SQL-dialects admittedly offer to the users a rich set of data types. The most commonly used among them are probably the Integer and Float of different precision, the Boolean, the Timestamp and the Time data type. Yet, these basic data types are not sufficient for covering all the analytical needs that may arise. So, how can I handle an IP address in SQL?
There are also some other data types that require special handling, like IP addresses and ranges. A typical case of an analysis that utilizes IP addresses would probably be the filtering of the range of the IP addresses related to certain logged in actions or events in order to identify the companies from which the visitors to your website come from.
This is part #5 of our helpful SQL short guides:
- Table and Data Partitioning
- SQL type casting
- SQL pattern matching
- Organizing SQL Queries
- Handling Datatypes for IP addresses in SQL (this part)
In clickstream analysis, being able to group actions performed by the same user or multiple users within the same company can provide an analyst very useful information. Another common application has to do with the monitoring and administration of systems and services like digging into the logs of a web server trying to troubleshoot a specific problem.
Storing this information as varchar(), although common, can become troublesome very easily due to the large variety of possible representations on the one hand and the incompatibility with the string operators on the other.
Although they are essentially positive 32 or 128-bit integers, they are more commonly represented as blocks of 8 or 16-bit integers written in decimal and separated by dots in IPv4 or written in hexadecimal and separated by colons in IPv6. Especially IPv6 addresses can be stored with upper- or lower-case hexadecimal letters since they are case-insensitive. Also, leading zeros can be suppressed for abbreviation and so :0000: or :0: are used for a group of zeros.
For shortening more than one consecutive groups of zeros the :: annotation is used (but only once in an address) while when there are more than one sets of consecutive groups of zeros to shorten then only the leftmost set gets suppressed. It becomes evident that this lack of uniform representation complicates manipulation when stored as strings.
Furthermore, even in cases when a lot of manual work is being done towards adaptation to a fixed standard, the majority of the operators available for strings won’t work as desired on IP addresses resulting in misleading and unexpected outcomes.
Luckily for us, there are database systems, like PostgreSQL and BigQuery, that offer special data types dedicated to network operations. In some others, like MS SQL Server and Amazon Redshift, although these capabilities are not directly supported, there are still some hacks that one can do.
The offered data types include inet and cidr for storing host and network addresses respectively in either IPv4, IPv6 or both. Even when data is not initially stored as such, you can always cast string into inet or cidr, if needed. This can be done simply with the :: annotation as shown below:
The simplest operations that one can perform with these data types include subtraction, bitwise operations and basic numeric comparisons, as presented in the following examples:
- ‘192.168.1.1’::inet + 256
- ‘192.168.2.1’::inet – 256
- ‘192.168.1.5’::inet < ‘192.168.1.6’ :: inet
For more advanced manipulation, containment operators and utility functions are also available. More specifically, there are operators to check if one value contains or is contained by another (>> and <<), or the same with equality (>>= and <<=), or containment going either way (&&). They consider only the network parts of the two addresses (ignoring any host part) and determine whether one network is identical to or a subnet of the other.
- ‘192.168.1.5’::inet << ‘192.168.1/24’::inet
- ‘192.168.1/24’::inet>> ‘192.168.1.5’::inet
- ‘192.168.1/24’::inet >>= ‘192.168.1/24’::inet
- ‘192.168.1/24’::inet <<= ‘192.168.1/24’::inet
- ‘192.168.1/24’::inet && ‘192.168.1.80/28’::inet
Available one can find the three major bitwise operations including AND(&), OR(|) and NOT(~).
- ~ ‘192.168.1.6’::inet
- ‘192.168.1.6’::inet & ‘0.0.0.255’::inet
- ‘192.168.1.6’::inet | ‘0.0.0.255’::inet
PostgreSQL also offers a set of a functions for a bunch of other useful operations such as the following:
- network(inet): extract network part of address
- broadcast(inet): broadcast address for network
- host(inet): extract IP address as text
BigQuery is another database system that supports some functions that can be used when working with IP addresses. Namely these are the FORMAT_IP(), PARSE_IP(), FORMAT_PACKED_IP() and PARSE_PACKED_IP().
In cases when the IP addresses are stored as integers in the BigQuery database you can use the FORMAT_IP() function which takes as input an integer value and converts its 32 least significant bits in an IPv4 address in human-readable format. If stored as strings, it is recommended that you convert them into integers in order to perform further manipulation.
Furthermore, BigQuery supports a very helpful function, the PARSE_IP(), that can be used for checking if an IP address has a valid format or not. PARSE_IP() takes as input a string representing a human-readable address and returns 1 if its format is valid or NULL if it is not.
For facilitating both IPv4 and IPv6, BigQuery offers the FORMAT_PACKED_IP() and PARSE_PACKED_IP() functions which take as input packed_ip and readable IP in a packed format accordingly and perform almost the same operations as their simpler versions, FORMAT_IP() and PARSE_IP(). The only difference is that in the case of valid input the PARSE_PACKED_IP() returns an IP address in BYTES.
The following tables illustrates some examples of these functions taken from BigQuery documentation:
MS SQL Server
Unfortunately, SQL Server does not offer a data type suitable for storing IP addresses in a format that would facilitate reading by humans. So, one approach widely used involves storing addresses as strings with the aforementioned problems in manipulation.
Another approach which is probably saner suggests storing IP addresses as binary numbers. Ultimately, it is the “truest” to what it represents and allows easy subnetting if needed.
In addition, in contrast to the varchar representation, it takes up much less space. A 32-bit IPv4 can be stored as binary using 4 bytes while its varchar representation would take up 15 bytes. Accordingly, a 128-bit IPv6 requires 16 bytes if stored numerically and 39 bytes as varchar.
However, storing addresses as binary number requires reverse transformation according to the display conversion of its binary content in case a human inspection is required.
Fortunately, defining functions that can be used for this back and forth conversion is not too difficult as many implementations can be found online on StackOverflow and other sites.
Adopting Jerry’s Birchler implementation you can create the following functions:
Unfortunately, although Redshift is based on PostgreSQL, it does not support all of PostgreSQL functionalities. Among them belong the convenient inet and cidr data type discussed previously. As with MS SQL Server you can store IP address, both IPv4 and IPv6, either as varchar or as numeric. As already mentioned, the binary representation is much more preferable than any other as it reflects the true nature of an IP address.
The first thing one has to do towards this goal is to convert the IP address to an integer. This can be done easily with the following select
SELECT (split_part('192.168.1.1', '.', 1)::bigint << 24) + (split_part('192.168.1.1', '.', 2)::bigint << 16) + (split_part('192.168.1.1', '.', 3)::bigint << 8) + (split_part('192.168.1.1', '.', 4)::bigint) FROM table;
Normally you would have to go on and convert this integer into binary in order to be able to perform any bitwise operation. However, Redshift offers its users functions suitable for bitwise operations that take as input integers, convert them internally to binaries, perform the operation and then transform them back to integer. These functions are the BIT_AND and the BIT_OR.
Yet, if you insist in converting manually the initial IP address into a binary value you can use the STROL() function that takes as input an integer and converts it according to the specified base which in our case is equal to 2.
SELECT (strol(split_part(ip_addr, '.', 1),2) << 24) + (strol(split_part(ip_addr, '.', 2),2) << 16) + (strol(split_part(ip_addr, '.', 3),2) << 8) + (strol(split_part(ip_addr, '.', 4),2)) FROM table;
It becomes evident that IP addresses cannot be easily handled as regular numeric data without undergoing multiple transformations. This can increase substantially the complexity of the calculations required especially in database systems that do not support dedicated data types and relevant functions like PostgreSQL does.
Yet do not be afraid. You can always consult the queries we have prepared for you in this post and perform your tasks without a care!