Running SQL Queries on plain text files

Although storing data in a plain text file can never replace a full-blown relational database management server, they are perfect for small jobs and quick reporting. This paper explores the available options when data is stored in plain text files.

For the purpose of discussion, we chose the following examples.

  • 1. Running SQL queries on a log file generated by a web server
  • 2. Extracting a subset of data from a relational database into plain text files and running SQL queries.

Example 1

Most web servers generate log files containing the web hits made to that server. These log files are in tabular format and therefore, can be treated as a database table. The following section shows how to run SQL queries on a log file generated by Microsoft IIS Server.

File Preparation

Although the log files generated by IIS are in tabular format, the first few lines contain a descriptive message that serves as a comment and should be removed. Following is a sample file containing logged messages.

#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2006-02-20 00:00:24
#Fields: date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytes
2006-02-20 00:00:24 GET /index.htm - - 192.168.188.225 HTTP/1.1 - - 200 269
2006-02-20 00:03:00 GET /robots.txt - - 192.168.188.24 HTTP/1.0 - - 404 1814
2006-02-20 00:03:00 GET /default.htm - - 192.168.188.24 HTTP/1.0 - - 200 18299

The first step is to remove the extra lines that appear at the beginning of the document. Notice that the forth line contain column headings for each column that appear in the data. This information is useful. To convert the text file in an SQL-friendly format, we remove the first three lines and the first word from the fourth line. Our sample file, after removing the comments, should look like as follows.

date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytes
2006-02-20 00:00:24 GET /index.htm - - 192.168.188.225 HTTP/1.1 - - 200 269
2006-02-20 00:03:00 GET /robots.txt - - 192.168.188.24 HTTP/1.0 - - 404 1814
2006-02-20 00:03:00 GET /default.htm - - 192.168.188.24 HTTP/1.0 - - 200 18299

Consider changing some of the field names. For example,

  • Rename "Date" and "Time" to "rDate" and "rTime" respectively; these names are typically treated as keywords in SQL.
  • Remove dashes and replace them with underscores. A dash can be treated as a minus operator and may result in errors.
  • Remove the parenthesis from field names.

Using ODBC as the SQL engine

Since plain text file has no intelligence, we need some sort of a querying engine that can processes SQL queries and return desired results. The obvious choice for this engine is Microsoft Jet Engine, which comes bundled with the OS and most computers running Windows already have it installed on their machine. The Jet Engine includes ODBC drivers for several file-based data sources and one of them in a plain text file. We can use this text driver as our querying engine.

Besides this engine, we will also need a querying tool that can take SQL queries from the user, submit it to the engine and display the generated results. This tool is WinSQL, which can be downloaded from here

Setting up ODBC

In order to treat this IIS log as database table, we need to create an ODBC DSN using a driver that works with text files. Follow the steps below to create a Text DSN for your text file.

  1. Rename the file; change its extension to .txt from .log. By default, the ODBC driver for text files does not recognize the .log extension.
  2. Open ODBC administrator from Control Panel.
  3. Create a new User DNS for Text files.
  4. Define the format of your text file.
    Text ODBC
  5. Click the "Select Directory" button to pick a folder where your log files are located.
  6. Click the "Define Format" button. If you do not see this button, click "Options" to expand this window.
    Define Text ODBC
  7. Select the desired file name from left.
  8. Check the "Column Name Header" check box, which makes the driver think that the field names are stored in the first line.
  9. Change the "Format" from CSV delimited to Custom Delimited and set the Delimiter to a SPACE.
  10. Click the "Guess" button the on upper right hand side.
  11. Notice how the column names are filled automatically. This is because the first line in the file contain column names. Although the column names are filled, you still need to specify their data type. Select every column one at a time and ensure that the associated data type is CHAR except for the column "sc-bytes", which contains the total number of bytes transferred - leave that as Integer. Specifying integer as data type will allow us to run mathematical operations on the column.
  12. Change the name of column Date and Time to something else, like RDATE and RTIME. This is important as Date and Time can be confused with keywords in ODBC.
  13. Click ok all the way, until ODBC manager closes.

What happens in the background?

Creating a text DNS can be a tedious task; particularly for text files with too many columns. However, once you understand what happens in the background, you may be able to save some time by defining the design manually.

The ODBC DSN for Text files creates a file called SCHEMA.INI in the directory that you choose for your database. This .INI file holds the schema for the tables. Once you understand the layout of this file, you can manually edit it for future use, which is usually much faster and less tedious.

The layout of SCHEMA.INI file for the IIS Log file we created earlier using ODBC Manager is shown below.

[LogFile.txt]
ColNameHeader=True
CharacterSet=OEM
Format=Delimited( )
Col1=rdate Char Width 255
Col2=rtime Char Width 255
Col3=method Char Width 255
Col4=uri Char Width 255
Col5=query Char Width 255
Col6=username Char Width 255
Col7=ip Char Width 255
Col8=version Char Width 255
Col9=user_agent Char Width 255
Col10=referer Char Width 255
Col11=status Integer
Col12=bytes Integer


This file is broken into sections; each section specifies the schema of a file. The text within square bracket contains the name of the file, followed by a name-value pair defining the format of the file as well as column definitions.

The most important name-value pair is the "Format" tag, which defines the format of the file. Following table shows the possible values this tag accepts.

Format statementDescription
Format=TabDelimitedFields in the file are delimited by tabs.
Format=CSVDelimitedFields in the file are delimited by commas (comma-separated values).
Format=Delimited(custom character)Fields in the file are delimited by any character you choose to input into the dialog box. All except the double quote (") are allowed, including blank.
Format=FixedLengthFields in the file are of a fixed length.

The following table defines the rest of the variables in the file

Variable nameDescription
ColNameHeaderThis is set to true if the first column contains the name of the columns.
MaxScanRowsUse the MaxScanRows option to indicate how many rows should be scanned when determining the column types. If you set MaxScanRows to 0, the entire file is scanned..
Coln The following entries designate fields in a table by using the column number (Coln) option, which is optional for character-delimited files and required for fixed-length files.

The format for Coln option is

Coln = ColumnName type [Width #]

Possible value for data types are:
  • Char
  • Float
  • Integer
  • LongChar
  • Date date

    Besides these data types, you can also specify the types that are valid types for Microsoft Jet Engine.
  • Connecting from WinSQL

    Once the ODBC DNS is correctly configured, you can start WinSQL and connect to your newly created data source. If you do not have a copy of WinSQL, you can download it from here. When you start WinSQL, select the appropriate DSN from the list of available entries.

    Define Text ODBC

    Once connected, WinSQL will treat this log file as if it were a database table and will allow you to run SELECT as well as INSERT queries. Note that this driver does not support DELETE or UPDATE statements.

    Verifying connection and data

    To verify the connection to your database, run a select statement or simply click on Catalog, locate the file and click "Browse Data"

    Define Text ODBC

    Query Examples

    Following section show some queries that you can run from WinSQL against an IIS Log file.

    Query 1 - Calculate amount of data transfer for different file types such Zipped files, HTML and images

    select 'Zip Files' as `File Type `, -- Returns the type of file
        sum(bytes) as `Total Bytes`, -- Returns total bytes for this type
        count(*) as `Number of Hits` -- Number of hits for this type
    from ex060220.txt
    where uri like '%.zip'
    and bytes is not null

    union  -- use UNION to merge results for multiple types

    select 'Images', sum(bytes),
        count(*)
    from ex060220.txt
    where uri like '%.gif' or uri like '%.jpg'
    and bytes is not null

    union

    select 'HTML Files', sum(bytes),
        count(*)
    from ex060220.txt
    where (uri like '%.htm'
        or uri like '%.html')
    and bytes is not null
    order by 2 desc

    Query 2 - List pages that resulted in an error

    select uri, status, count(*)
    from logFile.txt
    where status >= 400
    group by uri, status
    order by 3 desc

    Query 3 - What type of browsers clients use.

    select 'Internet Explorer' as Browser, count(*) as `Total Hits`
    from ex060220.txt
    where {fn LOCATE('MSIE', user_agent)} > 0
    -- Notice the use of string function LOCATE

    union

    select 'Firefox', count(*)
    from ex060220.txt
    where {fn LOCATE('Firefox', user_agent)} > 0

    union

    select 'Konqueror', count(*)
    from ex060220.txt
    where {fn LOCATE('Konqueror', user_agent)} > 0

    union

    select 'Opera', count(*)
    from ex060220.txt
    where {fn LOCATE('Opera', user_agent)} > 0

    union

    select 'Safari', count(*)
    from ex060220.txt
    where {fn LOCATE('Safari', user_agent)} > 0

    Query 4 - Find out which site users come from. In other words, extract the referrer site.

    select {fn SUBSTRING(referrer, 8 , -- Use Substring function to extract
            -- the domain name. Start with the 8th character as the first
            -- 7 are http://
            {fn LOCATE('/', referer, 8)} - 8 )},
            count(*)
    from ex060220.txt t
    where {fn LOCATE('/', referer, 8)} > 1
        and {fn LENGTH(referer)} > 8
        and referer like 'http://%' -- This will only get referers
            -- for HTTP. Use a UNION query to get HTTPS or any other
            -- Protocol
    group by {fn SUBSTRING(referer, 8 ,
        {fn LOCATE('/', referer, 8)} - 8 )}
    order by 2 desc

    Same query can be written using a temporary table. The following scripts creates a temporary table, inserts data into it, calculates the referrers and finally, drops the table.

    -- Create a temporary table
    create table Referrers.txt(
        refName char(255)
    )
    go
    -- Insert data into this table
    insert into Referrers.txt(refName)
    select {fn SUBSTRING(referer, 8 , {fn LOCATE('/', referer, 8)} - 8 )}
    from ex060220.txt t
    where {fn LOCATE('/', referer, 8)} > 1
    go
    -- Calculate the referrer count
    select refName, count(*)
    from Referrers.txt
    group by refName
    order by 2 desc
    go
    -- Finally, drop the table
    drop table Referrers.txt




    Example 2

    The second example in this document relates to extracting a subset of data from a relational database, storing that data in plain text files and running SQL queries on them. It really does not matter which RDBMS you use as the source database. However, it is important that we do talk briefly about the database design.

    The diagram below shows the design for tables. Notice that there are relationships among these tables.
    Sample database design

    Exporting data into text files

    The first step in our exercise is to extract a subset of data from a relational database. For the sake of argument, we select all records from every table where the customer lives in New Jersey. Following steps illustrate how to achieve this task.
    1. Connect to the source database
    2. First, let's export the "Customers" table by saving the results of the following query.

      select *
      from CUSTOMER
      where STATE = 'NJ'

    3. Type the above query in the query window and click the export button. Alternatively, you can also select Query/Export from the main menu. This brings up the Export wizard in WinSQL.
    4. Since we are treating our text files as a database, you pick the last option instead of the first one, which reads, "Export to a new table in another database".
      Database export

      Database export
    5. Use Customers.txt as the new table name, which represents the text file where records will be exported.
    6. In the export dialog box, select the same DSN that we used for example 1
    7. Follow the instructions on the screen and click finish to export. For more information on how to use the export wizard, refer to WinSQL User's Guide

      Query for ORDERS table
      select o.*
      from ORDERS o, CUSTOMER c
      where o.C_ID = c.C_ID
      and c.STATE = 'NJ'


      Query for EMPLOYEE table
      select distinct e.*
      from ORDERS o, CUSTOMER c, EMPLOYEE e
      where o.C_ID = c.C_ID and o.E_SSN = e.E_SSN
      and c.STATE = 'NJ'


      Query for LINEITEM table
      select distinct l.*
      from ORDERS o, CUSTOMER c, LINEITEM l
      where o.C_ID = c.C_ID and o.O_ID = l.O_ID
      and c.STATE = 'NJ'


      Query for PRODUCTS table
      select distinct p.*
      from ORDERS o, CUSTOMER c, LINEITEM l, PRODUCT p
      where o.C_ID = c.C_ID and o.O_ID = l.O_ID and l.P_ID = p.P_ID
      and c.STATE = 'NJ'

    After successfully exporting all five tables, you will see their corresponding text files on your local hard disk. You should now be able to run SQL Queries on the text files as if the data is actually coming from an RDBMS

    Specifying relationships among text files

    Note that the ODBC driver for text file does not support the concept of foreign keys. Therefore, it is not possible to define relationships among these tables at a database level. However, WinSQL allows you to create local relationships that work independently of the back-end database.

    Following steps illustrate how to create local relationships within WinSQL.

    1. Connect to the Text DSN.
    2. Click on the Catalog Tab.
    3. Open the nodes for Tables, locate the Customer.txt table and click on "Manage Relationships".
    4. Click on the "Add Relationships" button to add a new relation.
      Local relationship
    5. The new relationship screen allows users to create one-to-many or many-to-one relationships. To create a relation between CUSTOMER and ORDERS, select one-to-many.
    6. Repeat this process for all five tables.
    7. The manage relationships node in the catalog window displays relationships for all tables.

    Manage relationship

    Why do you need local relationships?

    WinSQL uses relationships when you write queries. For example, when you write a SELECT statements containing both CUSTOMER and ORDERS table, the Intelli Tip feature automatically recognizes this relation and displays the appropriate WHERE clause.
    Auto JOIN

    Another advantage of having relationships is the ability to drill down within results. The drill down feature is only applicable if you run a query against one table and the result set contains either primary for foreign keys from that table. Columns containing primary and foreign keys are displayed in a different color and clicking the plus sign for a particular cell brings up the associated value from the secondary table.

    Drill down

    Running complex queries

    Although the ODBC text driver does not support most of the advanced features, like indexes and stored procedures, it does allow you to write complex JOINS between tables. For example, examine the following SQL script, which involves four tables in the JOIN statement and uses an aggregate function "sum", which is part of ANSI SQL.

    Query 2 - List pages that resulted in an error

    select FNAME, LNAME, sum(t2.QTY * t3.COST) as Total
    from Customers.txt t,
        Orders.txt t1,
        LineItem.txt t2,
        Product.txt t3
    where t.C_ID = t1.C_ID
        and t2.O_ID = t1.O_ID
        and t3.P_ID = t2.P_ID
        and t.CITY = 'Princeton'
    group by FNAME, LNAME

    Limitations

    While on one hand the ODBC driver for text provides a convenient way of querying data stored in a text file, on the other hand it does impose certain limitations. The list below shows some of these limitations.

    • You can only issue SELECT and INSERT statements. UPDATE and DELETE are not supported.
    • You can issue a CREATE TABLE statement but cannot specify a primary key, default value or foreign key constraints.
    • You cannot create Indexes. As a result, the queries take much longer to run.
    • You cannot create Triggers, Functions or Stored Procedures.

    Summary

    This document provides some helpful tips related to querying text files using WinSQL. It shows the reader how to write simple SQL statements against any text file containing data in tabular format. Moreover, the paper talks about some features in WinSQL that help users understand the relationships among different entities stored in text files.

    Navigation

    Social Media

    Powered by 10MinutesWeb.com