Revolving around the core of technology
Although storing data in a plain text file can never replace a full-blown relational database management server, they are perfect for small jobs and quick reporting. This paper explores the available options when data is stored in plain text files.
For the purpose of discussion, we chose the following examples.
Most web servers generate log files containing the web hits made to that server. These log files are in tabular format and therefore, can be treated as a database table. The following section shows how to run SQL queries on a log file generated by Microsoft IIS Server.
Although the log files generated by IIS are in tabular format, the first few lines contain a
descriptive message that serves as a comment and should be removed. Following is a sample file
containing logged messages.
#Software: Microsoft Internet Information Services 6.0
|
The first step is to remove the extra lines that appear at the beginning of the document. Notice that the
forth line contain column headings for each column that appear in the data. This information is useful.
To convert the text file in an SQL-friendly format, we remove the first three lines and the first word from
the fourth line. Our sample file, after removing the comments, should look like as follows.
date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytes
|
Consider changing some of the field names. For example,
Since plain text file has no intelligence, we need some sort of a querying engine that can processes SQL queries and return desired results. The obvious choice for this engine is Microsoft Jet Engine, which comes bundled with the OS and most computers running Windows already have it installed on their machine. The Jet Engine includes ODBC drivers for several file-based data sources and one of them in a plain text file. We can use this text driver as our querying engine.
Besides this engine, we will also need a querying tool that can take SQL queries from the user, submit it to the engine and display the generated results. This tool is WinSQL, which can be downloaded from here
In order to treat this IIS log as database table, we need to create an ODBC DSN using a driver that works with text files. Follow the steps below to create a Text DSN for your text file.
Creating a text DNS can be a tedious task; particularly for text files with too many columns.
However, once you understand what happens in the background, you may be able to save some time by
defining the design manually.
The ODBC DSN for Text files creates a file called SCHEMA.INI in the directory that you choose
for your database. This .INI file holds the schema for the tables. Once you understand the layout
of this file, you can manually edit it for future use, which is usually much faster and less tedious.
The layout of SCHEMA.INI file for the IIS Log file we created earlier using ODBC Manager is shown below.
[LogFile.txt]
|
Format statement | Description |
Format=TabDelimited | Fields in the file are delimited by tabs. |
Format=CSVDelimited | Fields in the file are delimited by commas (comma-separated values). |
Format=Delimited(custom character) | Fields in the file are delimited by any character you choose to input into the dialog box. All except the double quote (") are allowed, including blank. |
Format=FixedLength | Fields in the file are of a fixed length. |
The following table defines the rest of the variables in the file
Variable name | Description |
ColNameHeader | This is set to true if the first column contains the name of the columns. |
MaxScanRows | Use the MaxScanRows option to indicate how many rows should be scanned when determining the column types. If you set MaxScanRows to 0, the entire file is scanned.. |
Coln |
The following entries designate fields in a table by using the column number (Coln) option, which is optional for character-delimited files and required for fixed-length files.
The format for Coln option is Coln = ColumnName type [Width #] Possible value for data types are: Besides these data types, you can also specify the types that are valid types for Microsoft Jet Engine. |
Once the ODBC DNS is correctly configured, you can start WinSQL and connect to your newly created data source.
If you do not have a copy of WinSQL, you can download it from here.
When you start WinSQL, select the appropriate DSN from the list of available entries.
Once connected, WinSQL will treat this log file as if it were a database table and will allow you to run
SELECT as well as INSERT queries. Note that this driver does not support DELETE or UPDATE statements.
To verify the connection to your database, run a select statement or simply click on Catalog,
locate the file and click "Browse Data"
Following section show some queries that you can run from WinSQL against an IIS Log file.
Query 1 - Calculate amount of data transfer for different file types such Zipped files, HTML and images
select 'Zip
Files' as `File Type
`, -- Returns the type of
file sum(bytes) as `Total Bytes`, -- Returns total bytes for this type count(*) as `Number of Hits` -- Number of hits for this type from ex060220.txt where uri like '%.zip' and bytes is not null union -- use UNION to merge results for multiple types select 'Images', sum(bytes), count(*) from ex060220.txt where uri like '%.gif' or uri like '%.jpg' and bytes is not null union select 'HTML Files', sum(bytes), count(*) from ex060220.txt where (uri like '%.htm' or uri like '%.html') and bytes is not null order by 2 desc |
Query 2 - List pages that resulted in an error
select uri, status, count(*) from logFile.txt where status >= 400 group by uri, status order by 3 desc |
Query 3 - What type of browsers clients use.
select 'Internet Explorer' as Browser, count(*) as `Total Hits` from ex060220.txt where {fn LOCATE('MSIE', user_agent)} > 0 -- Notice the use of string function LOCATE union select 'Firefox', count(*) from ex060220.txt where {fn LOCATE('Firefox', user_agent)} > 0 union select 'Konqueror', count(*) from ex060220.txt where {fn LOCATE('Konqueror', user_agent)} > 0 union select 'Opera', count(*) from ex060220.txt where {fn LOCATE('Opera', user_agent)} > 0 union select 'Safari', count(*) from ex060220.txt where {fn LOCATE('Safari', user_agent)} > 0 |
Query 4 - Find out which site users come from. In other words, extract the referrer site.
select {fn SUBSTRING(referrer, 8 , -- Use Substring function to extract -- the domain name. Start with the 8th character as the first -- 7 are http:// {fn LOCATE('/', referer, 8)} - 8 )}, count(*) from ex060220.txt t where {fn LOCATE('/', referer, 8)} > 1 and {fn LENGTH(referer)} > 8 and referer like 'http://%' -- This will only get referers -- for HTTP. Use a UNION query to get HTTPS or any other -- Protocol group by {fn SUBSTRING(referer, 8 , {fn LOCATE('/', referer, 8)} - 8 )} order by 2 desc |
Same query can be written using a temporary table. The following scripts creates a temporary table,
inserts data into it, calculates the referrers and finally, drops the table.
-- Create a temporary table create table Referrers.txt( refName char(255) ) go -- Insert data into this table insert into Referrers.txt(refName) select {fn SUBSTRING(referer, 8 , {fn LOCATE('/', referer, 8)} - 8 )} from ex060220.txt t where {fn LOCATE('/', referer, 8)} > 1 go -- Calculate the referrer count select refName, count(*) from Referrers.txt group by refName order by 2 desc go -- Finally, drop the table drop table Referrers.txt |
The second example in this document relates to extracting a subset of data from a relational database,
storing that data in plain text files and running SQL queries on them. It really does not matter which RDBMS
you use as the source database. However, it is important that we do talk briefly about the database design.
The diagram below shows the design for tables. Notice that there are relationships among these tables.
select * from CUSTOMER where STATE = 'NJ' |
select o.* from ORDERS o, CUSTOMER c where o.C_ID = c.C_ID and c.STATE = 'NJ' |
select distinct e.* from ORDERS o, CUSTOMER c, EMPLOYEE e where o.C_ID = c.C_ID and o.E_SSN = e.E_SSN and c.STATE = 'NJ' |
select distinct l.* from ORDERS o, CUSTOMER c, LINEITEM l where o.C_ID = c.C_ID and o.O_ID = l.O_ID and c.STATE = 'NJ' |
select distinct p.* from ORDERS o, CUSTOMER c, LINEITEM l, PRODUCT p where o.C_ID = c.C_ID and o.O_ID = l.O_ID and l.P_ID = p.P_ID and c.STATE = 'NJ' |
After successfully exporting all five tables, you will see their corresponding text files on your local hard disk. You should now be able to run SQL Queries on the text files as if the data is actually coming from an RDBMS
Note that the ODBC driver for text file does not support the concept of foreign keys. Therefore, it is not
possible to define relationships among these tables at a database level. However, WinSQL allows you to create
local relationships that work independently of the back-end database.
Following steps illustrate how to create local relationships within WinSQL.
WinSQL uses relationships when you write queries. For example, when you write a SELECT statements containing
both CUSTOMER and ORDERS table, the Intelli Tip feature automatically recognizes this relation and displays
the appropriate WHERE clause.
Another advantage of having relationships is the ability to drill down within results.
The drill down feature is only applicable if you run a query against one table and the result set
contains either primary for foreign keys from that table. Columns containing primary and foreign keys are
displayed in a different color and clicking the plus sign for a particular cell brings up the associated
value from the secondary table.
Although the ODBC text driver does not support most of the advanced features, like indexes and stored procedures, it does allow you to write complex JOINS between tables. For example, examine the following SQL script, which involves four tables in the JOIN statement and uses an aggregate function "sum", which is part of ANSI SQL.
Query 2 - List pages that resulted in an error
select FNAME, LNAME, sum(t2.QTY * t3.COST) as Total from Customers.txt t, Orders.txt t1, LineItem.txt t2, Product.txt t3 where t.C_ID = t1.C_ID and t2.O_ID = t1.O_ID and t3.P_ID = t2.P_ID and t.CITY = 'Princeton' group by FNAME, LNAME |
While on one hand the ODBC driver for text provides a convenient way of querying data stored in a text file, on the other hand it does impose certain limitations. The list below shows some of these limitations.
This document provides some helpful tips related to querying text files using WinSQL. It shows the reader how to write simple SQL statements against any text file containing data in tabular format. Moreover, the paper talks about some features in WinSQL that help users understand the relationships among different entities stored in text files.