First, most of my SQL scripts are multiple statements, typically 6+, ranging up as high as 100. When I'm reading and trying to digset such scripts, the long format described by the author, particularly putting each column on its own row, makes it difficult to easily digest the script. I'm forced to scroll constantly to make sense of the statements in relation to one another. Instead, I prefer to my SQL to be more compact, so I can view as much of the total script as possible.
To accomplish this, and to maintain readability, I structure each statement something like this:
CREATE/INSERT/UPDATE line
SELECT line
FROM line
JOINS (if present)
ON (if present)
WHERE (if present)
AND (if present)
Additional SQL (qualify, group by, order by)
;
If I have a lot of columns, the select portion will get split into several lines, usually where there is a case when, column operations, or when the line is 200ish characters long.
This makes the most sense to me, as each command (create, update, insert, select, from, join, on, where, and, group by) is on its own line, followed by the information relevevant to it.
The same argument applies to writing normal code. The semi-colon exists so that you can put more statements on a line, and make your code fit into your editor, right?
You don't agree?
What is the difference between coding in that language and SQL? I submit that it is only the amount of it you write.
I spent several years of my life focusing on reporting, and spending more time writing/maintaining SQL than writing any other language. In that time I discovered that complex SQL queries are a language like anything else. For "hello world" you can get away with anything. But as soon as you are doing complex stuff, the layout matters.
Did you know that it can make a difference whether a condition is in your ON or your WHERE? It can. (Think left joins.) Did you know that the location/order of the ON statements can make a difference? It does. Is it visually obvious where this particular condition is? It should be. Did you know that the order you put things in in your query can have performance impacts? It shouldn't, but it does (particularly for MySQL - MySQL is stupid).
If you've got 200ish character lines and you are unwilling to format, well, I'm glad that I don't work with you. Because I'm likely to be asked to figure it out at some point, and I don't want to maintain crap like that.
Perhaps I didn't explain myself well. I also noticed that my spacing got messed up. Each SQL command (create, update, insert, select, from, join, on, and, where, group by, etc...) is on its own line, followed by the portion relevant to it. My SQL is still formated, but the preference is towards putting each "section" of the statement on its own line, rather than on separate lines.
For example (i've added \ as line break in case those get lost again):
select \
a., \
b.column1, \
b.column2, \
b.column3 \
from \
my_table a \
left join \
my_other_table b \
on \
a.col = b.col \
and \
a.col2 = b.col2 \
where \
a.col = condition \
group by \
column1, column2 \
;
vs.
select a., b.column1, b.column2, b.column3 \
from my_table a \
left join my_other_table b \
on a.col = b.col \
and a.col2 = b.col2 \
where a.col = condition \
group by column1, column2 \
;
I'll take the 2nd approach any day, particularly when I have statements above and below that reference that statement, because I can more easily understand the context of the entire script. I also prefer this, because in my mind, there is greater continuity, my_table is related to from, so it makes sense that it should follow it. I read left to right, and don't need to move down to a new line.
That is more reasonable than what I thought you were saying.
However what happens when your list of columns is long? What happens if you want to include a CASE statement in a field?
I use vim and personally solve the scrolling problem with :split. This is particularly important in making sure that the SELECT and GROUP BY match up. (I am perpetually annoyed that the GROUP BY is not inferred from the SELECT. Unfortunately multiple databases have invented different inconsistent behavior for missing stuff in a GROUP BY, so there is absolutely no possibility of getting agreement on the convenient default of grouping on all non-aggregate functions that appear in the SELECT and HAVING clauses.)
When my list of columns is long, I split it up into multiple lines, usually 7-8 per line (where I work, column names are capped at 30 chars, hence the 200 character estimate), unless it is a case statement. In case statements, each case/when clause get its own line, so you'll have:
case when condition then result \
when condition_2 then result_2 \
...\
when condition_n then result_n end as column_name \
All of my queries and statements follow the same logic and set of rules, I just prefer a more compact view than most it seems (which could indicate that I'm optimizing for a different set of constraints/preferences).
I'll have to look into using :split.
Honestly I don't think about formatting that much anymore, as I've started storing most of my statements as "metrics" which can be easily repurposed for use in another script. Each metric contains the name of the metric, the columns added, the metric source (usually a table, but can also be a select stmt), the possible join conditions, extra SQL like where, group by, qualify, and any indices. When I want to use that metric in another script, I can do so by specifying the metric name, the table or select statement I'd like append the metric's columns to, and my join conditions. It works great for simple to moderately complex queries, and provides you a good starting point for really complex queries. Plus, it auto-formats everything how I like it. I also use comments to explain what the statement is doing, so someone theoretically should be able to get a pretty good idea of whats going just by reading that.
I really only think about formatting when I'm reading others code and trying to make sense of it, which is where I find the the formatting like the author mentioned most annoying/frustrating (perhaps because it's different than how I think/do things?).
My personal experience when I tried to clump things together is that I lost track of what columns were where. My eyes scan vertically really fast. And once I was used to it for large queries, well, what works for large queries works for small as well if you're used to it.
Honestly I don't think about formatting that much anymore, as I've started storing most of my statements as "metrics" which can be easily repurposed for use in another script. Each metric contains the name of the metric, the columns added, the metric source (usually a table, but can also be a select stmt), the possible join conditions, extra SQL like where, group by, qualify, and any indices. When I want to use that metric in another script, I can do so by specifying the metric name, the table or select statement I'd like append the metric's columns to, and my join conditions. It works great for simple to moderately complex queries, and provides you a good starting point for really complex queries. Plus, it auto-formats everything how I like it. I also use comments to explain what the statement is doing, so someone theoretically should be able to get a pretty good idea of whats going just by reading that.
I do something different that you might like to borrow. If you're working in a database with real tmp tables (PostgreSQL, MySQL and MS SQL all do, Oracle does not) you can control execution plans for complex queries by creating a series of tmp tables in order, adding indexes if you want, then joining them down the chain. I initially did this for performance reasons, but quickly found that I could easily reuse the logic to create specific tmp tables in different places, and that a series of tmp tables was much more readable than complex subqueries.
You may wonder why I specifically dissed Oracle, after all they claim to have tmp tables. They do..but all tmp table definitions are in a global namespace that requires special permissions to write to, and are shared across all database handles. All of the other databases that I named allow you to create/destroy them on the fly without special permissions. And do not have an issue if two database handles want to use the same tmp table name with different definitions.
This opens you up for all sorts of nasty interaction problems. And increases the cognitive overhead of the technique to the point where it becomes not worthwhile.
I work mainly in MySQL and Teradata (which has tmp tables which are called volatile tables), and I do exactly what you describe when creating complex queries. My metric system is just a way to build those temp tables more rapidly.
This retrieves the metric, and returns a string of (Teradata) SQL in the form:
CREATE VOLATILE MULTISET TABLE add_metric_name AS (
SELECT a.*, b.cols_added_1, b.cols_added_2,...,b.cols_added_n
FROM my_other_table a
LEFT JOIN join_src b
ON a.my_join_condition_1 = b.join_col_1
AND a.my_join_condition_2 = b.join_col_2
...
AND a.my_join_condition_n = b.join_col_n
[If there is extra sql, like where a.condition = X, or group by's like group by 1, 2, it would show up here. SQL here can reference the join columns and table name in an add_metric stmt]
) WITH DATA PRIMARY INDEX(indice_1, indice_2) ON COMMIT PRESERVE ROWS;
I can also store entire create tmp table chains as metrics, with the last table appending all of the information from that chain to another source (I do this by storing the chain as preparatory sql, which is run before the create add_metric_name statement.
It also allows me to search all of my metrics on a number of different dimensions: the common name of the information I am adding (metric name), column names, tables, join conditions (particularly useful - It helps you map how you'll get from one metric to another), indices, or any combination of the above. For example I can find all metrics that have the word phone in the metric name and are joinable on user_id.
I'm aware of Oracle's lack of tmp tables. My fiancée has to use Oracle SQL at work, and I quickly discovered its lack of tmp tables when trying to help her solve a SQL issue.
Yup, they look like similar solutions to similar problems.
One of the things that I built into reports at that location was the ability to see all of the tmp tables that had been created, and the ability to stop the report on any particular one and display that. I built this as a debugging aid for myself, but was quite surprised when finance came to me one day and said, "Report X is going wrong on step Y - it looks like you're filtering out duplicate records."
I've started a blog where I post an idea a day, both to help me remember, but also to help train myself to think more creatively. A number of the ideas could function as startups.
A lot of the ideas skew towards tech and software, but some of them are actual products that could be built, and some are just half-baked ideas that might incrementally useful for an existing product.
Thats close, but I think it would still require too much clicking, and that the interface could be drastically improved. A lot of the actions I do are repeated, so I am trying to figure out a way to minimize the typing and clicking I do. However, I don't know which actions I will be doing at any given time, so I can't script them.