Hacker News new | past | comments | ask | show | jobs | submit login
How I write SQL (craigkerstiens.com)
51 points by rachbelaid on May 1, 2013 | hide | past | favorite | 83 comments



A neat trick on writing sql that I learned working with Oracle consultants:

    SELECT field1
    , field2
    , field3
    , another_field
    FROM ...
By placing the comma at the beginning of the line, instead of at the end, we can very easily reorganize the sequence without fiddling with commas most of the time:

    SELECT field1
    , field3
    , another_field
    , field2
    FROM ...


Along the same lines it makes it easier to comment out items when debugging. Especially if you add in tricks like SELECT 0.

For example multiple lines in this query would cause an error:

    SELECT a.field1,
      a.field2,
    --  b.field1,
    --  b.field2,
      c.field1,
    --  b.field3
    FROM a,
    --  b,
      c
    WHERE --a.field1 = b.field1
      AND a.field2 = c.field1
This query would not cause an error (Note that DUAL is a dummy table in Oracle that contains 1 column and row):

    SELECT 0
    , a.field1
    , a.field2
    --, b.field1
    --, b.field2
    , c.field1
    --, b.field3
    FROM dual 
    , a
    --, b
    , c
    WHERE 0=0
    --  AND a.field1 = b.field1
      AND a.field2 = c.field1
It does add a 0 column to the result set but that can be dealt with or removed after development.


I do something similar:

    select 
          site.name
        , site.plan
        , site.status
        , domain.domain
        , subscription.updated_at
    from
        subscription
        inner join site on subscription.site_id = site.id
        inner join domain on domain.site_id = site.id
    where
        subscription.status = 'inactive'
        and subscription.plan = 'pro'
        and domain.created_at <= '2012-04-01'
    order by
        subscription.updated_at desc
When I was writing a lot of sql I found it a lot easier to work with (when you're commenting things out a lot during development for example). Also, I'm a geek, so I like stuff lined up.

I understand the arguments for putting commas at the end of the line. For example, Google say in their style guide that when reading, having the comma at the end alerts the reader to the continuation on the next line. In languages that allow trailing commas (Python, php, C# etc) that's what I do. I put them at the front otherwise.

It's certainly not worth an overreaction though. Having commas on either end of the line isn't going to kill anyone. I've written 1000s of lines of sql in both styles and the world hasn't imploded yet.

As always, be consistent with the surrounding code - that's more important than any other rules (especially ones with a purely subjective basis).


By putting the commas at the beginning, you can easily reorganize the sequence without messing with the commas for all but the first element. By putting the commas at the end, you can easily reorganize the sequence without messing with the commas for all but the final element. I do not see how this is any different. Maybe the description should be more of the form "I can append entries to the end by simply adding a single line"?


For the WHERE clauses one can use:

  WHERE 1=1
  AND ...
  --AND ...
  AND ...
I do not like having commas at the beginning of the lines though.


> WHERE 1=1

ORM frameworks frequently use this when they create SQL.


This reminds me about something that I like in C#, you can leave an extra comma at the end of a sequence and the compiler doesn't freak out on you.


Python is the same way - it lets you have trailing commas for lists, tuples, and dictionaries.


JavaScript is almost the same way. It works in some browsers, but not in others. :-(


This exact issue caused us to deploy code that completely broke our application in IE7. There was a trailing comma in a list which all browsers ignore except IE6/7.

To this day we still give that dev (the team lead) a hard time about the trailing comma. We since implemented mandatory linting on all JS.


So often I've wished SQL worked like that. Go requires a comma at the end of a sequence.


I do the same in C++:

    myClass::myClass(int aa,int bb)
       : a(aa)
       , b(bb)
       , p(new int[aa])
       , c(17)
    {}
Yes, that looks weird at first, but you get used to it (and, to be honest, everything in C looks weird at first).


No, no, no. Comma first variable listing is the worst thing ever in the history of the world.

Having the comma in the proper place has such a negligible cost (oh no I might have to delete the comma on the last element and add it to the formerly-last element!), and having it like this looks so god awful and doesn't really save you anything if you're swapping the first element instead of the last one.


The sensibilities for everything you just said are entirely subjective, and change as you are exposed to them. If you spend all day every day writing SQL, it is different than if you do it a little here and there. The entire codebase of Oracle Apps looks like this, so I found I grew accustomed to it over time and saw the benefits. Editing commas actually can be significant in terms of time and drag (this is even more apparent when edit Erlang).

Being very dogmatic, and in a discussion of ideas saying "that's the worst thing in the world" is easily worse than comma placement. So surely commas as indentation isn't the worst thing in the world.


I do SQL most days. Commas at the end of the line is more advantageous than not, in my book. I find this best:

  select
      Field1,
      Field2,
      AnotherField
  from
      Table1
So clean, so consistent! No all-uppercase that adds almost no benefit at significant cost. Fields lined up for easy reading and editing.

The comma controversy could be eliminated if the SQL standard was like Go's, where even the last line would have a comma.


It's not the worst thing in the world. It's the worst thing in the history of the world. Get it right, gosh.

Seriously, though, hideous syntax is bad. It makes life miserable for everyone who has to work on your code after you. Don't do it, no matter what good reasons you think you have.


"Ugly" is a subjective judgement. "Worst thing in the history of the world" is a subjective judgement, and egregious hyperbole, to boot. Saying that mitigates pretty much all the credibility of any rational, non-hyperbolic argument you might also have offered in support of your point.

I find having to re-run my query because I forgot to delete the comma following the penultimate item in the SELECT list when I commented the ultimate one out "ugly".

Additionally, everyone I have ever explained this concept to has said some variation of, "Wow, that's a great idea. Thanks!" No one has ever said, "That's the worst thing in the history of the world" ... until you.

I've been doing database work professionally for a decade now. I've explained this concept to a lot of people.


Chances are if you're explaining this to someone they have less experience than you so they wouldn't be in a position to tell you that this is a bad practice.

Like I said, objectively, you're just moving the extra work from moving around the last element to moving around the first element. You're not really saving any time. Plus even if you accept the argument that you are saving time (pretend you live in a world where you never move the first element) the time you save is negligible.

If you hadn't resorted to this ugly syntax, you should have developed habits that lead you to fix your commas whenever you move the last element around.

Obviously you can do what you want, but if you were on my team, I would not allow this syntax.


Like I said, objectively, you're just moving the extra work from moving around the last element to moving around the first element.

Except that moving the last element (more specifically appending after it) is by far the more common operation. And when you mess up on the first element, it is visually obvious that you have done so.

If you hadn't resorted to this ugly syntax, you should have developed habits that lead you to fix your commas whenever you move the last element around.

I worked with SQL for years, and never developed the habit reliably enough. Once I was introduced to the idea of the leading comma, I noticed the difference.

Obviously you can do what you want, but if you were on my team, I would not allow this syntax.

Then let's both be glad that I'm not on your team, because there are some co-workers that I don't want to have to put up with.


To speak frankly, you seem pretty belligerent in your crusade against the comma-firsters. For whatever reason, this is apparently holy war territory for you, and you're willing to make assumptions about me as a person, and my abilities as a technologist, based on where I put a goddamned comma.

Thanks for that.


My experience.

Changed formatting is a shock the first time. But it does not take long to retrain yourself to find it aesthetically more pleasing.

When you're making a quick fix to a batch job that does not hit the problematic query for 30 minutes, the cost of messing up the comma is 30 minutes. (Yeah, I know, it all should be properly factored out, and unit tested. But there is a lot of lightly tested code in the real world that behaves just like I said.) Because SQL is stored as plain text, there is real value in making your mental compiler have to do no work to notice stupid syntactical stuff.


Couldn't you use an IDE that understands SQL? IntelliJ does this - it recognises the SQL dialect you're using and can even validate against the tables & columns in your schema


Discussing IDEs is a different religious war. Outside of the Java world, people are not so wedded to IDEs.

Going back to my last job that was mostly SQL, I was working in Perl, with SQL embedded in TT templates. Good luck finding an IDE that understands that.

In that job the comma placement was critical. Which column would be the last column would depend on what options the report was being generated with. The point is that people would ask for many variations on the same report. One person would want a top level view, another would want to break out event type, another would want to break it out by which type of advertising campaign was thrown at it - and all were being generated out of the same templated SQL so I didn't have to maintain code in parallel. But the result is that in the code I couldn't make assumptions about which column was going to wind up being last.

Thanks to leading comma formatting, it was sufficient to require that the FIRST field stayed the same. Which proved to never be a problem.


That's a methodology problem you're describing.

The solution should not be hideous syntax.


There are a ton of companies out there doing interesting things that are willing to pay well for working on code that has that methodology problem. But they are not willing to undertake a full rewrite of their system at the outset before you can start doing anything useful.

At that point you can choose which is more important - the religious war or getting on with your life. When I encounter things like that my vote is a bit of both. I bring habits that allow me to cope with the methodological problem, and incrementally try to create a space where I don't have a problem.

In the process I've found that my preference for trailing commas was a useless idiosyncrasy that I could live without. When I tried living without it, I quickly found that leading commas made my life more pleasant, and I learned to like that formatting style.


I don't like having comma first listings, but it makes it incredibly easy to comment out parts of a query. I find commenting out parts of a query essential when trying to debug really long, complex queries.


Not sure why the strenuous objection to the comma first. I too code that way and a couple of SQL formatting tools such as http://poorsql.com/ have the comma first as the default setting. An added advantage to the one mentioned by the original comment is being able to utilise column editing features.


Why is "after the element" the proper place for a comma?


That's what I was wondering. The comma is just a separator. Plus it's only really needed BEFORE each field not after the preceding field.


That's where it goes in English.


It's probably more pleasent and easier to read for the next person.


(I can't reply to my reply)

Comma first seems more consistent to me.... unless you write your WHERE clauses like this:

WHERE a=b AND

a=d AND

e=f


The same reason that you should put spaces before and after your logical operators. The same reason you should use consistent spacing on the left and right parenthesis. The same reason you should use caps for the SQL and lower case for all table and field names.

Because it makes your SQL more legible.

In english, commas go after the previous word, not before the next word. The syntax is the same for all programming languages. The fact that most compilers and interpreters don't require it is not an excuse for abusing it.


I disagree with you regarding legibility. I've been using commas like this in SQL for 10 years and I find it to be much more legible and functional this way.

Besides the subjective argument of legibility, what other benefits does it have?

Having a comma preceding something is an indicator that it is one of many. When debugging you can easily comment out the column, or table from a query.


Again, you can comment out lines with the normal syntax too. There's no reason for this gimmick syntax. In theory it makes the last 2 elements easier to deal with but it makes the first two harder. It's silly, and it's ugly.


Lowercase for table names? I think table names should be in PascalCase.

And I personally prefer the rest of SQL to be in lowercase, so it does not attract too much attention. Usually in SQL table names and column names is the most important information.


My personal style diverges considerably.

First, most of my SQL scripts are multiple statements, typically 6+, ranging up as high as 100. When I'm reading and trying to digset such scripts, the long format described by the author, particularly putting each column on its own row, makes it difficult to easily digest the script. I'm forced to scroll constantly to make sense of the statements in relation to one another. Instead, I prefer to my SQL to be more compact, so I can view as much of the total script as possible.

To accomplish this, and to maintain readability, I structure each statement something like this:

CREATE/INSERT/UPDATE line SELECT line FROM line JOINS (if present) ON (if present) WHERE (if present) AND (if present) Additional SQL (qualify, group by, order by) ;

If I have a lot of columns, the select portion will get split into several lines, usually where there is a case when, column operations, or when the line is 200ish characters long.

This makes the most sense to me, as each command (create, update, insert, select, from, join, on, where, and, group by) is on its own line, followed by the information relevevant to it.


The same argument applies to writing normal code. The semi-colon exists so that you can put more statements on a line, and make your code fit into your editor, right?

You don't agree?

What is the difference between coding in that language and SQL? I submit that it is only the amount of it you write.

I spent several years of my life focusing on reporting, and spending more time writing/maintaining SQL than writing any other language. In that time I discovered that complex SQL queries are a language like anything else. For "hello world" you can get away with anything. But as soon as you are doing complex stuff, the layout matters.

Did you know that it can make a difference whether a condition is in your ON or your WHERE? It can. (Think left joins.) Did you know that the location/order of the ON statements can make a difference? It does. Is it visually obvious where this particular condition is? It should be. Did you know that the order you put things in in your query can have performance impacts? It shouldn't, but it does (particularly for MySQL - MySQL is stupid).

If you've got 200ish character lines and you are unwilling to format, well, I'm glad that I don't work with you. Because I'm likely to be asked to figure it out at some point, and I don't want to maintain crap like that.


Perhaps I didn't explain myself well. I also noticed that my spacing got messed up. Each SQL command (create, update, insert, select, from, join, on, and, where, group by, etc...) is on its own line, followed by the portion relevant to it. My SQL is still formated, but the preference is towards putting each "section" of the statement on its own line, rather than on separate lines.

For example (i've added \ as line break in case those get lost again):

select \ a., \ b.column1, \ b.column2, \ b.column3 \ from \ my_table a \ left join \ my_other_table b \ on \ a.col = b.col \ and \ a.col2 = b.col2 \ where \ a.col = condition \ group by \ column1, column2 \ ;

vs.

select a., b.column1, b.column2, b.column3 \ from my_table a \ left join my_other_table b \ on a.col = b.col \ and a.col2 = b.col2 \ where a.col = condition \ group by column1, column2 \ ;

I'll take the 2nd approach any day, particularly when I have statements above and below that reference that statement, because I can more easily understand the context of the entire script. I also prefer this, because in my mind, there is greater continuity, my_table is related to from, so it makes sense that it should follow it. I read left to right, and don't need to move down to a new line.


That is more reasonable than what I thought you were saying.

However what happens when your list of columns is long? What happens if you want to include a CASE statement in a field?

I use vim and personally solve the scrolling problem with :split. This is particularly important in making sure that the SELECT and GROUP BY match up. (I am perpetually annoyed that the GROUP BY is not inferred from the SELECT. Unfortunately multiple databases have invented different inconsistent behavior for missing stuff in a GROUP BY, so there is absolutely no possibility of getting agreement on the convenient default of grouping on all non-aggregate functions that appear in the SELECT and HAVING clauses.)


Sorry for the confusion.

When my list of columns is long, I split it up into multiple lines, usually 7-8 per line (where I work, column names are capped at 30 chars, hence the 200 character estimate), unless it is a case statement. In case statements, each case/when clause get its own line, so you'll have:

case when condition then result \ when condition_2 then result_2 \ ...\ when condition_n then result_n end as column_name \

All of my queries and statements follow the same logic and set of rules, I just prefer a more compact view than most it seems (which could indicate that I'm optimizing for a different set of constraints/preferences).

I'll have to look into using :split.

Honestly I don't think about formatting that much anymore, as I've started storing most of my statements as "metrics" which can be easily repurposed for use in another script. Each metric contains the name of the metric, the columns added, the metric source (usually a table, but can also be a select stmt), the possible join conditions, extra SQL like where, group by, qualify, and any indices. When I want to use that metric in another script, I can do so by specifying the metric name, the table or select statement I'd like append the metric's columns to, and my join conditions. It works great for simple to moderately complex queries, and provides you a good starting point for really complex queries. Plus, it auto-formats everything how I like it. I also use comments to explain what the statement is doing, so someone theoretically should be able to get a pretty good idea of whats going just by reading that.

I really only think about formatting when I'm reading others code and trying to make sense of it, which is where I find the the formatting like the author mentioned most annoying/frustrating (perhaps because it's different than how I think/do things?).


My personal experience when I tried to clump things together is that I lost track of what columns were where. My eyes scan vertically really fast. And once I was used to it for large queries, well, what works for large queries works for small as well if you're used to it.

Honestly I don't think about formatting that much anymore, as I've started storing most of my statements as "metrics" which can be easily repurposed for use in another script. Each metric contains the name of the metric, the columns added, the metric source (usually a table, but can also be a select stmt), the possible join conditions, extra SQL like where, group by, qualify, and any indices. When I want to use that metric in another script, I can do so by specifying the metric name, the table or select statement I'd like append the metric's columns to, and my join conditions. It works great for simple to moderately complex queries, and provides you a good starting point for really complex queries. Plus, it auto-formats everything how I like it. I also use comments to explain what the statement is doing, so someone theoretically should be able to get a pretty good idea of whats going just by reading that.

I do something different that you might like to borrow. If you're working in a database with real tmp tables (PostgreSQL, MySQL and MS SQL all do, Oracle does not) you can control execution plans for complex queries by creating a series of tmp tables in order, adding indexes if you want, then joining them down the chain. I initially did this for performance reasons, but quickly found that I could easily reuse the logic to create specific tmp tables in different places, and that a series of tmp tables was much more readable than complex subqueries.

You may wonder why I specifically dissed Oracle, after all they claim to have tmp tables. They do..but all tmp table definitions are in a global namespace that requires special permissions to write to, and are shared across all database handles. All of the other databases that I named allow you to create/destroy them on the fly without special permissions. And do not have an issue if two database handles want to use the same tmp table name with different definitions.

This opens you up for all sorts of nasty interaction problems. And increases the cognitive overhead of the technique to the point where it becomes not worthwhile.


It would seem we're each doing what works for us.

I work mainly in MySQL and Teradata (which has tmp tables which are called volatile tables), and I do exactly what you describe when creating complex queries. My metric system is just a way to build those temp tables more rapidly.

I use two main functions to manipulate metrics:

  create_metric(metric_name,{cols_added},join_src,{join_cols},{extra_sql},{indices});
This stores the metric for later use.

  add_metric(metric_name, my_other_table, {my_join_conditions});
This retrieves the metric, and returns a string of (Teradata) SQL in the form:

  CREATE VOLATILE MULTISET TABLE add_metric_name AS (

  SELECT a.*, b.cols_added_1, b.cols_added_2,...,b.cols_added_n
  
  FROM my_other_table a

  LEFT JOIN join_src b

  ON a.my_join_condition_1 = b.join_col_1

  AND a.my_join_condition_2 = b.join_col_2

  ...

  AND a.my_join_condition_n = b.join_col_n

  [If there is extra sql, like where a.condition = X, or group by's like group by 1, 2, it would show up here. SQL here can reference the join columns and table name in an add_metric stmt]

  ) WITH DATA PRIMARY INDEX(indice_1, indice_2) ON COMMIT PRESERVE ROWS;
I can also store entire create tmp table chains as metrics, with the last table appending all of the information from that chain to another source (I do this by storing the chain as preparatory sql, which is run before the create add_metric_name statement.

It also allows me to search all of my metrics on a number of different dimensions: the common name of the information I am adding (metric name), column names, tables, join conditions (particularly useful - It helps you map how you'll get from one metric to another), indices, or any combination of the above. For example I can find all metrics that have the word phone in the metric name and are joinable on user_id.

I'm aware of Oracle's lack of tmp tables. My fiancée has to use Oracle SQL at work, and I quickly discovered its lack of tmp tables when trying to help her solve a SQL issue.


Yup, they look like similar solutions to similar problems.

One of the things that I built into reports at that location was the ability to see all of the tmp tables that had been created, and the ability to stop the report on any particular one and display that. I built this as a debugging aid for myself, but was quite surprised when finance came to me one day and said, "Report X is going wrong on step Y - it looks like you're filtering out duplicate records."

I like having users that will debug my stuff. :-)


My personal style has a lot of similarities, but with some glaring differences. The biggest is that I put the comma in front of the next item, rather than trailing the one before. What this means is that when I add a new thing to the list of columns, I'm less likely to leave one out.

See http://bentilly.blogspot.com/2011/02/sql-formatting-style.ht... for what this looks like in practice.


I put the commas in front as well, but the reason I started doing it that way is that you can then comment out (and uncomment) any column without breaking the query.


I use this formatting style as well - it's also easier to format as the columns can all be aligned with tabs.


You also write FROM/JOIN/ON indentations like I do.


Enjoy. A CTE, Xpath, and Oracle-specific hierarchical bits for a data report I'm generating. My favorite part was the XML in a CLOB column where all the values were stored in XML element attributes. I tend to build these monstrosities up in a very repl-like way. SQL actually led me to Lisp . . .

BTW, I can't imagine any of the below is legible, but I find myself enjoying these types of things.

  with bgu as 
  (select
    protocol_code,
    application_code,
    description,
    lab,
    proj."seqno",
    proj."project",
    proj."protocol",
    'false' "nexttag",
    'true' "materialtag"
  from
    protocol,
    xmltable('for $a in /*
                 for $n at $nidx in //*/MATERIAL
                 return <result>
                 <project>{fn:data($n/@project)}</project>
                 <protocol>{fn:data($n/@protocol)}             </protocol>
                      </result>'
             passing xmltype(protocol.xml)
             columns
                 "seqno" for ordinality,
                 "project" varchar2(100),
                 "protocol" varchar2(100)) proj
  where
    active = 'Y'
  union
  select
    protocol_code,
    application_code,
    description,
    lab,
    proj."seqno",
    proj."project",
    proj."protocol",
    'true' "nexttag",
    'false' "materialtag"
  from
    protocol,
    xmltable('for $a in /*
                 for $n at $nidx in //*/NEXT
                 return <result>
                 <project>{fn:data($n/@project)}</project>
                 <protocol>{fn:data($n/@protocol)}</protocol>
                      </result>'
             passing xmltype(protocol.xml)
             columns
                 "seqno" for ordinality,
                 "project" varchar2(100),
                 "protocol" varchar2(100)) proj
  where
    active='Y'
  order by 1,2,4 )
  select 
    * 
  from 
    (select distinct 
          protocol_code,
          sys_connect_by_path(protocol_code,'||||'),
          connect_by_isleaf "isleaf"
    from bgu
    where protocol_code like 'PROTNAME%'
    connect by nocycle prior "protocol" = protocol_code)
  where "isleaf"=1


I can't stand the XMLTable syntax. I like the idea of turning an xpath expression into a result set, but in practice it's a massive pain in the ass.


XML and Oracle drives me crazy in general. You wind up with neither a relational model nor a generally useful way of data storage/exchange.

I am almost always working with legacy schemas, many generated by long-extinct tools or developers, and being a genuinely curious person, wandering down strange code paths to see if I can satisfy random reporting requests.


I'm just biting my time until someone makes a "Daily WTF" website for monster queries. My worst was this sync that took data from one database, and put it into another one. The thing though is the source database was a CRM system that is on its 3rd iteration. So there's a million random fields from old systems that have accumulated over the years that need to be checked.


http://oracle-wtf.blogspot.com/

This might have been what you're looking for if it was still updated. Still some funny stuff there though.


This is a breeze. Don't make me paste some of the wacky SQL I've seen generated from Oracle's BI tools when they are being abused.


I don't write SQL like that. He has a good point about using a line for each and/or condition in the where clause and starting with either one (easier to remove).

But I don't think all queries should be written in several lines. If a query is simple enough (e.g. no aggregation, group by, having and at maximum one where condition and few fields to select), it can be written in one line, like:

SELECT field FROM table WHERE value = @value

That's easy and straightforward to understand and readable (which was initially part of SQL's goals, although we can discuss at a later date how successful that was). Complicated SQL queries - which are bound to happen - follow - in my style - a more straightforward approach:

There are new lines after every grouping (FROM, WHERE, GROUP BY, SORT BY, HAVING, etc.) and indenting. In fact, I prefer to indent my fields twice, because I consider FROM, WHERE, etc. to be subject to SELECT (even if part of an INSERT). I also tend to avoid extra keywords such as 'OUTER' in 'LEFT OUTER JOIN' (there is no such thing as 'LEFT INNER JOIN'). And then I follow each JOIN statement (indented compared to the ground table in from) by the table name, a new line, an indent and the ON condition. If another table relies on that table being joined, I indent it further to indicate that, so I can quickly look at my code to see what tables goes through which (which is useful if you have a lot of relationship tables).

Should be noted, however, I most commonly write towards Sybase at work, which tends to have no preference - style wise - on whether you write the keywords in upper case. My co-workers do not, so I don't either. But when I write my own projects against MySQL/MariaDB, I use upper case. But I generally follow the same style.


If I'm doing a simple query, I still structure it as

  select blah, blha, blah
  from dual
  where ...
I find that even for short ones, lines quickly get very long and illegiable if you're shmushing them all onto one line.


Please, please, please don't perform your JOINs with comma-separated items in the FROM clause. As soon as you need also to do an OUTER JOIN, you're going to get unexpected results.

Per the SQL spec, explicit JOIN-style syntax binds more tightly than FROM-list elements, which can not only constrain the query planner in the choices it can make, but can actually yield a different result.


I wonder why so many people hate writing sql with such a poisonous intensity these days. I remember my database-systems class being one of my favorite classes when I was in school. Whats wrong with relational algebra and writing sql? I always thought they were fun.


I was hoping to see some examples of more complex WHERE clauses. They can get a little messy when you start mixing ANDs and ORs at different levels of nesting.

Also, why are

  wine_tags as (
    SELECT DISTINCT
      unnest(tags) as tag,
      wine_id
    FROM 
      app_rating
    GROUP BY 
      wine_id, tags),

  wine_detail as (
    SELECT
      app_wine.name as name,
      app_wine.id,
      app_winery.name as winery
    FROM
      app_wine,
      app_winery
    WHERE app_wine.winery_id = app_winery.id
   ) 
handling closing parentheses differently? This affects how you organize nested parentheses.


My rule is that if I'm mixing ANDs and ORs in a WHERE, I use parentheses and extra indentation.

Not everyone knows that AND binds more tightly. Nor should they have to.


I'm not a big fan of this style. It's technically 'organized' (in the sense that there's a definite consistency), but it's vertical to the point of reduced readability: trivial queries can take up dozens of lines.


> trivial queries can take up dozens of lines.

My general rule-of-thumb is that trivial things can have syntax constraints relaxed, as long as it makes them more readable, not less. I may adopt the style of the article for a complex query, but for something simpler I may go with:

    SELECT   foo, bar, baz
    FROM     my_table
    WHERE    foo=1 AND bar > 10
    GROUP BY bar
    ORDER BY baz
But that becomes less readable after more than a couple fields or conditions for each line. Until that point, I think it's fairly concise and efficient.

That said, if your SQL is intermixed with code whose sole purpose isn't to deal with that SQL, you are probably in for hurt later on anyway. If it is safely quarantined into data access routines of some sort, the size doesn't really matter, since the function or method containing the query should really be about that query.


The fact that the first thing he does with the tags is unnest them is, IMO, material evidence for a traditional 1NF formulation. It's worth considering that using arrays is a violation of the first normal form. That's a good indicator of how obvious Codd et. al. thought this rule was.

Other than that, I use a variant of this style, where I put things on one line if possible (especially GROUP BY and ORDER BY). And I tend to line up my joins like so:

    FROM foo
    JOIN bar ON b.foo_id = f.id
    LEFT JOIN baz ON ...


I would tend to normalise arrays too, but I can see the argument that when getting an 'app' and turning it into an object in $programming_language (say), it's rather a pain to have to go through the sql result set building up an object from the denormalised data (and arguably results in a lot of wasted data being transmitted). It gets particularly bad if you have more than one 1-many relationship on the object you're trying to build up.


If you're using an ORM, and you almost certainly are, it's equal to all the other work you do to map your objects. If you're querying directly, array_agg on the way out is the same work as unnest is on the way in. You can have your cake and eat it to. Tagging has been discussed a lot recently as a good excuse to use arrays in Postgres, but since the whole point of tagging is to enhance searching I think it really misses the point to do it this way. It appeals to our intuitions only because we think of rows as being large and string arrays as being small, but in practice this sort of intuitive "optimization" is almost always wrong.


> If you're using an ORM, and you almost certainly are

I'm not a web developer - I spend more of my time in raw-sql-land, so this discussion is largely academic to me :-). That said, even if you're using an ORM, it's presumably not exactly free for the ORM to produce an object out of the thousands of distinct rows you can end up getting if you have multiple one to many relationships.

In general, the availability of arrays for the purposes of data transport - less so actual storage - in postgres is rather helpful for these sort of problems. I find working with other DBs to do object storage quite constricting by comparison.


Without some numbers I see no reason to believe that Postgres would be more efficient at transporting arrays than rows. One of those is a highly optimized use-case that literally everybody relies on every day. The other one is a corner case. I'm not saying it's impossible but I'm going to need more than a good feeling to buy it.


Perhaps my wording was unclear - I'm not talking about postgres' efficiency, but efficiency of processing the data that postgres produces.

Consider the example of an ORM that's storing an object with three multi-valued properties on the object it's encoding, each with 100 values. Encoding one has a row with three arrays in it, encoding two is the result set from joining to separate tables. In the end, it's quite plainly obvious that it's going to be cheaper (and, frankly, easier) to construct the object from a row with three arrays of size 100 in it than it is from a million row result set. Yes, this is absolutely a corner case (particularly with many multi-valued props of such a cardinality), but I'm not really sure why it requires numbers to prove - unless there's some kind of JDBC/ODBC technique for dealing with this situation that I'm simply unaware of?


I almost agree with you, but I put all reserved words (including ON) on a new line.


What is comes down to is use any type of formatting as opposed to no formatting. Too often people crank out code, do not comment it and do not format it. This applies to every language. So far as SQL syntax goes, what irks me are:

* No use of spaces.

* Obscure use of aliases. If you have 4 tables all aliased as a,b,c,d then you are making life hard for someone else.

* Not capitalizing operations or reserved words.


I don't often deal with SQL and when I do, it is usually quite simple. But, I think there is actually an error in the first step:

   SELECT 
     avg(rating),
     wine_id
   FROM 
     app_wine
   GROUP BY
     wine_id;
The query should be against the app_rating table. It does get corrected later on, but I was confused for a while.


One thing I wish more people would do is to indent joins appropriately; e.g.

    ...
    Inner join a on a.... = main..
      Inner join b on b... = a...
        Inner join c on c... = b...
      Inner join x on x... = a...
The syntax of SQL is a real basket case, readers need all the help you can give them.


Whenever I see it, it takes too long to differentiate it from a bunch of random indentation. Whether this indentation system is needlessly time-consuming or necessary depends a lot on the type of query and the underlying schema.


Postgres WITH SELECT ... is very handy but in that case I would create three views instead. It seems that wine_ratings tags and detail could be used in other queries and having the views created would prevent duplicating their code.


I follow a similar style as well.... One of the things I love about SQLyog is that it has a 'format query' button that will make any query readable.

Its a huge help when analyzing bad queries that newrelic might spit out.


I have developed a very similar style, for the same reasons. It's funny to see this blog post, because I have gobs of SQL just like it on my other monitor.



I very like the WITH statement in Postgres.

It allow to break down the logic and composite the result.

Any idea if it's standard SQL?


Its part of the ANSI SQL-99 standard: http://en.wikipedia.org/wiki/SQL:1999


[strikeout, see comment above, it's in SQL99]

MSSQL has a similar construct with Common Table Expressions (CTE's) though, so it' not unique to PG.


Another tip for using WITH on MSSQL - prefix it with semicolon like this: ;WITH

MSSQL makes it so that people pick up bad habits over the years and often you'll find missing semicolons at the end of statements. While the parser will intelligently handle missing semicolons for the majority of of scripts, it will bomb 100% of the time (at least up to 2008R2) on WITH.


Thanks for the details.


i recently dealt with a lot of this while discussing/resolving some of these issues @ https://github.com/jdorn/sql-formatter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: