Show HN: IMDb SQL Best Movie Finder

dewey · 2024-11-05T10:19:22 1730801962

The raw data is available here in case you are curious: https://datasets.imdbws.com

cube2222 · 2024-11-05T10:30:29 1730802629

It's so nice that they make this data available in such a fashion.

That said, it would be amazing to have (anonymized) data about raw ratings per person, and be able to build your own personal recommenders based on that... One can dream!

account42 · 2024-11-05T11:15:54 1730805354

> It's so nice that they make this data available in such a fashion.

Is it or are they just trying to preempt a more open alternative.

dewey · 2024-11-05T11:22:00 1730805720

It always seems more like an forgotten artifact on their end that is used for universities and research. They are not really a competitor to the open platforms as contributing on IMDb is very limited.

Cyph0n · 2024-11-05T14:03:54 1730815434

TMDB, TVDB, AniDB & Anilist (anime)

croisillon · 2024-11-05T12:46:40 1730810800

what are open competitors to IMDb?

adren123 · 2024-11-05T15:54:04 1730822044

Wikidata.org the database behind Wikipedia is a good alternative as a general database including films and related crew for instance with Avatar (2009) https://www.wikidata.org/wiki/Q24871

Although is cannot be queried in SQL, but SPARQL instead https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service

miah_ · 2024-11-05T13:54:49 1730814889

I really like imcdb, or The Internet Movie Car Database. Want to know what cars were in a movie? Check imcdb!

https://www.imcdb.org/

croisillon · 2024-11-05T18:21:27 1730830887

haha, thanks, we need a movie database database!

dewey · 2024-11-05T12:51:27 1730811087

TMDB and many others, some specific to Anime or other types of content.

plopz · 2024-11-05T14:43:15 1730817795

You should check out movielens, they have a couple recommenders as well as some data sets you can download.

bambax · 2024-11-05T11:11:31 1730805091

Interesting! But it seems these datasets don't contain the description of each movie? Do you know where tools such as Jellyfin or Plex get them? Do they simply scrape the website?

jam3sn · 2024-11-05T11:16:51 1730805411

The Movie Database (TMDB) is a good resource for this. Likewise, TVMaze offers some details / media missing from TMDB for TV shows.

https://www.themoviedb.org/ https://www.tvmaze.com/

indigodaddy · 2024-11-05T12:38:22 1730810302

I found OMDB to be the best for my spaghetti of code

https://www.omdbapi.com/

https://github.com/jgbrwn/my-upc/blob/main/app.py

https://my-upc.com

jam3sn · 2024-11-05T13:57:42 1730815062

Not heard of OMDB before, thanks, will give it a go

JamesSwift · 2024-11-05T14:18:48 1730816328

By default, Jellyfin uses TMDB (TheMovieDb) and OMDB (The Open Movie Database) as its metadata providers

GlumWoodpecker · 2024-11-05T13:52:16 1730814736

They have an official advanced search as well:

https://www.imdb.com/search/title/

nzealand · 2024-11-05T15:36:32 1730820992

Try a Bayesian sort.

You will get a better result.

SELECT * EXCLUDE (titleType, primaryTitle, language) FROM 'imdb01-11-2024.parquet' WHERE (region is null and titleType = 'movie') ORDER BY ( (numVotes * averageRating)+(1000007))/ (numVotes+1000000) desc LIMIT 100

afiodorov · 2024-11-05T16:19:36 1730823576

That's awesome. Care to explain? Looks like Laplace estimator of some kind?

P.S. Query sharing is supported:

https://www.imdb-sql.com/?query=SELECT+*+EXCLUDE+%28titleTyp...

afiodorov · 2024-11-05T16:43:42 1730825022

Ah 7 might have been a typo. May be it was meant to be *7? Then that's exactly the Laplace estimator. We start with a prior of 1_000_000 votes of 7.0 and then adjust as we accumulate real votes.

nzealand · 2024-11-05T21:06:59 1730840819

Yup. The correct query is:

SELECT * EXCLUDE (titleType, primaryTitle, language) FROM 'imdb01-11-2024.parquet' WHERE (region is null and titleType = 'movie') ORDER BY ( (numVotes * averageRating)+(100000 * 7))/ (numVotes+100000) desc LIMIT 100

roman_soldier · 2024-11-05T11:12:59 1730805179

Nice, I like it, anyway to filter out Indian movies though?

NiekvdMaas · 2024-11-05T14:19:11 1730816351

FWIW this is possible through the imdb.com search interface: https://www.imdb.com/search/title/?title_type=feature&releas...

afiodorov · 2024-11-10T13:12:36 1731244356

This query appears to work as well

https://www.imdb-sql.com/?query=SELECT+*+EXCLUDE+%28titleTyp...

Suppafly · 2024-11-05T18:12:32 1730830352

>anyway to filter out Indian movies though?

I feel like I'd pay money for Netflix to add that option.

kaon_ · 2024-11-05T11:21:26 1730805686

This is cool. I don't watch so many movies anymore, so there must be many good ones i've missed. Is there a functioning natural language query somewhere?

"I want to watch a science fiction movie that is not a super hero movie. Movies I love are Dark City and Daybreakers. Prioritize movies of short length, nothing more than 2h30. Filter out movies I've already rated, here's my Watchlist"

6510 · 2024-11-05T12:28:31 1730809711

You need to watch Just Imagine 1930.

The movie has countless things we know from much later productions and countless things not seen elsewhere.

It takes place in the distant future, 1980, a place where everyone has a flying car, people have numbers in stead of names, the government decides who you can marry and babies come out of vending machines.

One IMDB comment described it as: I sat on the tip of my chair wondering if it could get any stranger and then it did!

afiodorov · 2024-11-05T11:23:08 1730805788

It'd be trivial to ask LLM to generate the query; but since it's a client-side app there's nowhere to store the api key - so each user would have to supply one, which is a bit of an awkward experience.

yoavm · 2024-11-05T11:13:09 1730805189

Love seeing things like this! I think static databases are underrated and we're going to see more of that soon. I'm working on a similar project - https://github.com/bjesus/teatime - for book reading. Any Github repo can be a database source, as long as it contains some JSON configuration file and has the matching repo tags.

smusamashah · 2024-11-05T12:59:44 1730811584

Is it querying an external dataset or is the whole dataset loaded in browser first?

Currently there are only few columns, can there be more e.g. I was looking for movies based on age rating for example (for my kid or myself)

afiodorov · 2024-11-05T13:03:30 1730811810

Whole dataset http://www.imdb-sql.com/imdb01-11-2024.parquet loaded (100mb, 1.6 million titles).

It has these columns: https://www.imdb-sql.com/?query=SELECT+*+FROM+%27imdb01-11-2...

It is produced using https://github.com/afiodorov/imdb-sql/blob/main/notebooks/im...

From this data: https://developer.imdb.com/non-commercial-datasets/

The age rating is not part of the imdb non-commercial dataset.

resurge · 2024-11-05T11:11:51 1730805111

Hah, someone else trying to find some good movies they haven't watched?

I've also been working on something recently to quickly search in the IMDB datasets. No hosted version yet though.

https://github.com/jeroenpelgrims/imdbsearch

Mainly also to learn some rust/htmx.

nasarquis · 2024-11-05T10:31:14 1730802674

Nice. It would be great if you include a filter by apps such as Netflix, Prime and Max

joeyagreco · 2024-11-07T19:31:44 1731007904

I've been looking for something like that dropdown / query builder on the site.. any idea what was used for that?

sarp · 2024-11-07T19:44:02 1731008642

Looks like react query builder: https://react-querybuilder.js.org/

afiodorov · 2024-11-09T15:47:09 1731167229

soco · 2024-11-05T10:26:59 1730802419

Took me a while to discover the SQL can be edited...

afiodorov · 2024-11-05T10:28:28 1730802508

Sorry about that. I prefer for my UI to be discoverable.

P.S. We can share links too https://www.imdb-sql.com/?query=select+1

The SQL flavor is https://duckdb.org/docs/sql/query_syntax/select

Pikamander2 · 2024-11-05T10:33:45 1730802825

> I prefer for my UI to be discoverable.

I'm not quite sure what you mean by this.

If you gave the SQL a light gray border and perhaps overflow-y:scroll, it would make it much more obvious that it can be edited directly without needing any other adjustments.

afiodorov · 2024-11-05T11:02:33 1730804553

Thanks for the suggestion, will try to improve soon.

If anyone with actual css experience wants to open a PR in the meantime I'll be very happy!

_thisdot · 2024-11-05T12:26:29 1730809589

Taking a look right now. I found the source code from another comment in this HN thread. Would be better if you linked it in app

afiodorov · 2024-11-05T12:35:03 1730810103

Shouldn't be hard to find since the repo name on github is same as the domain :).

I just run npm run dev locally. You'll need the parquet file of the data in public/ though, can get it from http://www.imdb-sql.com/imdb01-11-2024.parquet

afiodorov · 2024-11-09T13:13:25 1731158005

srmarm · 2024-11-05T10:57:43 1730804263

Yeah, I also struggled with the purple box - I assumed I'd change the options, click run and it'd reload the data accordingly. Almost quit before I realised you have to build and then run. If the changes in the purple box were reflected in the SQL code preview that would be much better.

That being said it's an impressive demo of what the browser can do without breaking a sweat! Will try and discover some new stuff to watch now!

afiodorov · 2024-11-05T11:16:10 1730805370

Browsers are becoming OS'es :).

thatguy288 · 2024-11-05T12:37:36 1730810256

How can I exclude movies from a certain country?

afiodorov · 2024-11-05T12:50:01 1730811001

doesn't appear possible from https://developer.imdb.com/non-commercial-datasets/

NiekvdMaas · 2024-11-05T14:19:35 1730816375

The imdb.com web search allows filtering on country: https://www.imdb.com/search/title/?title_type=feature&releas...

paol · 2024-11-05T10:44:28 1730803468

Seems to be some normalization problem with the data, right in the 1st page of the default query there's a duplicate entry.

afiodorov · 2024-11-05T10:59:06 1730804346

Good spot, will deduplicate in the next iteration.

However titles are repeated often due to the region/language variations.

paol · 2024-11-05T11:35:27 1730806527

Since you're denormalizing to a single table, I think the correct way to handle this would be to aggregate all the titles into the title column.

Although "Untitled Pixar Animation Project" is basically garbage data, but that's a harder problem to solve...

afiodorov · 2024-11-09T14:06:14 1731161174

deduped all rows with a simple .uniq() call in polars before saving

croisillon · 2024-11-05T10:32:42 1730802762

as someone who is tinkering in that area, i say really nice job!

trizoza · 2024-11-05T10:20:29 1730802029

I love it. Thanks!

wickedOne · 2024-11-05T10:55:13 1730804113

nicely done!

would be even better if modifications to the actual select part would somehow persist when building the query

afiodorov · 2024-11-05T11:05:47 1730804747

Will think about improving the ui in this aspect...

afiodorov · 2024-11-09T14:10:32 1731161432

done :)

coryfklein · 2024-11-05T17:09:51 1730826591

IMDb must have a lot of Indian users!

afiodorov · 2024-11-05T18:58:41 1730833121

Bollywood is the second largest film industry!

maxboone · 2024-11-05T10:09:37 1730801377

Is the source available somewhere?

afiodorov · 2024-11-05T10:10:06 1730801406

https://github.com/afiodorov/imdb-sql

bufferoverflow · 2024-11-05T16:56:16 1730825776

Looks like it's dead.

afiodorov · 2024-11-05T18:27:35 1730831255

First load takes some time, it is downloading a parquet that's a 100mb and has more than 1.5 million titles of series and movies.

cranberryturkey · 2024-11-05T09:20:15 1730798415

this is pretty cool. pretty slow though. maybe scrape in parallel?

afiodorov · 2024-11-05T09:22:30 1730798550

It's slow on the first load as you download a 100mb parquet file with all the IMDb data. It'll be cached on the second load though (using browser's indexeddb to cache).

Not sure what you mean by scraping in parallel.

maleldil · 2024-11-05T10:42:40 1730803360

100 Mb is a lot of data to download. Maybe you should be upfront about it and, hopefully, require some user input before you do it.

netsharc · 2024-11-05T10:47:14 1730803634

Groan... Are you hosting this 100MB (and not mb)? If yes, I hope the bill will teach a lesson. If not, 100 visitors is already 10GB of bandwidth, that's quite abusive.

Maybe I'm just an old fart, and blowing 10GB of data is cheap nowadays?

Jach · 2024-11-05T12:03:08 1730808188

OVH and Hetzner offer cheap unmetered VPS instances, Cloudflare CDN doesn't have any bandwidth restrictions... Yes 10GB of data is cheap. If they were serving off EC2, then after the first 100 GB of the month it's 9 cents/GB or 90 cents per 10GB, which is absurd, but still not terribly expensive in the grand scheme of things: what would be a surprise lesson-teaching bill, $90 after a TB / 10,000 visitors?

I'm surprised some users think it's a lot on an individual scale. Consider an ideal persona for this submission: query some movies, then go stream several GBs for one. The 100 MB payload isn't much in comparison. I admit it is kind of bad form for mobile users who might be on metered data plans, and a warning and trigger for manual action for those devices would be kinder.

One thing that might help reset your notion of what a lot of bandwidth is would be to browse around with your network tab in the developer console open for a day. nytimes home page is 14 MB and they get a ton of traffic. Even a corporate blog on the HN front page right now that could just be the tiny compressed text is 2 MB. Single image loads on many pages can be 1 MB or more. Glancing at the submission, the response headers indicate it seems they're serving from S3 Cloudfront, which is free for the first TB per month, though after that it gets back to absurd pricing. AWS is not price competitive.

thimabi · 2024-11-05T11:53:26 1730807606

I have never imagined that simply by using your app it would download and cache a 100 MB file on my phone. If I hadn’t read the discussion here, I wouldn’t even know about it.

That’s an irresponsible waste of both bandwidth and storage. You should have really made that clear in the website.

afiodorov · 2024-11-05T19:48:50 1730836130

A 100 mb is not much - about 2 mins on youtube

jafitc · 2024-11-05T11:16:00 1730805360

I think you should consider trimming that file.

Exclude movies with very low number of rating or potentially very low scores too.

The long tail reduction would be significant

afiodorov · 2024-11-05T12:39:21 1730810361

I initially loved looking for obscure stuff, e.g. setting region to soviet union. It surely is the case that 99% of the users want 10% of the data at most. I'll have to work ability to select the file and download & cache it only if the relevant query is asking for it.

smusamashah · 2024-11-05T13:02:15 1730811735

Is that data file also in your github repo or is it a third party source?

afiodorov · 2024-11-05T16:23:16 1730823796

The data was produced using this script: https://github.com/afiodorov/imdb-sql/blob/main/notebooks/im...

It is on hosted on http://www.imdb-sql.com/imdb01-11-2024.parquet

In fact the main reason this project exists is

a) I wrote a jupyter notebook ages ago that'd join the raw data into a queryable form.

b) I committed it and forgot about it after the initial viewing of top movies/series.

c) For the Halloween I wanted to find a good-rated horror movie, a genre I don't watch much.

d) I found my notebook but it was a drag to get it running again, first pandas would keep throwing OOM errors so I had to migrate to polars. Secondly I had to find a spare working laptop since iPad is my primary off-work computing device as of late. Finally the schema is not so intuitive - I need to look up things like region/language codes.

I want something I can just leave unattended but still be able to use a few months later without going through a dev-env set-up. A static website with all the data is one potential way of avoiding maintaining it as it should keep running, just like an exe would keep running on Windows in the past.

I avoid maintaining personal projects as my software job requires me to maintain software and I don't find it enjoyable to do the same in my spare time.

cranberryturkey · 2024-11-05T09:22:56 1730798576

oh i thought you were convert html to sql. nevermind.

Crazyontap · 2024-11-05T12:41:40 1730810500

One neat trick I discovered for finding good movies on IMDb is to delve into the review history of users who share my unique tastes, especially when they diverge from mainstream opinions.

For example, I found "Paris, Texas" to be pretty disappointing, yet most reviews were overwhelmingly positive. So, I sought out others who also didn't enjoy it and explored their review history to find movies we both agreed on.

Occasionally, you'll hit the jackpot and find an avid reviewer whose taste aligns perfectly with yours, providing a treasure trove of excellent movie recommendations.

I like to call these users my "IMDb doppelgangers."

munchler · 2024-11-05T12:55:52 1730811352

This is the premise of Criticker.com. You rate movies, and it will show you recommendations from similar movie watchers. They call it a "Taste Compatibility Index".

https://www.criticker.com/tcis/

jogi21 · 2024-11-05T15:46:49 1730821609

Criticker has been around since 2005, is super accurate with movie and tv recommendations and they're now owned by a big company!

wodenokoto · 2024-11-05T12:51:19 1730811079

There used to be a recommendation engine called tastekid where you could search for multiple movie likes and dislikes in the same query.

Another plus was you didn’t even need to create a profile. Maybe that was their downfall

KetoManx64 · 2024-11-05T13:27:11 1730813231

They still exist. They rebranded to TasteDive, but are still doing the same service: https://tastedive.com/

cptaj · 2024-11-05T12:50:25 1730811025

So, obvious question, is there a database of IMDB reviews?