Okay I tried it. I got interrupted twice for like ~12 minutes total, making the ...

valicord · on July 17, 2024

At a glance it seems correct, but there's a lot of inefficiencies, which might or might not be acceptable depending on the interview level/role.

Major:

1. Sorting finishedPaths is unnecessary given it only asks for the most frequent one (not the top 3 btw)

2. Deleting from the middle of the unfinishedPaths list is slow because it needs to shift the subsequent elements

3. You're storing effectively the same information 3 times in unfinishedPaths ([A, B, C], [B, C], [C])

Minor:

1. line.split is called twice

2. Way too many repeated dict lookups that could be easily avoided (in particular the 'if key (not) in dict: do_something(dict[key])' stuff should be done using dict.get and dict.setdefault instead)

3. deleteIndex doesn't need to be a list, it's always at most 1 element

commandlinefan · on July 19, 2024

> there's a lot of inefficiencies, which might or might not be acceptable

This is exactly what irritates us about these questions. There's no possible answer that will ever be correct "enough".

valicord · on July 21, 2024

Just like in real life, there's no perfect solution to most problems, only different trade-offs.

lucb1e · on July 18, 2024

Thanks for the feedback!

I realized at least the double-calling of line.split while writing the second instance, but figured I'm in an interview (not a take-home where you polish it before handing in) and this is more about getting a working solution (fairly quickly, since there are more questions and topics and most interviews are 1h) and from there the interviewer will steer towards what issues they care about. But then I never had to do live coding in an interview, so perhaps I'm wrong? Or overoptimizing what would take a handful of seconds to improve

That only ever one user path will hit length==3 at a time is an insight I hadn't realized, that's from minor point #3 but I guess it also shows up in major points #2 and #3 because it means you can design the whole thing differently -- each user having a rolling buffer of 3 elements and a pointer, perhaps. (I guess this is the sort of conversation to have with the interviewer)

Defaultdict, yeah I know of it, I don't remember the API by heart so I don't use it. Not sure the advantage is worth it but yep it would look cleaner

Got curious about the performance now. Downloading 1M lines of my web server logs and formatting it so that IPaddr=user and URI=page (size is now 65MB), the code runs in 3.1 seconds. I'm not displeased with 322k lines/sec for a quick/naive solution in cpython I must say. One might argue that for an average webshop, more engineering time would just be wasted :) but of course a better solution would be better

Finally, I was going to ask what you meant with major point #1 since the task does say top 3 but then I read it one more time and...... right. I should have seen that!

As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

For anyone curious, the top results from my access logs are

   / -> / -> / with a count of 6120
   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
   / -> /404.html -> / with a count of 4300

valicord · on July 18, 2024

> As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

You need the list regardless, just do `max` instead of `sort` at the end, which is O(N) rather than O(N log N). Likewise, returning top 3 elements can still be done in O(N) without sorting (with heapq.nlargest or similar), although I agree that you probably shouldn't expect most interviewees to know about this.

As for the rest, as I've said, it depends on the candidate level. From a junior it's fine as-is, although I'd still want them to be able to fix at least some of those issues once I point them out. I'd expect a senior to be able to write a cleaner solution on their own, or at most with minimal prompting (eg "Can you optimize this?")

FYI, defaultdict and setdefault is not the same thing.

  d = defaultdict(list)
  d[key].append(value)

vs

  d = {}
  d.setdefault(key, []).append(value)

useful when you only want the "default" behavior in one piece of code but not others

  >   / -> / -> / with a count of 6120
  >   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459

LOL

johnfn · on July 18, 2024

Your solution looks alright. I think you could use a defaultdict() to clean up a few lines of code, and I don't fully understand why you have two nested loops inside your file processing loop.

Here's my solution in TS.

    const parseLog = (input: string) => {
        const userToHistory: {[user: string]: string[] } = {}
        const pageListToFrequencyCount: { [pages: string]: number } = {}

        for (const [user, page, ] of input.trim().split("\n").map(row => row.split(", "))) {
            userToHistory[user] = (userToHistory[user] ?? []).concat(page);

            if (userToHistory[user].length >= 3) {
                const path = userToHistory[user].slice(-3).join(" -> ")

                pageListToFrequencyCount[path] = (pageListToFrequencyCount[path] ?? 0) + 1;
            }
        }

        return Object.entries(pageListToFrequencyCount).sort(([, a], [, b]) => a - b);
    }

It could be slow on large log files because it keeps the whole log in memory. You could speed it up significantly by doing a `.shift()` at the point when you `.slice(-3)` so that you only track the last 3 pages for any user.