Okay I tried it. I got interrupted twice for like ~12 minutes total, making the time I spent coding *checks terminal history* also 12 minutes. I made the assumption (would have asked if live) that if a user visits "A-B-C-D-E-F", then the program should identify "B-C-D" (etc.) as a visited path as well, and not only "A-B-C" and "D-E-F", which I felt made it quite a bit trickier than perhaps intended (but this seems like the only correct solution to me). The code I came up with for the first question, where you "cat" (without UUOC! Heh) the log file data into the program:
import sys
unfinishedPaths = {} # [user] = [path1, path2, ...] = [[page1, page2], [page1]]
finishedPaths = {} # [path] = count
for line in sys.stdin:
user = line.split(',')[0].strip()
page = line.split(',')[1].strip()
if user not in unfinishedPaths:
unfinishedPaths[user] = []
deleteIndex = []
for pathindex, path in enumerate(unfinishedPaths[user]):
path.append(page)
if len(path) == 3:
deleteIndex.append(pathindex)
for pathindex in deleteIndex:
serializedPath = ' -> '.join(unfinishedPaths[user][pathindex])
if serializedPath in finishedPaths:
finishedPaths[serializedPath] += 1
else:
finishedPaths[serializedPath] = 1
del unfinishedPaths[user][pathindex]
unfinishedPaths[user].append([page])
for k in sorted(finishedPaths, key=lambda x: finishedPaths[x], reverse=True):
print(str(k) + ' with a count of ' + str(finishedPaths[k]))
Not tested properly because no expected output is given, but from concatenating your sample data a few times and introducing a third person, the output looks plausible. And I just noticed I failed because it says top 3, not just print all in order (guess I expect the user to use "| head -3" since it's a command-line program).
I needed to look up the parameter/argument that turns out to be called "key" for sorted() so I didn't do it all by heart (used html docs on the local filesystem for that, no web search or LLM), and I had one bout of confusion where I thought I needed to have another for loop inside of the "for pathindex, path in ..." (thinking it was "for pathsindex, paths in", note the plural). Not sure I'd have figured that one out with interview stress.
This is definitely trickier than fizzbuzz or similar. Would budget at least 20 minutes for a great candidate having bad nerves and bad luck, which makes it fairly long given that you have follow-up questions and probably also want to get to other topics like team fit and compensation expectations at some point
At a glance it seems correct, but there's a lot of inefficiencies, which might or might not be acceptable depending on the interview level/role.
Major:
1. Sorting finishedPaths is unnecessary given it only asks for the most frequent one (not the top 3 btw)
2. Deleting from the middle of the unfinishedPaths list is slow because it needs to shift the subsequent elements
3. You're storing effectively the same information 3 times in unfinishedPaths ([A, B, C], [B, C], [C])
Minor:
1. line.split is called twice
2. Way too many repeated dict lookups that could be easily avoided (in particular the 'if key (not) in dict: do_something(dict[key])' stuff should be done using dict.get and dict.setdefault instead)
3. deleteIndex doesn't need to be a list, it's always at most 1 element
I realized at least the double-calling of line.split while writing the second instance, but figured I'm in an interview (not a take-home where you polish it before handing in) and this is more about getting a working solution (fairly quickly, since there are more questions and topics and most interviews are 1h) and from there the interviewer will steer towards what issues they care about. But then I never had to do live coding in an interview, so perhaps I'm wrong? Or overoptimizing what would take a handful of seconds to improve
That only ever one user path will hit length==3 at a time is an insight I hadn't realized, that's from minor point #3 but I guess it also shows up in major points #2 and #3 because it means you can design the whole thing differently -- each user having a rolling buffer of 3 elements and a pointer, perhaps. (I guess this is the sort of conversation to have with the interviewer)
Defaultdict, yeah I know of it, I don't remember the API by heart so I don't use it. Not sure the advantage is worth it but yep it would look cleaner
Got curious about the performance now. Downloading 1M lines of my web server logs and formatting it so that IPaddr=user and URI=page (size is now 65MB), the code runs in 3.1 seconds. I'm not displeased with 322k lines/sec for a quick/naive solution in cpython I must say. One might argue that for an average webshop, more engineering time would just be wasted :) but of course a better solution would be better
Finally, I was going to ask what you meant with major point #1 since the task does say top 3 but then I read it one more time and...... right. I should have seen that!
As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)
For anyone curious, the top results from my access logs are
/ -> / -> / with a count of 6120
/robots.txt -> /robots.txt -> /robots.txt with a count of 4459
/ -> /404.html -> / with a count of 4300
> As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)
You need the list regardless, just do `max` instead of `sort` at the end, which is O(N) rather than O(N log N). Likewise, returning top 3 elements can still be done in O(N) without sorting (with heapq.nlargest or similar), although I agree that you probably shouldn't expect most interviewees to know about this.
As for the rest, as I've said, it depends on the candidate level. From a junior it's fine as-is, although I'd still want them to be able to fix at least some of those issues once I point them out. I'd expect a senior to be able to write a cleaner solution on their own, or at most with minimal prompting (eg "Can you optimize this?")
FYI, defaultdict and setdefault is not the same thing.
d = defaultdict(list)
d[key].append(value)
vs
d = {}
d.setdefault(key, []).append(value)
useful when you only want the "default" behavior in one piece of code but not others
> / -> / -> / with a count of 6120
> /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
Your solution looks alright. I think you could use a defaultdict() to clean up a few lines of code, and I don't fully understand why you have two nested loops inside your file processing loop.
It could be slow on large log files because it keeps the whole log in memory. You could speed it up significantly by doing a `.shift()` at the point when you `.slice(-3)` so that you only track the last 3 pages for any user.
I needed to look up the parameter/argument that turns out to be called "key" for sorted() so I didn't do it all by heart (used html docs on the local filesystem for that, no web search or LLM), and I had one bout of confusion where I thought I needed to have another for loop inside of the "for pathindex, path in ..." (thinking it was "for pathsindex, paths in", note the plural). Not sure I'd have figured that one out with interview stress.
This is definitely trickier than fizzbuzz or similar. Would budget at least 20 minutes for a great candidate having bad nerves and bad luck, which makes it fairly long given that you have follow-up questions and probably also want to get to other topics like team fit and compensation expectations at some point
edit: wait, now I need to know: did I get hired?