I had a large CSV that I wanted to analyze with duckdb. I was having a weird problem: I couldn’t read it withread_csv(...) due to a confusing error:
Error: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n
I say “confusing” because I had specified new_line='\n', so I don’t know why it was expecting \r\n. But I assumed it was something with the line endings.
Oddly, if I split it up with my split.py, then used read_csv('part_*csv', ...) it would work fine.
But if I split it by year using [[Permanent/4 things to know about awk|awk]] :
for yy in `seq 2014 2023`; do
    awk -F';' -v "year=^$yy" '$2~year' all_bursts.csv > bursts_$yy.csv & 
done
then it still failed with the same problem.
I finally managed to open one of the smaller files and saw ^M at the end, confirming that it was from a newline problem. The quickest way to fix it wasn’t obvious, but in the end, sed was the simplest:
 sed -e 's/\r//' -i.bak all_bursts.csv
(which left me the .bak file in case I did it wrong)
But still, why was the python-split version working?
To check, I made dummy file with some \n and some \r\n
f = open("test1", 'w')
f.write("aaa\nbbb\r\nccc\n")
If I simply cat the results, you can’t tell…
$ cat test1
aaa
bbb
ccc
But there’s a -e which displays non-printing characters, so the line endings get shown:
$ cat -e test1
aaa$
bbb^M$
ccc$
So this is what was messing up duckdb. How did python get rid of it? 
Apparently, python will automatically get rid of the double line ending when you .write to a file:
>>> f = open("test1")
>>> f2 = open("test2", "w")
>>> for line in f:
...     f2.write(line)
...
4
4
4
Note the 4! Even though
>>> f.write("aaa\nbbb\r\nccc\n")
13
gave 13 bytes written, the line-by-line writing only did 12 total, and the extra byte is gone:
$ cat -e test[12]
aaa$
bbb^M$
ccc$
aaa$
bbb$
ccc$
\r do in python?I realized I didn’t actually know what the “carriage return” did. It turns out the simple answer is “return to the line beginning”. So if we print one before a new line is printed, it will overwrite the existing text in the line:
>>> print("This will be\roverwritten")
overwrittene